| | --- |
| | license: mit |
| | --- |
| | |
| | ## ποΈ Model description |
| | **InstructCell** is a multi-modal AI copilot that integrates natural language with single-cell RNA sequencing data, enabling researchers to perform tasks like cell type annotation, pseudo-cell generation, and drug sensitivity prediction through intuitive text commands. |
| | By leveraging a specialized multi-modal architecture and our multi-modal single-cell instruction dataset, InstructCell reduces technical barriers and enhances accessibility for single-cell analysis. |
| |
|
| | **Instruct Version**: Supports generating only the answer portion without additional explanatory text, providing concise and task-specific outputs. |
| |
|
| |
|
| | ### π How to use |
| |
|
| | We provide a simple example for quick reference. This demonstrates a basic **cell type annotation** workflow. |
| |
|
| | Make sure to specify the paths for `H5AD_PATH` and `GENE_VOCAB_PATH` appropriately: |
| | - `H5AD_PATH`: Path to your `.h5ad` single-cell data file (e.g., `H5AD_PATH = "path/to/your/data.h5ad"`). |
| | - `GENE_VOCAB_PATH`: Path to your gene vocabulary file (e.g., `GENE_VOCAB_PATH = "path/to/your/gene_vocab.npy"`). |
| |
|
| | ```python |
| | from mmllm.module import InstructCell |
| | import anndata |
| | import numpy as np |
| | from utils import unify_gene_features |
| | |
| | # Load the pre-trained InstructCell model from HuggingFace |
| | model = InstructCell.from_pretrained("zjunlp/InstructCell-instruct") |
| | |
| | # Load the single-cell data (H5AD format) and gene vocabulary file (numpy format) |
| | adata = anndata.read_h5ad(H5AD_PATH) |
| | gene_vocab = np.load(GENE_VOCAB_PATH) |
| | adata = unify_gene_features(adata, gene_vocab, force_gene_symbol_uppercase=False) |
| | |
| | # Select a random single-cell sample and extract its gene counts and metadata |
| | k = np.random.randint(0, len(adata)) |
| | gene_counts = adata[k, :].X.toarray() |
| | sc_metadata = adata[k, :].obs.iloc[0].to_dict() |
| | |
| | # Define the model prompt with placeholders for metadata and gene expression profile |
| | prompt = ( |
| | "Can you help me annotate this single cell from a {species}? " |
| | "It was sequenced using {sequencing_method} and is derived from {tissue}. " |
| | "The gene expression profile is {input}. Thanks!" |
| | ) |
| | |
| | # Use the model to generate predictions |
| | for key, value in model.predict( |
| | prompt, |
| | gene_counts=gene_counts, |
| | sc_metadata=sc_metadata, |
| | do_sample=True, |
| | top_p=0.95, |
| | top_k=50, |
| | max_new_tokens=256, |
| | ).items(): |
| | # Print each key-value pair |
| | print(f"{key}: {value}") |
| | ``` |
| |
|
| | For more detailed explanations and additional examples, please refer to the Jupyter notebook [demo.ipynb](https://github.com/zjunlp/InstructCell/blob/main/demo.ipynb). |
| |
|
| |
|
| | ### π Citation |
| |
|
| | If you use the code or data, please cite the following paper: |
| |
|
| | ```bibtex |
| | @article{fang2025instructcell, |
| | title={A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following}, |
| | author={Fang, Yin and Deng, Xinle and Liu, Kangwei and Zhang, Ningyu and Qian, Jingyang and Yang, Penghui and Fan, Xiaohui and Chen, Huajun}, |
| | journal={arXiv preprint arXiv:2501.08187}, |
| | year={2025} |
| | } |
| | ``` |