--- license: mit --- # DecoderTCR v0.1 DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family. For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR) ## Model Architecture DecoderTCR is built on a Transformer-based protein language model (ESM2 family). ### Core Architecture The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences. #### Embedding Layer - Token embedding dimension: *d* (e.g., 1280) - Learned positional embeddings - Vocabulary includes: - 20 standard amino acids - Special tokens (mask, padding, BOS/EOS, unknown) #### Transformer Stack - Number of layers: *L* (e.g., 33) - Hidden dimension: *d* - Number of attention heads: *h* (e.g., 20) - Multi-head self-attention: - Full-sequence, bidirectional attention - Feed-forward network: - Intermediate dimension ≈ 4× *d* - Activation function: GELU - Layer normalization: Pre-LayerNorm - Residual connections around attention and feed-forward blocks ### Continual Training Setup The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives. ### Model Scale (Example Configurations) | Model Variant | Parameters | Layers | Hidden Dim | Attention Heads | | --- | --- | --- | --- | --- | | ESM2-650M | ~650M | 33 | 1280 | 20 | | ESM2-3B | ~3B | 36 | 2560 | 40 | ### Model Card Authors Ben Lai ### Primary Contact Email Ben Lai ben.lai@czbiohub.org To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR). ### System Requirements - Compute Requirements: GPU ## Intended Use ### Primary Use Cases The DecoderTCR models are designed for the following primary use cases: 1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes 2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions 3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides 4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation The models are particularly useful for: - Identifying potential TCR-peptide binding pairs - Screening TCR sequences for specific antigen recognition - Understanding the molecular basis of T-cell recognition - Supporting vaccine design and immunotherapy development ### Out-of-Scope or Unauthorized Use Cases Do not use the model for the following purposes: - Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights - Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy). - Clinical diagnosis or treatment decisions without proper validation - Direct use in patient care without appropriate clinical validation and regulatory approval - Use for purposes that could cause harm to individuals or groups The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions. ## Training Data The models are trained with multi-component large-scale protein sequence databases. The training data consists of: - **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences. - **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions. - **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data. ## Continual Pre-training Strategy This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations. ### Overview Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies: - Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations. - Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge. The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve. ### Stage 1: Component-Level Adaptation In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain. - **Objective:** Masked Language Modeling (MLM) - **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions - **Purpose:** - Adapt the pretrained ESM2 representations to the target protein subspace - Learn domain-specific sequence statistics while retaining general protein knowledge This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies. ### Stage 2: Conditional / Interaction-Aware Refinement In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components). - **Objective:** Masked Language Modeling (MLM) - **Masking:** Joint masking across interacting regions to encourage cross-context conditioning - **Purpose:** - Refine conditional relationships learned from limited paired data - Align representations across components without degrading Stage 1 task performance ## Biases, Risks, and Limitations ### Potential Biases - The model may reflect biases present in the training data, including: - Overrepresentation of certain HLA alleles or peptide types - Limited diversity in TCR sequences from specific populations - Bias toward well-studied antigen systems - Certain TCR clonotypes or peptide types may be underrepresented in training data ### Risks Areas of risk may include but are not limited to: - **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations - **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain - **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions - **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions ### Limitations - **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens) - **Novel sequences**: Performance may degrade on sequences very different from training data - **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy - **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment) - **Computational requirements**: GPU is recommended for optimal performance ### Caveats and Recommendations - **Review and validate outputs**: Always review and validate model predictions, especially for critical applications - **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts - **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty - **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology - **Version tracking**: Keep track of which model version and checkpoint you are using We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy) when engaging with our services. Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively. ## Acknowledgements This model builds upon: - **ESM2** by Meta AI (Facebook Research) for the base protein language model - The broader computational biology and immunology research communities Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.