license: mit
DecoderTCR v0.1
DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family.
For Model Code and additional information on installation/usage please see the associated GitHub repository
Model Architecture
DecoderTCR is built on a Transformer-based protein language model (ESM2 family).
Core Architecture
The model follows the ESM2 architecture, a deep Transformer encoder designed for protein sequences.
Embedding Layer
- Token embedding dimension: d (e.g., 1280)
- Learned positional embeddings
- Vocabulary includes:
- 20 standard amino acids
- Special tokens (mask, padding, BOS/EOS, unknown)
Transformer Stack
- Number of layers: L (e.g., 33)
- Hidden dimension: d
- Number of attention heads: h (e.g., 20)
- Multi-head self-attention:
- Full-sequence, bidirectional attention
- Feed-forward network:
- Intermediate dimension ≈ 4× d
- Activation function: GELU
- Layer normalization: Pre-LayerNorm
- Residual connections around attention and feed-forward blocks
Continual Training Setup
The model is initialized from a pretrained ESM2 checkpoint and further trained via continual pretraining with MLM objectives.
Model Scale (Example Configurations)
| Model Variant | Parameters | Layers | Hidden Dim | Attention Heads |
|---|---|---|---|---|
| ESM2-650M | ~650M | 33 | 1280 | 20 |
| ESM2-3B | ~3B | 36 | 2560 | 40 |
Model Card Authors
Ben Lai
Primary Contact Email
Ben Lai ben.lai@czbiohub.org To submit feature requests or report issues with the model, please open an issue on the GitHub repository.
System Requirements
- Compute Requirements: GPU
Intended Use
Primary Use Cases
The DecoderTCR models are designed for the following primary use cases:
- TCR-pMHC Binding Prediction: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
- Interaction Scoring: Calculate interface energy scores for TCR-pMHC interactions
- Sequence Analysis: Analyze TCR sequences and their interactions with specific peptides
- Immunology Research: Support research in adaptive immunity, T-cell recognition, and antigen presentation
The models are particularly useful for:
- Identifying potential TCR-peptide binding pairs
- Screening TCR sequences for specific antigen recognition
- Understanding the molecular basis of T-cell recognition
- Supporting vaccine design and immunotherapy development
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
- Any use that is prohibited by the MIT license and Acceptable Use Policy.
- Clinical diagnosis or treatment decisions without proper validation
- Direct use in patient care without appropriate clinical validation and regulatory approval
- Use for purposes that could cause harm to individuals or groups
The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.
Training Data
The models are trained with multi-component large-scale protein sequence databases. The training data consists of:
- TCR sequences: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
- Peptide-MHC sequences: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
- Paired TCR-pMHC Interactions: VDJdb for paired TCR-pMHC interaction data.
Continual Pre-training Strategy
This model is trained using a continual pre-training curriculum that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.
Overview
Continual pre-training proceeds in multiple stages, each leveraging different data regimes and masking strategies:
- Stage 1 emphasize abundant marginal sequence data, encouraging robust component-level representations.
- Stage 2 incorporate scarcer, structured, or interaction-rich data, refining conditional dependencies without overwriting earlier knowledge.
The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.
Stage 1: Component-Level Adaptation
In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.
- Objective: Masked Language Modeling (MLM)
- Masking: Component- or region-aware masking schedules that upweight functionally relevant positions
- Purpose:
- Adapt the pretrained ESM2 representations to the target protein subspace
- Learn domain-specific sequence statistics while retaining general protein knowledge
This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.
Stage 2: Conditional / Interaction-Aware Refinement
In subsequent stages, the model is continually trained on structured or paired sequences that encode higher-order dependencies (e.g., interactions between protein regions or components).
- Objective: Masked Language Modeling (MLM)
- Masking: Joint masking across interacting regions to encourage cross-context conditioning
- Purpose:
- Refine conditional relationships learned from limited paired data
- Align representations across components without degrading Stage 1 task performance
Biases, Risks, and Limitations
Potential Biases
- The model may reflect biases present in the training data, including:
- Overrepresentation of certain HLA alleles or peptide types
- Limited diversity in TCR sequences from specific populations
- Bias toward well-studied antigen systems
- Certain TCR clonotypes or peptide types may be underrepresented in training data
Risks
Areas of risk may include but are not limited to:
- Inaccurate predictions: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
- Overconfidence: The model may assign high confidence to predictions that are actually uncertain
- Biological misinterpretation: Users may misinterpret model outputs as definitive biological facts rather than predictions
- Clinical misuse: Use in clinical settings without proper validation could lead to incorrect treatment decisions
Limitations
- Sequence length: The model has limitations on maximum sequence length (typically ~1024 tokens)
- Novel sequences: Performance may degrade on sequences very different from training data
- HLA diversity: Limited training data for rare HLA alleles may affect prediction accuracy
- Context dependency: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
- Computational requirements: GPU is recommended for optimal performance
Caveats and Recommendations
- Review and validate outputs: Always review and validate model predictions, especially for critical applications
- Experimental validation: Model predictions should be validated experimentally before use in research or clinical contexts
- Uncertainty awareness: Be aware that predictions are probabilistic and may have uncertainty
- Domain expertise: Use the model in conjunction with domain expertise in immunology and T-cell biology
- Version tracking: Keep track of which model version and checkpoint you are using
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.
Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
Acknowledgements
This model builds upon:
- ESM2 by Meta AI (Facebook Research) for the base protein language model
- The broader computational biology and immunology research communities
Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.