DecoderTCR / README.md

jperera-czbio

add github link to model card

4a2cfd7 verified 4 days ago

preview code

raw

history blame contribute delete

8.6 kB

metadata

license: mit

DecoderTCR v0.1

DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family.

For Model Code and additional information on installation/usage please see the associated GitHub repository

Model Architecture

DecoderTCR is built on a Transformer-based protein language model (ESM2 family).

Core Architecture

The model follows the ESM2 architecture, a deep Transformer encoder designed for protein sequences.

Embedding Layer

Token embedding dimension: d (e.g., 1280)
Learned positional embeddings
Vocabulary includes:
- 20 standard amino acids
- Special tokens (mask, padding, BOS/EOS, unknown)

Transformer Stack

Number of layers: L (e.g., 33)
Hidden dimension: d
Number of attention heads: h (e.g., 20)
Multi-head self-attention:
- Full-sequence, bidirectional attention
Feed-forward network:
- Intermediate dimension ≈ 4× d
- Activation function: GELU
Layer normalization: Pre-LayerNorm
Residual connections around attention and feed-forward blocks

Continual Training Setup

The model is initialized from a pretrained ESM2 checkpoint and further trained via continual pretraining with MLM objectives.

Model Scale (Example Configurations)

Model Variant	Parameters	Layers	Hidden Dim	Attention Heads
ESM2-650M	~650M	33	1280	20
ESM2-3B	~3B	36	2560	40

Model Card Authors

Ben Lai

Primary Contact Email

Ben Lai ben.lai@czbiohub.org To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

Compute Requirements: GPU

Intended Use

Primary Use Cases

The DecoderTCR models are designed for the following primary use cases:

TCR-pMHC Binding Prediction: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
Interaction Scoring: Calculate interface energy scores for TCR-pMHC interactions
Sequence Analysis: Analyze TCR sequences and their interactions with specific peptides
Immunology Research: Support research in adaptive immunity, T-cell recognition, and antigen presentation

The models are particularly useful for:

Identifying potential TCR-peptide binding pairs
Screening TCR sequences for specific antigen recognition
Understanding the molecular basis of T-cell recognition
Supporting vaccine design and immunotherapy development

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
Any use that is prohibited by the MIT license and Acceptable Use Policy.
Clinical diagnosis or treatment decisions without proper validation
Direct use in patient care without appropriate clinical validation and regulatory approval
Use for purposes that could cause harm to individuals or groups

The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.

Training Data

The models are trained with multi-component large-scale protein sequence databases. The training data consists of:

TCR sequences: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
Peptide-MHC sequences: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
Paired TCR-pMHC Interactions: VDJdb for paired TCR-pMHC interaction data.

Continual Pre-training Strategy

This model is trained using a continual pre-training curriculum that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.

Overview

Continual pre-training proceeds in multiple stages, each leveraging different data regimes and masking strategies:

Stage 1 emphasize abundant marginal sequence data, encouraging robust component-level representations.
Stage 2 incorporate scarcer, structured, or interaction-rich data, refining conditional dependencies without overwriting earlier knowledge.

The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.

Stage 1: Component-Level Adaptation

In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.

Objective: Masked Language Modeling (MLM)
Masking: Component- or region-aware masking schedules that upweight functionally relevant positions
Purpose:
- Adapt the pretrained ESM2 representations to the target protein subspace
- Learn domain-specific sequence statistics while retaining general protein knowledge

This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.

Stage 2: Conditional / Interaction-Aware Refinement

In subsequent stages, the model is continually trained on structured or paired sequences that encode higher-order dependencies (e.g., interactions between protein regions or components).

Objective: Masked Language Modeling (MLM)
Masking: Joint masking across interacting regions to encourage cross-context conditioning
Purpose:
- Refine conditional relationships learned from limited paired data
- Align representations across components without degrading Stage 1 task performance

Biases, Risks, and Limitations

Potential Biases

The model may reflect biases present in the training data, including:
- Overrepresentation of certain HLA alleles or peptide types
- Limited diversity in TCR sequences from specific populations
- Bias toward well-studied antigen systems
Certain TCR clonotypes or peptide types may be underrepresented in training data

Risks

Areas of risk may include but are not limited to:

Inaccurate predictions: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
Overconfidence: The model may assign high confidence to predictions that are actually uncertain
Biological misinterpretation: Users may misinterpret model outputs as definitive biological facts rather than predictions
Clinical misuse: Use in clinical settings without proper validation could lead to incorrect treatment decisions

Limitations

Sequence length: The model has limitations on maximum sequence length (typically ~1024 tokens)
Novel sequences: Performance may degrade on sequences very different from training data
HLA diversity: Limited training data for rare HLA alleles may affect prediction accuracy
Context dependency: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
Computational requirements: GPU is recommended for optimal performance

Caveats and Recommendations

Review and validate outputs: Always review and validate model predictions, especially for critical applications
Experimental validation: Model predictions should be validated experimentally before use in research or clinical contexts
Uncertainty awareness: Be aware that predictions are probabilistic and may have uncertainty
Domain expertise: Use the model in conjunction with domain expertise in immunology and T-cell biology
Version tracking: Keep track of which model version and checkpoint you are using

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

Acknowledgements

This model builds upon:

ESM2 by Meta AI (Facebook Research) for the base protein language model
The broader computational biology and immunology research communities

Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.