README.md · duttaprat/HViLM-base at main

HViLM-base / README.md

duttaprat

Update README.md

3fe8f58 verified 4 months ago

preview code

raw

history blame contribute delete

11.8 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- genomics
	- virology
	- dnabert
	- foundation-model
	- hvilm
	- pathogenicity
	- transmissibility
	- host-tropism
	- viral-genomics
	datasets:
	- VIRION
	- BV-BRC
	- VHDB
	- duttaprat/HVUE
	pipeline_tag: feature-extraction
	widget:
	- text: "ATGCGTACGTTAGCCGATCG"
	example_title: "Viral Sequence Example"
	---

	# HViLM-base: A Foundation Model for Viral Genomics

	<div align="center">

	[![Paper](https://img.shields.io/badge/Paper-RECOMB%202026-blue)](https://github.com/duttaprat/HViLM)
	[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/duttaprat/HViLM)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
	[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-HViLM--base-yellow)](https://huggingface.co/duttaprat/HViLM-base)

	</div>

	## Model Description

	HViLM (Human Virome Language Model) is the first foundation model specifically designed for comprehensive viral risk assessment through multi-task prediction of pathogenicity, host tropism, and transmissibility. Built through continued pre-training of [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) on 5 million viral genome sequences from the [VIRION database](https://virion.verena.org), HViLM captures universal viral genomic patterns relevant for human disease risk assessment.

	Paper: HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism (RECOMB 2026)

	Authors: Pratik Dutta, Jack Vaska, Pallavi Surana, Rekha Sathian, Max Chao, Zhihan Zhou, Han Liu, and Ramana V. Davuluri

	Code & Benchmarks: [GitHub Repository](https://github.com/duttaprat/HViLM)

	---

	## Key Features

	- 🦠 Viral-specialized pre-training on 5M sequences from 10.8M genomes spanning 45+ viral families
	- 🎯 Multi-task predictions across 3 epidemiologically critical tasks:
	- Pathogenicity classification: 95.32% average accuracy
	- Host tropism prediction: 96.25% accuracy
	- Transmissibility assessment: 97.36% average accuracy
	- 📊 [HVUE Benchmark](https://huggingface.co/datasets/duttaprat/HVUE): 7 curated datasets totaling 60K+ viral sequences
	- 🔍 Mechanistic interpretability: Identifies transcription factor binding site mimicry (42 conserved motifs)
	- ⚡ Parameter-efficient fine-tuning: LoRA adaptation (~0.3M trainable parameters per task)
	- 🚀 State-of-the-art performance: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB

	---

	## Model Architecture

	HViLM is built upon DNABERT-2 (117M parameters), which uses the MosaicBERT architecture with:
	- Tokenization: Byte Pair Encoding (BPE) with vocabulary size 4,096
	- Max sequence length: 1,000 base pairs
	- Hidden size: 768
	- Attention heads: 12
	- Layers: 12
	- Positional encoding: Attention with Linear Biases (ALiBi)

	Continued pre-training:
	- Objective: Masked Language Modeling (MLM)
	- Training data: 5M viral sequence chunks (non-overlapping, 1000 bp)
	- Data source: VIRION database (clustered at 80% identity with MMseqs2)
	- Training: 10 epochs, AdamW optimizer, learning rate 5e-5
	- Hardware: 4x NVIDIA A100 GPUs (72 hours)
	- Performance: 94.2% MLM accuracy on validation set

	---

	## Installation

	```bash
	pip install transformers torch
	```

	---

	## Quick Start

	### Basic Usage: Extract Sequence Embeddings

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained(
	"duttaprat/HViLM-base",
	trust_remote_code=True # Required for custom architecture
	)
	model = AutoModel.from_pretrained(
	"duttaprat/HViLM-base",
	trust_remote_code=True
	)

	# Example: Get embeddings for a viral sequence
	viral_sequence = "ATGCGTACGTTAGCCGATCGATTACGCGTACGTAGCTAGCTAGCT"

	# Tokenize
	inputs = tokenizer(
	viral_sequence,
	return_tensors="pt",
	truncation=True,
	max_length=512,
	padding=True
	)

	# Generate embeddings
	with torch.no_grad():
	outputs = model(**inputs)
	embeddings = outputs.last_hidden_state # [batch_size, seq_len, 768]

	print(f"Sequence embeddings shape: {embeddings.shape}")

	# Mean pooling for sequence-level representation
	attention_mask = inputs['attention_mask']
	mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
	sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1)
	sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
	mean_embeddings = sum_embeddings / sum_mask

	print(f"Mean sequence embedding shape: {mean_embeddings.shape}") # [batch_size, 768]
	```

	### Fine-tuning on Your Own Task

	For fine-tuning HViLM on custom viral classification tasks, please refer to the [GitHub repository](https://github.com/duttaprat/HViLM) for complete training scripts and examples.

	```python
	# Example fine-tuning setup (see GitHub for complete code)
	from transformers import AutoModel, TrainingArguments, Trainer
	from peft import LoraConfig, get_peft_model

	# Load base model
	model = AutoModel.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True)

	# Configure LoRA for parameter-efficient fine-tuning
	lora_config = LoraConfig(
	r=8, # rank
	lora_alpha=16, # scaling factor
	target_modules=["query", "value"], # attention layers
	lora_dropout=0.1,
	bias="none"
	)

	# Apply LoRA
	model = get_peft_model(model, lora_config)

	# Add classification head and train (see GitHub for details)
	```

	---

	## Performance on HVUE Benchmark

	### Pathogenicity Classification

	\| Dataset \| Sequences \| Accuracy \| F1-Score \| MCC \|
	\|---------\|-----------\|----------\|----------\|-----\|
	\| CINI \| 159 \| 87.74% \| 86.98 \| 74.48 \|
	\| BVBRC-CoV \| 18,066 \| 98.26% \| 98.26 \| 96.52 \|
	\| BVBRC-Calici \| 31,089 \| 99.95% \| 99.93 \| 99.90 \|
	\| Average \| 49,314 \| 95.32% \| 95.06 \| 90.30 \|

	### Host Tropism Prediction

	\| Dataset \| Sequences \| Accuracy \| F1-Score \| MCC \|
	\|---------\|-----------\|----------\|----------\|-----\|
	\| VHDB \| 9,428 \| 96.25% \| 91.34 \| 91.24 \|

	### Transmissibility Assessment (R₀-based Classification)

	\| Viral Family \| Sequences \| Accuracy \| F1-Score \| MCC \|
	\|--------------\|-----------\|----------\|----------\|-----\|
	\| Coronaviridae \| ~3,000 \| 97.45% \| 97.37 \| 93.43 \|
	\| Orthomyxoviridae \| ~2,500 \| 95.62% \| 95.44 \| 91.07 \|
	\| Caliciviridae \| ~1,800 \| 99.95% \| 99.95 \| 99.90 \|
	\| Average \| ~7,300 \| 97.36% \| 97.59 \| 94.80 \|

	Comparison with baselines: HViLM consistently outperforms Nucleotide Transformer 500M-1000g, GENA-LM, and DNABERT-MB across all tasks.

	---

	## Interpretability: Transcription Factor Mimicry

	HViLM's attention mechanisms reveal biologically meaningful pathogenicity determinants through molecular mimicry of host regulatory elements:

	- 42 conserved motifs identified in high-attention regions of pathogenic coronaviruses
	- 10 vertebrate transcription factors targeted, including:
	- Irf1 (Interferon Regulatory Factor 1): 8 convergent motifs for immune evasion
	- Foxq1: Multiple motifs for epithelial cell tropism
	- ZNF354A: 6 motifs for chromatin regulation

	This demonstrates that HViLM captures genuine biological mechanisms rather than spurious correlations.

	---

	## Training Data

	### Pre-training Corpus

	- Source: [VIRION database](https://virion.verena.org) (476,242 virus-host associations)
	- Genomes: 10,817,265 unique NCBI accession numbers
	- Processing:
	- Segmented into non-overlapping 1000 bp chunks
	- Clustered with MMseqs2 at 80% identity threshold
	- Final dataset: 5 million unique sequences
	- Coverage: 45+ viral families across all Baltimore classification groups


	---

	## HVUE Benchmark Datasets

	The Human Virome Understanding Evaluation (HVUE) benchmark consists of 7 curated datasets:

	### Pathogenicity Prediction (3 datasets)
	- CINI: 159 sequences, 4 viral families, manual literature curation
	- BVBRC-CoV: 18,066 coronaviruses
	- BVBRC-Calici: 31,089 caliciviruses

	### Host Tropism Prediction (1 dataset)
	- VHDB: 9,428 sequences, 30 viral families
	- Binary classification: human-tropic (13.1%) vs non-human-tropic (86.9%)

	### Transmissibility Prediction (3 datasets)
	- Coronaviridae: R₀-based classification (R₀<1 vs R₀≥1)
	- Orthomyxoviridae: R₀-based classification
	- Caliciviridae: R₀-based classification

	All datasets available at: [🤗 duttaprat/HVUE](https://huggingface.co/datasets/duttaprat/HVUE)

	### Download and Use
	```python
	from datasets import load_dataset

	# Load specific task
	host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
	pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
	transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")

	# Load specific split
	train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv")
	```

	---

	## Reproducing Paper Results

	### Step 1: Download HVUE Benchmark
	```python
	from datasets import load_dataset

	# Download all datasets
	host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
	pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
	transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
	```

	### Step 2: Fine-tune and Evaluate

	To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:



	```bash
	# Clone repository
	git clone https://github.com/duttaprat/HViLM.git
	cd HViLM

	# Install dependencies
	pip install -r requirements.txt

	# Reproduce pathogenicity results on CINI dataset
	cd finetune
	bash scripts/run_patho_cini.sh

	# Reproduce host tropism results
	bash scripts/run_tropism_vhdb.sh

	# Reproduce transmissibility results
	bash scripts/run_r0_coronaviridae.sh
	```

	For detailed instructions, see the [GitHub repository](https://github.com/duttaprat/HViLM).

	---

	## Citation



	If you use DNABERT-2 (the base model), please also cite:

	```bibtex
	@article{zhou2023dnabert2,
	title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
	author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and Davuluri, Ramana and Liu, Han},
	journal={ICLR},
	year={2024}
	}
	```
	If you use HViLM in your research, please cite our paper:
	```
	@article{dutta2025hvilm,
	title={HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism},
	author={Dutta, Pratik and Vaska, Jack and Surana, Pallavi and Sathian, Rekha and Chao, Max and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V.},
	journal={Submitted to RECOMB},
	year={2025},
	note={Under review}
	}
	```
	---

	## Model Card Authors

	- Pratik Dutta (Senior Research Scientist, Stony Brook University)
	- Ramana V. Davuluri (Professor, Stony Brook University)

	---

	## Contact

	- Email: pratik.dutta@stonybrook.edu
	- Lab: [Davuluri Lab, Stony Brook University](https://davulurilab.github.io/)
	- GitHub Issues: [Report bugs or request features](https://github.com/duttaprat/HViLM/issues)

	---

	## Acknowledgments

	This work builds upon [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) by Zhou et al. Pre-training data from the [VIRION database](https://virion.verena.org) maintained by the Viral Emergence Research Initiative (Verena).



	---

	## License

	This model is released under the Apache License 2.0.

	---

	## Disclaimer

	HViLM is a research tool for computational biology and should not be used as the sole basis for clinical or public health decisions. Predictions should be validated through experimental methods and expert analysis.