dna-origin-classifier

A reference-free, alignment-free classifier that labels a DNA sequence by its source: human, other eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors file, runs on a CPU at thousands of reads per second, consults no database, and because the score is an exact linear function of 8-mer counts it is also interpretable, invertible, and certifiable in closed form.

The model is small because the thing it detects is small. Sequence identity, the answer to "what is this", compresses to a few hundred thousand linear weights, and a 2 MB model recovers what an 8B genomic language model knows about origin. Sequence function, the answer to "what does this do", does not compress the same way: the same compact form trained on variant effect reaches far below the large model. That asymmetry runs through this card. It is computable in advance from the data alone, it is machine-checked over the reals, and run across the benchmarks that genomic and protein foundation models are measured on, it separates the tasks those models are credited with into the ones a linear baseline already solves and the ones where the large model actually earns its scale. The classifier is the anchor; the boundary is the result.

Method

The featurizer counts all 65,536 8-mers in a sequence, normalizes them to within-sequence frequency, and divides by a stored per-feature scale. Three discriminatively trained linear heads read that vector:

origin: five-class, human / eukaryote / bacteria / virus / engineered.
host: binary, human versus non-host (bacterial or viral).
engineered: binary, engineered or synthetic versus natural.

The discriminative fit is what separates this from the classical generative k-mer log-odds: on the same features it raises five-class accuracy from 0.63 to 0.71. Every parameter lives in model.safetensors (feature_scale, origin.weight/bias, host.weight/bias, engineered.weight/bias), 524,295 numbers in total. There is nothing else: no embedding table, no nonlinearity, no reference index.

Usage

from model import DnaOriginClassifier
clf = DnaOriginClassifier("model.safetensors")

seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
clf.classify(seq)          # 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered'
clf.host_score(seq)        # higher = more human/host-like
clf.engineered_score(seq)  # higher = more likely engineered/synthetic

clf.attribute(seq)         # exact per-base contribution to the host score (sums to the score)
clf.certify(seq)           # base substitutions that flip the host call

The repository ships the model with four small tools, each requiring only numpy and safetensors:

design.py generates sequences that maximize or minimize a head, either from scratch or by synonymous codon choice with the encoded protein held fixed.
dna_filter.py is a FASTQ/FASTA read filter built on the host head, in two modes: host depletion for pathogen enrichment, and human removal for privacy.
boundary.py computes the composition-solvability distance D for two classes of sequences, or for many, with no training. It is the criterion behind the boundary sections below.
Detector.v and CompositionBoundary.v are Rocq proofs, described where they are used.

Evaluation

On the test and novel-taxa splits of dna-origin-benchmark (24,950 fragments, 382 organisms, cluster-level splits so that train and test share no near-duplicate cluster):

task	head	test	novel taxa
human vs non-host	`host`	0.993	0.990
engineered vs natural	`engineered`	0.919	0.896

Five-class origin accuracy is 0.708 against a random baseline of 0.20. The novel-taxa column holds out entire clades, so the small drop from test to novel measures generalization to organisms never seen in training rather than memorization of databased sequence.

How it compares

On the same splits and tasks:

Database tools. Kraken2 against RefSeq matches the model on databased taxa (host sensitivity 0.972, specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% of reads and BLAST 6.6%, while this model calls 100%, because it needs no reference to consult.
Learned reference-free models. A fine-tuned 110M genomic language model (DNABERT) matches it on host detection (0.989 against 0.990 on novel clades); a CNN and a BiLSTM trained on the same data land below at 0.94 and 0.97. The linear model reaches that level at roughly 200 times fewer parameters, on a CPU.

What the linear form gives you

Because the score is an exact linear function of 8-mer counts, the model supports operations a black-box classifier cannot do cleanly:

Certified robustness. certify returns, by greedy search with exact per-edit deltas, a count of base substitutions that flips a call, an upper bound on the true adversarial distance: a typical human read takes a median of 3 of 300 to read as non-host, an engineered sequence 4 to pass as natural. The matching lower bound is proved from the linear form and machine-checked in Detector.v (Rocq, over the rationals, no axioms and no admit): one substitution moves the score by at most 2k times the largest effective weight, so no edit within the margin divided by that constant can flip the call.
Exact attribution. attribute decomposes any call into per-base contributions that sum exactly to the score, with no gradient approximation. The same decomposition is the attribution-exactness identity in the proof file.
Inverse design. Coordinate ascent on a head designs sequences to order. From scratch it reaches host_score 22, far past the natural human ceiling of about 6. Holding a protein fixed, synonymous codon choice alone moves the same gene from host_score −15 to +14. The two heads give two synonymous channels at once, so a single gene stamps into any of the four (host, engineered) sign-quadrants with the protein unchanged and the detector reads the two bits back: a protein-preserving provenance watermark.
Weight arithmetic. Detectors compose in weight space. A host detector built purely by algebra on the origin rows, human − ½(bacteria + virus), never trained as such, scores 0.992 against the trained head's 0.993.

The 8-mer atlas

The model's learned weights are published as a standalone dataset, dna-origin-atlas: all 65,536 8-mer weights for each head, annotated with GC and CpG content. The dominant axis is biological. CpG-bearing 8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the table encodes the vertebrate CpG-depletion (methylation) signature as a number. The discrimination rests on bulk composition of this kind rather than on a lexicon of named regulatory motifs.

The partition: identity is shallow, function is deep

One asymmetry runs through everything here. Sequence identity is shallow: it compresses to this 2 MB linear model, separable and invertible, reaching 0.99 AUROC on human versus non-host, and the learned origin space separates bacteria, eukaryotes, and viruses without a database. Sequence function is deep: a compact model trained on labeled variants reaches only 0.56 on variant effect, far below the 0.85 of the 8B language model, so the function signal does not compress from supervision the way identity does.

Origin-head weights alone separate bacteria, eukaryotes, and viruses in a PCA of the model's learned space; the single human point sits far from all three. No reference database is consulted.

The split is not specific to DNA. Building the same composition classifier on other molecules:

molecule	identity (composition)	function (composition)
DNA	0.99	0.56
protein	0.91	0.73
RNA	1.00	n/a

Identity compresses to composition across alphabets; function does not. The protein function entry reads 0.73 rather than chance because the amino-acid substitution itself carries chemical signal, but its information fraction (0.14) stays below the boundary, and the context-dependent part is absent. The boundary is predictable in information terms: a task is composition-solvable when the mutual information between the k-mer spectrum and its label exceeds about 0.2 of the label entropy. Across ten tasks spanning DNA and protein, a single threshold near 0.20 separates every identity task from every function task, and the two cases whose side is not clear in advance land where it predicts. Human against other eukaryotes, the hardest identity task at 0.26, stays above; missense pathogenicity, the strongest function signal at 0.14, stays below; pathogenicity read from the surrounding composition alone sits at chance.

Composition AUROC against the captured information fraction. Identity tasks (blue) lie above the threshold, function tasks (red) below, with DNA and protein on the same axis.

Generation makes the separation concrete. Coordinate ascent on the host head designs sequences with host_score up to 22, against a natural ceiling of about 6, and these designs sit far outside the region of composition space occupied by any natural class, where the detector saturates but no biology lives. The 8B language model rates those same designs as increasingly unnatural, so composition and neural naturalness are separable axes rather than two views of one thing. ADVERSARIAL.md covers the order-(k-1) evasion boundary and why a language model resists it; TOOL.md covers footprint, throughput, read length, and operating modes.

Natural sequences of every origin class occupy one connected region of composition space (center); coordinate-ascent designs (stars) land far outside it, where the host head saturates but no biology lives.

The composition boundary

The partition has a boundary that can be computed before any training and a proof that it holds. View a length-L sequence as N = L - k + 1 draws from a class-conditional k-mer distribution, which is all a composition classifier ever sees. The Bhattacharyya coefficient between the two classes' count distributions then factorizes exactly to rho^N, with rho the per-k-mer overlap sum of sqrt(theta_0 theta_1); the Bayes error of any count-based classifier is at most half of rho^N; and for a balanced label the mutual information equals the Jensen-Shannon divergence between the count distributions. Solvability is therefore set by D = -N ln rho, read off the class distributions with no classifier. Equal distributions give rho = 1 and chance; rho below 1 gives an error that vanishes with length. These four statements are machine-checked in CompositionBoundary.v (Rocq, over Coq's standard reals, no admit).

The first statement of the boundary treats overlapping k-mers as independent, which overstates separability where the sequence is correlated. The proof file also carries the correlated generalization: model the sequence as an order-1 Markov chain over states, with amp the initial Bhattacharyya amplitude and a transfer matrix whose entries are per-transition Bhattacharyya weights, and the path coefficient is the transfer-matrix evaluation, whose per-step growth is the top eigenvalue of that matrix rather than the scalar rho. The generalization reduces to rho^N exactly when the chain has no memory, which is the iid case, and that reduction is the machine-checked theorem. Recomputed at order 5, the distance tracks measured performance more closely (rank correlation with AUROC 0.92 against 0.87 for the iid form) and the worst overstatement, the enhancer tasks, falls from a distance of 9.0 to 2.3, in line with their 0.79 AUROC.

Left, the iid distance against measured AUROC; right, the order-5 Markov distance. The correlated form pulls the over-separated tasks back toward the line and raises the rank correlation.

A foundation model against the linear baseline

Run the criterion across the sixteen binary Nucleotide Transformer tasks, the genomic benchmark these models are sold on, and D predicts task by task whether composition suffices. A fine-tuned 110M DNABERT bears it out. On the promoters, where D is high and the 2 MB linear model already reaches 0.96, DNABERT adds only 0.02 to 0.04: a near tie, with little room for any model. The splice donor and acceptor sites are the opposite extreme: at D near 1 the linear model reaches about 0.75 while DNABERT reaches 0.99, a gain near a quarter of an AUROC, because the motif that separates the classes is positional and sits beneath the composition a bag of k-mers reads. The histone marks fall in between, with gains of 0.05 to 0.10 at intermediate D. Averaged over the benchmark, DNABERT's gain over the linear model is +0.097 below the boundary against +0.054 above it, and a full-data fine-tune on the decisive tasks (splice at 0.99, promoter at 0.99) reproduces the split. The foundation model's distinct capability concentrates where the criterion places it, scaling with how far below the boundary the task sits.

Left, D against test AUROC for the linear model and DNABERT across the sixteen tasks; right, the DNABERT-minus-linear gain against D, concentrated below the threshold, splice sites foremost.

The same line holds across suites and molecules. D and the linear baseline reproduce the partition over the Genomic Benchmarks and the available GUE tasks as well as over the Nucleotide Transformer set, about thirty tasks in three suites. Real foundation models confirm the cross-molecule reading: amino-acid composition separates eukaryote from bacterium above the boundary (D 4.1, where an ESM probe adds 0.05) but cannot resolve subcellular localization below it (D_min 0.4, where ESM adds 0.26); for RNA, composition and an RNA-FM probe both clear the human-versus-yeast non-coding identity task above the boundary (D 15, RNA-FM ahead by 0.03). In every molecule the learned model adds little above the line and most of its value below it.

Composition AUROC against D for DNA tasks from three suites plus the protein and RNA identity points, all on one axis.

A forward test fixes the direction. On a three-class splice dataset from a different source than any used to set the threshold, D was computed first and predicted that composition cannot separate any pair of classes (D between 0.13 and 0.22); the linear baseline then scored 0.53 to 0.66, confirming all three calls before a model was run. Splice sounds tractable, and the criterion calls its failure for composition in advance.

The consequence is a re-scoring rather than a new tool. A sequence-classification benchmark can be ordered by D with no training; its above-boundary entries are k-mer baselines that a 2 MB model matches, and a foundation model's distinct contribution is whatever it adds on the entries below. Variant and fitness assays such as ProteinGym are mutant ensembles rather than composition samples, so they sit outside this model and in the deep regime by construction. The full GUE set, BEND, and an RNA function benchmark need data not mirrored on HuggingFace, so RNA enters here on the identity side.

Screening and hybrid use

Certified screening frontier. To pass an engineered construct as natural, an adversary must make a median of 4 base substitutions (46% need 5 or more) or reproduce the full order-7 composition of a natural class. Detection holds for 60% of constructs after the best 3-edit attack and for none after 9 edits. Composition screening is therefore sound against light point edits and blind to a construct that reproduces natural composition, and the boundary states precisely where that blindness begins.
Strictly dominant hybrid. Composing this detector with a database method, taking the database call where it classifies and the composition call otherwise, reaches 0.994 correct on a mixed databased and undatabased set, above Kraken2 alone (0.337, blind to undatabased sequence) and this model alone (0.980), and by construction never worse than either component.

Calibration

The heads output linear margins, not calibrated probabilities. Use the argmax of logits for origin and the ranking or sign of host_score and engineered_score for the binary tasks.

Training data

All heads are fit on the train split of dna-origin-benchmark: RefSeq coding sequences across roughly 370 taxa, NCBI synthetic-construct and vector records, synonymous recodings, and environmental metagenome fragments. Splits are drawn at the cluster level so that evaluation never sees a near-duplicate of a training sequence, and a novel-taxa split holds out whole clades.

Files in this repository

model.safetensors, model.py: the classifier and its loader.
boundary.py: the composition-solvability distance D, for two classes or for many, no training.
Detector.v: machine-checked certified robustness and exact attribution for the linear detector.
CompositionBoundary.v: the boundary theorem (the factorization, the Bayes-error bound, the Jensen-Shannon identity, the dichotomy) and its Markov-correlated generalization.
design.py, dna_filter.py: inverse design and the read filter.
ADVERSARIAL.md, TOOL.md: the evasion analysis and the deployment characteristics.
Figures: tree_of_life.png, partition_law.png, analytical_crossover.png, manifold.png, composition_boundary_fm.png, cross_suite.png, markov_d.png.

References

Base model (derived by analysis): HuggingFaceBio/Carbon-8B. This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT.
Benchmark and baselines: phanerozoic/dna-origin-benchmark and phanerozoic/dna-origin-atlas.
Foundation-model benchmarks compared against: the Nucleotide Transformer downstream tasks, the Genomic Benchmarks, GUE, with DNABERT, ESM-2, and RNA-FM as the learned models.
Background: k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin and Burge 1995).

License

MIT.

Downloads last month: 1

Safetensors

Model size

524k params

Tensor type

F32

Model tree for phanerozoic/dna-origin-classifier

Base model

HuggingFaceBio/Carbon-8B

Finetuned

(1)

this model

Dataset used to train phanerozoic/dna-origin-classifier

Evaluation results

AUROC (test) on dna-origin-benchmark
test set self-reported

0.993
AUROC (novel taxa) on dna-origin-benchmark
test set self-reported

0.990
AUROC (test) on dna-origin-benchmark
test set self-reported

0.919
AUROC (novel taxa) on dna-origin-benchmark
test set self-reported

0.896
Accuracy (test, 5-class) on dna-origin-benchmark
test set self-reported

0.708