stratabert-tiny-smoke

Model Summary

This is a StrataBERT diagnostic checkpoint from run_001. Claim status: diagnostic_only. It is not a release-quality checkpoint and must not be used for public quality or efficiency claims.

Architecture

tokens -> embeddings -> [global attention / bidirectional SSM / local attention]* -> mask-aware pooling -> task head

Architecture class: StrataBertForSequenceClassification. Layer types: ['global_attention', 'ssm', 'local_attention']. Hidden size: 48. Max positions: 128.

Parameter Count

Total parameters: 2498404.

Training Data

Data artifacts:

train_index: data/eval_frozen/run_001/ag_news_train_index_sample64.json
eval_index: data/eval_frozen/run_001/ag_news_eval_index_sample200.json

Raw text is not embedded in this card or the frozen eval indices.

Objective Mix

task: 1.0

Teacher Models

No teacher model is used for this checkpoint.

Licenses

Project code license: MIT. Dataset audit summary:

ag_news_v001: restricted_noncommercial_unclear; No standard permissive license is declared.
arxiv_classification_v001: needs_review_full_text_rights; Selected HF repo does not declare a data license.
bc5cdr_v001: needs_review_bc5cdr_tner_mirror; No source-license research entry is present; manifest note: Canonical bigbio/bc5cdr script is disabled by current datasets versions; executable manifest uses TNER BC5CDR converted parquet.
conll2003_v001: restricted_avoid_publication_claims; Highest-risk MVP dataset because the source text is Reuters copyrighted newswire.
eurlex57k_v001: needs_review_lexglue_eurlex; No source-license research entry is present; manifest note: HF datasets metadata inspected with datasets.load_dataset_builder('coastalcph/lex_glue', 'eurlex') on 2026-06-10.
hyperpartisan_news_v001: needs_review_hyperpartisan_mirror; No source-license research entry is present; manifest note: HF parquet metadata inspected on 2026-06-10 via jonathanli/hyperpartisan-longformer-split.
imdb_v001: restricted_noncommercial_unclear; HF license tag is other rather than a permissive license.
openpii_1m_v001: approved_cc_by_4_0_attribution_required; No source-license research entry is present; manifest note: HF datasets metadata inspected with datasets.load_dataset_builder('ai4privacy/pii-masking-openpii-1m', 'default') on 2026-06-10.
patent_classification_v001: needs_review_mirror_license; The selected ccdv sample repo does not declare its own license.
pubmed_200k_rct_v001: needs_review_pubmed_rct_mirror; No source-license research entry is present; manifest note: HF parquet metadata inspected on 2026-06-10.
scicite_v001: needs_review_allenai_scicite; No source-license research entry is present; manifest note: Legacy dataset script is disabled by current datasets versions; executable manifest uses HF converted parquet files.
twenty_newsgroups_v001: needs_review_dataset_card_blank; No source-license research entry is present; manifest note: HF parquet metadata inspected on 2026-06-10 via refs/convert/parquet.

Intended Uses

Local smoke testing of StrataBERT checkpoint loading, evaluation scripts, and metadata plumbing.
Reproducibility checks for run_001 diagnostic artifacts.

Out-of-Scope Uses

Public benchmark claims.
Production classification or token-classification deployment.
Commercial reuse of dataset-derived behavior without legal review of the relevant datasets.

Evaluation

metric	value
`accuracy`	0.26
`macro_f1`	0.10317460317460318
`weighted_f1`	0.10730158730158731
`loss`	1.3858718490600586

Evaluation artifact: checkpoints/run_001/tiny_ag_news_smoke.

Length-Bucketed Results

bucket	support	accuracy
`0_512`	200	0.26

Latency and Memory

item	value
device	cpu
batch size	2
sequence length	128
p50 latency ms	10.763351499917917
p95 latency ms	12.447670099209063
latency 95% CI ms	0.6102587742635365
examples/sec	180.17026675821398
tokens/sec	23061.79414505139
OOM status	not_oom
max batch under memory cap	2

Memory measurements are not release-grade in this diagnostic card unless explicitly listed above.

Hardware and Software

Training/eval torch: 2.12.0+cu130
CUDA available during checkpoint creation: False
Latency environment: {'cuda': '13.0', 'cuda_available': False, 'platform': 'Linux-6.14.0-37-generic-x86_64-with-glibc2.41', 'python': '3.12.13', 'torch': '2.12.0+cu130'}
Vast AI: None

Known Limitations

Random or tiny diagnostic training only; no release-quality pretraining.
Mandatory ModernBERT, Ettin, DeBERTa-v3, Longformer, BigBird, and embedding baselines are still pending.
Long-context 2k/4k/8k claims are unsupported by this card.
Dataset license caveats remain unresolved for public claims.

Ethical and Privacy Considerations

This checkpoint is diagnostic and should not be deployed. Dataset provenance and privacy review are incomplete for release use, and token-classification public claims require a publication-safe dataset replacement or legal approval.

Reproducibility

Training command: scripts/finetune_classification.py --train-index data/eval_frozen/run_001/ag_news_train_index_sample64.json --train-split train --eval-index data/eval_frozen/run_001/ag_news_eval_index_sample200.json --eval-split test --max-train-examples 32 --max-eval-examples 64 --batch-size 8 --epochs 1 --max-length 96 --lr 5e-4 --seed 1337 --output runs/run_001/eval_reports/stratabert_tiny_ag_news_finetune_smoke.json --checkpoint-dir checkpoints/run_001/tiny_ag_news_smoke
Tokenizer: {'source': 'answerdotai/ModernBERT-base', 'vocab_size': 50368}
Seed: 1337
Checkpoint path: checkpoints/run_001/tiny_ag_news_smoke/model.safetensors
Evaluation reports: data/eval_frozen/run_001/ag_news_eval_index_sample200.json

Citation

Use CITATION.cff from this repository. Title: StrataBERT: A Padding-Safe SSM-Attention Encoder for Efficient Long-Document Classification.

Exact Git Commit

Commit: no_commit_yet. Dirty worktree at checkpoint creation: True.

Downloads last month: -

Safetensors

Model size

2.5M params

Tensor type

F32

dplotnikov
/

stratabert-tiny-ag-news-smoke