| | --- |
| | base_model: |
| | - allenai/scibert_scivocab_cased |
| | datasets: |
| | - ExponentialScience/DLT-Tweets |
| | - ExponentialScience/DLT-Patents |
| | - ExponentialScience/DLT-Scientific-Literature |
| | language: |
| | - en |
| | license: cc-by-nc-4.0 |
| | library_name: transformers |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # LedgerBERT |
| |
|
| | [Paper: DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain](https://huggingface.co/papers/2602.22045) | [GitHub Repository](https://github.com/dlt-science/DLT-Corpus) |
| |
|
| | ## Model Description |
| |
|
| | ### Model Summary |
| |
|
| | LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems. |
| |
|
| | LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content. |
| |
|
| | - **Developed by:** Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu |
| | - **Model type:** BERT-base encoder (bidirectional transformer) |
| | - **Language:** English |
| | - **License:** CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International) |
| | - **Base model:** SciBERT (allenai/scibert_scivocab_cased) |
| | - **Training corpus:** DLT-Corpus (2.98 billion tokens) |
| |
|
| | ### Model Architecture |
| |
|
| | - **Architecture:** BERT-base |
| | - **Parameters:** 110 million |
| | - **Hidden size:** 768 |
| | - **Number of layers:** 12 |
| | - **Attention heads:** 12 |
| | - **Vocabulary size:** 30,522 (SciBERT vocabulary) |
| | - **Max sequence length:** 512 tokens |
| |
|
| | ## Intended Uses |
| |
|
| | ### Primary Use Cases |
| |
|
| | LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to: |
| |
|
| | - **Named Entity Recognition (NER)**: Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing) |
| | - **Text Classification**: Categorizing DLT-related documents, patents, or social media posts |
| | - **Sentiment Analysis**: Analyzing sentiment in cryptocurrency news and social media |
| | - **Information Extraction**: Extracting technical concepts and relationships from DLT literature |
| | - **Document Retrieval**: Building search systems for DLT content |
| | - **Question Answering (QA)**: Creating QA systems for blockchain and cryptocurrency topics |
| |
|
| | ### Out-of-Scope Uses |
| |
|
| | - **Real-time trading systems**: LedgerBERT should not be used as the sole basis for automated trading decisions |
| | - **Investment advice**: Not suitable for providing financial or investment recommendations without proper disclaimers |
| | - **General-purpose NLP**: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks |
| | - **Legal or regulatory compliance**: Should not be used for legal interpretation without expert review |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | LedgerBERT was continually pre-trained on the **DLT-Corpus**, consisting of: |
| |
|
| | - **Scientific Literature**: 37,440 documents, 564M tokens (1978-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature |
| | - **Patents**: 49,023 documents, 1,296M tokens (1990-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Patents |
| | - **Social Media**: 22.03M documents, 1,120M tokens (2013-mid 2023). See https://huggingface.co/datasets/ExponentialScience/DLT-Tweets |
| |
|
| | **Total:** 22.12 million documents, 2.98 billion tokens |
| |
|
| | For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402 |
| |
|
| | ### Training Procedure |
| |
|
| | **Continual Pre-training:** |
| |
|
| | Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts. |
| |
|
| | **Training hyperparameters:** |
| | - **Epochs:** 3 |
| | - **Learning rate:** 5×10⁻⁵ with linear decay schedule |
| | - **MLM probability:** 0.15 (standard BERT masking) |
| | - **Warmup ratio:** 0.10 |
| | - **Batch size:** 12 per device |
| | - **Sequence length:** 512 tokens |
| | - **Weight decay:** 0.01 |
| | - **Optimizer:** Stable AdamW |
| | - **Precision:** bfloat16 |
| |
|
| |
|
| | ## Limitations and Biases |
| |
|
| | ### Known Limitations |
| |
|
| | - **Language coverage**: English only; does not support other languages |
| | - **Temporal coverage**: Training data extends to mid-2023 for social media; may not capture very recent terminology |
| | - **Domain specificity**: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa |
| | - **Context length**: Limited to 512 tokens; longer documents require truncation or chunking |
| |
|
| | ### Potential Biases |
| |
|
| | The model may reflect biases present in the training data: |
| |
|
| | - **Geographic bias**: English-language sources may over-represent certain regions |
| | - **Platform bias**: Social media data only from Twitter/X; other platforms not represented |
| | - **Temporal bias**: More recent DLT developments are more heavily represented |
| | - **Market bias**: Training during periods of market volatility may influence sentiment understanding |
| | - **Source bias**: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others |
| |
|
| | ### Ethical Considerations |
| |
|
| | - **Market manipulation risk**: Could potentially be misused for analyzing or generating content for market manipulation |
| | - **Investment decisions**: Should not be used as sole basis for financial decisions without proper risk disclaimers |
| | - **Misinformation**: May reproduce or fail to identify false claims present in training data |
| | - **Privacy**: While usernames were removed from social media data, care should be taken not to re-identify individuals |
| |
|
| | ## How to Use |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | # Load model and tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT") |
| | model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT") |
| | |
| | # Example text |
| | text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation." |
| | |
| | # Tokenize and encode |
| | inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True) |
| | |
| | # Get embeddings |
| | outputs = model(**inputs) |
| | embeddings = outputs.last_hidden_state |
| | ``` |
| |
|
| | ### Fine-tuning for NER |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer |
| | |
| | # Load for token classification |
| | tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT") |
| | model = AutoModelForTokenClassification.from_pretrained( |
| | "ExponentialScience/LedgerBERT", |
| | num_labels=num_labels # Set based on your NER task |
| | ) |
| | |
| | # Fine-tune on your dataset |
| | training_args = TrainingArguments( |
| | output_dir="./results", |
| | learning_rate=1e-5, |
| | per_device_train_batch_size=16, |
| | num_train_epochs=20, |
| | warmup_steps=500 |
| | ) |
| | |
| | trainer = Trainer( |
| | model=model, |
| | args=training_args, |
| | train_dataset=train_dataset, |
| | eval_dataset=eval_dataset |
| | ) |
| | |
| | trainer.train() |
| | ``` |
| |
|
| | ### Fine-tuning for Sentiment Analysis |
| |
|
| | A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment") |
| | model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment") |
| | |
| | text = "Bitcoin reaches new all-time high amid institutional adoption" |
| | inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True) |
| | outputs = model(**inputs) |
| | predictions = outputs.logits.argmax(dim=-1) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you use LedgerBERT in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{hernandez2026dlt-corpus, |
| | title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain}, |
| | author={Walter Hernandez Cruz and Peter Devine and Nikhil Vadgama and Paolo Tasca and Jiahua Xu}, |
| | year={2026}, |
| | eprint={2602.22045}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2602.22045}, |
| | } |
| | ``` |
| |
|
| | ## Related Resources |
| |
|
| | - **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402 |
| | - **Scientific Literature Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature |
| | - **Patents Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Patents |
| | - **Social Media Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Tweets |
| | - **Sentiment Analysis Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News |
| | - **Fine-tuned Market Sentiment Model**: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment |
| |
|
| | ## Model Card Contact |
| |
|
| | For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus |