BERTInvoiceCzechR (V2 – Synthetic + Random Layout + Real Layout Injection)
This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for structured information extraction from Czech invoices.
It achieves the following results on the evaluation set:
- Loss: 0.1326
- Precision: 0.8120
- Recall: 0.7868
- F1: 0.7992
- Accuracy: 0.9700
Model description
BERTInvoiceCzechR (V2) represents an advanced stage in the training pipeline, combining synthetic data with realistic document layouts.
The model performs token-level classification to extract structured invoice fields:
- supplier
- customer
- invoice number
- bank details
- totals
- dates
This version introduces a key improvement: real invoice layouts with synthetic content, bridging the gap between artificial and real-world data.
Training data
The dataset is composed of three main components:
- Synthetic template-based invoices
- Synthetic invoices with randomized layouts
- Hybrid invoices with real layouts and synthetic content
Real layout injection
In the hybrid dataset:
- real invoice documents are used as layout templates
- original textual content is removed
- fields (e.g., supplier, customer, bank details) are replaced with synthetic data
- new content is rendered into the original spatial structure
This approach preserves:
- realistic spacing
- typography patterns
- structural complexity
while maintaining:
- full control over annotations
- label consistency
Role in the pipeline
This model corresponds to:
V2 – Synthetic + layout augmentation + real layout injection
It is designed to:
- reduce the domain gap between synthetic and real invoices
- evaluate the impact of realistic spatial distributions
- serve as a bridge between purely synthetic training (V0–V1) and real data fine-tuning (V3)
Intended uses
- Advanced research in document AI
- Evaluation of hybrid synthetic-real training strategies
- Invoice information extraction in semi-realistic conditions
- Benchmarking generalization improvements
Limitations
- Still does not use fully real textual content
- Synthetic text may not capture all linguistic variability
- OCR noise and scanning artifacts are not fully represented
- Performance may still drop on unseen real-world edge cases
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|---|---|
| No log | 1.0 | 87 | 0.1326 | 0.7356 | 0.7270 | 0.7312 | 0.9636 |
| No log | 2.0 | 174 | 0.1226 | 0.7985 | 0.7604 | 0.7790 | 0.9704 |
| No log | 3.0 | 261 | 0.1224 | 0.7880 | 0.7852 | 0.7866 | 0.9689 |
| No log | 4.0 | 348 | 0.1325 | 0.7557 | 0.7783 | 0.7668 | 0.9657 |
| No log | 5.0 | 435 | 0.1390 | 0.7655 | 0.8229 | 0.7932 | 0.9674 |
| 0.0733 | 6.0 | 522 | 0.1324 | 0.7709 | 0.8155 | 0.7926 | 0.9682 |
| 0.0733 | 7.0 | 609 | 0.1326 | 0.8123 | 0.7868 | 0.7994 | 0.9700 |
| 0.0733 | 8.0 | 696 | 0.1366 | 0.8109 | 0.7775 | 0.7938 | 0.9697 |
| 0.0733 | 9.0 | 783 | 0.1385 | 0.7893 | 0.7930 | 0.7912 | 0.9686 |
| 0.0733 | 10.0 | 870 | 0.1393 | 0.8044 | 0.7938 | 0.7991 | 0.9696 |
Framework versions
- Transformers 5.0.0
- PyTorch 2.10.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
- Downloads last month
- 107