BERTInvoiceCzechR (V2 – Synthetic + Random Layout + Real Layout Injection)

This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.1326
  • Precision: 0.8120
  • Recall: 0.7868
  • F1: 0.7992
  • Accuracy: 0.9700

Model description

BERTInvoiceCzechR (V2) represents an advanced stage in the training pipeline, combining synthetic data with realistic document layouts.

The model performs token-level classification to extract structured invoice fields:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

This version introduces a key improvement: real invoice layouts with synthetic content, bridging the gap between artificial and real-world data.


Training data

The dataset is composed of three main components:

  1. Synthetic template-based invoices
  2. Synthetic invoices with randomized layouts
  3. Hybrid invoices with real layouts and synthetic content

Real layout injection

In the hybrid dataset:

  • real invoice documents are used as layout templates
  • original textual content is removed
  • fields (e.g., supplier, customer, bank details) are replaced with synthetic data
  • new content is rendered into the original spatial structure

This approach preserves:

  • realistic spacing
  • typography patterns
  • structural complexity

while maintaining:

  • full control over annotations
  • label consistency

Role in the pipeline

This model corresponds to:

V2 – Synthetic + layout augmentation + real layout injection

It is designed to:

  • reduce the domain gap between synthetic and real invoices
  • evaluate the impact of realistic spatial distributions
  • serve as a bridge between purely synthetic training (V0–V1) and real data fine-tuning (V3)

Intended uses

  • Advanced research in document AI
  • Evaluation of hybrid synthetic-real training strategies
  • Invoice information extraction in semi-realistic conditions
  • Benchmarking generalization improvements

Limitations

  • Still does not use fully real textual content
  • Synthetic text may not capture all linguistic variability
  • OCR noise and scanning artifacts are not fully represented
  • Performance may still drop on unseen real-world edge cases

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 2
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
No log 1.0 87 0.1326 0.7356 0.7270 0.7312 0.9636
No log 2.0 174 0.1226 0.7985 0.7604 0.7790 0.9704
No log 3.0 261 0.1224 0.7880 0.7852 0.7866 0.9689
No log 4.0 348 0.1325 0.7557 0.7783 0.7668 0.9657
No log 5.0 435 0.1390 0.7655 0.8229 0.7932 0.9674
0.0733 6.0 522 0.1324 0.7709 0.8155 0.7926 0.9682
0.0733 7.0 609 0.1326 0.8123 0.7868 0.7994 0.9700
0.0733 8.0 696 0.1366 0.8109 0.7775 0.7938 0.9697
0.0733 9.0 783 0.1385 0.7893 0.7930 0.7912 0.9686
0.0733 10.0 870 0.1393 0.8044 0.7938 0.7991 0.9696

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
107
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/BERTInvoiceCzechV012

Finetuned
(953)
this model
Finetunes
1 model