BERTInvoiceCzechR (V2 – Synthetic + Random Layout + Real Layout Injection)

This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.1326
Precision: 0.8120
Recall: 0.7868
F1: 0.7992
Accuracy: 0.9700

Model description

BERTInvoiceCzechR (V2) represents an advanced stage in the training pipeline, combining synthetic data with realistic document layouts.

The model performs token-level classification to extract structured invoice fields:

supplier
customer
invoice number
bank details
totals
dates

This version introduces a key improvement: real invoice layouts with synthetic content, bridging the gap between artificial and real-world data.

Training data

The dataset is composed of three main components:

Synthetic template-based invoices
Synthetic invoices with randomized layouts
Hybrid invoices with real layouts and synthetic content

Real layout injection

In the hybrid dataset:

real invoice documents are used as layout templates
original textual content is removed
fields (e.g., supplier, customer, bank details) are replaced with synthetic data
new content is rendered into the original spatial structure

This approach preserves:

realistic spacing
typography patterns
structural complexity

while maintaining:

full control over annotations
label consistency

Role in the pipeline

This model corresponds to:

V2 – Synthetic + layout augmentation + real layout injection

It is designed to:

reduce the domain gap between synthetic and real invoices
evaluate the impact of realistic spatial distributions
serve as a bridge between purely synthetic training (V0–V1) and real data fine-tuning (V3)

Intended uses

Advanced research in document AI
Evaluation of hybrid synthetic-real training strategies
Invoice information extraction in semi-realistic conditions
Benchmarking generalization improvements

Limitations

Still does not use fully real textual content
Synthetic text may not capture all linguistic variability
OCR noise and scanning artifacts are not fully represented
Performance may still drop on unseen real-world edge cases

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 16
eval_batch_size: 2
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
No log	1.0	87	0.1326	0.7356	0.7270	0.7312	0.9636
No log	2.0	174	0.1226	0.7985	0.7604	0.7790	0.9704
No log	3.0	261	0.1224	0.7880	0.7852	0.7866	0.9689
No log	4.0	348	0.1325	0.7557	0.7783	0.7668	0.9657
No log	5.0	435	0.1390	0.7655	0.8229	0.7932	0.9674
0.0733	6.0	522	0.1324	0.7709	0.8155	0.7926	0.9682
0.0733	7.0	609	0.1326	0.8123	0.7868	0.7994	0.9700
0.0733	8.0	696	0.1366	0.8109	0.7775	0.7938	0.9697
0.0733	9.0	783	0.1385	0.7893	0.7930	0.7912	0.9686
0.0733	10.0	870	0.1393	0.8044	0.7938	0.7991	0.9696

Framework versions

Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/BERTInvoiceCzechV012

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(1001)

this model

Finetunes

1 model