TomasFAV's picture
Update README.md
07ca27c verified
metadata
library_name: transformers
license: apache-2.0
base_model: google/pix2struct-docvqa-base
tags:
  - generated_from_trainer
  - invoice-processing
  - information-extraction
  - czech-language
  - document-ai
  - multimodal-model
  - generative-model
  - synthetic-data
metrics:
  - f1
model-index:
  - name: Pix2StructCzechInvoice-V0
    results: []

Pix2StructCzechInvoice (V0 – Synthetic Templates Only)

This model is a fine-tuned version of google/pix2struct-docvqa-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.5022
  • F1: 0.5907

Model description

Pix2StructCzechInvoice (V0) is a generative multimodal model designed for document understanding.

Unlike token classification models (e.g., BERT, LiLT, LayoutLMv3), this model:

  • processes the entire document image
  • generates structured outputs as text sequences

The model is trained to extract key invoice fields such as:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

Training data

The dataset consists of:

  • synthetically generated invoice images
  • fixed template layouts
  • corresponding target text sequences representing structured fields

Key properties:

  • clean and consistent visual structure
  • no OCR noise (end-to-end image input)
  • controlled output formatting
  • no real-world documents

This represents the baseline dataset for generative multimodal models.


Role in the pipeline

This model corresponds to:

V0 – Synthetic template-based dataset only

It is used to:

  • establish a baseline for generative document models
  • compare with:
    • token classification approaches (BERT, LiLT)
    • multimodal encoders (LayoutLMv3)
  • evaluate feasibility of end-to-end extraction

Intended uses

  • End-to-end invoice information extraction from images
  • Document VQA-style tasks
  • Research in generative document understanding
  • Comparison with structured prediction approaches

Limitations

  • Trained only on synthetic data
  • Sensitive to output formatting inconsistencies
  • Lower stability compared to token classification models
  • Requires careful evaluation (string matching vs structured metrics)
  • Performance depends on generation quality

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 1
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine_with_restarts
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss F1
3.1072 1.0 300 2.9769 0.0
2.6572 2.0 600 2.8684 0.0
2.4810 3.0 900 2.6349 0.0
1.7941 4.0 1200 1.6395 0.0
0.8458 5.0 1500 1.0680 0.2173
0.6198 6.0 1800 0.7713 0.4835
0.1999 7.0 2100 0.4331 0.5700
0.0946 8.0 2400 0.3844 0.5907
0.1020 9.0 2700 0.4066 0.4294
0.0842 10.0 3000 0.5022 0.4665

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2