---
library_name: transformers
license: apache-2.0
base_model: google/pix2struct-docvqa-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- multimodal-model
- generative-model
- synthetic-data
metrics:
- f1
model-index:
- name: Pix2StructCzechInvoice-V0
  results: []
---

# Pix2StructCzechInvoice (V0 – Synthetic Templates Only)

This model is a fine-tuned version of [google/pix2struct-docvqa-base](https://huggingface.co/google/pix2struct-docvqa-base) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.5022  
- F1: 0.5907  

---

## Model description

Pix2StructCzechInvoice (V0) is a generative multimodal model designed for document understanding.

Unlike token classification models (e.g., BERT, LiLT, LayoutLMv3), this model:
- processes the entire document image  
- generates structured outputs as text sequences  

The model is trained to extract key invoice fields such as:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

---

## Training data

The dataset consists of:

- synthetically generated invoice images  
- fixed template layouts  
- corresponding target text sequences representing structured fields  

Key properties:
- clean and consistent visual structure  
- no OCR noise (end-to-end image input)  
- controlled output formatting  
- no real-world documents  

This represents the **baseline dataset for generative multimodal models**.

---

## Role in the pipeline

This model corresponds to:

**V0 – Synthetic template-based dataset only**

It is used to:
- establish a baseline for generative document models  
- compare with:
  - token classification approaches (BERT, LiLT)  
  - multimodal encoders (LayoutLMv3)  
- evaluate feasibility of end-to-end extraction  

---

## Intended uses

- End-to-end invoice information extraction from images  
- Document VQA-style tasks  
- Research in generative document understanding  
- Comparison with structured prediction approaches  

---

## Limitations

- Trained only on synthetic data  
- Sensitive to output formatting inconsistencies  
- Lower stability compared to token classification models  
- Requires careful evaluation (string matching vs structured metrics)  
- Performance depends on generation quality  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 1
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 3.1072        | 1.0   | 300  | 2.9769          | 0.0    |
| 2.6572        | 2.0   | 600  | 2.8684          | 0.0    |
| 2.4810        | 3.0   | 900  | 2.6349          | 0.0    |
| 1.7941        | 4.0   | 1200 | 1.6395          | 0.0    |
| 0.8458        | 5.0   | 1500 | 1.0680          | 0.2173 |
| 0.6198        | 6.0   | 1800 | 0.7713          | 0.4835 |
| 0.1999        | 7.0   | 2100 | 0.4331          | 0.5700 |
| 0.0946        | 8.0   | 2400 | 0.3844          | 0.5907 |
| 0.1020        | 9.0   | 2700 | 0.4066          | 0.4294 |
| 0.0842        | 10.0  | 3000 | 0.5022          | 0.4665 |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2