---
library_name: transformers
license: apache-2.0
base_model: TomasFAV/Pix2StructCzechInvoice
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- multimodal-model
- generative-model
- synthetic-data
- layout-augmentation
metrics:
- f1
model-index:
- name: Pix2StructCzechInvoice-V1
  results: []
---

# Pix2StructCzechInvoice (V1 – Synthetic + Random Layout)

This model is a fine-tuned version of [TomasFAV/Pix2StructCzechInvoice](https://huggingface.co/TomasFAV/Pix2StructCzechInvoice) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.4679  
- F1: 0.6432  

---

## Model description

Pix2StructCzechInvoice (V1) extends the baseline generative model by introducing layout variability into the training data.

Unlike token classification models, this model:
- processes full document images  
- generates structured outputs as text sequences  

It is trained to extract key invoice fields:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

---

## Training data

The dataset consists of:

- synthetically generated invoice images  
- augmented variants with randomized layouts  
- corresponding structured text outputs  

Key properties:
- variable layout structure  
- visual diversity (spacing, positioning, formatting)  
- consistent annotation format  
- fully synthetic data  

This introduces **layout variability in the visual domain**, which is crucial for generative multimodal models.

---

## Role in the pipeline

This model corresponds to:

**V1 – Synthetic templates + randomized layouts**

It is used to:
- evaluate the effect of layout variability on generative models  
- compare against:
  - V0 (fixed templates)  
  - later hybrid and real-data stages (V2, V3)  
- analyze robustness of end-to-end extraction  

---

## Intended uses

- End-to-end invoice extraction from images  
- Document VQA-style tasks  
- Research in generative document understanding  
- Comparison with structured prediction models  

---

## Limitations

- Still trained only on synthetic data  
- Sensitive to output formatting inconsistencies  
- Training instability (fluctuating F1 across epochs)  
- Evaluation depends on string matching quality  
- Less interpretable than token classification models  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 1
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.1978        | 1.0   | 75   | 0.3757          | 0.5804 |
| 0.1031        | 2.0   | 150  | 0.3578          | 0.6399 |
| 0.0725        | 3.0   | 225  | 0.3504          | 0.6318 |
| 0.0512        | 4.0   | 300  | 0.3929          | 0.6396 |
| 0.0500        | 5.0   | 375  | 0.4072          | 0.6394 |
| 0.0462        | 6.0   | 450  | 0.4655          | 0.4377 |
| 0.0502        | 7.0   | 525  | 0.6320          | 0.3384 |
| 0.0528        | 8.0   | 600  | 0.4835          | 0.5018 |
| 0.0393        | 9.0   | 675  | 0.4679          | 0.6432 |
| 0.0392        | 10.0  | 750  | 0.5330          | 0.4931 |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2