| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - dedoc/paragraph_dataset |
| | language: |
| | - ru |
| | - en |
| | metrics: |
| | - f1 |
| | - accuracy |
| | --- |
| | |
| | # Paragraph classifier |
| |
|
| | The classifier is used for binary classification of text lines in PDF or scanned documents. |
| |
|
| | For each document line, it determines: |
| |
|
| | * line is a beginning of a new paragraph or |
| |
|
| | * line is a continuation of the previous paragraph |
| |
|
| | For each line, feature vector is formed based on line's text and formatting, please see |
| | `dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py` in [dedoc](https://github.com/ispras/dedoc). |
| |
|
| |
|
| | * Training data are available at [the link](https://huggingface.co/datasets/dedoc/paragraph_dataset). |
| |
|
| | * Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py). |