dedoc
/

paragraph_classifier

Model card Files Files and versions

paragraph_classifier / README.md

nastyboget's picture

Update README.md

0bb59f5 verified over 1 year ago

|

history blame contribute delete

801 Bytes

	---
	license: apache-2.0
	datasets:
	- dedoc/paragraph_dataset
	language:
	- ru
	- en
	metrics:
	- f1
	- accuracy
	---

	# Paragraph classifier

	The classifier is used for binary classification of text lines in PDF or scanned documents.

	For each document line, it determines:

	* line is a beginning of a new paragraph or

	* line is a continuation of the previous paragraph

	For each line, feature vector is formed based on line's text and formatting, please see
	`dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py` in [dedoc](https://github.com/ispras/dedoc).


	* Training data are available at [the link](https://huggingface.co/datasets/dedoc/paragraph_dataset).

	* Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py).