Add model card and metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - ocr
7
+ - vision-language-model
8
+ - document-understanding
9
+ ---
10
+
11
+ # OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
12
+
13
+ OCRVerse is a holistic OCR method that enables unified text-centric OCR (extracting text from documents like books and magazines) and vision-centric OCR (identifying visual elements from information-dense sources like charts, web pages, and scientific plots) in an end-to-end manner.
14
+
15
+ - **Paper:** [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://huggingface.co/papers/2601.21639)
16
+ - **GitHub Repository:** [DocTron-hub/OCRVerse](https://github.com/DocTron-hub/OCRVerse)
17
+
18
+ ## Usage Example
19
+
20
+ To use OCRVerse, please ensure you have the `transformers` library installed:
21
+
22
+ ```shell
23
+ pip install "transformers>=4.57.0"
24
+ ```
25
+
26
+ ### Text-Centric Document Parsing
27
+
28
+ Below is a simple example of how to use OCRVerse for document parsing tasks.
29
+
30
+ ```python
31
+ from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
32
+ import torch
33
+
34
+ # Load model
35
+ model_path = 'DocTron/OCRVerse'
36
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
37
+ model_path,
38
+ dtype="auto",
39
+ device_map="cuda",
40
+ trust_remote_code=True
41
+ )
42
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
43
+
44
+ # Prepare input with image and text
45
+ image_path = "path/to/your/image.jpg"
46
+ # We recommend using the following prompt for better performance
47
+ prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
48
+
49
+ messages = [
50
+ {
51
+ "role": "user",
52
+ "content": [
53
+ {"type": "image", "image": image_path},
54
+ {"type": "text", "text": prompt},
55
+ ]
56
+ }
57
+ ]
58
+
59
+ # Preparation for inference
60
+ inputs = processor.apply_chat_template(
61
+ messages,
62
+ tokenize=True,
63
+ add_generation_prompt=True,
64
+ return_dict=True,
65
+ return_tensors="pt"
66
+ )
67
+ inputs = inputs.to(model.device)
68
+
69
+ # Inference: Generation of the output
70
+ generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
71
+
72
+ generated_ids = [
73
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
74
+ ]
75
+ output_text = processor.tokenizer.batch_decode(
76
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
77
+ )
78
+ print(output_text[0])
79
+ ```
80
+
81
+ ## Citation
82
+
83
+ If you find this project useful, please cite our paper:
84
+
85
+ ```bibtex
86
+ @article{zhong2026ocrverse,
87
+ title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models},
88
+ author={Yufeng Zhong and Lei Chen and Xuanle Zhao and Wenkang Han and Liming Zheng and Jing Huang and Deyang Jiang and Yilin Cao and Lin Ma and Zhixiong Zeng},
89
+ journal={arXiv preprint arXiv:2601.21639},
90
+ year={2026}
91
+ }
92
+ ```