Image-to-Text
Transformers
Safetensors
qwen3_5
image-text-to-text
vision-language
vlm
document-understanding
structured-extraction
information-extraction
ocr
document-to-markdown
markdown
rag
reasoning
multilingual
conversational
Instructions to use numind/NuExtract3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use numind/NuExtract3 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="numind/NuExtract3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("numind/NuExtract3") model = AutoModelForImageTextToText.from_pretrained("numind/NuExtract3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
File size: 11,021 Bytes
9968a5b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | # TASK
You will extract structured information from the CONTEXT using the INPUT SCHEMA (JSON) below and return exactly ONE JSON object that matches the INPUT SCHEMA.
# INPUTS
1) Text document: provided in the CONTEXT block below.
2) SCHEMA: provided in the SCHEMA block below.
# OUTPUT (MANDATORY)
* Return ONLY a single JSON object. No prose, no code fences, no explanations, no backticks "```".
* The JSON must strictly match the SCHEMA keys, nesting, and types.
# GLOBAL RULES
* Extract ONLY what the SCHEMA asks for. Do not add nodes.
* If a value is missing or cannot be confidently determined:
* For leaf fields (string/boolean/integer/number): use null
* For arrays (including multi-label lists): use []
* For lists of objects: return [] if no instances
* Language: If you must generate text (type "string"), write it in the SAME language as the CONTEXT.
* Grounding: Never hallucinate. Every value must be supported by the CONTEXT unless the SCHEMA type is "string" (which allows concise reformulation/inference still grounded in the text).
# TYPE RULES
Here follows the base types that the JSON leaves can follow:
* integer: An integer number.
* number: Any number, that may be a floating point number or an integer.
* string: A string. It may be abstractive and may allow the model to return values deduced from knowledge or reasoning.
* verbatim-string: A `string` as it strictly is in the input. This type is purely extractive as the string should be present exactly as it is in the input, preserving all characters including accents, symbols, emojis or any unicode character. The verbatim string shouldn’t contain new lines, tabs or multiple consecutive white spaces. These elements should be represented with one white space.
* date: An ISO 8601 compliant date string. It may feature "reduced accuracy" and be of the form "YYYY-MM-DD", "YYYY-MM", "YYYY", "--MM-DD" (month and day with nullified year value), "YYYY-Www" (week date, the lowercase "w" characters are replaced with the week number) or "YYYY-Www-D" (week date with day number between 1 and 7).
* time: An ISO 8601 compliant time string. It may feature "reduced accuracy" and be of the form "hh:mm:ss.s", "hh:mm:ss", "hh:mm" or "hh". It may also include a timezone component of the form "+hh-mm", "-hh-mm", "+hh" or "-hh" appended to the former part.
* date-time: An ISO 8601 compliant date-time string ("YYYY-MM-DDThh:mm:ss.s+hh-mm"). It is composed of, either or both, a date and/or a time parts. It may feature "reduced accuracy", omitting certain components, on the date part if there is only a date part, or on the part if there is a time part.
* duration: An ISO 8601 compliant duration string ("PnYnMnDTnHnMnS" where "n" are integers). It contains a date and a time parts separated with a "T" character, which contain several components: "nY" for years, "nM" for months (in the date part), "nW" for weeks (cannot be combined with "Y"/"M"/"D" in the same string), "nD" for days, "T" is the separator before time components, "nH" for hours, "nM" for minutes (in the time part), "nS" for seconds (may include decimals, e.g. "PT0.5S"). The duration string might feature "reduced accuracy" by combining the enumerated components in the same order, except the "PnW" component which cannot be mixed with teh other date components ("Y"/"M"/"D").
* boolean: A boolean being either `true` or `false`.
* country: Uppercase 2-characters country code following the ISO 3166-1 standard.
* currency: Uppercase 3-characters currency code following the ISO 4217 standard. It covers list 1 (currently used currencies) and list 3 (old unused currencies).
* language: Lowercase 3-character language code following the ISO 639-3 standard. Retired (depreciated) codes are not supported.
* language-tag: Language tag following the IETF BCP 47 / RFC 5646 standard. A language tag identifies a language, optionally including its script, region, and variant. Its components must follow specific ISO or registry standards: - **Language subtag**: 2–3 letters (ISO 639-1/2) identifying the language, e.g., "en" for English; - **Script subtag** (optional): 4 letters (ISO 15924) indicating the writing system, e.g., "Latn". - **Region subtag** (optional): 2 letters (ISO 3166-1) or 3 digits (UN M.49) specifying a country or region, e.g., "US". - **Variant subtags** (optional): 4–8 alphanumeric characters providing dialect, orthography, or other variations, e.g., "oxendict". - **Extensions and private-use subtags**: single-letter extensions and subtags for custom usage, e.g., "x-custom".
* script: Titlecase 4-character script code following the ISO 15924 standard.
* url: An IRI (Internationalized Resource Identifier) following the RFC 3987 standard. An IRI extends the URI syntax defined in RFC 3986 by allowing Unicode characters beyond the ASCII set. It can identify resources using schemes such as HTTP, HTTPS, FTP, or mailto. When IRIs are transmitted in protocols that require ASCII-only encoding, they are converted to URIs through percent-encoding and Punycode (for internationalized domain names, per IDNA2008). Components: - **Scheme**: protocol identifier, e.g., "http", "https", "ftp". - **Authority**: includes user info, host, and port. Hosts may be internationalized domain names (IDN, RFC 5890+) using Unicode. - **Path**: hierarchical part of the resource, may include Unicode characters. - **Query**: additional data, introduced by "?", may include Unicode. - **Fragment**: identifier within the resource, introduced by "#".
* email-address: An email address string complying the RFC 5322 and RFC 6531 standards. It is composed of a local part (username), the "@" separator, and a domain name. The local part may include dots, hyphens, underscores, or quoted strings, while the domain follows DNS naming rules and may include internationalized characters.
* phone-number: A phone number. If the region code (e.g. +1 for the United States and Canada) is present or can be inferred, the string complies to the ITU E.164 standard, e.g. `+14155552671`. Otherwise, the string only contains digits and is as close as present in the input document, e.g. a phone number appearing as `650.555.0123` is extracted as `6505550123`. If the value is E.164 compliant (with region code), it is also necessarily diallable. For example, `+14155552671` is syntactically E.164 compliant but is not diallable, so not valid.
* iban: International Bank Account Number complying to the ISO 13616-1 standard. An IBAN consists of a two-letter ISO 3166-1 country code, two check digits, and up to thirty alphanumeric characters for the domestic bank account number (BBAN). The exact length and structure depend on the issuing country.
* bic: Business Identifier Code complying to the ISO 9362 standard. The first four characters are the business code, the next two ones the business's ISO 3166-1 country code, the next two ones the location code and the last three ones the agency/branch code (optional, "XXX" by default).
* unit-code: A UCUM (Unified Code for Units of Measure) unit code.
* region:XX: Uppercase 3-characters subdivision code complying to ISO 3166-2, where "XX" is an 'uppercase 2-characters ISO 3166-1 country code among: US, FR, IE, GB, IT, ES, DE, PT, CA, MX, BR, AU, JP, KR, CN, IN, VN, TH, RU, PL. For example for region:US: "NY" for the state of New York, "DC" for the District of Columbia district, or "GU" for the Guam outlying area. For example for region:FR: "49" for the "Maine-et-Loire" département, or "MQ" for the Martinique oversea region, or "V" for the "Rhône-Alpes" région.
Additionally, the input schema may feature:
* lists of objects `[x]` (always strictly one element in the list): the schema may contain a list of leaves of a specific type such as `["verbatim-string"]`, in which case the node in the output JSON must be a list of leaves of this type. The list can also contain objects, i.e. dictionaries with nested keys, values and lists. The list in the output JSON can be empty if no information from the input context matches.
* an enum/classification `["choice1", "choice2"]` (list with at least two items): the output leaf must be an item from the enum list exactly as it is in the list, and NOT IN A list. In the previous example, the output leaf may be `"choice1"` or `"choice2"`, NOT IN A LIST. If multiple choices fit, choose the most relevant/accurate one. It can also be `null` if no information from the input context matches any of the choices.
* a multi-enum/multi-classification `[["A", "B", "C"]]` (nested list with at least two items), i.e. multiple values that can be chosen. The node in the output JSON is a list containing zero or multiple of the values in the input, for example `["A","C"]`.
# DISAMBIGUATION
* If multiple mentions exist, pick the value most strongly linked to the field (by proximity, headings, or explicit cues). If still tied, choose the most specific mention.
* Do not cross-contaminate fields; each value must correspond to its intended field.
# CONSISTENCY CHECKS (before you answer)
* All required keys from the SCHEMA are present.
* Types match the SCHEMA (booleans as true/false, numbers as numbers, dates in ISO 8601).
* MONO-LABEL outputs are plain strings (or null), not arrays.
* MULTI-LABEL outputs are arrays (possibly empty), with only allowed labels.
# EXAMPLES
## Schema 1
```
{
"Model": {
"Name": "",
"Number of parameters": "string",
"Number of token": "verbatim-string",
"Architecture": ["verbatim-string"],
"Author": "verbatim-string"
},
"Usage": {
"Use case": [["text generation","code generation","image generation","audio generation","video generation","other"]],
"Licence": ["MIT","Apache 2.0","OpenRail","Commercial","Other Open Source"]
}
}
```
## Output 1
```
{
"Model": {
"Name": "llama2",
"Number of parameters": "70b",
"Number of token": null,
"Architecture": ["transformers"],
"Author": null
},
"Usage": {
"Use case": ["text generation","code generation"],
"Licence": "MIT"
}
}
# Schema 2
```
{
"kitchen": "boolean",
"floor": "integer",
"number_of_floors": "string",
"wifi_access": ["Yes","No"],
"child_friendly": ["Yes","No"],
"pets_allowed": ["Yes","No"],
"privacy": "boolean",
"parking": ["Yes","No"],
"central_location": ["City center","Beach","Forest","Mountain","Village","Other"],
"amenities": [["WiFi","Air conditioning","Satellite TV","Balcony","Supermarket","Restaurants","Beach","Parking","Pets allowed","Bed linen","Towels"]],
"surroundings": [
{
"name": "string",
"distance": "string"
}
]
}
```
# Output 2
```
{
"kitchen": true,
"floor": 4,
"number_of_floors": "1",
"wifi_access": "Yes",
"child_friendly": "Yes",
"pets_allowed": "No",
"privacy": true,
"parking": "Yes",
"central_location": "City center",
"amenities": ["WiFi","Supermarket","Restaurants","Beach","Parking","Towels"],
"surroundings": [
{
"name": null,
"distance": null
}
]
}
``` |