| | --- |
| | license: mit |
| | language: |
| | - en |
| | tags: |
| | - statement-extraction |
| | - named-entity-recognition |
| | - t5 |
| | - gemma |
| | - seq2seq |
| | - nlp |
| | - information-extraction |
| | - corp-o-rate |
| | pipeline_tag: text2text-generation |
| | --- |
| | |
| | # Statement Extractor (T5-Gemma 2) |
| |
|
| | A fine-tuned T5-Gemma 2 model for extracting structured statements from text. Part of [corp-o-rate.com](https://corp-o-rate.com). |
| |
|
| | ## Model Description |
| |
|
| | This model extracts subject-predicate-object triples from unstructured text, with automatic entity type recognition and coreference resolution. |
| |
|
| | - **Architecture**: T5-Gemma 2 (270M-270M, 540M total parameters) |
| | - **Training Data**: 77,515 examples from corporate and news documents |
| | - **Final Eval Loss**: 0.209 |
| | - **Max Input Length**: 4,096 tokens |
| | - **Max Output Length**: 2,048 tokens |
| |
|
| | ## Usage |
| |
|
| | ### Python |
| |
|
| | ```python |
| | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| | import torch |
| | |
| | model = AutoModelForSeq2SeqLM.from_pretrained( |
| | "Corp-o-Rate-Community/statement-extractor", |
| | torch_dtype=torch.bfloat16, |
| | trust_remote_code=True, |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained( |
| | "Corp-o-Rate-Community/statement-extractor", |
| | trust_remote_code=True, |
| | ) |
| | |
| | text = "Apple Inc. announced a commitment to carbon neutrality by 2030." |
| | inputs = tokenizer(f"<page>{text}</page>", return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=2048, num_beams=4) |
| | result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(result) |
| | ``` |
| |
|
| | ### Input Format |
| |
|
| | Wrap your text in `<page>` tags: |
| |
|
| | ``` |
| | <page>Your text here...</page> |
| | ``` |
| |
|
| | ### Output Format |
| |
|
| | The model outputs XML with extracted statements: |
| |
|
| | ```xml |
| | <statements> |
| | <stmt> |
| | <subject type="ORG">Apple Inc.</subject> |
| | <object type="EVENT">carbon neutrality by 2030</object> |
| | <predicate>committed to</predicate> |
| | <text>Apple Inc. committed to achieving carbon neutrality by 2030.</text> |
| | </stmt> |
| | </statements> |
| | ``` |
| |
|
| | ## Entity Types |
| |
|
| | | Type | Description | |
| | |------|-------------| |
| | | ORG | Organizations (companies, agencies) | |
| | | PERSON | People (names, titles) | |
| | | GPE | Geopolitical entities (countries, cities) | |
| | | LOC | Locations (mountains, rivers) | |
| | | PRODUCT | Products (devices, services) | |
| | | EVENT | Events (announcements, meetings) | |
| | | WORK_OF_ART | Creative works (reports, books) | |
| | | LAW | Legal documents | |
| | | DATE | Dates and time periods | |
| | | MONEY | Monetary values | |
| | | PERCENT | Percentages | |
| | | QUANTITY | Quantities and measurements | |
| |
|
| | ## Demo |
| |
|
| | Try the interactive demo at [statement-extractor.vercel.app](https://statement-extractor.vercel.app) |
| |
|
| | ## Training |
| |
|
| | - Base model: `google/t5gemma-2-270m-270m` |
| | - Training examples: 77,515 |
| | - Final eval loss: 0.209 |
| | - Training with refinement phase (LR=1e-6, epochs=0.2) |
| | - Beam search: num_beams=4 |
| | |
| | ## About corp-o-rate |
| | |
| | This model is part of [corp-o-rate.com](https://corp-o-rate.com) - an AI-powered platform for ESG analysis and corporate accountability. |
| | |
| | ## License |
| | |
| | MIT License |
| | |