Egypt Constitution VLM

This model is a fine-tuned Vision-Language Model (VLM) based on Gemma-3-4B-IT (quantized to 4-bit via Unsloth). It is designed to extract highly structured JSON data—including constitutional articles, page metadata, legal intent, and named entities—directly from scanned images of Arabic constitutional and legal documents.

Model Details

Model Description

This model leverages Parameter-Efficient Fine-Tuning (PEFT) using LoRA on the Gemma-3 architecture. By processing image inputs of scanned legal documents alongside specific instructions, it accurately transcribes Arabic text while simultaneously structuring the output into a predefined JSON schema. Vision layers, Language layers, Attention modules, and MLP modules were all targeted during the fine-tuning process.

  • Developed by: [Mahmoud Essam]
  • Model type: Vision-Language Model (VLM) with LoRA Adapters
  • Language(s) (NLP): Arabic (content extraction), English (JSON keys/schema)
  • License: Apache 2.0 (Inherited from Gemma-3)
  • Finetuned from model: unsloth/gemma-3-4b-it-unsloth-bnb-4bit

Uses

Direct Use

The primary use case is the digitization and structured data extraction from scanned Arabic legal documents (specifically the Egyptian Constitution). By providing a document image as input, the model outputs a structured JSON object containing:

  • Page Metadata: Source document, page number, language.
  • Hierarchy Context: Part and Chapter titles.
  • Articles: raw text, cleaned body text, legal intent, key entities, Arabic summaries, and keywords.

Out-of-Scope Use

  • Recognition of highly illegible handwritten Arabic documents.
  • General-purpose conversational AI or chat tasks (it is highly specialized for JSON extraction).
  • Processing documents in languages other than Arabic.

Bias, Risks, and Limitations

  • Domain Specificity: The model is heavily optimized for formal Arabic legal and constitutional texts. It may hallucinate or underperform on standard conversational Arabic or vastly different document layouts (e.g., newspapers, unstructured letters).
  • Resolution Sensitivity: The model relies heavily on a specific image preprocessing pipeline (resizing and padding to 1024x1024). Feeding raw, unformatted images may degrade performance.

Recommendations

Users should ensure that input images undergo the identical preprocessing steps used during training (Grayscale, Autocontrast, Denoising, Sharpening, and Padding) to achieve optimal extraction accuracy. Human-in-the-loop verification is recommended for critical legal digitization tasks.

How to Get Started with the Model

Use the separated code blocks below to get started with the model using Unsloth.

Getting Model

import torch
import json
from PIL import Image, ImageOps, ImageFilter
from unsloth import FastVisionModel

# 1. Load Model & Tokenizer
model_id = "Humachine/egypt-constitution-vlm"
model, tokenizer = FastVisionModel.from_pretrained(
    model_name=model_id,
    load_in_4bit=True,
    trust_remote_code=True
)

Image Preprocessing Logic

def preprocess_image(image_path: str, target_size: tuple = (1024, 1024)) -> Image.Image:
    image = Image.open(image_path).convert('L')
    image = ImageOps.autocontrast(image, cutoff=1)
    image = image.filter(ImageFilter.MedianFilter(size=3))
    image = image.filter(ImageFilter.SHARPEN)
    image.thumbnail(target_size, Image.Resampling.LANCZOS)
    
    padded = Image.new('L', target_size, color=255)
    offset = ((target_size[0] - image.width) // 2, (target_size[1] - image.height) // 2)
    padded.paste(image, offset)
    return padded.convert('RGB')

Prepare Inputs

image_path = "/content/0100.jpg"
image = preprocess_image(image_path)

instruction = (
    "Extract all articles from this Arabic constitutional document. "
    "Return a JSON object with keys: page_metadata, hierarchy_context, and articles. "
    "Each article must include: article_id, article_number, content "
    "(body_text, key_entities, legal_intent), training_features (summary_ar, keywords), "
    "and text_raw."
)

messages = [
    {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": instruction}]}
]

Generate Output

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(text, return_tensors="pt").to("cuda",dtype=torch.bfloat16) 

output_tokens = model.generate(**inputs, max_new_tokens=2048, use_cache=True, temperature=0.2)
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print(output_text)

Citation

If you use this model in your research or application, please cite it as follows:

@misc{essam2026egypt,
  author  = {Mahmoud Essam},
  title   = {Egypt Constitution VLM: A Vision-Language Model for Arabic Legal JSON Extraction},
  journal = {Hugging Face Repositories},
  year    = {2026},
  url     = {https://huggingface.co/your-username/egypt-constitution-vlm}
}
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Humachine/egypt-constitution-vlm

Adapter
(78)
this model