Spaces:

ai4data
/

data-use-annotation

Runtime error

App Files Files Community

data-use-annotation / ANNOTATION_GUIDE.md

rafmacalaba

docs: add comprehensive ANNOTATION_GUIDE.md

e746bfe 16 days ago

preview code

raw

history blame contribute delete

7.47 kB

📝 Annotation Tool — Guide

A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.

Quick Start

For Annotators

Go to the Space URL and click 🤗 Sign in with HuggingFace
You'll see only your assigned documents in the dropdown
Navigate pages with ← Prev / Next →
Open the Data Mentions panel to validate each mention
Track your progress in the top-right: Progress: 📄 PDF 3/55 | 📑 Page 2/12 | 🏷️ Verified 4/8

Validation Actions

Action	What it does
✅ Correct	Confirms the AI extraction is a real dataset mention
❌ Incorrect	Marks the extraction as wrong / not a dataset
Click tag badge	Change dataset type (named, descriptive, generic)
Highlight text → Annotate	Manually add a dataset mention the AI missed
🗑️ Delete	Remove a dataset entry entirely

Tip: If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.

Document Assignments

Each annotator sees only their assigned documents. A configurable percentage (default 10%) are overlap documents shared across all annotators for inter-annotator agreement measurement.

Configuration File

annotation_data/annotator_config.yaml:

settings:
  overlap_percent: 10  # % of docs shared between all annotators

annotators:
  - username: rafmacalaba     # HuggingFace username
    docs: [2, 3, 14, ...]     # assigned doc indices
  - username: rafaelmacalaba
    docs: [1, 2, 10, ...]

Auto-Generate Assignments

# Preview assignment distribution:
uv run --with pyyaml python3 generate_assignments.py --dry-run

# Generate and save locally:
uv run --with pyyaml python3 generate_assignments.py

# Generate, save, and upload to HF:
uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload

The script:

Reads annotator_config.yaml for the annotator list and overlap %
Shuffles all available docs (deterministic seed=42)
Reserves overlap_percent docs shared by ALL annotators
Splits the rest evenly across annotators
Saves back to the YAML

Adding a New Annotator

Add to annotation_data/annotator_config.yaml:
```
- username: new_hf_username
  docs: []
```
Re-run: uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
Add the username to ALLOWED_USERS in the Space settings

Manual Editing

You can manually edit the docs array for any annotator in the YAML file, then upload:

uv run --with huggingface_hub python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.upload_file('annotation_data/annotator_config.yaml',
    'annotation_data/annotator_config.yaml',
    'ai4data/annotation_data', repo_type='dataset')
"

Per-Annotator Validation (Overlap Support)

Each dataset mention stores validations per-annotator in a validations array:

{
  "dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
  "dataset_tag": "named",
  "validations": [
    {
      "annotator": "rafmacalaba",
      "human_validated": true,
      "human_verdict": true,
      "human_notes": null,
      "validated_at": "2025-02-24T11:00:00Z"
    },
    {
      "annotator": "rafaelmacalaba",
      "human_validated": true,
      "human_verdict": false,
      "human_notes": "This is a study name, not a dataset",
      "validated_at": "2025-02-24T11:05:00Z"
    }
  ]
}

Key behavior:

Each annotator only sees their own validation status (no bias)
Progress bar and "Next" prompt count only your verifications
Tag edits (dataset_tag) are shared — they're factual, not judgment-based
Re-validating updates your existing entry (doesn't create duplicates)

Data Pipeline

`prepare_data.py` — Prepare & Upload Documents

# Dry run (scan only):
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run

# Upload missing docs + update links:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py

# Only update wbg_pdf_links.json:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only

This script:

Scans local annotation_data/wbg_extractions/ for real _direct_judged.jsonl files
Detects language using langdetect (excludes non-English: Arabic, French)
Uploads English docs to HF dataset
Updates wbg_pdf_links.json with has_revalidation and language fields

Leaderboard 🏆

Click 🏆 Leaderboard in the top bar to see annotator rankings.

Metric	Description
✅ Verified	Number of mentions validated
✍️ Added	Manually added dataset mentions
📄 Docs	Number of documents worked on
⭐ Score	`Verified + Added`

Cached for 2 minutes.

API Endpoints

Endpoint	Method	Description
`/api/documents?user=X`	GET	List documents (filtered by user assignment)
`/api/document?index=X&page=Y`	GET	Get page data for a specific document
`/api/validate`	PUT	Submit validation for a dataset mention
`/api/validate?doc=X&page=Y&idx=Z`	DELETE	Remove a dataset entry
`/api/leaderboard`	GET	Annotator rankings
`/api/pdf-proxy?url=X`	GET	Proxy PDF downloads (bypasses CORS)
`/api/auth/login`	GET	Start HF OAuth flow
`/api/auth/callback`	GET	OAuth callback

Architecture

hf_spaces_docker/
├── app/
│   ├── page.js                    # Main app (client component)
│   ├── globals.css                # All styles
│   ├── api/
│   │   ├── documents/route.js     # Doc listing + user filtering
│   │   ├── document/route.js      # Single page data
│   │   ├── validate/route.js      # Validate/delete mentions
│   │   ├── leaderboard/route.js   # Leaderboard stats
│   │   ├── pdf-proxy/route.js     # PDF CORS proxy
│   │   └── auth/                  # HF OAuth login/callback
│   └── components/
│       ├── AnnotationPanel.js     # Side panel with dataset cards
│       ├── AnnotationModal.js     # Manual annotation dialog
│       ├── DocumentSelector.js    # Document dropdown
│       ├── Leaderboard.js         # Leaderboard modal
│       ├── MarkdownAnnotator.js   # Text viewer with highlighting
│       ├── PageNavigator.js       # Prev/Next page buttons
│       ├── PdfViewer.js           # PDF iframe with loading state
│       └── ProgressBar.js         # PDF/Page/Verified pills
├── annotation_data/
│   ├── annotator_config.yaml      # Annotator assignments
│   └── wbg_data/
│       └── wbg_pdf_links.json     # Doc registry with URLs
├── prepare_data.py                # Upload docs to HF
└── generate_assignments.py        # Auto-assign docs to annotators

Environment Variables

Variable	Required	Description
`HF_TOKEN`	Yes	HuggingFace API token (read/write)
`OAUTH_CLIENT_ID`	Yes (Space)	HF OAuth app client ID
`OAUTH_CLIENT_SECRET`	Yes (Space)	HF OAuth app client secret
`ALLOWED_USERS`	Yes (Space)	Comma-separated HF usernames
`NEXTAUTH_SECRET`	Yes	Secret for cookie signing