data-use-annotation / ANNOTATION_GUIDE.md
rafmacalaba's picture
docs: add comprehensive ANNOTATION_GUIDE.md
e746bfe

πŸ“ Annotation Tool β€” Guide

A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.


Quick Start

For Annotators

  1. Go to the Space URL and click πŸ€— Sign in with HuggingFace
  2. You'll see only your assigned documents in the dropdown
  3. Navigate pages with ← Prev / Next β†’
  4. Open the Data Mentions panel to validate each mention
  5. Track your progress in the top-right: Progress: πŸ“„ PDF 3/55 | πŸ“‘ Page 2/12 | 🏷️ Verified 4/8

Validation Actions

Action What it does
βœ… Correct Confirms the AI extraction is a real dataset mention
❌ Incorrect Marks the extraction as wrong / not a dataset
Click tag badge Change dataset type (named, descriptive, generic)
Highlight text β†’ Annotate Manually add a dataset mention the AI missed
πŸ—‘οΈ Delete Remove a dataset entry entirely

Tip: If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.


Document Assignments

Each annotator sees only their assigned documents. A configurable percentage (default 10%) are overlap documents shared across all annotators for inter-annotator agreement measurement.

Configuration File

annotation_data/annotator_config.yaml:

settings:
  overlap_percent: 10  # % of docs shared between all annotators

annotators:
  - username: rafmacalaba     # HuggingFace username
    docs: [2, 3, 14, ...]     # assigned doc indices
  - username: rafaelmacalaba
    docs: [1, 2, 10, ...]

Auto-Generate Assignments

# Preview assignment distribution:
uv run --with pyyaml python3 generate_assignments.py --dry-run

# Generate and save locally:
uv run --with pyyaml python3 generate_assignments.py

# Generate, save, and upload to HF:
uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload

The script:

  • Reads annotator_config.yaml for the annotator list and overlap %
  • Shuffles all available docs (deterministic seed=42)
  • Reserves overlap_percent docs shared by ALL annotators
  • Splits the rest evenly across annotators
  • Saves back to the YAML

Adding a New Annotator

  1. Add to annotation_data/annotator_config.yaml:
    - username: new_hf_username
      docs: []
    
  2. Re-run: uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
  3. Add the username to ALLOWED_USERS in the Space settings

Manual Editing

You can manually edit the docs array for any annotator in the YAML file, then upload:

uv run --with huggingface_hub python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.upload_file('annotation_data/annotator_config.yaml',
    'annotation_data/annotator_config.yaml',
    'ai4data/annotation_data', repo_type='dataset')
"

Per-Annotator Validation (Overlap Support)

Each dataset mention stores validations per-annotator in a validations array:

{
  "dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
  "dataset_tag": "named",
  "validations": [
    {
      "annotator": "rafmacalaba",
      "human_validated": true,
      "human_verdict": true,
      "human_notes": null,
      "validated_at": "2025-02-24T11:00:00Z"
    },
    {
      "annotator": "rafaelmacalaba",
      "human_validated": true,
      "human_verdict": false,
      "human_notes": "This is a study name, not a dataset",
      "validated_at": "2025-02-24T11:05:00Z"
    }
  ]
}

Key behavior:

  • Each annotator only sees their own validation status (no bias)
  • Progress bar and "Next" prompt count only your verifications
  • Tag edits (dataset_tag) are shared β€” they're factual, not judgment-based
  • Re-validating updates your existing entry (doesn't create duplicates)

Data Pipeline

prepare_data.py β€” Prepare & Upload Documents

# Dry run (scan only):
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run

# Upload missing docs + update links:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py

# Only update wbg_pdf_links.json:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only

This script:

  • Scans local annotation_data/wbg_extractions/ for real _direct_judged.jsonl files
  • Detects language using langdetect (excludes non-English: Arabic, French)
  • Uploads English docs to HF dataset
  • Updates wbg_pdf_links.json with has_revalidation and language fields

Leaderboard πŸ†

Click πŸ† Leaderboard in the top bar to see annotator rankings.

Metric Description
βœ… Verified Number of mentions validated
✍️ Added Manually added dataset mentions
πŸ“„ Docs Number of documents worked on
⭐ Score Verified + Added

Cached for 2 minutes.


API Endpoints

Endpoint Method Description
/api/documents?user=X GET List documents (filtered by user assignment)
/api/document?index=X&page=Y GET Get page data for a specific document
/api/validate PUT Submit validation for a dataset mention
/api/validate?doc=X&page=Y&idx=Z DELETE Remove a dataset entry
/api/leaderboard GET Annotator rankings
/api/pdf-proxy?url=X GET Proxy PDF downloads (bypasses CORS)
/api/auth/login GET Start HF OAuth flow
/api/auth/callback GET OAuth callback

Architecture

hf_spaces_docker/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ page.js                    # Main app (client component)
β”‚   β”œβ”€β”€ globals.css                # All styles
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ documents/route.js     # Doc listing + user filtering
β”‚   β”‚   β”œβ”€β”€ document/route.js      # Single page data
β”‚   β”‚   β”œβ”€β”€ validate/route.js      # Validate/delete mentions
β”‚   β”‚   β”œβ”€β”€ leaderboard/route.js   # Leaderboard stats
β”‚   β”‚   β”œβ”€β”€ pdf-proxy/route.js     # PDF CORS proxy
β”‚   β”‚   └── auth/                  # HF OAuth login/callback
β”‚   └── components/
β”‚       β”œβ”€β”€ AnnotationPanel.js     # Side panel with dataset cards
β”‚       β”œβ”€β”€ AnnotationModal.js     # Manual annotation dialog
β”‚       β”œβ”€β”€ DocumentSelector.js    # Document dropdown
β”‚       β”œβ”€β”€ Leaderboard.js         # Leaderboard modal
β”‚       β”œβ”€β”€ MarkdownAnnotator.js   # Text viewer with highlighting
β”‚       β”œβ”€β”€ PageNavigator.js       # Prev/Next page buttons
β”‚       β”œβ”€β”€ PdfViewer.js           # PDF iframe with loading state
β”‚       └── ProgressBar.js         # PDF/Page/Verified pills
β”œβ”€β”€ annotation_data/
β”‚   β”œβ”€β”€ annotator_config.yaml      # Annotator assignments
β”‚   └── wbg_data/
β”‚       └── wbg_pdf_links.json     # Doc registry with URLs
β”œβ”€β”€ prepare_data.py                # Upload docs to HF
└── generate_assignments.py        # Auto-assign docs to annotators

Environment Variables

Variable Required Description
HF_TOKEN Yes HuggingFace API token (read/write)
OAUTH_CLIENT_ID Yes (Space) HF OAuth app client ID
OAUTH_CLIENT_SECRET Yes (Space) HF OAuth app client secret
ALLOWED_USERS Yes (Space) Comma-separated HF usernames
NEXTAUTH_SECRET Yes Secret for cookie signing