Spaces:
Runtime error
Runtime error
| # π Annotation Tool β Guide | |
| A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents. | |
| --- | |
| ## Quick Start | |
| ### For Annotators | |
| 1. Go to the Space URL and click **π€ Sign in with HuggingFace** | |
| 2. You'll see only your assigned documents in the dropdown | |
| 3. Navigate pages with **β Prev / Next β** | |
| 4. Open the **Data Mentions** panel to validate each mention | |
| 5. Track your progress in the top-right: `Progress: π PDF 3/55 | π Page 2/12 | π·οΈ Verified 4/8` | |
| ### Validation Actions | |
| | Action | What it does | | |
| |--------|-------------| | |
| | β **Correct** | Confirms the AI extraction is a real dataset mention | | |
| | β **Incorrect** | Marks the extraction as wrong / not a dataset | | |
| | **Click tag badge** | Change dataset type (named, descriptive, generic) | | |
| | **Highlight text β Annotate** | Manually add a dataset mention the AI missed | | |
| | ποΈ **Delete** | Remove a dataset entry entirely | | |
| > **Tip:** If you try to click "Next" with unverified mentions, you'll get a confirmation prompt. | |
| --- | |
| ## Document Assignments | |
| Each annotator sees only their assigned documents. A configurable percentage (default 10%) are **overlap documents** shared across all annotators for inter-annotator agreement measurement. | |
| ### Configuration File | |
| `annotation_data/annotator_config.yaml`: | |
| ```yaml | |
| settings: | |
| overlap_percent: 10 # % of docs shared between all annotators | |
| annotators: | |
| - username: rafmacalaba # HuggingFace username | |
| docs: [2, 3, 14, ...] # assigned doc indices | |
| - username: rafaelmacalaba | |
| docs: [1, 2, 10, ...] | |
| ``` | |
| ### Auto-Generate Assignments | |
| ```bash | |
| # Preview assignment distribution: | |
| uv run --with pyyaml python3 generate_assignments.py --dry-run | |
| # Generate and save locally: | |
| uv run --with pyyaml python3 generate_assignments.py | |
| # Generate, save, and upload to HF: | |
| uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload | |
| ``` | |
| The script: | |
| - Reads `annotator_config.yaml` for the annotator list and overlap % | |
| - Shuffles all available docs (deterministic seed=42) | |
| - Reserves `overlap_percent` docs shared by ALL annotators | |
| - Splits the rest evenly across annotators | |
| - Saves back to the YAML | |
| ### Adding a New Annotator | |
| 1. Add to `annotation_data/annotator_config.yaml`: | |
| ```yaml | |
| - username: new_hf_username | |
| docs: [] | |
| ``` | |
| 2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload` | |
| 3. Add the username to `ALLOWED_USERS` in the Space settings | |
| ### Manual Editing | |
| You can manually edit the `docs` array for any annotator in the YAML file, then upload: | |
| ```bash | |
| uv run --with huggingface_hub python3 -c " | |
| from huggingface_hub import HfApi | |
| api = HfApi() | |
| api.upload_file('annotation_data/annotator_config.yaml', | |
| 'annotation_data/annotator_config.yaml', | |
| 'ai4data/annotation_data', repo_type='dataset') | |
| " | |
| ``` | |
| --- | |
| ## Per-Annotator Validation (Overlap Support) | |
| Each dataset mention stores validations per-annotator in a `validations` array: | |
| ```json | |
| { | |
| "dataset_name": { "text": "DHS Survey", "confidence": 0.95 }, | |
| "dataset_tag": "named", | |
| "validations": [ | |
| { | |
| "annotator": "rafmacalaba", | |
| "human_validated": true, | |
| "human_verdict": true, | |
| "human_notes": null, | |
| "validated_at": "2025-02-24T11:00:00Z" | |
| }, | |
| { | |
| "annotator": "rafaelmacalaba", | |
| "human_validated": true, | |
| "human_verdict": false, | |
| "human_notes": "This is a study name, not a dataset", | |
| "validated_at": "2025-02-24T11:05:00Z" | |
| } | |
| ] | |
| } | |
| ``` | |
| **Key behavior:** | |
| - Each annotator only sees **their own** validation status (no bias) | |
| - Progress bar and "Next" prompt count only **your** verifications | |
| - Tag edits (`dataset_tag`) are shared β they're factual, not judgment-based | |
| - Re-validating updates your existing entry (doesn't create duplicates) | |
| --- | |
| ## Data Pipeline | |
| ### `prepare_data.py` β Prepare & Upload Documents | |
| ```bash | |
| # Dry run (scan only): | |
| uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run | |
| # Upload missing docs + update links: | |
| uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py | |
| # Only update wbg_pdf_links.json: | |
| uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only | |
| ``` | |
| This script: | |
| - Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files | |
| - Detects language using `langdetect` (excludes non-English: Arabic, French) | |
| - Uploads English docs to HF dataset | |
| - Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields | |
| --- | |
| ## Leaderboard π | |
| Click **π Leaderboard** in the top bar to see annotator rankings. | |
| | Metric | Description | | |
| |--------|-------------| | |
| | β Verified | Number of mentions validated | | |
| | βοΈ Added | Manually added dataset mentions | | |
| | π Docs | Number of documents worked on | | |
| | β Score | `Verified + Added` | | |
| Cached for 2 minutes. | |
| --- | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/api/documents?user=X` | GET | List documents (filtered by user assignment) | | |
| | `/api/document?index=X&page=Y` | GET | Get page data for a specific document | | |
| | `/api/validate` | PUT | Submit validation for a dataset mention | | |
| | `/api/validate?doc=X&page=Y&idx=Z` | DELETE | Remove a dataset entry | | |
| | `/api/leaderboard` | GET | Annotator rankings | | |
| | `/api/pdf-proxy?url=X` | GET | Proxy PDF downloads (bypasses CORS) | | |
| | `/api/auth/login` | GET | Start HF OAuth flow | | |
| | `/api/auth/callback` | GET | OAuth callback | | |
| --- | |
| ## Architecture | |
| ``` | |
| hf_spaces_docker/ | |
| βββ app/ | |
| β βββ page.js # Main app (client component) | |
| β βββ globals.css # All styles | |
| β βββ api/ | |
| β β βββ documents/route.js # Doc listing + user filtering | |
| β β βββ document/route.js # Single page data | |
| β β βββ validate/route.js # Validate/delete mentions | |
| β β βββ leaderboard/route.js # Leaderboard stats | |
| β β βββ pdf-proxy/route.js # PDF CORS proxy | |
| β β βββ auth/ # HF OAuth login/callback | |
| β βββ components/ | |
| β βββ AnnotationPanel.js # Side panel with dataset cards | |
| β βββ AnnotationModal.js # Manual annotation dialog | |
| β βββ DocumentSelector.js # Document dropdown | |
| β βββ Leaderboard.js # Leaderboard modal | |
| β βββ MarkdownAnnotator.js # Text viewer with highlighting | |
| β βββ PageNavigator.js # Prev/Next page buttons | |
| β βββ PdfViewer.js # PDF iframe with loading state | |
| β βββ ProgressBar.js # PDF/Page/Verified pills | |
| βββ annotation_data/ | |
| β βββ annotator_config.yaml # Annotator assignments | |
| β βββ wbg_data/ | |
| β βββ wbg_pdf_links.json # Doc registry with URLs | |
| βββ prepare_data.py # Upload docs to HF | |
| βββ generate_assignments.py # Auto-assign docs to annotators | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Required | Description | | |
| |----------|----------|-------------| | |
| | `HF_TOKEN` | Yes | HuggingFace API token (read/write) | | |
| | `OAUTH_CLIENT_ID` | Yes (Space) | HF OAuth app client ID | | |
| | `OAUTH_CLIENT_SECRET` | Yes (Space) | HF OAuth app client secret | | |
| | `ALLOWED_USERS` | Yes (Space) | Comma-separated HF usernames | | |
| | `NEXTAUTH_SECRET` | Yes | Secret for cookie signing | | |