Spaces:

ai4data
/

data-use-annotation

Running

App Files Files Community

rafmacalaba commited on Feb 24

Commit

e746bfe

1 Parent(s): 79ba9a0

docs: add comprehensive ANNOTATION_GUIDE.md

Browse files

Files changed (1) hide show

ANNOTATION_GUIDE.md +223 -0

ANNOTATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,223 @@

+# 📝 Annotation Tool — Guide
+A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.
+---
+## Quick Start
+### For Annotators
+1. Go to the Space URL and click **🤗 Sign in with HuggingFace**
+2. You'll see only your assigned documents in the dropdown
+3. Navigate pages with **← Prev / Next →**
+4. Open the **Data Mentions** panel to validate each mention
+5. Track your progress in the top-right: `Progress: 📄 PDF 3/55 | 📑 Page 2/12 | 🏷️ Verified 4/8`
+### Validation Actions
+| Action | What it does |
+|--------|-------------|
+| ✅ **Correct** | Confirms the AI extraction is a real dataset mention |
+| ❌ **Incorrect** | Marks the extraction as wrong / not a dataset |
+| **Click tag badge** | Change dataset type (named, descriptive, generic) |
+| **Highlight text → Annotate** | Manually add a dataset mention the AI missed |
+| 🗑️ **Delete** | Remove a dataset entry entirely |
+> **Tip:** If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.
+---
+## Document Assignments
+Each annotator sees only their assigned documents. A configurable percentage (default 10%) are **overlap documents** shared across all annotators for inter-annotator agreement measurement.
+### Configuration File
+`annotation_data/annotator_config.yaml`:
+```yaml
+settings:
+  overlap_percent: 10  # % of docs shared between all annotators
+annotators:
+  - username: rafmacalaba     # HuggingFace username
+    docs: [2, 3, 14, ...]     # assigned doc indices
+  - username: rafaelmacalaba
+    docs: [1, 2, 10, ...]
+```
+### Auto-Generate Assignments
+```bash
+# Preview assignment distribution:
+uv run --with pyyaml python3 generate_assignments.py --dry-run
+# Generate and save locally:
+uv run --with pyyaml python3 generate_assignments.py
+# Generate, save, and upload to HF:
+uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
+```
+The script:
+- Reads `annotator_config.yaml` for the annotator list and overlap %
+- Shuffles all available docs (deterministic seed=42)
+- Reserves `overlap_percent` docs shared by ALL annotators
+- Splits the rest evenly across annotators
+- Saves back to the YAML
+### Adding a New Annotator
+1. Add to `annotation_data/annotator_config.yaml`:
+   ```yaml
+   - username: new_hf_username
+     docs: []
+   ```
+2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload`
+3. Add the username to `ALLOWED_USERS` in the Space settings
+### Manual Editing
+You can manually edit the `docs` array for any annotator in the YAML file, then upload:
+```bash
+uv run --with huggingface_hub python3 -c "
+from huggingface_hub import HfApi
+api = HfApi()
+api.upload_file('annotation_data/annotator_config.yaml',
+    'annotation_data/annotator_config.yaml',
+    'ai4data/annotation_data', repo_type='dataset')
+"
+```
+---
+## Per-Annotator Validation (Overlap Support)
+Each dataset mention stores validations per-annotator in a `validations` array:
+```json
+{
+  "dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
+  "dataset_tag": "named",
+  "validations": [
+    {
+      "annotator": "rafmacalaba",
+      "human_validated": true,
+      "human_verdict": true,
+      "human_notes": null,
+      "validated_at": "2025-02-24T11:00:00Z"
+    },
+    {
+      "annotator": "rafaelmacalaba",
+      "human_validated": true,
+      "human_verdict": false,
+      "human_notes": "This is a study name, not a dataset",
+      "validated_at": "2025-02-24T11:05:00Z"
+    }
+  ]
+}
+```
+**Key behavior:**
+- Each annotator only sees **their own** validation status (no bias)
+- Progress bar and "Next" prompt count only **your** verifications
+- Tag edits (`dataset_tag`) are shared — they're factual, not judgment-based
+- Re-validating updates your existing entry (doesn't create duplicates)
+---
+## Data Pipeline
+### `prepare_data.py` — Prepare & Upload Documents
+```bash
+# Dry run (scan only):
+uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run
+# Upload missing docs + update links:
+uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py
+# Only update wbg_pdf_links.json:
+uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only
+```
+This script:
+- Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files
+- Detects language using `langdetect` (excludes non-English: Arabic, French)
+- Uploads English docs to HF dataset
+- Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields
+---
+## Leaderboard 🏆
+Click **🏆 Leaderboard** in the top bar to see annotator rankings.
+| Metric | Description |
+|--------|-------------|
+| ✅ Verified | Number of mentions validated |
+| ✍️ Added | Manually added dataset mentions |
+| 📄 Docs | Number of documents worked on |
+| ⭐ Score | `Verified + Added` |
+Cached for 2 minutes.
+---
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/documents?user=X` | GET | List documents (filtered by user assignment) |
+| `/api/document?index=X&page=Y` | GET | Get page data for a specific document |
+| `/api/validate` | PUT | Submit validation for a dataset mention |
+| `/api/validate?doc=X&page=Y&idx=Z` | DELETE | Remove a dataset entry |
+| `/api/leaderboard` | GET | Annotator rankings |
+| `/api/pdf-proxy?url=X` | GET | Proxy PDF downloads (bypasses CORS) |
+| `/api/auth/login` | GET | Start HF OAuth flow |
+| `/api/auth/callback` | GET | OAuth callback |
+---
+## Architecture
+```
+hf_spaces_docker/
+├── app/
+│   ├── page.js                    # Main app (client component)
+│   ├── globals.css                # All styles
+│   ├── api/
+│   │   ├── documents/route.js     # Doc listing + user filtering
+│   │   ├── document/route.js      # Single page data
+│   │   ├── validate/route.js      # Validate/delete mentions
+│   │   ├── leaderboard/route.js   # Leaderboard stats
+│   │   ├── pdf-proxy/route.js     # PDF CORS proxy
+│   │   └── auth/                  # HF OAuth login/callback
+│   └── components/
+│       ├── AnnotationPanel.js     # Side panel with dataset cards
+│       ├── AnnotationModal.js     # Manual annotation dialog
+│       ├── DocumentSelector.js    # Document dropdown
+│       ├── Leaderboard.js         # Leaderboard modal
+│       ├── MarkdownAnnotator.js   # Text viewer with highlighting
+│       ├── PageNavigator.js       # Prev/Next page buttons
+│       ├── PdfViewer.js           # PDF iframe with loading state
+│       └── ProgressBar.js         # PDF/Page/Verified pills
+├── annotation_data/
+│   ├── annotator_config.yaml      # Annotator assignments
+│   └── wbg_data/
+│       └── wbg_pdf_links.json     # Doc registry with URLs
+├── prepare_data.py                # Upload docs to HF
+└── generate_assignments.py        # Auto-assign docs to annotators
+```
+---
+## Environment Variables
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `HF_TOKEN` | Yes | HuggingFace API token (read/write) |
+| `OAUTH_CLIENT_ID` | Yes (Space) | HF OAuth app client ID |
+| `OAUTH_CLIENT_SECRET` | Yes (Space) | HF OAuth app client secret |
+| `ALLOWED_USERS` | Yes (Space) | Comma-separated HF usernames |
+| `NEXTAUTH_SECRET` | Yes | Secret for cookie signing |