rafmacalaba commited on
Commit
e746bfe
Β·
1 Parent(s): 79ba9a0

docs: add comprehensive ANNOTATION_GUIDE.md

Browse files
Files changed (1) hide show
  1. ANNOTATION_GUIDE.md +223 -0
ANNOTATION_GUIDE.md ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“ Annotation Tool β€” Guide
2
+
3
+ A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.
4
+
5
+ ---
6
+
7
+ ## Quick Start
8
+
9
+ ### For Annotators
10
+
11
+ 1. Go to the Space URL and click **πŸ€— Sign in with HuggingFace**
12
+ 2. You'll see only your assigned documents in the dropdown
13
+ 3. Navigate pages with **← Prev / Next β†’**
14
+ 4. Open the **Data Mentions** panel to validate each mention
15
+ 5. Track your progress in the top-right: `Progress: πŸ“„ PDF 3/55 | πŸ“‘ Page 2/12 | 🏷️ Verified 4/8`
16
+
17
+ ### Validation Actions
18
+
19
+ | Action | What it does |
20
+ |--------|-------------|
21
+ | βœ… **Correct** | Confirms the AI extraction is a real dataset mention |
22
+ | ❌ **Incorrect** | Marks the extraction as wrong / not a dataset |
23
+ | **Click tag badge** | Change dataset type (named, descriptive, generic) |
24
+ | **Highlight text β†’ Annotate** | Manually add a dataset mention the AI missed |
25
+ | πŸ—‘οΈ **Delete** | Remove a dataset entry entirely |
26
+
27
+ > **Tip:** If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.
28
+
29
+ ---
30
+
31
+ ## Document Assignments
32
+
33
+ Each annotator sees only their assigned documents. A configurable percentage (default 10%) are **overlap documents** shared across all annotators for inter-annotator agreement measurement.
34
+
35
+ ### Configuration File
36
+
37
+ `annotation_data/annotator_config.yaml`:
38
+ ```yaml
39
+ settings:
40
+ overlap_percent: 10 # % of docs shared between all annotators
41
+
42
+ annotators:
43
+ - username: rafmacalaba # HuggingFace username
44
+ docs: [2, 3, 14, ...] # assigned doc indices
45
+ - username: rafaelmacalaba
46
+ docs: [1, 2, 10, ...]
47
+ ```
48
+
49
+ ### Auto-Generate Assignments
50
+
51
+ ```bash
52
+ # Preview assignment distribution:
53
+ uv run --with pyyaml python3 generate_assignments.py --dry-run
54
+
55
+ # Generate and save locally:
56
+ uv run --with pyyaml python3 generate_assignments.py
57
+
58
+ # Generate, save, and upload to HF:
59
+ uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
60
+ ```
61
+
62
+ The script:
63
+ - Reads `annotator_config.yaml` for the annotator list and overlap %
64
+ - Shuffles all available docs (deterministic seed=42)
65
+ - Reserves `overlap_percent` docs shared by ALL annotators
66
+ - Splits the rest evenly across annotators
67
+ - Saves back to the YAML
68
+
69
+ ### Adding a New Annotator
70
+
71
+ 1. Add to `annotation_data/annotator_config.yaml`:
72
+ ```yaml
73
+ - username: new_hf_username
74
+ docs: []
75
+ ```
76
+ 2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload`
77
+ 3. Add the username to `ALLOWED_USERS` in the Space settings
78
+
79
+ ### Manual Editing
80
+
81
+ You can manually edit the `docs` array for any annotator in the YAML file, then upload:
82
+ ```bash
83
+ uv run --with huggingface_hub python3 -c "
84
+ from huggingface_hub import HfApi
85
+ api = HfApi()
86
+ api.upload_file('annotation_data/annotator_config.yaml',
87
+ 'annotation_data/annotator_config.yaml',
88
+ 'ai4data/annotation_data', repo_type='dataset')
89
+ "
90
+ ```
91
+
92
+ ---
93
+
94
+ ## Per-Annotator Validation (Overlap Support)
95
+
96
+ Each dataset mention stores validations per-annotator in a `validations` array:
97
+
98
+ ```json
99
+ {
100
+ "dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
101
+ "dataset_tag": "named",
102
+ "validations": [
103
+ {
104
+ "annotator": "rafmacalaba",
105
+ "human_validated": true,
106
+ "human_verdict": true,
107
+ "human_notes": null,
108
+ "validated_at": "2025-02-24T11:00:00Z"
109
+ },
110
+ {
111
+ "annotator": "rafaelmacalaba",
112
+ "human_validated": true,
113
+ "human_verdict": false,
114
+ "human_notes": "This is a study name, not a dataset",
115
+ "validated_at": "2025-02-24T11:05:00Z"
116
+ }
117
+ ]
118
+ }
119
+ ```
120
+
121
+ **Key behavior:**
122
+ - Each annotator only sees **their own** validation status (no bias)
123
+ - Progress bar and "Next" prompt count only **your** verifications
124
+ - Tag edits (`dataset_tag`) are shared β€” they're factual, not judgment-based
125
+ - Re-validating updates your existing entry (doesn't create duplicates)
126
+
127
+ ---
128
+
129
+ ## Data Pipeline
130
+
131
+ ### `prepare_data.py` β€” Prepare & Upload Documents
132
+
133
+ ```bash
134
+ # Dry run (scan only):
135
+ uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run
136
+
137
+ # Upload missing docs + update links:
138
+ uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py
139
+
140
+ # Only update wbg_pdf_links.json:
141
+ uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only
142
+ ```
143
+
144
+ This script:
145
+ - Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files
146
+ - Detects language using `langdetect` (excludes non-English: Arabic, French)
147
+ - Uploads English docs to HF dataset
148
+ - Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields
149
+
150
+ ---
151
+
152
+ ## Leaderboard πŸ†
153
+
154
+ Click **πŸ† Leaderboard** in the top bar to see annotator rankings.
155
+
156
+ | Metric | Description |
157
+ |--------|-------------|
158
+ | βœ… Verified | Number of mentions validated |
159
+ | ✍️ Added | Manually added dataset mentions |
160
+ | πŸ“„ Docs | Number of documents worked on |
161
+ | ⭐ Score | `Verified + Added` |
162
+
163
+ Cached for 2 minutes.
164
+
165
+ ---
166
+
167
+ ## API Endpoints
168
+
169
+ | Endpoint | Method | Description |
170
+ |----------|--------|-------------|
171
+ | `/api/documents?user=X` | GET | List documents (filtered by user assignment) |
172
+ | `/api/document?index=X&page=Y` | GET | Get page data for a specific document |
173
+ | `/api/validate` | PUT | Submit validation for a dataset mention |
174
+ | `/api/validate?doc=X&page=Y&idx=Z` | DELETE | Remove a dataset entry |
175
+ | `/api/leaderboard` | GET | Annotator rankings |
176
+ | `/api/pdf-proxy?url=X` | GET | Proxy PDF downloads (bypasses CORS) |
177
+ | `/api/auth/login` | GET | Start HF OAuth flow |
178
+ | `/api/auth/callback` | GET | OAuth callback |
179
+
180
+ ---
181
+
182
+ ## Architecture
183
+
184
+ ```
185
+ hf_spaces_docker/
186
+ β”œβ”€β”€ app/
187
+ β”‚ β”œβ”€β”€ page.js # Main app (client component)
188
+ β”‚ β”œβ”€β”€ globals.css # All styles
189
+ β”‚ β”œβ”€β”€ api/
190
+ β”‚ β”‚ β”œβ”€β”€ documents/route.js # Doc listing + user filtering
191
+ β”‚ β”‚ β”œβ”€β”€ document/route.js # Single page data
192
+ β”‚ β”‚ β”œβ”€β”€ validate/route.js # Validate/delete mentions
193
+ β”‚ β”‚ β”œβ”€β”€ leaderboard/route.js # Leaderboard stats
194
+ β”‚ β”‚ β”œβ”€β”€ pdf-proxy/route.js # PDF CORS proxy
195
+ β”‚ β”‚ └── auth/ # HF OAuth login/callback
196
+ β”‚ └── components/
197
+ β”‚ β”œβ”€β”€ AnnotationPanel.js # Side panel with dataset cards
198
+ β”‚ β”œβ”€β”€ AnnotationModal.js # Manual annotation dialog
199
+ β”‚ β”œβ”€β”€ DocumentSelector.js # Document dropdown
200
+ β”‚ β”œβ”€β”€ Leaderboard.js # Leaderboard modal
201
+ β”‚ β”œβ”€β”€ MarkdownAnnotator.js # Text viewer with highlighting
202
+ β”‚ β”œβ”€β”€ PageNavigator.js # Prev/Next page buttons
203
+ β”‚ β”œβ”€β”€ PdfViewer.js # PDF iframe with loading state
204
+ β”‚ └── ProgressBar.js # PDF/Page/Verified pills
205
+ β”œβ”€β”€ annotation_data/
206
+ β”‚ β”œβ”€β”€ annotator_config.yaml # Annotator assignments
207
+ β”‚ └── wbg_data/
208
+ β”‚ └── wbg_pdf_links.json # Doc registry with URLs
209
+ β”œβ”€β”€ prepare_data.py # Upload docs to HF
210
+ └── generate_assignments.py # Auto-assign docs to annotators
211
+ ```
212
+
213
+ ---
214
+
215
+ ## Environment Variables
216
+
217
+ | Variable | Required | Description |
218
+ |----------|----------|-------------|
219
+ | `HF_TOKEN` | Yes | HuggingFace API token (read/write) |
220
+ | `OAUTH_CLIENT_ID` | Yes (Space) | HF OAuth app client ID |
221
+ | `OAUTH_CLIENT_SECRET` | Yes (Space) | HF OAuth app client secret |
222
+ | `ALLOWED_USERS` | Yes (Space) | Comma-separated HF usernames |
223
+ | `NEXTAUTH_SECRET` | Yes | Secret for cookie signing |