lucymakeit commited on
Commit
704224b
·
verified ·
1 Parent(s): a328ede

upload README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -0
README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - fr
6
+ tags:
7
+ - graph-neural-network
8
+ - dom
9
+ - html
10
+ - node-classification
11
+ - web
12
+ - browser-automation
13
+ library_name: custom
14
+ pipeline_tag: token-classification
15
+ ---
16
+
17
+ # dom-node-classifier
18
+
19
+ ## Model description
20
+
21
+ `dom-node-classifier` is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.
22
+
23
+ The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.
24
+
25
+ **Architecture:** GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.
26
+
27
+ **Why GATv2 over GAT v1?** GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.
28
+
29
+ ---
30
+
31
+ ## Intended uses
32
+
33
+ - **Browser agent perception:** replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
34
+ - **DOM annotation:** automatically labeling nodes in a page corpus for downstream ML tasks.
35
+ - **Web research:** studying element-type distributions across sites, languages, and page categories.
36
+
37
+ ## Out-of-scope uses
38
+
39
+ - **Accessibility compliance:** the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
40
+ - **Production-critical UX automation** without human oversight: F1 on thin classes (particularly `action_input`, `action_select`, `structure_dismissible`) is insufficient for fully unattended operation.
41
+ - **Adversarial robustness:** the model was not trained against adversarially obfuscated DOM structures.
42
+
43
+ ---
44
+
45
+ ## How to use
46
+
47
+ ```python
48
+ from model.inference import DOMClassifier
49
+ from pathlib import Path
50
+ import json
51
+
52
+ # Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
53
+ clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
54
+ # Or from a local .pt checkpoint: DOMClassifier.from_checkpoint("checkpoints_final/best.pt")
55
+
56
+ raw_page = json.loads(Path("examples/sample_page.json").read_text())
57
+ predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)
58
+
59
+ for p in predictions:
60
+ print(f"[{p['class']:25s}] {p['confidence']:.2f} {p['selector']}")
61
+ ```
62
+
63
+ ### Input format
64
+
65
+ `raw_page` is a dict with the following top-level keys:
66
+
67
+ | Key | Type | Description |
68
+ |-----|------|-------------|
69
+ | `url` | string | Page URL (used for link feature computation) |
70
+ | `viewport` | dict `{width, height}` | Viewport dimensions in pixels |
71
+ | `nodes` | list of node dicts | One entry per DOM node |
72
+ | `edges` | list of `[src_idx, dst_idx]` pairs | Parent→child edges using node list indices |
73
+
74
+ Each node dict:
75
+
76
+ | Key | Required | Type | Description |
77
+ |-----|----------|------|-------------|
78
+ | `id` | yes | string | Unique node identifier |
79
+ | `tag` | yes | string | HTML tag name (e.g. `"button"`, `"div"`) |
80
+ | `text` | no | string | Visible text content (truncated to 200 chars) |
81
+ | `selector` | no | string | CSS selector (returned in predictions, not used as feature) |
82
+ | `classes` | no | list[str] | CSS class tokens |
83
+ | `attrs` | no | dict | HTML attributes (`href`, `id`, `type`, `role`, …) |
84
+ | `css` | no | dict | Computed CSS (`display`, `position`, `visibility`, `opacity`, `cursor`, `font_size`, `font_weight`, `z_index`) |
85
+ | `bbox` | no | dict `{x, y, width, height}` | Bounding box in pixels |
86
+ | `depth` | no | int | DOM depth from root |
87
+ | `n_children` | no | int | Number of direct children |
88
+ | `is_visible` | no | bool | Whether the node is visible |
89
+ | `in_viewport` | no | bool | Whether the node is in the initial viewport |
90
+ | `has_listeners_heuristic` | no | bool | Whether the node likely has JS event listeners |
91
+
92
+ Missing optional fields default to sensible zeros/empty values.
93
+
94
+ A complete example is in [`examples/sample_page.json`](examples/sample_page.json).
95
+
96
+ ---
97
+
98
+ ## Training data
99
+
100
+ The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.
101
+
102
+ The training dataset is not publicly distributed.
103
+
104
+ ---
105
+
106
+ ## Training procedure
107
+
108
+ **Hardware:** NVIDIA L40S (48 GB VRAM)
109
+
110
+ **Hyperparameters:**
111
+
112
+ | Parameter | Value |
113
+ |-----------|-------|
114
+ | Epochs | 80 (early stopping, patience=15) |
115
+ | Batch size | 8 pages |
116
+ | Optimizer | AdamW |
117
+ | Learning rate | 1e-3 |
118
+ | LR schedule | Cosine annealing |
119
+ | Weight decay | 1e-4 |
120
+ | Dropout | 0.3 |
121
+ | Hidden dim | 128 |
122
+ | Attention heads | 4 |
123
+ | GATv2 layers | 3 |
124
+ | Class weighting | sqrt-inverse frequency |
125
+ | Edge augmentation | Reverse edges + sibling edges |
126
+
127
+ **Feature vector (618 dims/node):**
128
+
129
+ | Feature block | Dims | Notes |
130
+ |---------------|------|-------|
131
+ | Tag one-hot | 51 | 50 tags + OOV bucket |
132
+ | Class hash | 128 | Hashing trick over CSS class tokens (Tailwind-robust) |
133
+ | Attribute presence | 17 | id, href, role, aria-*, type, placeholder, … |
134
+ | Computed CSS | 28 | display (11) + position (5) + 6 numeric CSS values |
135
+ | Bounding box | 5 | x, y, w, h, area (normalized by viewport) |
136
+ | Topology | 5 | depth, n_children, is_visible, in_viewport, has_listeners |
137
+ | Link semantics | 9 | absolute/relative/fragment/mailto, same-host/domain, path depth |
138
+ | Text embedding | 384 | MiniLM-L6-v2 sentence embedding (frozen) |
139
+
140
+ **Validation criterion:** best checkpoint selected by macro-F1 on the validation split.
141
+
142
+ **Data split:** 70 / 15 / 15 train/val/test, stratified by page.
143
+
144
+ ---
145
+
146
+ ## Evaluation results
147
+
148
+ Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.
149
+
150
+ | Metric | Mean ± std | Min | Max |
151
+ |--------|-----------|-----|-----|
152
+ | Macro F1 | 0.825 ± 0.026 | 0.797 | 0.865 |
153
+ | Weighted F1 | 0.917 ± 0.032 | 0.882 | 0.965 |
154
+ | Action F1 (5 classes) | 0.895 ± 0.036 | 0.818 | 0.917 |
155
+
156
+ **Per-class F1, mean ± std across 5 seeds:**
157
+
158
+ | Class | Mean F1 | Std | Test support (best seed) |
159
+ |-------|---------|-----|--------------------------|
160
+ | `action_input` | 0.686 | 0.104 | 25 |
161
+ | `action_select` | 0.768 | 0.086 | 8 |
162
+ | `action_button` | 0.909 | 0.071 | 1 577 |
163
+ | `action_link_internal` | 0.996 | 0.004 | 3 119 |
164
+ | `action_link_external` | 0.996 | 0.003 | 327 |
165
+ | `structure_navigation` | 0.884 | 0.062 | 52 |
166
+ | `structure_region` | 0.770 | 0.140 | 52 |
167
+ | `structure_dismissible` | 0.363 | 0.073 | 158 |
168
+ | `structure_card` | 0.625 | 0.199 | 1 045 |
169
+ | `structure_list_item` | 0.974 | 0.015 | 3 885 |
170
+ | `content_heading` | 0.986 | 0.007 | 525 |
171
+ | `content_text` | 0.736 | 0.067 | 322 |
172
+ | `content_media` | 0.915 | 0.035 | 1 319 |
173
+ | `noise` | 0.938 | 0.022 | 18 345 |
174
+
175
+ ---
176
+
177
+ ## Limitations
178
+
179
+ - **Low-support classes.** `action_input` (n=25) and `action_select` (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted.
180
+ - **`structure_dismissible` is hard.** Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.
181
+ - **Heuristic labels.** Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative `<button>` vs. a functional one) may be mislabeled.
182
+ - **No price class.** Numerical price strings are classified as `noise`. This is a known gap.
183
+ - **Static DOM only.** The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
184
+ - **Dataset size and diversity.** ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.
185
+
186
+ ---
187
+
188
+ ## Bias and ethical considerations
189
+
190
+ - The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
191
+ - The `noise` class is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped in `action_only=True` mode. Always set a confidence threshold and review low-confidence predictions.
192
+ - The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.
193
+
194
+ ---
195
+
196
+ ## License
197
+
198
+ Apache 2.0 — see [LICENSE](https://github.com/lucymakeit/dom-node-classifier/blob/main/LICENSE).
199
+
200
+ ## Citation
201
+
202
+ If you use this model in your work, a link back to this repository is appreciated.
203
+
204
+ ## Contact
205
+
206
+ Lucy Paureau · [lmi.rest](https://lmi.rest) · lucy.paureau@gmail.com