File size: 9,179 Bytes
704224b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56ec1f3
704224b
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: apache-2.0
language:
  - en
  - fr
tags:
  - graph-neural-network
  - dom
  - html
  - node-classification
  - web
  - browser-automation
library_name: custom
pipeline_tag: token-classification
---

# dom-node-classifier

## Model description

`dom-node-classifier` is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.

The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.

**Architecture:** GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.

**Why GATv2 over GAT v1?** GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.

---

## Intended uses

- **Browser agent perception:** replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
- **DOM annotation:** automatically labeling nodes in a page corpus for downstream ML tasks.
- **Web research:** studying element-type distributions across sites, languages, and page categories.

## Out-of-scope uses

- **Accessibility compliance:** the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
- **Production-critical UX automation** without human oversight: F1 on thin classes (particularly `action_input`, `action_select`, `structure_dismissible`) is insufficient for fully unattended operation.
- **Adversarial robustness:** the model was not trained against adversarially obfuscated DOM structures.

---

## How to use

```python
from model.inference import DOMClassifier
from pathlib import Path
import json

# Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
# Or from a local .pt checkpoint:  DOMClassifier.from_checkpoint("checkpoints_final/best.pt")

raw_page = json.loads(Path("examples/sample_page.json").read_text())
predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)

for p in predictions:
    print(f"[{p['class']:25s}] {p['confidence']:.2f}  {p['selector']}")
```

### Input format

`raw_page` is a dict with the following top-level keys:

| Key | Type | Description |
|-----|------|-------------|
| `url` | string | Page URL (used for link feature computation) |
| `viewport` | dict `{width, height}` | Viewport dimensions in pixels |
| `nodes` | list of node dicts | One entry per DOM node |
| `edges` | list of `[src_idx, dst_idx]` pairs | Parent→child edges using node list indices |

Each node dict:

| Key | Required | Type | Description |
|-----|----------|------|-------------|
| `id` | yes | string | Unique node identifier |
| `tag` | yes | string | HTML tag name (e.g. `"button"`, `"div"`) |
| `text` | no | string | Visible text content (truncated to 200 chars) |
| `selector` | no | string | CSS selector (returned in predictions, not used as feature) |
| `classes` | no | list[str] | CSS class tokens |
| `attrs` | no | dict | HTML attributes (`href`, `id`, `type`, `role`, …) |
| `css` | no | dict | Computed CSS (`display`, `position`, `visibility`, `opacity`, `cursor`, `font_size`, `font_weight`, `z_index`) |
| `bbox` | no | dict `{x, y, width, height}` | Bounding box in pixels |
| `depth` | no | int | DOM depth from root |
| `n_children` | no | int | Number of direct children |
| `is_visible` | no | bool | Whether the node is visible |
| `in_viewport` | no | bool | Whether the node is in the initial viewport |
| `has_listeners_heuristic` | no | bool | Whether the node likely has JS event listeners |

Missing optional fields default to sensible zeros/empty values.

A complete example is in [`examples/sample_page.json`](examples/sample_page.json).

---

## Training data

The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.

The training dataset is not publicly distributed.

---

## Training procedure

**Hardware:** NVIDIA L40S (48 GB VRAM)

**Hyperparameters:**

| Parameter | Value |
|-----------|-------|
| Epochs | 80 (early stopping, patience=15) |
| Batch size | 8 pages |
| Optimizer | AdamW |
| Learning rate | 1e-3 |
| LR schedule | Cosine annealing |
| Weight decay | 1e-4 |
| Dropout | 0.3 |
| Hidden dim | 128 |
| Attention heads | 4 |
| GATv2 layers | 3 |
| Class weighting | sqrt-inverse frequency |
| Edge augmentation | Reverse edges + sibling edges |

**Feature vector (618 dims/node):**

| Feature block | Dims | Notes |
|---------------|------|-------|
| Tag one-hot | 51 | 50 tags + OOV bucket |
| Class hash | 128 | Hashing trick over CSS class tokens (Tailwind-robust) |
| Attribute presence | 17 | id, href, role, aria-*, type, placeholder, … |
| Computed CSS | 28 | display (11) + position (5) + 6 numeric CSS values |
| Bounding box | 5 | x, y, w, h, area (normalized by viewport) |
| Topology | 5 | depth, n_children, is_visible, in_viewport, has_listeners |
| Link semantics | 9 | absolute/relative/fragment/mailto, same-host/domain, path depth |
| Text embedding | 384 | MiniLM-L6-v2 sentence embedding (frozen) |

**Validation criterion:** best checkpoint selected by macro-F1 on the validation split.

**Data split:** 70 / 15 / 15 train/val/test, stratified by page.

---

## Evaluation results

Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.

| Metric | Mean ± std | Min | Max |
|--------|-----------|-----|-----|
| Macro F1 | 0.825 ± 0.026 | 0.797 | 0.865 |
| Weighted F1 | 0.917 ± 0.032 | 0.882 | 0.965 |
| Action F1 (5 classes) | 0.895 ± 0.036 | 0.818 | 0.917 |

**Per-class F1, mean ± std across 5 seeds:**

| Class | Mean F1 | Std | Test support (best seed) |
|-------|---------|-----|--------------------------|
| `action_input` | 0.686 | 0.104 | 25 |
| `action_select` | 0.768 | 0.086 | 8 |
| `action_button` | 0.909 | 0.071 | 1 577 |
| `action_link_internal` | 0.996 | 0.004 | 3 119 |
| `action_link_external` | 0.996 | 0.003 | 327 |
| `structure_navigation` | 0.884 | 0.062 | 52 |
| `structure_region` | 0.770 | 0.140 | 52 |
| `structure_dismissible` | 0.363 | 0.073 | 158 |
| `structure_card` | 0.625 | 0.199 | 1 045 |
| `structure_list_item` | 0.974 | 0.015 | 3 885 |
| `content_heading` | 0.986 | 0.007 | 525 |
| `content_text` | 0.736 | 0.067 | 322 |
| `content_media` | 0.915 | 0.035 | 1 319 |
| `noise` | 0.938 | 0.022 | 18 345 |

---

## Limitations

- **Low-support classes.** `action_input` (n=25) and `action_select` (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted.
- **`structure_dismissible` is hard.** Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.
- **Heuristic labels.** Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative `<button>` vs. a functional one) may be mislabeled.
- **No price class.** Numerical price strings are classified as `noise`. This is a known gap.
- **Static DOM only.** The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
- **Dataset size and diversity.** ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.

---

## Bias and ethical considerations

- The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
- The `noise` class is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped in `action_only=True` mode. Always set a confidence threshold and review low-confidence predictions.
- The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.

---

## License

Apache 2.0 — see [LICENSE](https://github.com/lucydjo/dom-node-classifier/blob/main/LICENSE).

## Citation

If you use this model in your work, a link back to this repository is appreciated.

## Contact

Lucy Paureau · [lmi.rest](https://lmi.rest) · lucy.paureau@gmail.com