File size: 9,179 Bytes
704224b 56ec1f3 704224b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | ---
license: apache-2.0
language:
- en
- fr
tags:
- graph-neural-network
- dom
- html
- node-classification
- web
- browser-automation
library_name: custom
pipeline_tag: token-classification
---
# dom-node-classifier
## Model description
`dom-node-classifier` is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.
The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.
**Architecture:** GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.
**Why GATv2 over GAT v1?** GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.
---
## Intended uses
- **Browser agent perception:** replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
- **DOM annotation:** automatically labeling nodes in a page corpus for downstream ML tasks.
- **Web research:** studying element-type distributions across sites, languages, and page categories.
## Out-of-scope uses
- **Accessibility compliance:** the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
- **Production-critical UX automation** without human oversight: F1 on thin classes (particularly `action_input`, `action_select`, `structure_dismissible`) is insufficient for fully unattended operation.
- **Adversarial robustness:** the model was not trained against adversarially obfuscated DOM structures.
---
## How to use
```python
from model.inference import DOMClassifier
from pathlib import Path
import json
# Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
# Or from a local .pt checkpoint: DOMClassifier.from_checkpoint("checkpoints_final/best.pt")
raw_page = json.loads(Path("examples/sample_page.json").read_text())
predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)
for p in predictions:
print(f"[{p['class']:25s}] {p['confidence']:.2f} {p['selector']}")
```
### Input format
`raw_page` is a dict with the following top-level keys:
| Key | Type | Description |
|-----|------|-------------|
| `url` | string | Page URL (used for link feature computation) |
| `viewport` | dict `{width, height}` | Viewport dimensions in pixels |
| `nodes` | list of node dicts | One entry per DOM node |
| `edges` | list of `[src_idx, dst_idx]` pairs | Parent→child edges using node list indices |
Each node dict:
| Key | Required | Type | Description |
|-----|----------|------|-------------|
| `id` | yes | string | Unique node identifier |
| `tag` | yes | string | HTML tag name (e.g. `"button"`, `"div"`) |
| `text` | no | string | Visible text content (truncated to 200 chars) |
| `selector` | no | string | CSS selector (returned in predictions, not used as feature) |
| `classes` | no | list[str] | CSS class tokens |
| `attrs` | no | dict | HTML attributes (`href`, `id`, `type`, `role`, …) |
| `css` | no | dict | Computed CSS (`display`, `position`, `visibility`, `opacity`, `cursor`, `font_size`, `font_weight`, `z_index`) |
| `bbox` | no | dict `{x, y, width, height}` | Bounding box in pixels |
| `depth` | no | int | DOM depth from root |
| `n_children` | no | int | Number of direct children |
| `is_visible` | no | bool | Whether the node is visible |
| `in_viewport` | no | bool | Whether the node is in the initial viewport |
| `has_listeners_heuristic` | no | bool | Whether the node likely has JS event listeners |
Missing optional fields default to sensible zeros/empty values.
A complete example is in [`examples/sample_page.json`](examples/sample_page.json).
---
## Training data
The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.
The training dataset is not publicly distributed.
---
## Training procedure
**Hardware:** NVIDIA L40S (48 GB VRAM)
**Hyperparameters:**
| Parameter | Value |
|-----------|-------|
| Epochs | 80 (early stopping, patience=15) |
| Batch size | 8 pages |
| Optimizer | AdamW |
| Learning rate | 1e-3 |
| LR schedule | Cosine annealing |
| Weight decay | 1e-4 |
| Dropout | 0.3 |
| Hidden dim | 128 |
| Attention heads | 4 |
| GATv2 layers | 3 |
| Class weighting | sqrt-inverse frequency |
| Edge augmentation | Reverse edges + sibling edges |
**Feature vector (618 dims/node):**
| Feature block | Dims | Notes |
|---------------|------|-------|
| Tag one-hot | 51 | 50 tags + OOV bucket |
| Class hash | 128 | Hashing trick over CSS class tokens (Tailwind-robust) |
| Attribute presence | 17 | id, href, role, aria-*, type, placeholder, … |
| Computed CSS | 28 | display (11) + position (5) + 6 numeric CSS values |
| Bounding box | 5 | x, y, w, h, area (normalized by viewport) |
| Topology | 5 | depth, n_children, is_visible, in_viewport, has_listeners |
| Link semantics | 9 | absolute/relative/fragment/mailto, same-host/domain, path depth |
| Text embedding | 384 | MiniLM-L6-v2 sentence embedding (frozen) |
**Validation criterion:** best checkpoint selected by macro-F1 on the validation split.
**Data split:** 70 / 15 / 15 train/val/test, stratified by page.
---
## Evaluation results
Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.
| Metric | Mean ± std | Min | Max |
|--------|-----------|-----|-----|
| Macro F1 | 0.825 ± 0.026 | 0.797 | 0.865 |
| Weighted F1 | 0.917 ± 0.032 | 0.882 | 0.965 |
| Action F1 (5 classes) | 0.895 ± 0.036 | 0.818 | 0.917 |
**Per-class F1, mean ± std across 5 seeds:**
| Class | Mean F1 | Std | Test support (best seed) |
|-------|---------|-----|--------------------------|
| `action_input` | 0.686 | 0.104 | 25 |
| `action_select` | 0.768 | 0.086 | 8 |
| `action_button` | 0.909 | 0.071 | 1 577 |
| `action_link_internal` | 0.996 | 0.004 | 3 119 |
| `action_link_external` | 0.996 | 0.003 | 327 |
| `structure_navigation` | 0.884 | 0.062 | 52 |
| `structure_region` | 0.770 | 0.140 | 52 |
| `structure_dismissible` | 0.363 | 0.073 | 158 |
| `structure_card` | 0.625 | 0.199 | 1 045 |
| `structure_list_item` | 0.974 | 0.015 | 3 885 |
| `content_heading` | 0.986 | 0.007 | 525 |
| `content_text` | 0.736 | 0.067 | 322 |
| `content_media` | 0.915 | 0.035 | 1 319 |
| `noise` | 0.938 | 0.022 | 18 345 |
---
## Limitations
- **Low-support classes.** `action_input` (n=25) and `action_select` (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted.
- **`structure_dismissible` is hard.** Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.
- **Heuristic labels.** Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative `<button>` vs. a functional one) may be mislabeled.
- **No price class.** Numerical price strings are classified as `noise`. This is a known gap.
- **Static DOM only.** The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
- **Dataset size and diversity.** ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.
---
## Bias and ethical considerations
- The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
- The `noise` class is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped in `action_only=True` mode. Always set a confidence threshold and review low-confidence predictions.
- The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.
---
## License
Apache 2.0 — see [LICENSE](https://github.com/lucydjo/dom-node-classifier/blob/main/LICENSE).
## Citation
If you use this model in your work, a link back to this repository is appreciated.
## Contact
Lucy Paureau · [lmi.rest](https://lmi.rest) · lucy.paureau@gmail.com
|