upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- fr
|
| 6 |
+
tags:
|
| 7 |
+
- graph-neural-network
|
| 8 |
+
- dom
|
| 9 |
+
- html
|
| 10 |
+
- node-classification
|
| 11 |
+
- web
|
| 12 |
+
- browser-automation
|
| 13 |
+
library_name: custom
|
| 14 |
+
pipeline_tag: token-classification
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# dom-node-classifier
|
| 18 |
+
|
| 19 |
+
## Model description
|
| 20 |
+
|
| 21 |
+
`dom-node-classifier` is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.
|
| 22 |
+
|
| 23 |
+
The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.
|
| 24 |
+
|
| 25 |
+
**Architecture:** GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.
|
| 26 |
+
|
| 27 |
+
**Why GATv2 over GAT v1?** GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## Intended uses
|
| 32 |
+
|
| 33 |
+
- **Browser agent perception:** replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
|
| 34 |
+
- **DOM annotation:** automatically labeling nodes in a page corpus for downstream ML tasks.
|
| 35 |
+
- **Web research:** studying element-type distributions across sites, languages, and page categories.
|
| 36 |
+
|
| 37 |
+
## Out-of-scope uses
|
| 38 |
+
|
| 39 |
+
- **Accessibility compliance:** the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
|
| 40 |
+
- **Production-critical UX automation** without human oversight: F1 on thin classes (particularly `action_input`, `action_select`, `structure_dismissible`) is insufficient for fully unattended operation.
|
| 41 |
+
- **Adversarial robustness:** the model was not trained against adversarially obfuscated DOM structures.
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## How to use
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from model.inference import DOMClassifier
|
| 49 |
+
from pathlib import Path
|
| 50 |
+
import json
|
| 51 |
+
|
| 52 |
+
# Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
|
| 53 |
+
clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
|
| 54 |
+
# Or from a local .pt checkpoint: DOMClassifier.from_checkpoint("checkpoints_final/best.pt")
|
| 55 |
+
|
| 56 |
+
raw_page = json.loads(Path("examples/sample_page.json").read_text())
|
| 57 |
+
predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)
|
| 58 |
+
|
| 59 |
+
for p in predictions:
|
| 60 |
+
print(f"[{p['class']:25s}] {p['confidence']:.2f} {p['selector']}")
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### Input format
|
| 64 |
+
|
| 65 |
+
`raw_page` is a dict with the following top-level keys:
|
| 66 |
+
|
| 67 |
+
| Key | Type | Description |
|
| 68 |
+
|-----|------|-------------|
|
| 69 |
+
| `url` | string | Page URL (used for link feature computation) |
|
| 70 |
+
| `viewport` | dict `{width, height}` | Viewport dimensions in pixels |
|
| 71 |
+
| `nodes` | list of node dicts | One entry per DOM node |
|
| 72 |
+
| `edges` | list of `[src_idx, dst_idx]` pairs | Parent→child edges using node list indices |
|
| 73 |
+
|
| 74 |
+
Each node dict:
|
| 75 |
+
|
| 76 |
+
| Key | Required | Type | Description |
|
| 77 |
+
|-----|----------|------|-------------|
|
| 78 |
+
| `id` | yes | string | Unique node identifier |
|
| 79 |
+
| `tag` | yes | string | HTML tag name (e.g. `"button"`, `"div"`) |
|
| 80 |
+
| `text` | no | string | Visible text content (truncated to 200 chars) |
|
| 81 |
+
| `selector` | no | string | CSS selector (returned in predictions, not used as feature) |
|
| 82 |
+
| `classes` | no | list[str] | CSS class tokens |
|
| 83 |
+
| `attrs` | no | dict | HTML attributes (`href`, `id`, `type`, `role`, …) |
|
| 84 |
+
| `css` | no | dict | Computed CSS (`display`, `position`, `visibility`, `opacity`, `cursor`, `font_size`, `font_weight`, `z_index`) |
|
| 85 |
+
| `bbox` | no | dict `{x, y, width, height}` | Bounding box in pixels |
|
| 86 |
+
| `depth` | no | int | DOM depth from root |
|
| 87 |
+
| `n_children` | no | int | Number of direct children |
|
| 88 |
+
| `is_visible` | no | bool | Whether the node is visible |
|
| 89 |
+
| `in_viewport` | no | bool | Whether the node is in the initial viewport |
|
| 90 |
+
| `has_listeners_heuristic` | no | bool | Whether the node likely has JS event listeners |
|
| 91 |
+
|
| 92 |
+
Missing optional fields default to sensible zeros/empty values.
|
| 93 |
+
|
| 94 |
+
A complete example is in [`examples/sample_page.json`](examples/sample_page.json).
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## Training data
|
| 99 |
+
|
| 100 |
+
The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.
|
| 101 |
+
|
| 102 |
+
The training dataset is not publicly distributed.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Training procedure
|
| 107 |
+
|
| 108 |
+
**Hardware:** NVIDIA L40S (48 GB VRAM)
|
| 109 |
+
|
| 110 |
+
**Hyperparameters:**
|
| 111 |
+
|
| 112 |
+
| Parameter | Value |
|
| 113 |
+
|-----------|-------|
|
| 114 |
+
| Epochs | 80 (early stopping, patience=15) |
|
| 115 |
+
| Batch size | 8 pages |
|
| 116 |
+
| Optimizer | AdamW |
|
| 117 |
+
| Learning rate | 1e-3 |
|
| 118 |
+
| LR schedule | Cosine annealing |
|
| 119 |
+
| Weight decay | 1e-4 |
|
| 120 |
+
| Dropout | 0.3 |
|
| 121 |
+
| Hidden dim | 128 |
|
| 122 |
+
| Attention heads | 4 |
|
| 123 |
+
| GATv2 layers | 3 |
|
| 124 |
+
| Class weighting | sqrt-inverse frequency |
|
| 125 |
+
| Edge augmentation | Reverse edges + sibling edges |
|
| 126 |
+
|
| 127 |
+
**Feature vector (618 dims/node):**
|
| 128 |
+
|
| 129 |
+
| Feature block | Dims | Notes |
|
| 130 |
+
|---------------|------|-------|
|
| 131 |
+
| Tag one-hot | 51 | 50 tags + OOV bucket |
|
| 132 |
+
| Class hash | 128 | Hashing trick over CSS class tokens (Tailwind-robust) |
|
| 133 |
+
| Attribute presence | 17 | id, href, role, aria-*, type, placeholder, … |
|
| 134 |
+
| Computed CSS | 28 | display (11) + position (5) + 6 numeric CSS values |
|
| 135 |
+
| Bounding box | 5 | x, y, w, h, area (normalized by viewport) |
|
| 136 |
+
| Topology | 5 | depth, n_children, is_visible, in_viewport, has_listeners |
|
| 137 |
+
| Link semantics | 9 | absolute/relative/fragment/mailto, same-host/domain, path depth |
|
| 138 |
+
| Text embedding | 384 | MiniLM-L6-v2 sentence embedding (frozen) |
|
| 139 |
+
|
| 140 |
+
**Validation criterion:** best checkpoint selected by macro-F1 on the validation split.
|
| 141 |
+
|
| 142 |
+
**Data split:** 70 / 15 / 15 train/val/test, stratified by page.
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## Evaluation results
|
| 147 |
+
|
| 148 |
+
Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.
|
| 149 |
+
|
| 150 |
+
| Metric | Mean ± std | Min | Max |
|
| 151 |
+
|--------|-----------|-----|-----|
|
| 152 |
+
| Macro F1 | 0.825 ± 0.026 | 0.797 | 0.865 |
|
| 153 |
+
| Weighted F1 | 0.917 ± 0.032 | 0.882 | 0.965 |
|
| 154 |
+
| Action F1 (5 classes) | 0.895 ± 0.036 | 0.818 | 0.917 |
|
| 155 |
+
|
| 156 |
+
**Per-class F1, mean ± std across 5 seeds:**
|
| 157 |
+
|
| 158 |
+
| Class | Mean F1 | Std | Test support (best seed) |
|
| 159 |
+
|-------|---------|-----|--------------------------|
|
| 160 |
+
| `action_input` | 0.686 | 0.104 | 25 |
|
| 161 |
+
| `action_select` | 0.768 | 0.086 | 8 |
|
| 162 |
+
| `action_button` | 0.909 | 0.071 | 1 577 |
|
| 163 |
+
| `action_link_internal` | 0.996 | 0.004 | 3 119 |
|
| 164 |
+
| `action_link_external` | 0.996 | 0.003 | 327 |
|
| 165 |
+
| `structure_navigation` | 0.884 | 0.062 | 52 |
|
| 166 |
+
| `structure_region` | 0.770 | 0.140 | 52 |
|
| 167 |
+
| `structure_dismissible` | 0.363 | 0.073 | 158 |
|
| 168 |
+
| `structure_card` | 0.625 | 0.199 | 1 045 |
|
| 169 |
+
| `structure_list_item` | 0.974 | 0.015 | 3 885 |
|
| 170 |
+
| `content_heading` | 0.986 | 0.007 | 525 |
|
| 171 |
+
| `content_text` | 0.736 | 0.067 | 322 |
|
| 172 |
+
| `content_media` | 0.915 | 0.035 | 1 319 |
|
| 173 |
+
| `noise` | 0.938 | 0.022 | 18 345 |
|
| 174 |
+
|
| 175 |
+
---
|
| 176 |
+
|
| 177 |
+
## Limitations
|
| 178 |
+
|
| 179 |
+
- **Low-support classes.** `action_input` (n=25) and `action_select` (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted.
|
| 180 |
+
- **`structure_dismissible` is hard.** Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.
|
| 181 |
+
- **Heuristic labels.** Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative `<button>` vs. a functional one) may be mislabeled.
|
| 182 |
+
- **No price class.** Numerical price strings are classified as `noise`. This is a known gap.
|
| 183 |
+
- **Static DOM only.** The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
|
| 184 |
+
- **Dataset size and diversity.** ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.
|
| 185 |
+
|
| 186 |
+
---
|
| 187 |
+
|
| 188 |
+
## Bias and ethical considerations
|
| 189 |
+
|
| 190 |
+
- The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
|
| 191 |
+
- The `noise` class is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped in `action_only=True` mode. Always set a confidence threshold and review low-confidence predictions.
|
| 192 |
+
- The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.
|
| 193 |
+
|
| 194 |
+
---
|
| 195 |
+
|
| 196 |
+
## License
|
| 197 |
+
|
| 198 |
+
Apache 2.0 — see [LICENSE](https://github.com/lucymakeit/dom-node-classifier/blob/main/LICENSE).
|
| 199 |
+
|
| 200 |
+
## Citation
|
| 201 |
+
|
| 202 |
+
If you use this model in your work, a link back to this repository is appreciated.
|
| 203 |
+
|
| 204 |
+
## Contact
|
| 205 |
+
|
| 206 |
+
Lucy Paureau · [lmi.rest](https://lmi.rest) · lucy.paureau@gmail.com
|