fix: correct GitHub links lucymakeit → lucydjo

56ec1f3 verified 16 days ago

9.18 kB

	---
	license: apache-2.0
	language:
	- en
	- fr
	tags:
	- graph-neural-network
	- dom
	- html
	- node-classification
	- web
	- browser-automation
	library_name: custom
	pipeline_tag: token-classification
	---

	# dom-node-classifier

	## Model description

	`dom-node-classifier` is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.

	The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.

	Architecture: GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.

	Why GATv2 over GAT v1? GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.

	---

	## Intended uses

	- Browser agent perception: replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
	- DOM annotation: automatically labeling nodes in a page corpus for downstream ML tasks.
	- Web research: studying element-type distributions across sites, languages, and page categories.

	## Out-of-scope uses

	- Accessibility compliance: the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
	- Production-critical UX automation without human oversight: F1 on thin classes (particularly `action_input`, `action_select`, `structure_dismissible`) is insufficient for fully unattended operation.
	- Adversarial robustness: the model was not trained against adversarially obfuscated DOM structures.

	---

	## How to use

	```python
	from model.inference import DOMClassifier
	from pathlib import Path
	import json

	# Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
	clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
	# Or from a local .pt checkpoint: DOMClassifier.from_checkpoint("checkpoints_final/best.pt")

	raw_page = json.loads(Path("examples/sample_page.json").read_text())
	predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)

	for p in predictions:
	print(f"[{p['class']:25s}] {p['confidence']:.2f} {p['selector']}")
	```

	### Input format

	`raw_page` is a dict with the following top-level keys:

	\| Key \| Type \| Description \|
	\|-----\|------\|-------------\|
	\| `url` \| string \| Page URL (used for link feature computation) \|
	\| `viewport` \| dict `{width, height}` \| Viewport dimensions in pixels \|
	\| `nodes` \| list of node dicts \| One entry per DOM node \|
	\| `edges` \| list of `[src_idx, dst_idx]` pairs \| Parent→child edges using node list indices \|

	Each node dict:

	\| Key \| Required \| Type \| Description \|
	\|-----\|----------\|------\|-------------\|
	\| `id` \| yes \| string \| Unique node identifier \|
	\| `tag` \| yes \| string \| HTML tag name (e.g. `"button"`, `"div"`) \|
	\| `text` \| no \| string \| Visible text content (truncated to 200 chars) \|
	\| `selector` \| no \| string \| CSS selector (returned in predictions, not used as feature) \|
	\| `classes` \| no \| list[str] \| CSS class tokens \|
	\| `attrs` \| no \| dict \| HTML attributes (`href`, `id`, `type`, `role`, …) \|
	\| `css` \| no \| dict \| Computed CSS (`display`, `position`, `visibility`, `opacity`, `cursor`, `font_size`, `font_weight`, `z_index`) \|
	\| `bbox` \| no \| dict `{x, y, width, height}` \| Bounding box in pixels \|
	\| `depth` \| no \| int \| DOM depth from root \|
	\| `n_children` \| no \| int \| Number of direct children \|
	\| `is_visible` \| no \| bool \| Whether the node is visible \|
	\| `in_viewport` \| no \| bool \| Whether the node is in the initial viewport \|
	\| `has_listeners_heuristic` \| no \| bool \| Whether the node likely has JS event listeners \|

	Missing optional fields default to sensible zeros/empty values.

	A complete example is in [`examples/sample_page.json`](examples/sample_page.json).

	---

	## Training data

	The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.

	The training dataset is not publicly distributed.

	---

	## Training procedure

	Hardware: NVIDIA L40S (48 GB VRAM)

	Hyperparameters:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 80 (early stopping, patience=15) \|
	\| Batch size \| 8 pages \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 1e-3 \|
	\| LR schedule \| Cosine annealing \|
	\| Weight decay \| 1e-4 \|
	\| Dropout \| 0.3 \|
	\| Hidden dim \| 128 \|
	\| Attention heads \| 4 \|
	\| GATv2 layers \| 3 \|
	\| Class weighting \| sqrt-inverse frequency \|
	\| Edge augmentation \| Reverse edges + sibling edges \|

	Feature vector (618 dims/node):

	\| Feature block \| Dims \| Notes \|
	\|---------------\|------\|-------\|
	\| Tag one-hot \| 51 \| 50 tags + OOV bucket \|
	\| Class hash \| 128 \| Hashing trick over CSS class tokens (Tailwind-robust) \|
	\| Attribute presence \| 17 \| id, href, role, aria-*, type, placeholder, … \|
	\| Computed CSS \| 28 \| display (11) + position (5) + 6 numeric CSS values \|
	\| Bounding box \| 5 \| x, y, w, h, area (normalized by viewport) \|
	\| Topology \| 5 \| depth, n_children, is_visible, in_viewport, has_listeners \|
	\| Link semantics \| 9 \| absolute/relative/fragment/mailto, same-host/domain, path depth \|
	\| Text embedding \| 384 \| MiniLM-L6-v2 sentence embedding (frozen) \|

	Validation criterion: best checkpoint selected by macro-F1 on the validation split.

	Data split: 70 / 15 / 15 train/val/test, stratified by page.

	---

	## Evaluation results

	Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.

	\| Metric \| Mean ± std \| Min \| Max \|
	\|--------\|-----------\|-----\|-----\|
	\| Macro F1 \| 0.825 ± 0.026 \| 0.797 \| 0.865 \|
	\| Weighted F1 \| 0.917 ± 0.032 \| 0.882 \| 0.965 \|
	\| Action F1 (5 classes) \| 0.895 ± 0.036 \| 0.818 \| 0.917 \|

	Per-class F1, mean ± std across 5 seeds:

	\| Class \| Mean F1 \| Std \| Test support (best seed) \|
	\|-------\|---------\|-----\|--------------------------\|
	\| `action_input` \| 0.686 \| 0.104 \| 25 \|
	\| `action_select` \| 0.768 \| 0.086 \| 8 \|
	\| `action_button` \| 0.909 \| 0.071 \| 1 577 \|
	\| `action_link_internal` \| 0.996 \| 0.004 \| 3 119 \|
	\| `action_link_external` \| 0.996 \| 0.003 \| 327 \|
	\| `structure_navigation` \| 0.884 \| 0.062 \| 52 \|
	\| `structure_region` \| 0.770 \| 0.140 \| 52 \|
	\| `structure_dismissible` \| 0.363 \| 0.073 \| 158 \|
	\| `structure_card` \| 0.625 \| 0.199 \| 1 045 \|
	\| `structure_list_item` \| 0.974 \| 0.015 \| 3 885 \|
	\| `content_heading` \| 0.986 \| 0.007 \| 525 \|
	\| `content_text` \| 0.736 \| 0.067 \| 322 \|
	\| `content_media` \| 0.915 \| 0.035 \| 1 319 \|
	\| `noise` \| 0.938 \| 0.022 \| 18 345 \|

	---

	## Limitations

	- Low-support classes. `action_input` (n=25) and `action_select` (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted.
	- `structure_dismissible` is hard. Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.
	- Heuristic labels. Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative `<button>` vs. a functional one) may be mislabeled.
	- No price class. Numerical price strings are classified as `noise`. This is a known gap.
	- Static DOM only. The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
	- Dataset size and diversity. ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.

	---

	## Bias and ethical considerations

	- The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
	- The `noise` class is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped in `action_only=True` mode. Always set a confidence threshold and review low-confidence predictions.
	- The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.

	---

	## License

	Apache 2.0 — see [LICENSE](https://github.com/lucydjo/dom-node-classifier/blob/main/LICENSE).

	## Citation

	If you use this model in your work, a link back to this repository is appreciated.

	## Contact

	Lucy Paureau · [lmi.rest](https://lmi.rest) · lucy.paureau@gmail.com