--- license: apache-2.0 language: - en - fr tags: - graph-neural-network - dom - html - node-classification - web - browser-automation library_name: custom pipeline_tag: token-classification --- # dom-node-classifier ## Model description `dom-node-classifier` is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines. The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below. **Architecture:** GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation. **Why GATv2 over GAT v1?** GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context. --- ## Intended uses - **Browser agent perception:** replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage. - **DOM annotation:** automatically labeling nodes in a page corpus for downstream ML tasks. - **Web research:** studying element-type distributions across sites, languages, and page categories. ## Out-of-scope uses - **Accessibility compliance:** the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits. - **Production-critical UX automation** without human oversight: F1 on thin classes (particularly `action_input`, `action_select`, `structure_dismissible`) is insufficient for fully unattended operation. - **Adversarial robustness:** the model was not trained against adversarially obfuscated DOM structures. --- ## How to use ```python from model.inference import DOMClassifier from pathlib import Path import json # Load from HuggingFace weights (model.safetensors + config.json must be in the same directory) clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors") # Or from a local .pt checkpoint: DOMClassifier.from_checkpoint("checkpoints_final/best.pt") raw_page = json.loads(Path("examples/sample_page.json").read_text()) predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5) for p in predictions: print(f"[{p['class']:25s}] {p['confidence']:.2f} {p['selector']}") ``` ### Input format `raw_page` is a dict with the following top-level keys: | Key | Type | Description | |-----|------|-------------| | `url` | string | Page URL (used for link feature computation) | | `viewport` | dict `{width, height}` | Viewport dimensions in pixels | | `nodes` | list of node dicts | One entry per DOM node | | `edges` | list of `[src_idx, dst_idx]` pairs | Parent→child edges using node list indices | Each node dict: | Key | Required | Type | Description | |-----|----------|------|-------------| | `id` | yes | string | Unique node identifier | | `tag` | yes | string | HTML tag name (e.g. `"button"`, `"div"`) | | `text` | no | string | Visible text content (truncated to 200 chars) | | `selector` | no | string | CSS selector (returned in predictions, not used as feature) | | `classes` | no | list[str] | CSS class tokens | | `attrs` | no | dict | HTML attributes (`href`, `id`, `type`, `role`, …) | | `css` | no | dict | Computed CSS (`display`, `position`, `visibility`, `opacity`, `cursor`, `font_size`, `font_weight`, `z_index`) | | `bbox` | no | dict `{x, y, width, height}` | Bounding box in pixels | | `depth` | no | int | DOM depth from root | | `n_children` | no | int | Number of direct children | | `is_visible` | no | bool | Whether the node is visible | | `in_viewport` | no | bool | Whether the node is in the initial viewport | | `has_listeners_heuristic` | no | bool | Whether the node likely has JS event listeners | Missing optional fields default to sensible zeros/empty values. A complete example is in [`examples/sample_page.json`](examples/sample_page.json). --- ## Training data The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators. The training dataset is not publicly distributed. --- ## Training procedure **Hardware:** NVIDIA L40S (48 GB VRAM) **Hyperparameters:** | Parameter | Value | |-----------|-------| | Epochs | 80 (early stopping, patience=15) | | Batch size | 8 pages | | Optimizer | AdamW | | Learning rate | 1e-3 | | LR schedule | Cosine annealing | | Weight decay | 1e-4 | | Dropout | 0.3 | | Hidden dim | 128 | | Attention heads | 4 | | GATv2 layers | 3 | | Class weighting | sqrt-inverse frequency | | Edge augmentation | Reverse edges + sibling edges | **Feature vector (618 dims/node):** | Feature block | Dims | Notes | |---------------|------|-------| | Tag one-hot | 51 | 50 tags + OOV bucket | | Class hash | 128 | Hashing trick over CSS class tokens (Tailwind-robust) | | Attribute presence | 17 | id, href, role, aria-*, type, placeholder, … | | Computed CSS | 28 | display (11) + position (5) + 6 numeric CSS values | | Bounding box | 5 | x, y, w, h, area (normalized by viewport) | | Topology | 5 | depth, n_children, is_visible, in_viewport, has_listeners | | Link semantics | 9 | absolute/relative/fragment/mailto, same-host/domain, path depth | | Text embedding | 384 | MiniLM-L6-v2 sentence embedding (frozen) | **Validation criterion:** best checkpoint selected by macro-F1 on the validation split. **Data split:** 70 / 15 / 15 train/val/test, stratified by page. --- ## Evaluation results Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds. | Metric | Mean ± std | Min | Max | |--------|-----------|-----|-----| | Macro F1 | 0.825 ± 0.026 | 0.797 | 0.865 | | Weighted F1 | 0.917 ± 0.032 | 0.882 | 0.965 | | Action F1 (5 classes) | 0.895 ± 0.036 | 0.818 | 0.917 | **Per-class F1, mean ± std across 5 seeds:** | Class | Mean F1 | Std | Test support (best seed) | |-------|---------|-----|--------------------------| | `action_input` | 0.686 | 0.104 | 25 | | `action_select` | 0.768 | 0.086 | 8 | | `action_button` | 0.909 | 0.071 | 1 577 | | `action_link_internal` | 0.996 | 0.004 | 3 119 | | `action_link_external` | 0.996 | 0.003 | 327 | | `structure_navigation` | 0.884 | 0.062 | 52 | | `structure_region` | 0.770 | 0.140 | 52 | | `structure_dismissible` | 0.363 | 0.073 | 158 | | `structure_card` | 0.625 | 0.199 | 1 045 | | `structure_list_item` | 0.974 | 0.015 | 3 885 | | `content_heading` | 0.986 | 0.007 | 525 | | `content_text` | 0.736 | 0.067 | 322 | | `content_media` | 0.915 | 0.035 | 1 319 | | `noise` | 0.938 | 0.022 | 18 345 | --- ## Limitations - **Low-support classes.** `action_input` (n=25) and `action_select` (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted. - **`structure_dismissible` is hard.** Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug. - **Heuristic labels.** Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative `