File size: 8,372 Bytes
6a7089a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | # Lite Engine
PinchTab includes a **Lite Engine** that performs DOM capture β navigate, snapshot,
text extraction, click, and type β without requiring Chrome or Chromium. It is
powered by [Gost-DOM](https://github.com/gost-dom/browser) (v0.11.0, MIT), a headless
browser written in pure Go.
**Issue:** [#201](https://github.com/pinchtab/pinchtab/issues/201)
---
## Why a Lite Engine?
Chrome is the default execution backend for PinchTab. A real browser session handles
JavaScript rendering, bot-detection bypass, screenshots, and PDF generation. For many
workloads β static sites, wikis, news articles, APIs β none of these are needed.
| Driver | Chrome | Lite |
|--------|--------|------|
| Memory per instance | ~200 MB | ~10 MB |
| Cold-start latency | 1β6 seconds | <100 ms |
| JavaScript rendering | yes | no |
| Screenshots / PDF | yes | no |
| No Chrome installation required | no | **yes** |
Lite wins at DOM-only workloads (3β4Γ faster navigate, 3Γ faster snapshot) and is the
right choice for containers, CI pipelines, and edge environments where Chrome is not
available.
---
## Architecture
### Engine Interface
All engines implement a common interface defined in `internal/engine/engine.go`:
```go
type Engine interface {
Name() string
Navigate(ctx context.Context, url string) (*NavigateResult, error)
Snapshot(ctx context.Context, filter string) ([]SnapshotNode, error)
Text(ctx context.Context) (string, error)
Click(ctx context.Context, ref string) error
Type(ctx context.Context, ref, text string) error
Capabilities() []Capability
Close() error
}
```
The Chrome engine wraps the existing CDP/chromedp pipeline. `LiteEngine` in
`internal/engine/lite.go` implements the same interface using Gost-DOM.
### Router (Strategy Pattern)
```
Request β Router β [Rule 1] β [Rule 2] β β¦ β [Fallback Rule] β Engine
```
`Router` in `internal/engine/router.go` evaluates an ordered chain of `RouteRule`
implementations. The first rule that returns a non-`Undecided` verdict wins. Rules
are registered at startup and are hot-swappable via `AddRule()` / `RemoveRule()`.
No handler, bridge, or config change is needed when adding new routing logic β only a
`RouteRule` implementation and a single `router.AddRule(myRule)` call.
### Built-in Rules
| Rule | File | Behaviour |
|------|------|-----------|
| `CapabilityRule` | `rules.go` | Routes `screenshot`, `pdf`, `evaluate`, `cookies` β Chrome |
| `ContentHintRule` | `rules.go` | Routes URLs ending in `.html/.htm/.xml/.txt/.md` β Lite |
| `DefaultLiteRule` | `rules.go` | Catch-all: all remaining DOM ops β Lite (used in `lite` mode) |
| `DefaultChromeRule` | `rules.go` | Final fallback β Chrome (used in `chrome` and `auto` modes) |
### Three Modes
| Mode | Behaviour |
|------|-----------|
| `chrome` | All requests go through Chrome. Backward-compatible default. |
| `lite` | DOM operations (navigate, snapshot, text, click, type) use Gost-DOM. Screenshot / PDF / evaluate / cookies fall through to Chrome (501 if Chrome is unavailable). |
| `auto` | Per-request routing via rules: capability and content-hint rules are evaluated first; unknown URLs fall back to Chrome. |
---
## Request Flow (Lite Mode)
```
POST /navigate (server.engine=lite)
β
βΌ
handlers/navigation.go β HandleNavigate()
β
ββ useLite() == true
β β
β βΌ
β LiteEngine.Navigate(ctx, url)
β ββ HTTP GET url
β ββ Strip <script> tags (x/net/html tokenizer)
β ββ browser.NewWindowReader(reader) [Gost-DOM]
β ββ return NavigateResult{TabID, URL, Title}
β
ββ w.Header().Set("X-Engine", "lite")
JSON {"tabId": "lp-1", "url": "β¦", "title": "β¦"}
```
Snapshot then traverses the Gost-DOM document tree and maps HTML semantics to
accessibility roles (heading, link, button, textbox, β¦). Text walks the same tree and
collapses whitespace runs.
---
## Capability Boundaries
| Operation | Lite | Chrome |
|-----------|------|--------|
| Navigate | β
(HTTP fetch + DOM parse) | β
|
| Snapshot | β
| β
|
| Text extraction | β
| β
|
| Click | β
(DOM event dispatch) | β
|
| Type | β
(DOM input events) | β
|
| Screenshot | β β `501 Not Implemented` | β
|
| PDF | β β `501 Not Implemented` | β
|
| Evaluate (JS) | β β `501 Not Implemented` | β
|
| Cookies | β β `501 Not Implemented` | β
|
| JavaScript-rendered SPAs | β | β
|
| Bot-detection bypass | β | β
|
`CapabilityRule` ensures screenshot/pdf/evaluate/cookies are always routed to Chrome
even in `lite` mode.
---
## Known Limitations
| Limitation | Detail |
|------------|--------|
| `<script>` tags | Gost-DOM panics on an un-initialized `ScriptHost`. Scripts are stripped before parse via `x/net/html` tokenizer. |
| `<a href>` click | Gost-DOM navigates on anchor click and may encounter scripts. `Click()` wraps execution in `defer recover()` and returns an error instead of panicking. |
| CSS `display:none` | Lite has no CSS engine so hidden elements still appear in the snapshot. |
| JavaScript-rendered content | Only the initial HTML is captured. SPAs (React, Next.js etc.) should use Chrome. |
| Sites that block HTTP bots | Stack Overflow and similar sites return 4xx/5xx to plain HTTP clients. Chrome bypasses this via a real browser session. |
---
## Configuration
Set the engine in your config file:
```json
{
"server": {
"engine": "lite"
}
}
```
The `engine` field is also forwarded to child bridge instances so every managed
instance in a multi-instance deployment uses the same mode.
### Response Header
Responses served by the Lite engine include:
```
X-Engine: lite
```
This header is present on `navigate`, `snapshot`, and `text` responses when the lite
path was taken and is useful for observability and debugging.
---
## Performance
Benchmark across 8 real-world websites (Navigate β Snapshot β Text pipeline, 7 sites
where both engines completed successfully):
| Metric | Lite | Chrome | Speedup |
|--------|-----:|-------:|--------:|
| Navigate total | 4,580 ms | 17,981 ms | **3.9Γ** faster |
| Snapshot total | 1,739 ms | 5,155 ms | **3.0Γ** faster |
| Text total | 925 ms | 500 ms | 0.5Γ (Chrome faster) |
| **Grand total** | **7,244 ms** | **23,636 ms** | **3.3Γ faster** |
Chrome is faster at text extraction because it runs Mozilla Readability.js in-browser.
Lite performs a raw DOM text walk which is slower for very large pages (e.g. Wikipedia
CS: 687 ms vs 130 ms).
### When to use each engine
| Workload | Recommendation |
|----------|---------------|
| Static sites, wikis, news, blogs | **Lite** β 3β12Γ faster, no Chrome overhead |
| JavaScript-rendered SPAs | **Chrome** β Lite captures pre-JS HTML only |
| Sites that block HTTP clients | **Chrome** β real browser bypasses bot detection |
| Large-page snapshot / traversal | **Lite** β 3Γ faster snapshot |
| Text extraction on large articles | **Chrome** β Readability.js is more accurate |
| Screenshots, PDF, evaluate, cookies | **Chrome** β not supported in Lite |
---
## Code Layout
| File | Purpose |
|------|---------|
| `internal/engine/engine.go` | `Engine` interface, `Capability` constants, `Mode` enum, `NavigateResult` / `SnapshotNode` types |
| `internal/engine/lite.go` | `LiteEngine` β HTTP fetch, script stripping, Gost-DOM parse, role mapping |
| `internal/engine/router.go` | `Router` β ordered rule chain, `AddRule` / `RemoveRule` |
| `internal/engine/rules.go` | `CapabilityRule`, `ContentHintRule`, `DefaultLiteRule`, `DefaultChromeRule` |
| `internal/handlers/navigation.go` | `useLite()` fast path, `X-Engine` header |
| `internal/handlers/snapshot.go` | `SnapshotNode β A11yNode` conversion for lite path |
| `internal/handlers/text.go` | Lite text fast path |
| `cmd/pinchtab/cmd_bridge.go` | Router wiring from `config.Engine` at startup |
---
## Dependency
| Package | Version | License | Purpose |
|---------|---------|---------|---------|
| `github.com/gost-dom/browser` | v0.11.0 | MIT | Headless browser: HTML parsing, DOM traversal, event dispatch |
| `github.com/gost-dom/css` | v0.1.0 | MIT | CSS selector evaluation |
| `golang.org/x/net` | existing | BSD-3 | HTML tokenizer used for script stripping |
|