Duplicate from lodestones/tagger-experiment

147f0c2 3 days ago

3.85 kB

	---
	license: apache-2.0

	tags:
	- image-classification
	- multi-label-classification
	- booru
	- tagger
	- danbooru
	- e621
	- dinov3
	- vit
	pipeline_tag: image-classification
	---

	# DINOv3 ViT-H/16+ Booru Tagger

	A multi-label image tagger trained on e621 and Danbooru annotations, using a
	[DINOv3 ViT-H/16+](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
	backbone fine-tuned end-to-end with a single linear projection head.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Backbone \| `facebook/dinov3-vith16plus-pretrain-lvd1689m` \|
	\| Architecture \| ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens \|
	\| Head \| `Linear((1 + 4) × 1280 → 74 625)` — CLS + 4 register tokens concatenated \|
	\| Vocabulary \| 74 625 tags (min frequency ≥ 50 across training set) \|
	\| Input resolution \| Any multiple of 16 px — trained at 512 px, generalises to higher resolutions \|
	\| Input normalisation \| ImageNet mean/std `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]` \|
	\| Output \| Raw logits — apply `sigmoid` for per-tag probabilities \|
	\| Parameters \| ~632 M (backbone) + ~480 M (head) \|

	## Training

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Training data \| e621 + Danbooru (parquet) \|
	\| Batch size \| 32 \|
	\| Learning rate \| 1e-6 \|
	\| Warmup steps \| 50 \|
	\| Loss \| `BCEWithLogitsLoss` with per-tag `pos_weight = (neg/pos)^(1/T)`, cap 100 \|
	\| Optimiser \| AdamW (β₁=0.9, β₂=0.999, wd=0.01) \|
	\| Precision \| bfloat16 (backbone) / float32 (projection + loss) \|
	\| Hardware \| 2× GPU, ThreadPoolExecutor + NCCL all-reduce \|

	## Usage

	### Standalone (no `transformers` dependency)

	```python
	from inference_tagger_standalone import Tagger

	tagger = Tagger(
	checkpoint_path="tagger_proto.safetensors",
	vocab_path="tagger_vocab.json",
	device="cuda",
	)

	tags = tagger.predict("photo.jpg", topk=40)
	# → [("solo", 0.98), ("anthro", 0.95), ...]

	# or threshold-based
	tags = tagger.predict("https://example.com/image.jpg", threshold=0.35)
	```

	### CLI

	```bash
	# top-30 tags, pretty output
	python inference_tagger_standalone.py \
	--checkpoint tagger_proto.safetensors \
	--vocab tagger_vocab.json \
	--images photo.jpg https://example.com/image.jpg \
	--topk 30

	# comma-separated string (pipe into diffusion trainer)
	python inference_tagger_standalone.py ... --format tags

	# JSON
	python inference_tagger_standalone.py ... --format json
	```

	### Web UI

	```bash
	pip install fastapi uvicorn jinja2 aiofiles

	python tagger_ui_server.py \
	--checkpoint tagger_proto.safetensors \
	--vocab tagger_vocab.json \
	--port 7860
	# → open http://localhost:7860
	```

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `*.safetensors` \| Model weights (bfloat16) \|
	\| `tagger_vocab.json` \| `{"idx2tag": [...]}` — 74 625 tag strings ordered by training frequency \|
	\| `inference_tagger_standalone.py` \| Self-contained inference script (no `transformers` dep) \|
	\| `tagger_ui_server.py` \| FastAPI + Jinja2 web UI server \|

	## Tag Vocabulary

	Tags are sourced from e621 and Danbooru annotations and cover:

	- Subject — species, character count, gender (`solo`, `duo`, `anthro`, `1girl`, `male`, …)
	- Body — anatomy, fur/scale/skin markings, body parts
	- Action / pose — `looking at viewer`, `sitting`, …
	- Scene — background, lighting, setting
	- Style — `digital art`, `hi res`, `sketch`, `watercolor`, …
	- Rating — explicit content tags are included; filter as needed for your use case

	Minimum tag frequency threshold: 50 occurrences across the combined dataset.

	## Limitations

	- Evaluated on booru-style illustrations and furry art; performance on photographic
	images or other art styles is untested.
	- The vocabulary reflects the biases of e621 and Danbooru annotation practices.

	## License

	Apache 2.0