Initial export with WAMP attention weights (INT8 ONNX)

d8482d3 verified 2 months ago

2.64 kB

	---
	library_name: setfit
	license: mit
	base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
	tags:
	- setfit
	- onnx
	- attention-weights
	- context-compression
	- intent-classification
	- multilingual
	pipeline_tag: text-classification
	---

	# SetFit Multilingual OVR Router (ONNX with Attentions)

	This is a State-of-the-Art SetFit model exported to ONNX format, specifically trained to classify LLM tasks into three semantic categories: Needle (Fact Retrieval), Reasoning (Logic/Analysis), and Summary (General Recap).

	The model is based on [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) and has been modified to expose all 12 layers of raw attention weights.

	## Key Features

	- 3-Class Classification: High-precision separation of intents.
	- Multilingual: Native support for Russian, English, and 50+ other languages.
	- Attention Output: Every inference returns a full attention matrix `(batch, heads, seq_len, seq_len)` for all 12 layers.
	- Dual Precision: Both FP32 (`model.onnx`) and INT8 Quantized (`model_quantized.onnx`) versions are available.
	- Optimized for CPU: Fast ONNX inference via `onnxruntime`.

	## Classification Map
	- Label 0: Summary (Chatter, Recaps, TL;DR)
	- Label 1: Needle (Pinpoint facts, parameters, keys, IPs)
	- Label 2: Reasoning (Comparison, analysis, code debugging, logical chains)

	## Project Origin

	This model is a core component of the [WAMP-proxy](https://github.com/naranor/wamp-proxy) project, an intelligent middleware for research into LLM context optimization.

	## Quick Inference (Python)

	```python
	import numpy as np
	import onnxruntime as ort
	from transformers import AutoTokenizer
	import json

	# 1. Load model and weights
	session = ort.InferenceSession("model.onnx")
	tokenizer = AutoTokenizer.from_pretrained(".")
	with open("router_weights_setfit.json", "r") as f:
	weights = json.load(f)

	# 2. Prepare Input
	text = "What is the database port?"
	inputs = tokenizer(text, return_tensors="np")
	onnx_inputs = {
	"input_ids": inputs["input_ids"].astype(np.int64),
	"attention_mask": inputs["attention_mask"].astype(np.int64)
	}

	# 3. Run
	outputs = session.run(None, onnx_inputs)
	embeddings = np.mean(outputs[0], axis=1) # Mean pooling

	# 4. Predict probabilities (LogReg Head)
	scores = np.dot(embeddings, np.array(weights["coef"]).T) + weights["intercept"]
	probs = np.exp(scores) / np.exp(scores).sum()
	print(f"Probabilities: {probs}")
	```

	## License
	MIT License.