NV-EmbedCode-7B β Hexagon NPU (QHexRT, v81)
On-device code-retrieval embedding bundle of nvidia/nv-embedcode-7b-v1 for the Qualcomm Hexagon v81 NPU (SM8850), runnable with the QHexRT C++ runtime. No Python in the hot path.
MistralBiDirectionalModel (bidirectional Mistral-7B: hidden 4096, 32 layers, 32q/8kv, head_dim 128, plain
rope theta 10000, full attention) + avg-pool + L2 β a 4096-d embedding. The 7B encoder ships as 4
chained W8 context parts (llama_embed_sharded, the encoder-split engine; 1.6 GB each β 6.4 GB, fits the
~12 GB device).
Device validation (v81, SM8850) β device-vs-reference embedding cosine
| input | cosine |
|---|---|
| "function to compute fibonacci numbers" | 0.9992 |
| "def fib(n): ... fib(n-1)+fib(n-2)" | 0.9993 |
| "SELECT * FROM users WHERE age > 30" | 0.9993 |
| "binary search implementation in python" | 0.9993 |
W8 7B vs the fp32 reference β near bit-faithful (β₯ 0.9992). Code retrieval: cos(fib-query, fib-code)
0.6878 device vs 0.6872 reference, and 0.037 for an irrelevant SQL doc (correct ranking).
~930 ms/embedding. (Reference computed via a host-rope replication validated bit-exact to HF's real
MistralDecoderLayer, since the model's shipped custom forward is transformers-5.x-incompatible.)
What's inside (v81/)
nvembedcode7b_enc_p{0..3}_w8.binβ the 4 chained W8 encoder parts (graphsnvembedcode7b_enc_p{k}_w8).nvembedcode7b_embed_f16.binβ token-embedding table (fp16, vocab 32000).tokenizer.jsonβ Mistral sentencepiece BPE tokenizer.nv-embedcode-7b.jsonβ the QHexRT manifest (host-opllama_embed_sharded, plain rope).
Run
adb push v81 /data/local/tmp/wq/nb14emb
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. QHX_EMB_QUERY_PREFIX= QHX_EMB_DOC_PREFIX= \
./qhx_embed nb14emb/nv-embedcode-7b.json libQnnHtp.so libQnnSystem.so nb14emb \
'function to compute fibonacci numbers' 'def fib(n): return n if n<2 else fib(n-1)+fib(n-2)'"
Embeddings over raw text (no prefix). A QNN context binary is arch/QAIRT-pinned (v81, QAIRT 2.47, soc_model 87) β won't load on v79/v83.