NV-EmbedCode-7B β†’ Hexagon NPU (QHexRT, v81)

On-device code-retrieval embedding bundle of nvidia/nv-embedcode-7b-v1 for the Qualcomm Hexagon v81 NPU (SM8850), runnable with the QHexRT C++ runtime. No Python in the hot path.

MistralBiDirectionalModel (bidirectional Mistral-7B: hidden 4096, 32 layers, 32q/8kv, head_dim 128, plain rope theta 10000, full attention) + avg-pool + L2 β†’ a 4096-d embedding. The 7B encoder ships as 4 chained W8 context parts (llama_embed_sharded, the encoder-split engine; 1.6 GB each β‰ˆ 6.4 GB, fits the ~12 GB device).

Device validation (v81, SM8850) β€” device-vs-reference embedding cosine

input cosine
"function to compute fibonacci numbers" 0.9992
"def fib(n): ... fib(n-1)+fib(n-2)" 0.9993
"SELECT * FROM users WHERE age > 30" 0.9993
"binary search implementation in python" 0.9993

W8 7B vs the fp32 reference β€” near bit-faithful (β‰₯ 0.9992). Code retrieval: cos(fib-query, fib-code) 0.6878 device vs 0.6872 reference, and 0.037 for an irrelevant SQL doc (correct ranking). ~930 ms/embedding. (Reference computed via a host-rope replication validated bit-exact to HF's real MistralDecoderLayer, since the model's shipped custom forward is transformers-5.x-incompatible.)

What's inside (v81/)

  • nvembedcode7b_enc_p{0..3}_w8.bin β€” the 4 chained W8 encoder parts (graphs nvembedcode7b_enc_p{k}_w8).
  • nvembedcode7b_embed_f16.bin β€” token-embedding table (fp16, vocab 32000).
  • tokenizer.json β€” Mistral sentencepiece BPE tokenizer.
  • nv-embedcode-7b.json β€” the QHexRT manifest (host-op llama_embed_sharded, plain rope).

Run

adb push v81 /data/local/tmp/wq/nb14emb
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. QHX_EMB_QUERY_PREFIX= QHX_EMB_DOC_PREFIX= \
  ./qhx_embed nb14emb/nv-embedcode-7b.json libQnnHtp.so libQnnSystem.so nb14emb \
  'function to compute fibonacci numbers' 'def fib(n): return n if n<2 else fib(n-1)+fib(n-2)'"

Embeddings over raw text (no prefix). A QNN context binary is arch/QAIRT-pinned (v81, QAIRT 2.47, soc_model 87) β€” won't load on v79/v83.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support