Text Generation
Transformers
Safetensors
llama
sql
nvfp4
quantized
vllm
blackwell
llmcompressor
conversational
text-generation-inference
8-bit precision
compressed-tensors
Instructions to use pshashid/llama3.1B_8B_SQL_Finetuned_model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pshashid/llama3.1B_8B_SQL_Finetuned_model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pshashid/llama3.1B_8B_SQL_Finetuned_model") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("pshashid/llama3.1B_8B_SQL_Finetuned_model") model = AutoModelForCausalLM.from_pretrained("pshashid/llama3.1B_8B_SQL_Finetuned_model") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use pshashid/llama3.1B_8B_SQL_Finetuned_model with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pshashid/llama3.1B_8B_SQL_Finetuned_model" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pshashid/llama3.1B_8B_SQL_Finetuned_model", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pshashid/llama3.1B_8B_SQL_Finetuned_model
- SGLang
How to use pshashid/llama3.1B_8B_SQL_Finetuned_model with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pshashid/llama3.1B_8B_SQL_Finetuned_model" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pshashid/llama3.1B_8B_SQL_Finetuned_model", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pshashid/llama3.1B_8B_SQL_Finetuned_model" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pshashid/llama3.1B_8B_SQL_Finetuned_model", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use pshashid/llama3.1B_8B_SQL_Finetuned_model with Docker Model Runner:
docker model run hf.co/pshashid/llama3.1B_8B_SQL_Finetuned_model
Llama 3.1 8B SQL โ NVFP4 Quantized (Blackwell)
SQL generation model fine-tuned on text-to-SQL tasks, quantized for NVIDIA Blackwell (RTX 50-series) using llm-compressor.
Quantization Details
| Component | Format | Notes |
|---|---|---|
| Weights | NVFP4 | ~4.5GB โ Blackwell 5th-gen Tensor Core native |
| KV-Cache | FP8 | 50% memory vs FP16 โ configured via vLLM |
| Activations | FP16 | lm_head kept in FP16 for output quality |
vLLM Inference (RTX 5090)
vllm serve pshashid/llama3.1B_8B_SQL_Finetuned_model \
--dtype float16 \
--quantization fp4 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--port 8000
Performance Targets (Dual RTX 5090 Pod โ 8 Replicas)
| Metric | Target |
|---|---|
| Time to First Token | < 15ms |
| Throughput (1 replica) | ~200 tok/s |
| Aggregate (8 replicas) | 1,500+ tok/s |
| Max Concurrency | 100+ users |
Example Usage (Python)
from vllm import LLM, SamplingParams
llm = LLM(
model = "pshashid/llama3.1B_8B_SQL_Finetuned_model",
quantization = "fp4",
kv_cache_dtype = "fp8",
max_model_len = 131072,
enable_prefix_caching = True,
)
sampling = SamplingParams(temperature=0, max_tokens=200)
outputs = llm.generate(["SELECT"], sampling)
print(outputs[0].outputs[0].text)
- Downloads last month
- 7
Model tree for pshashid/llama3.1B_8B_SQL_Finetuned_model
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct