oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL
π A100-optimized TensorRT version of WayraPPL for high-throughput prediction of Perplexity.
β οΈ Hardware Requirements
This model works on NVIDIA A100 GPUs with:
- GPU Architecture: sm_80 (A100-80GB)
- CUDA: 12.8+
- TensorRT: 10.13.x
- Driver: 570.124.06+
π Performance
- Throughput: ~50,000+ samples/sec (A100)
- Latency: <1ms per sample
- Batch Size: Up to 2048
- Memory: ~2GB GPU memory
π¦ Installation
pip install -r tensorrt_requirements.txt
python -c "import tensorrt; print(tensorrt.__version__)"
π§ Usage
Option 1: PyTorch Model (Standard)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")
Option 2: TensorRT Engine (High Performance)
from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
texts = ["Your text here"] * 1000
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())
Files Included
Use Cases
- Semantic Filtering
- Curriculum Learning
- Large-scale dataset cleaning (millions of documents)
- Real-time perplexity estimation
- High-throughput data quality assessment
- Production MLOps pipelines
Model Details
- Base: Knowledge distillation from meta-llama/Llama-3.2-1B
- Architecture: GPT2-based Transformer blocks with perplexity heads
- Languages: Spanish, Portuguese, English
- Max Length: 512 tokens
- Precision: FP16 (TensorRT), FP32 (PyTorch)
β‘ Benchmarks (A100)
| Model Type |
Throughput |
Latency |
Memory |
| Llama 3 1B |
~200/sec |
50ms |
8GB |
| Wayra PyTorch |
~1,000/sec |
10ms |
4GB |
| Wayra TensorRT |
~50,000/sec |
<1ms |
2GB |
Troubleshooting
"TensorRT engine not compatible"
- Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
- Check CUDA version:
nvidia-smi (should be 12.8+)
- Verify TensorRT:
python -c "import tensorrt" (should be 10.13.x)
"CUDA out of memory"
- Reduce batch size in inference
- Use gradient checkpointing if training
Citation
@software{WayraPPL,
title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
author={Omar U. Florez and LatamGPT Team},
year={2025},
url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}
License
Apache 2.0 - See LICENSE file
Note: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.