File size: 1,863 Bytes
1d4204c
 
 
 
 
 
 
 
 
 
 
 
e872955
1d4204c
e872955
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
title: SmolLM2 360M Instruct
emoji: 🏃
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: 'SmolLM2-360M-Instruct '
---
# SmolLM2 360M Instruct Demo

This Space demonstrates the SmolLM2-360M-Instruct model with a CPU fallback mechanism. It is designed to run efficiently even on the Hugging Face Free Tier (2 vCPUs).

## Overview

A minimal but production-ready LLM service built on:
* **Model:** SmolLM2-360M-Instruct (approx. 269MB, Apache 2.0).
* **Efficiency:** Optimized to run on 2 CPUs and minimum 2 GB RAM (HF tier supports up to 16 GB).
* **Scalability:** Perfect for local training and testing.

## Related Project: SmolLM2-customs

If you are interested in training small LLMs the lazy way, check out:
[https://github.com/VolkanSah/SmolLM2-customs](https://github.com/VolkanSah/SmolLM2-customs)

**Features of the custom implementation:**
* **FastAPI:** OpenAI-compatible `/v1/chat/completions` endpoint.
* **ADI (Anti-Dump Index):** Filters low-quality requests before they hit the model.
* **HF Dataset Integration:** Logs every request for later analysis and finetuning.

---

## Deployment & Usage

You do not need an API key for this public demo, but rate limits apply. 

### How to run your own instance:
1. **Duplicate/Clone** this Space.
2. **Environment Variables:** To use your own model access or private weights, add one of the following keys to your **Secrets**:
   * `HF_TOKEN`
   * `TEST_TOKEN`
   * `HUGGINGFACE_TOKEN`
   * `HF_API_TOKEN`

The code uses a flexible token resolution logic to ensure compatibility with older or custom keys.

## Technical Details

The inference pipeline uses `transformers` with `torch`. It automatically detects if a GPU is available; otherwise, it falls back to CPU execution without breaking the Gradio interface.