Text Generation
Transformers
Safetensors
GGUF
English
code
code-editing
Merge
fastedit
qwen2
conversational
Instructions to use continuous-lab/FastEdit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use continuous-lab/FastEdit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="continuous-lab/FastEdit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("continuous-lab/FastEdit", dtype="auto") - llama-cpp-python
How to use continuous-lab/FastEdit with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="continuous-lab/FastEdit", filename="gguf/fastedit-1.7b-Q8_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use continuous-lab/FastEdit with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf continuous-lab/FastEdit:Q8_0 # Run inference directly in the terminal: llama-cli -hf continuous-lab/FastEdit:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf continuous-lab/FastEdit:Q8_0 # Run inference directly in the terminal: llama-cli -hf continuous-lab/FastEdit:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf continuous-lab/FastEdit:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf continuous-lab/FastEdit:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf continuous-lab/FastEdit:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf continuous-lab/FastEdit:Q8_0
Use Docker
docker model run hf.co/continuous-lab/FastEdit:Q8_0
- LM Studio
- Jan
- vLLM
How to use continuous-lab/FastEdit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "continuous-lab/FastEdit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "continuous-lab/FastEdit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/continuous-lab/FastEdit:Q8_0
- SGLang
How to use continuous-lab/FastEdit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "continuous-lab/FastEdit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "continuous-lab/FastEdit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "continuous-lab/FastEdit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "continuous-lab/FastEdit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use continuous-lab/FastEdit with Ollama:
ollama run hf.co/continuous-lab/FastEdit:Q8_0
- Unsloth Studio
How to use continuous-lab/FastEdit with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for continuous-lab/FastEdit to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for continuous-lab/FastEdit to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for continuous-lab/FastEdit to start chatting
- Pi
How to use continuous-lab/FastEdit with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf continuous-lab/FastEdit:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "continuous-lab/FastEdit:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use continuous-lab/FastEdit with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf continuous-lab/FastEdit:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default continuous-lab/FastEdit:Q8_0
Run Hermes
hermes
- Docker Model Runner
How to use continuous-lab/FastEdit with Docker Model Runner:
docker model run hf.co/continuous-lab/FastEdit:Q8_0
- Lemonade
How to use continuous-lab/FastEdit with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull continuous-lab/FastEdit:Q8_0
Run and chat with the model
lemonade run user.FastEdit-Q8_0
List all available models
lemonade list
| language: | |
| - en | |
| license: mit | |
| library_name: transformers | |
| base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct | |
| tags: | |
| - code | |
| - code-editing | |
| - merge | |
| - fastedit | |
| - qwen2 | |
| pipeline_tag: text-generation | |
| # FastEdit 1.7B | |
| A fine-tuned **Qwen2.5-Coder-1.5B-Instruct** for merging code edit snippets into source files. Given an original code chunk (~35 lines) and a compact edit snippet with context markers, the model produces the merged result. | |
| This model is designed to be used with the [FastEdit](https://github.com/parcadei/fastedit) toolkit, which handles AST scoping, deterministic edits, and post-processing. **Using the model directly requires the exact prompt format described below.** | |
| ## Model variants | |
| All variants are in this repo under subfolders: | |
| | Subfolder | Format | Size | Use case | | |
| |-----------|--------|------|----------| | |
| | `bf16/` | BF16 safetensors | 3.2 GB | Fine-tuning, reference, GPU serving via vLLM/TGI | | |
| | `mlx-8bit/` | MLX 8-bit | 1.7 GB | Apple Silicon (recommended for local use) | | |
| | `gguf/` | GGUF Q8_0 | 1.7 GB | llama.cpp, LM Studio, Ollama | | |
| ## Prompt format | |
| The model expects a specific 2-message chat format. **Using a different prompt will produce poor results.** | |
| ### System message | |
| ``` | |
| You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated. /no_think | |
| ``` | |
| The `/no_think` suffix disables Qwen's thinking mode — without it, the model may emit thousands of reasoning tokens before producing output. | |
| ### User message | |
| ``` | |
| Merge all changes from the <update> snippet into the <code> below. | |
| - Preserve the code's structure, order, comments, and indentation exactly. | |
| - Output only the updated code, enclosed within <updated-code> and </updated-code> tags. | |
| - Do not include any additional text, explanations, placeholders, ellipses, or code fences. | |
| <code>{original_code}</code> | |
| <update>{update_snippet}</update> | |
| Provide the complete updated code. | |
| ``` | |
| ### Expected output | |
| The model outputs the merged code wrapped in `<updated-code>` tags: | |
| ``` | |
| <updated-code> | |
| def process(data): | |
| try: | |
| result = transform(data) | |
| return result | |
| except Error as e: | |
| return {"error": str(e)} | |
| </updated-code> | |
| ``` | |
| ### Complete example | |
| **Original code** (what tree-sitter extracts for the target function): | |
| ```python | |
| def process(data): | |
| result = transform(data) | |
| return result | |
| ``` | |
| **Edit snippet** (what the user/agent writes): | |
| ```python | |
| def process(data): | |
| try: | |
| # ... existing code ... | |
| except Error as e: | |
| return {"error": str(e)} | |
| ``` | |
| **Model output:** | |
| ```python | |
| <updated-code> | |
| def process(data): | |
| try: | |
| result = transform(data) | |
| return result | |
| except Error as e: | |
| return {"error": str(e)} | |
| </updated-code> | |
| ``` | |
| The model understands `# ... existing code ...` markers (and language-specific variants like `// ... existing code ...`) as instructions to preserve the original lines in that region. | |
| ## How it fits into FastEdit | |
| In production, the model is the **fallback** — not the primary path: | |
| 1. **AST scoping** — tree-sitter finds the target function by name (~35 lines), so the model never sees the whole file | |
| 2. **Deterministic text-match** �� 74% of edits are resolved by matching context lines and splicing in new lines (0 tokens, <1ms) | |
| 3. **Model merge** — the remaining 26% of edits (structural changes like wrapping in try/catch, full rewrites) go to this model | |
| The model only ever processes ~35-line chunks. It was trained on function-scoped edits, not whole files. Feeding it large inputs will degrade quality. | |
| ## Using without FastEdit | |
| If you want to use the model directly (without the toolkit), you need to: | |
| 1. **Scope the input yourself** — extract only the target function/class, not the whole file | |
| 2. **Use the exact prompt format** above — different prompts will produce different (worse) results | |
| 3. **Parse the output** — extract text between `<updated-code>` and `</updated-code>` tags | |
| 4. **Handle edge cases** — the model may emit `<think>` blocks (strip them), use variant tag names (`<update-code>`, `<updated_code>`), or truncate output on long functions | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # BF16 (GPU / fine-tuning) | |
| model = AutoModelForCausalLM.from_pretrained("continuous-lab/FastEdit", subfolder="bf16", torch_dtype="auto") | |
| tokenizer = AutoTokenizer.from_pretrained("continuous-lab/FastEdit", subfolder="bf16") | |
| messages = [ | |
| {"role": "system", "content": "You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated. /no_think"}, | |
| {"role": "user", "content": """Merge all changes from the <update> snippet into the <code> below. | |
| - Preserve the code's structure, order, comments, and indentation exactly. | |
| - Output only the updated code, enclosed within <updated-code> and </updated-code> tags. | |
| - Do not include any additional text, explanations, placeholders, ellipses, or code fences. | |
| <code>def process(data): | |
| result = transform(data) | |
| return result</code> | |
| <update>def process(data): | |
| try: | |
| # ... existing code ... | |
| except Error as e: | |
| return {"error": str(e)}</update> | |
| Provide the complete updated code."""} | |
| ] | |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=512, temperature=0) | |
| result = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) | |
| # Parse: extract text between <updated-code> and </updated-code> | |
| ``` | |
| ## Training | |
| - **Base model**: Qwen2.5-Coder-1.5B-Instruct | |
| - **Task**: Code edit merging across 13 languages | |
| ## Evaluation | |
| Tested on 22 structurally distinct edit patterns (73 cases) across 13 languages: | |
| | Path | Accuracy | Avg tokens | Avg latency | | |
| |------|----------|------------|-------------| | |
| | Deterministic (74% of edits) | 100% | 0 | <1ms | | |
| | Model (26% of edits) | 92% | ~40 | ~500ms | | |
| | **Combined** | **~98%** | **~10** | **~130ms** | | |
| Per-language model accuracy (156-example benchmark): | |
| | Language | Accuracy | | |
| |----------|----------| | |
| | Python, Java, Kotlin, C, PHP | 92% | | |
| | JavaScript, TypeScript, Rust, Swift | 85% | | |
| | Go, C++, Ruby | 77% | | |
| ## Limitations | |
| - Performance degrades on inputs longer than ~100 lines. | |
| - Does not handle whole-file edits well — use the FastEdit toolkit's AST scoping. | |
| - The edit snippet must use `# ... existing code ...` markers (or language-equivalent) for context preservation. Without markers, the model treats the entire snippet as a replacement. | |
| - Languages not in the training set may work but are untested. | |
| ## License | |
| MIT | |