| | --- |
| | datasets: |
| | - bigcode/the-stack |
| | - bigcode/the-stack-v2 |
| | - bigcode/starcoderdata |
| | - bigcode/commitpack |
| | - nvidia/OpenCodeReasoning |
| | library_name: transformers |
| | inference: true |
| | tags: |
| | - code |
| | license: mit |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Spec Coder V1 |
| | **Spec Coder** is a cutting-edge, open-source AI model designed to assist with fundamental coding tasks. It is built on the **Llama architecture**, allowing seamless access via tools like **llama.cpp** and **Ollama**. This makes **Spec Coder** highly compatible with a variety of systems, enabling flexible deployment both locally and in the cloud. |
| |
|
| | Trained on vast datasets, **Spec Coder** excels in generating code, completing code snippets, and understanding programming tasks across multiple languages. It can be used for code completion, debugging, and automated code generation, acting as a versatile assistant for developers. |
| |
|
| | **Spec Coder** is optimized for integration into developer tools, providing intelligent coding assistance and facilitating research in programming languages. Its advanced transformer-based architecture, with 4 billion parameters, allows it to perform tasks across different environments efficiently. |
| |
|
| | The model supports various downstream tasks including supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its performance for specific programming tasks. |
| |
|
| | # Training Data |
| | - Total Training Tokens: ~4.3 trillion tokens |
| | - Corpus: The Stack, StarCoder Training Dataset, The Stack v2, CommitPack, OpenCodeReasoning, English Wikipedia |
| |
|
| | # Training Details |
| | - Context Window: 8,192 tokens |
| | - Optimization: Standard language modeling objective |
| | - Hardware: Cluster of 5 x RTX 4090 GPUs |
| | - Training Duration: ~140 days (approximately 6 months) |
| |
|
| | # Benchmarks |
| | ## RepoBench 1.1 (Python) |
| | | Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k | |
| | |--------------------|-------|-------|-------|-------|-------|-------|----------| |
| | | Spec-Coder-4b-V1 | 30.42%| 38.55%| 36.91%| 32.75%| 30.34%| 34.59%| 36.23% | |
| |
|
| | ## Syntax-Aware Fill-in-the-Middle (SAFIM) |
| | | Model | Algorithmic | Control | API | Average | |
| | |----------------------|-------------|---------|--------|---------| |
| | | Spec-Coder-4b-V1 | 38.22% | 41.18% | 60.45% | 46.28% | |
| |
|
| | ## HumanEval Infilling |
| | | Model | Single-Line | Multi-Line | Random Span | |
| | |----------------------|-------------|------------|-------------| |
| | | Spec-Coder-4b-V1 | 72.34% | 45.65% | 39.12% | |
| |
|
| | # Limitations |
| | - **Biases**: The model may reflect biases present in the public codebases. |
| | - **Security**: Code generated by the model may contain security vulnerabilities. It is essential to verify and audit the code generated by the model for any potential risks. |
| |
|
| | # Sample Usage |
| | Here are examples of how to run and interact with **Spec Coder**: |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "SVECTOR-CORPORATION/Spec-Coder-4b-V1" |
| | model = AutoModelForCausalLM.from_pretrained(model_name) |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | |
| | input_code = "def factorial(n):\n if n == 0:" |
| | |
| | inputs = tokenizer(input_code, return_tensors="pt") |
| | |
| | outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1) |
| | |
| | generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | |
| | print("Generated Python code:\n", generated_code) |
| | ``` |