JSCoder β JavaScript Code Completion Model (~300M)
A GPT-style decoder-only language model trained from scratch on ~1B tokens of JavaScript source code (sourced from The Stack). It supports both plain next-token completion and fill-in-the-middle (FIM) autocomplete at the cursor position (StarCoder-style PSM/SPM format).
Architecture
| Hyper-parameter | Value |
|---|---|
| Parameters | ~300M |
| Layers | 24 |
| Hidden dim | 1024 |
| Heads | 16 |
| Context window | 1024 tokens |
| Vocabulary | 8 192 (byte-level BPE, JS-tuned) |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
| Weight tying | Yes (embedding β lm_head) |
Files
| File | Description |
|---|---|
checkpoints/jscoder_300m/ckpt.pt |
PyTorch checkpoint (model state-dict + config dict) |
tokenizer/js_bpe.json |
Byte-level BPE tokenizer (HuggingFace tokenizers format) |
model/gpt.py |
Model definition (GPT, GPTConfig) |
tokenizer/tokenizer.py |
JSCoderTokenizer wrapper |
sample.py |
Inference script (plain completion + FIM) |
Quick Start
git clone https://huggingface.co/YOUR_USERNAME/jscoder-300m
cd jscoder-300m
pip install torch tokenizers
Plain completion
python sample.py \
--ckpt checkpoints/jscoder_300m/ckpt.pt \
--prompt "// returns the sum of all numbers in the array
const sumArray = (items) => {
let result = 0;
for (const item of items) {" \
--max-new-tokens 80 --temperature 0.2
Fill-in-the-middle (autocomplete at cursor)
python sample.py \
--ckpt checkpoints/jscoder_300m/ckpt.pt \
--fim \
--prefix $'function sum(arr) {\n let total = 0;\n ' \
--suffix $'\n return total;\n}' \
--temperature 0.2
Python API
import torch
from model.gpt import GPT, GPTConfig
from tokenizer.tokenizer import JSCoderTokenizer
ckpt = torch.load("checkpoints/jscoder_300m/ckpt.pt", map_location="cpu")
model = GPT(GPTConfig(**ckpt["config"]))
model.load_state_dict(ckpt["model"])
model.eval()
tok = JSCoderTokenizer.load("tokenizer/js_bpe.json")
prompt = "// parses JSON safely\nfunction parseJSON(str) {\n try {"
ids = tok.encode(prompt)
idx = torch.tensor([ids], dtype=torch.long)
with torch.no_grad():
out = model.generate(idx, max_new_tokens=100, temperature=0.2, top_k=50)
print(tok.decode(out[0].tolist()))
Capability Tiers
The model is most reliable on patterns that dominate its training data:
Tier 1 β high confidence:
try/catchJSON parse / async fetch wrappersfor-ofaccumulators- Throttle / memoize (when scaffolded with the outer shell)
Tier 2 β partial (right structure, minor logic error):
- Word capitalisation, type guards, number validation
Tier 3 β scaffold required:
Array.isArrayternaries,Setdedup,Object.assignmerge,hasOwnProperty, deep clone
See inference.md for detailed prompt examples and scaffolding
strategies for each tier.
Training
Trained with a custom PyTorch loop (train.py) on sharded .bin token files
packed from ~1B tokens of JavaScript from The Stack.
Tokenizer: byte-level BPE, 8 192 vocab, trained on the same corpus
Optimizer: AdamW, lr=3e-4, cosine decay, warmup=500 iters
Batch size: 512 tokens Γ grad-accum 128 β ~65k tokens/step
Hardware: trained on cloud GPU (A5000+)
Limitations
- Trained on JavaScript only; will not generalise to other languages.
- Small vocabulary (8 192) causes slightly longer tokenisation of uncommon identifiers.
- Recursive / divide-and-conquer patterns are weak β the model has not seen enough of them to generalise reliably.
- Not RLHF-tuned; outputs are raw language model completions.
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support