JSCoder β€” JavaScript Code Completion Model (~300M)

A GPT-style decoder-only language model trained from scratch on ~1B tokens of JavaScript source code (sourced from The Stack). It supports both plain next-token completion and fill-in-the-middle (FIM) autocomplete at the cursor position (StarCoder-style PSM/SPM format).

Architecture

Hyper-parameter Value
Parameters ~300M
Layers 24
Hidden dim 1024
Heads 16
Context window 1024 tokens
Vocabulary 8 192 (byte-level BPE, JS-tuned)
Positional encoding RoPE
Normalization RMSNorm
Activation SwiGLU
Weight tying Yes (embedding ↔ lm_head)

Files

File Description
checkpoints/jscoder_300m/ckpt.pt PyTorch checkpoint (model state-dict + config dict)
tokenizer/js_bpe.json Byte-level BPE tokenizer (HuggingFace tokenizers format)
model/gpt.py Model definition (GPT, GPTConfig)
tokenizer/tokenizer.py JSCoderTokenizer wrapper
sample.py Inference script (plain completion + FIM)

Quick Start

git clone https://huggingface.co/YOUR_USERNAME/jscoder-300m
cd jscoder-300m
pip install torch tokenizers

Plain completion

python sample.py \
  --ckpt checkpoints/jscoder_300m/ckpt.pt \
  --prompt "// returns the sum of all numbers in the array
const sumArray = (items) => {
  let result = 0;
  for (const item of items) {" \
  --max-new-tokens 80 --temperature 0.2

Fill-in-the-middle (autocomplete at cursor)

python sample.py \
  --ckpt checkpoints/jscoder_300m/ckpt.pt \
  --fim \
  --prefix $'function sum(arr) {\n  let total = 0;\n  ' \
  --suffix $'\n  return total;\n}' \
  --temperature 0.2

Python API

import torch
from model.gpt import GPT, GPTConfig
from tokenizer.tokenizer import JSCoderTokenizer

ckpt = torch.load("checkpoints/jscoder_300m/ckpt.pt", map_location="cpu")
model = GPT(GPTConfig(**ckpt["config"]))
model.load_state_dict(ckpt["model"])
model.eval()

tok = JSCoderTokenizer.load("tokenizer/js_bpe.json")

prompt = "// parses JSON safely\nfunction parseJSON(str) {\n  try {"
ids = tok.encode(prompt)
idx = torch.tensor([ids], dtype=torch.long)

with torch.no_grad():
    out = model.generate(idx, max_new_tokens=100, temperature=0.2, top_k=50)

print(tok.decode(out[0].tolist()))

Capability Tiers

The model is most reliable on patterns that dominate its training data:

Tier 1 β€” high confidence:

  • try/catch JSON parse / async fetch wrappers
  • for-of accumulators
  • Throttle / memoize (when scaffolded with the outer shell)

Tier 2 β€” partial (right structure, minor logic error):

  • Word capitalisation, type guards, number validation

Tier 3 β€” scaffold required:

  • Array.isArray ternaries, Set dedup, Object.assign merge, hasOwnProperty, deep clone

See inference.md for detailed prompt examples and scaffolding strategies for each tier.

Training

Trained with a custom PyTorch loop (train.py) on sharded .bin token files packed from ~1B tokens of JavaScript from The Stack.

Tokenizer:  byte-level BPE, 8 192 vocab, trained on the same corpus
Optimizer:  AdamW, lr=3e-4, cosine decay, warmup=500 iters
Batch size: 512 tokens Γ— grad-accum 128 β†’ ~65k tokens/step
Hardware:   trained on cloud GPU (A5000+)

Limitations

  • Trained on JavaScript only; will not generalise to other languages.
  • Small vocabulary (8 192) causes slightly longer tokenisation of uncommon identifiers.
  • Recursive / divide-and-conquer patterns are weak β€” the model has not seen enough of them to generalise reliably.
  • Not RLHF-tuned; outputs are raw language model completions.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support