Text Classification
Transformers
ONNX
Safetensors
modernbert
code
programming-language-identification
language-detection
text-embeddings-inference
Instructions to use FrameByFrame/programming-language-identification-100plus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FrameByFrame/programming-language-identification-100plus with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="FrameByFrame/programming-language-identification-100plus")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("FrameByFrame/programming-language-identification-100plus") model = AutoModelForSequenceClassification.from_pretrained("FrameByFrame/programming-language-identification-100plus") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| tags: | |
| - text-classification | |
| - code | |
| - programming-language-identification | |
| - language-detection | |
| - modernbert | |
| base_model: answerdotai/ModernBERT-base | |
| datasets: | |
| - cakiki/rosetta-code | |
| - bigcode/the-stack | |
| metrics: | |
| - accuracy | |
| - f1 | |
| # Programming Language Identification (100+ languages) | |
| A ModernBERT classifier that identifies the programming language of a code | |
| snippet across **107 languages**. | |
| ## Inference | |
| ### PyTorch | |
| ```python | |
| import torch | |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
| model_id = "FrameByFrame/programming-language-identification-100plus" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForSequenceClassification.from_pretrained( | |
| model_id, | |
| attn_implementation="eager", | |
| torch_dtype=torch.bfloat16, | |
| ).eval() | |
| code = "def greet(name: str) -> None:\n print(f'hello, {name}')" | |
| inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512) | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits | |
| print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python" | |
| ``` | |
| ### Batch | |
| ```python | |
| snippets = [py_code, rust_code, go_code] # list of strings | |
| inputs = tokenizer( | |
| snippets, return_tensors="pt", padding=True, truncation=True, max_length=512 | |
| ) | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits | |
| for i, pred in enumerate(logits.argmax(-1).tolist()): | |
| print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred]) | |
| ``` | |
| ### ONNX Runtime | |
| An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without | |
| pulling PyTorch — handy for non-Python consumers and edge deployments. | |
| ```python | |
| from optimum.onnxruntime import ORTModelForSequenceClassification | |
| from transformers import AutoTokenizer | |
| model_id = "FrameByFrame/programming-language-identification-100plus" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| ort_model = ORTModelForSequenceClassification.from_pretrained( | |
| model_id, subfolder="onnx" | |
| ) | |
| inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512) | |
| logits = ort_model(**inputs).logits | |
| print(ort_model.config.id2label[int(logits.argmax(-1))]) | |
| ``` | |
| **[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** — download and run in Colab or Jupyter. | |
| ## Evaluation | |
| Held-out validation split (9,495 rows, 107 labels): | |
| | metric | value | | |
| |---|---| | |
| | macro F1 | **0.9206** | | |
| | accuracy | 0.9306 | | |
| Wins on every shared label. Largest gaps: ARM Assembly +0.354, Erlang +0.270, | |
| COBOL +0.216, Pascal +0.206, Fortran +0.193, Mathematica/Wolfram +0.173. | |
| ## Supported languages (107) | |
| ABAP, APL, ARM Assembly, ATS, Ada, ActionScript, AppleScript, AutoHotkey, | |
| AutoIt, Awk, BASIC, BQN, Batchfile, Befunge, C, C#, C++, COBOL, Ceylon, | |
| Clojure, CoffeeScript, ColdFusion, Common Lisp, Component Pascal, Crystal, D, | |
| Dart, E, Eiffel, Elixir, Emacs Lisp, Erlang, Euphoria, F#, Factor, Fantom, | |
| Forth, Fortran, FreeBASIC, GAP, Go, Groovy, Haskell, Haxe, IDL, Io, J, Java, | |
| JavaScript, Julia, Kotlin, LabVIEW, LFE, Lasso, Logtalk, Lua, M, M4, MATLAB, | |
| MAXScript, Mathematica/Wolfram Language, Mercury, Modula-2, Modula-3, Nemerle, | |
| NewLisp, Nim, OCaml, Objective-C, Oz, PHP, Pascal, Perl, Pike, PicoLisp, | |
| PowerShell, Processing, Prolog, PureBasic, Python, QuickBASIC, R, REXX, Raku, | |
| Racket, Rebol, Red, Ring, Ruby, Rust, SAS, Scala, Scheme, Scilab, Smalltalk, | |
| Standard ML, Stata, Swift, Tcl, V, VBA, VBScript, Vala, Visual Basic .NET, | |
| Wren, Zig, jq | |
| ## Training data | |
| 91,209 code samples across 107 languages, drawn from Rosetta Code | |
| (`cakiki/rosetta-code`) and The Stack v1 (`bigcode/the-stack`). Labels were | |
| independently verified by an LLM judge, and a small set of high-confidence | |
| mislabels between mainstream languages was removed. | |
| Splits are grouped by task to prevent task-level leakage: | |
| 72,549 / 9,495 / 8,880 rows (train / val / test). | |
| ## Limitations | |
| - Only the first **512 characters** of each input are used — longer files are | |
| truncated before classification. | |
| - The classifier is purely content-based. If you have file extensions, treat | |
| them as a strong prior in a production pipeline. | |