Update README.md

da6a18b verified about 2 months ago

4.24 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- text-classification
	- code
	- programming-language-identification
	- language-detection
	- modernbert
	base_model: answerdotai/ModernBERT-base
	datasets:
	- cakiki/rosetta-code
	- bigcode/the-stack
	metrics:
	- accuracy
	- f1
	---

	# Programming Language Identification (100+ languages)

	A ModernBERT classifier that identifies the programming language of a code
	snippet across 107 languages.

	## Inference

	### PyTorch

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model_id = "FrameByFrame/programming-language-identification-100plus"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(
	model_id,
	attn_implementation="eager",
	torch_dtype=torch.bfloat16,
	).eval()

	code = "def greet(name: str) -> None:\n print(f'hello, {name}')"
	inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	logits = model(**inputs).logits
	print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python"
	```

	### Batch

	```python
	snippets = [py_code, rust_code, go_code] # list of strings
	inputs = tokenizer(
	snippets, return_tensors="pt", padding=True, truncation=True, max_length=512
	)
	with torch.no_grad():
	logits = model(**inputs).logits
	for i, pred in enumerate(logits.argmax(-1).tolist()):
	print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred])
	```

	### ONNX Runtime

	An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without
	pulling PyTorch — handy for non-Python consumers and edge deployments.

	```python
	from optimum.onnxruntime import ORTModelForSequenceClassification
	from transformers import AutoTokenizer

	model_id = "FrameByFrame/programming-language-identification-100plus"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	ort_model = ORTModelForSequenceClassification.from_pretrained(
	model_id, subfolder="onnx"
	)

	inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
	logits = ort_model(**inputs).logits
	print(ort_model.config.id2label[int(logits.argmax(-1))])
	```

	[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb) — download and run in Colab or Jupyter.

	## Evaluation

	Held-out validation split (9,495 rows, 107 labels):

	\| metric \| value \|
	\|---\|---\|
	\| macro F1 \| 0.9206 \|
	\| accuracy \| 0.9306 \|


	Wins on every shared label. Largest gaps: ARM Assembly +0.354, Erlang +0.270,
	COBOL +0.216, Pascal +0.206, Fortran +0.193, Mathematica/Wolfram +0.173.

	## Supported languages (107)

	ABAP, APL, ARM Assembly, ATS, Ada, ActionScript, AppleScript, AutoHotkey,
	AutoIt, Awk, BASIC, BQN, Batchfile, Befunge, C, C#, C++, COBOL, Ceylon,
	Clojure, CoffeeScript, ColdFusion, Common Lisp, Component Pascal, Crystal, D,
	Dart, E, Eiffel, Elixir, Emacs Lisp, Erlang, Euphoria, F#, Factor, Fantom,
	Forth, Fortran, FreeBASIC, GAP, Go, Groovy, Haskell, Haxe, IDL, Io, J, Java,
	JavaScript, Julia, Kotlin, LabVIEW, LFE, Lasso, Logtalk, Lua, M, M4, MATLAB,
	MAXScript, Mathematica/Wolfram Language, Mercury, Modula-2, Modula-3, Nemerle,
	NewLisp, Nim, OCaml, Objective-C, Oz, PHP, Pascal, Perl, Pike, PicoLisp,
	PowerShell, Processing, Prolog, PureBasic, Python, QuickBASIC, R, REXX, Raku,
	Racket, Rebol, Red, Ring, Ruby, Rust, SAS, Scala, Scheme, Scilab, Smalltalk,
	Standard ML, Stata, Swift, Tcl, V, VBA, VBScript, Vala, Visual Basic .NET,
	Wren, Zig, jq

	## Training data

	91,209 code samples across 107 languages, drawn from Rosetta Code
	(`cakiki/rosetta-code`) and The Stack v1 (`bigcode/the-stack`). Labels were
	independently verified by an LLM judge, and a small set of high-confidence
	mislabels between mainstream languages was removed.

	Splits are grouped by task to prevent task-level leakage:
	72,549 / 9,495 / 8,880 rows (train / val / test).

	## Limitations

	- Only the first 512 characters of each input are used — longer files are
	truncated before classification.
	- The classifier is purely content-based. If you have file extensions, treat
	them as a strong prior in a production pipeline.