ereniko commited on
Commit
09b2d15
Β·
verified Β·
1 Parent(s): edf2149

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +231 -3
README.md CHANGED
@@ -1,3 +1,231 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - code
7
+ - execution
8
+ - prediction
9
+ - language-generalization
10
+ - no-compiler
11
+ - python
12
+ - javascript
13
+ - lua
14
+ - cobol
15
+ - synthetic-languages
16
+ - transformers
17
+ - qwen2
18
+ pipeline_tag: text-generation
19
+ base_model: Qwen/Qwen2.5-1.5B
20
+ library_name: transformers
21
+ ---
22
+
23
+ # CaaLM/CaaLM-v1
24
+
25
+
26
+ ![CaaLM-v1 Logo](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/lsYHkWaSlewMkpgEaOJNP.png)
27
+
28
+ ## What is this?
29
+
30
+ CaaLM (Code as a Language Model) is a 1.5B parameter model that predicts the output of code β€” without a compiler, runtime, or interpreter.
31
+
32
+ You give it code. It tells you what it would print.
33
+
34
+ The interesting part: it was never trained on a fixed set of languages. Instead, it was trained on real languages (Python, JavaScript, Lua, COBOL) alongside 200 synthetically generated fake programming languages β€” each with randomized syntax but consistent semantics. The goal was to teach the model what *execution* means, not what any specific language looks like.
35
+
36
+ This means it can predict the output of languages it has never seen before.
37
+
38
+ ## Performance
39
+
40
+ ![Benchmark_by_Category](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/AZhDOGagSMRSNQmFu9bgC.png)
41
+
42
+ ![Real vs Novel Fake Languages](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/HghKHvXpx-Ddta8on-WqV.png)
43
+
44
+ **Overall: 96.2% (50/52 tests)**
45
+
46
+ | Category | Accuracy | Passed/Total |
47
+ |---|---|---|
48
+ | Real: Python | 100% | 10/10 |
49
+ | Real: JavaScript | 100% | 8/8 |
50
+ | Real: Lua | 100% | 6/6 |
51
+ | Real: COBOL | 75% | 3/4 |
52
+ | Novel Fake: Tier 1 (assign + print) | 100% | 8/8 |
53
+ | Novel Fake: Tier 2 (conditionals) | 86% | 6/7 |
54
+ | Novel Fake: Tier 3 (loops) | 100% | 4/4 |
55
+ | Edge Cases | 100% | 5/5 |
56
+
57
+ The novel fake language tests use languages that were never seen during training β€” completely invented syntax like `SCRIBBLE @x BECOMES 7` or `WONDER n > 10`. The model infers semantics from context and gets them right.
58
+
59
+ ### Known Failures
60
+
61
+ Two failures in the benchmark, both explainable:
62
+
63
+ - **COBOL zero-padding** β€” predicted `08` instead of `0008`. Got the value right, missed the `PIC 9(4)` padding format. Data consistency issue.
64
+ - **If-without-else** β€” when a conditional has no else branch and the condition is false, the correct output is empty. The model predicted `NO`, hallucinating an else branch. Most training data had if/else pairs so it defaulted to that pattern.
65
+
66
+ ## How It Works
67
+
68
+ Input format:
69
+ ```
70
+ Code:
71
+ <your code here>
72
+
73
+ Output:
74
+ ```
75
+
76
+ The model completes the `Output:` section with the predicted stdout.
77
+
78
+ ### Example β€” Real Language
79
+
80
+ ```
81
+ Code:
82
+ a = 10
83
+ b = 20
84
+ print(a + b)
85
+
86
+ Output:
87
+ 30
88
+ ```
89
+
90
+ ### Example β€” Novel Fake Language (never seen during training)
91
+
92
+ ```
93
+ Code:
94
+ SCRIBBLE @x BECOMES 7
95
+ SCRIBBLE @y BECOMES 3
96
+ YELL @x + @y
97
+
98
+ Output:
99
+ 10
100
+ ```
101
+
102
+ ```
103
+ Code:
104
+ BIND n TO 15
105
+ WONDER n > 10
106
+ SHOUT YES
107
+ STOP
108
+
109
+ Output:
110
+ YES
111
+ ```
112
+
113
+ ## Quick Start
114
+
115
+ ```python
116
+ from transformers import AutoModelForCausalLM, AutoTokenizer
117
+ import torch
118
+
119
+ model = AutoModelForCausalLM.from_pretrained(
120
+ "CaaLM/CaaLM-v1",
121
+ torch_dtype=torch.bfloat16,
122
+ device_map="auto"
123
+ )
124
+ tokenizer = AutoTokenizer.from_pretrained("CaaLM/CaaLM-v1")
125
+ model.eval()
126
+
127
+ def predict_output(code: str) -> str:
128
+ prompt = f"Code:\n{code}\n\nOutput:\n"
129
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
130
+
131
+ with torch.no_grad():
132
+ outputs = model.generate(
133
+ **inputs,
134
+ max_new_tokens=128,
135
+ do_sample=False,
136
+ pad_token_id=tokenizer.eos_token_id,
137
+ )
138
+
139
+ return tokenizer.decode(
140
+ outputs[0][inputs.input_ids.shape[1]:],
141
+ skip_special_tokens=True
142
+ ).strip()
143
+
144
+ # Real language
145
+ print(predict_output("a = 6\nb = 7\nprint(a * b)"))
146
+ # β†’ 42
147
+
148
+ # Novel fake language
149
+ print(predict_output("STORE X := 10\nSTORE Y := 5\nSPEAK X + Y"))
150
+ # β†’ 15
151
+ ```
152
+
153
+ ## Training
154
+
155
+ ![Training Summary](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/UXPYmNvYDiIsfHR5JC55n.png)
156
+
157
+ ### Data
158
+
159
+ Training data was split between real and synthetic languages:
160
+
161
+ **Real languages (8,000 examples total, 2,000 each):**
162
+ - Python β€” clean semantics, baseline
163
+ - JavaScript β€” type coercion, implicit behaviors
164
+ - Lua β€” minimal syntax, sparse
165
+ - COBOL β€” verbose, English-like, no conventional syntax markers
166
+
167
+ **Synthetic languages (120,000 examples total):**
168
+ - 200 procedurally generated fake languages
169
+ - Each language has randomized keywords, operators, variable styles, and block delimiters
170
+ - Semantics are consistent within each language but syntax varies wildly across all 200
171
+ - Programs generated via a Python simulator β€” outputs are ground truth from actual execution
172
+ - Three complexity tiers: assign+print (30%), conditionals (40%), loops (30%)
173
+
174
+ The spec for each fake language is discarded after data generation. The model only ever sees `(code, output)` pairs β€” it never gets a syntax guide.
175
+
176
+ ### Configuration
177
+
178
+ - **Base model:** Qwen/Qwen2.5-1.5B (base, not instruct)
179
+ - **Training method:** Full fine-tuning (no LoRA)
180
+ - **Loss masking:** Loss computed on output tokens only, not prompt
181
+ - **Precision:** BF16
182
+ - **Optimizer:** AdamW (lr=2e-5, weight_decay=0.01)
183
+ - **Scheduler:** Cosine with 3% warmup
184
+ - **Batch size:** 8 per device Γ— 4 gradient accumulation = 32 effective
185
+ - **Epochs:** 3
186
+ - **Max sequence length:** 512 tokens
187
+ - **Hardware:** NVIDIA A100 SXM4 40GB
188
+ - **Training time:** 66.5 minutes
189
+ - **Training cost:** ~$0.82
190
+
191
+ ## Supported Operations
192
+
193
+ The model reliably handles:
194
+
195
+ - Variable assignment and arithmetic
196
+ - Print / output statements
197
+ - Conditionals (if/else)
198
+ - While loops with accumulator patterns
199
+ - String output
200
+ - Basic error behavior (empty output when conditions not met)
201
+
202
+ It does not handle: functions, recursion, file I/O, complex data structures, pipes, or multi-line string manipulation. These may work in real languages due to Qwen's pretraining knowledge but are not guaranteed.
203
+
204
+ ## Limitations
205
+
206
+ - No actual code execution β€” outputs are predictions, not guarantees
207
+ - If-without-else edge cases can produce hallucinated else branches
208
+ - COBOL numeric padding format is inconsistent
209
+ - Long programs (many steps) may degrade in accuracy as state complexity grows
210
+ - Novel fake languages with very unusual execution models (non-linear control flow, stack-based semantics) are untested
211
+ - Context window limits programs to ~512 tokens
212
+
213
+ ## Why
214
+
215
+ The original motivation was to ask: can a language model learn what *execution* means as an abstract concept, independent of any specific language's syntax?
216
+
217
+ The novel fake language results suggest yes, at least for basic programs. The model sees `WONDER x > 10` for the first time and figures out it's a conditional. It sees `SCRIBBLE @x BECOMES 7` and figures out it's assignment. It doesn't know these keywords β€” it infers them from the structure of the code and the patterns it learned during training.
218
+
219
+ Whether this scales to more complex programs, more alien execution models, or larger languages is an open question.
220
+
221
+ ## Model Lineage
222
+
223
+ CaaLM-v1 is the first model in the CaaLM series, and a spiritual successor to the [LaaLM project](https://huggingface.co/LaaLM).
224
+
225
+ - **LaaLM-v1** β€” T5-base fine-tuned to simulate Linux shell commands (external state)
226
+ - **LaaLM-exp-v1** β€” Qwen 3B fine-tuned for conversational Linux terminal emulation (internal state)
227
+ - **CaaLM-v1** β€” Qwen 1.5B fine-tuned for language-agnostic code output prediction (current)
228
+
229
+ ## License
230
+
231
+ Apache 2.0 (inherited from Qwen 2.5 base model)