PermuFormer / README.md
ACDRepo's picture
Upload folder using huggingface_hub
eb03713 verified
|
Raw
History Blame Contribute Delete
5.88 kB
---
library_name: transformers
pipeline_tag: text-generation
tags:
- math
- combinatorics
- permutations
- algebraic-combinatorics
- llama
- causal-lm
---
# PermuFormer
PermuFormer is a small Llama-style causal language model trained on symbolic permutation tasks from algebraic combinatorics. It is intended as a specialist base model for permutation representation, reasoning, and finetuning experiments rather than as a general natural-language assistant.
The model operates on a compact word-level vocabulary for permutation syntax. Training examples are stored as pre-tokenized lists of tokens; at inference time, the Hugging Face tokenizer can also consume equivalent whitespace-separated strings. Prompts are formulaic equations: the left side specifies a permutation task and generation begins after the `=` token.
## Model Details
- **Architecture:** `LlamaForCausalLM`
- **Parameters:** about 75.7M
- **Layers:** 12
- **Hidden size:** 768
- **Attention heads:** 12 query heads, 4 key/value heads
- **MLP intermediate size:** 2048
- **Activation:** SiLU/SwiGLU
- **Position encoding:** RoPE, theta 10000
- **Vocabulary size:** 186
- **Context length used by tokenizer:** 1000 tokens
- **Checkpoint:** `step_2600000`
## Training Data
PermuFormer was trained autoregressively on synthetic permutation examples generated with exact combinatorial algorithms. The paper describes a dataset of 39.8M instances, approximately 2.66B tokens, over the symmetric groups `S_2` through `S_11`.
Training tasks cover three broad families:
- **Translation between encodings:** one-line notation, cycle notation, reduced Coxeter expressions, RSK tableaux, inversion vectors, and Lehmer codes.
- **Permutation statistics and properties:** length, descents, fixed points, sign/parity, cycle type, RSK shape, pattern avoidance, longest increasing/decreasing subsequences, and related statistics.
- **Algebraic operations and comparisons:** product/composition, inverse, powers, conjugation, commutator, relative products, multiplication by simple transpositions, complement, reverse, descent tests, and Bruhat order.
Some targets include computational witnesses before the final answer, for example inversion lists before a length answer or pattern witnesses before an avoidance answer.
## Usage
Use deterministic decoding for most evaluation-style tasks. Make sure special token IDs come from the tokenizer.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "YOUR_ORG/permuformer"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()
prompt = (
"<|endoftext|> n3 "
"1linebegin [ 3 , 1 , 2 ] 1lineend "
"in cyclenotationmake ="
)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=80,
do_sample=False,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=False))
```
### Prompt Format
Training data is represented as lists of token strings. When writing prompts as plain text, separate every token with spaces. Multi-digit integers, delimiters, and task names are individual tokens. A typical example starts with `<|endoftext|>`, then a size token such as `n7`, then the task expression, then `=`.
Translation example:
```text
<|endoftext|> n3 1linebegin [ 3 , 1 , 2 ] 1lineend in cyclenotationmake =
```
Property example:
```text
<|endoftext|> n3 1linebegin [ 3 , 2 , 1 ] 1lineend property lengthmake =
```
Algebraic operation example:
```text
<|endoftext|> n3 1linebegin [ 2 , 1 , 3 ] 1lineend inversemake =
```
## Evaluation Notes
The training code evaluates by exact match on the generated right-hand side after `=`. The local training log for this repository reports, at step 2,522,000 on a 2,560-example stratified evaluation sample:
- Overall exact match: **98.44%**
- Translation: **97.78%**
- Property/statistic tasks: **99.17%**
- Algebraic tasks: **98.36%**
These figures are from the local log and should be treated as checkpoint-adjacent repository metadata, not a full benchmark report for every downstream setting.
The paper also reports that PermuFormer is substantially more accurate than frontier general-purpose LLMs on a small held-out sample from the model's symbolic test distribution, while noting that the comparison is imperfect because PermuFormer was trained directly in this syntax.
## Finetuning
PermuFormer is designed to be finetuned on specialized permutation tasks. Experiments in the paper include:
- 231-avoidance and 2143-avoidance
- mHeight
- Schubert polynomial structure constants
- Kazhdan-Lusztig polynomial degree prediction
The repository's finetuning scripts compare starting from this pretrained checkpoint with training the same architecture from scratch.
## Limitations
- This is a specialist symbolic model. It expects the exact whitespace-tokenized syntax used during training and is brittle to natural-language paraphrases or malformed prompts.
- The model is trained on permutations of sizes represented in the training data, primarily `S_2` through `S_11`; behavior outside that regime is not guaranteed.
- Exact-match accuracy depends on canonical output formatting. Some mathematical tasks may have multiple valid answers, but evaluation expects the chosen canonical form.
- The model focuses on permutations. It does not natively handle broader combinatorial structures such as arbitrary graphs or partitions unless encoded through the supported task syntax.
- Outputs should be verified by exact combinatorial software for research-critical use.
## Citation
If you use this model, please cite the accompanying PermuFormer paper once citation details are available.