File size: 5,876 Bytes
24f62d5
eb03713
fbc206b
 
 
 
 
 
 
 
24f62d5
fbc206b
 
 
 
 
e694085
fbc206b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ccfa311
fbc206b
 
 
eb03713
 
 
fbc206b
eb03713
fbc206b
 
 
 
 
 
 
 
 
eb03713
fbc206b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e694085
fbc206b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb03713
fbc206b
 
 
 
 
 
eb03713
 
 
fbc206b
 
 
 
 
 
 
 
 
 
eb03713
 
fbc206b
 
 
 
 
 
 
 
 
 
eb03713
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
library_name: transformers
pipeline_tag: text-generation
tags:
- math
- combinatorics
- permutations
- algebraic-combinatorics
- llama
- causal-lm
---

# PermuFormer

PermuFormer is a small Llama-style causal language model trained on symbolic permutation tasks from algebraic combinatorics. It is intended as a specialist base model for permutation representation, reasoning, and finetuning experiments rather than as a general natural-language assistant.

The model operates on a compact word-level vocabulary for permutation syntax. Training examples are stored as pre-tokenized lists of tokens; at inference time, the Hugging Face tokenizer can also consume equivalent whitespace-separated strings. Prompts are formulaic equations: the left side specifies a permutation task and generation begins after the `=` token.

## Model Details

- **Architecture:** `LlamaForCausalLM`
- **Parameters:** about 75.7M
- **Layers:** 12
- **Hidden size:** 768
- **Attention heads:** 12 query heads, 4 key/value heads
- **MLP intermediate size:** 2048
- **Activation:** SiLU/SwiGLU
- **Position encoding:** RoPE, theta 10000
- **Vocabulary size:** 186
- **Context length used by tokenizer:** 1000 tokens
- **Checkpoint:** `step_2600000`

## Training Data

PermuFormer was trained autoregressively on synthetic permutation examples generated with exact combinatorial algorithms. The paper describes a dataset of 39.8M instances, approximately 2.66B tokens, over the symmetric groups `S_2` through `S_11`.

Training tasks cover three broad families:

- **Translation between encodings:** one-line notation, cycle notation, reduced Coxeter expressions, RSK tableaux, inversion vectors, and Lehmer codes.
- **Permutation statistics and properties:** length, descents, fixed points, sign/parity, cycle type, RSK shape, pattern avoidance, longest increasing/decreasing subsequences, and related statistics.
- **Algebraic operations and comparisons:** product/composition, inverse, powers, conjugation, commutator, relative products, multiplication by simple transpositions, complement, reverse, descent tests, and Bruhat order.

Some targets include computational witnesses before the final answer, for example inversion lists before a length answer or pattern witnesses before an avoidance answer.

## Usage

Use deterministic decoding for most evaluation-style tasks. Make sure special token IDs come from the tokenizer.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "YOUR_ORG/permuformer"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()

prompt = (
    "<|endoftext|> n3 "
    "1linebegin [ 3 , 1 , 2 ] 1lineend "
    "in cyclenotationmake ="
)

inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=False))
```

### Prompt Format

Training data is represented as lists of token strings. When writing prompts as plain text, separate every token with spaces. Multi-digit integers, delimiters, and task names are individual tokens. A typical example starts with `<|endoftext|>`, then a size token such as `n7`, then the task expression, then `=`.

Translation example:

```text
<|endoftext|> n3 1linebegin [ 3 , 1 , 2 ] 1lineend in cyclenotationmake =
```

Property example:

```text
<|endoftext|> n3 1linebegin [ 3 , 2 , 1 ] 1lineend property lengthmake =
```

Algebraic operation example:

```text
<|endoftext|> n3 1linebegin [ 2 , 1 , 3 ] 1lineend inversemake =
```

## Evaluation Notes

The training code evaluates by exact match on the generated right-hand side after `=`. The local training log for this repository reports, at step 2,522,000 on a 2,560-example stratified evaluation sample:

- Overall exact match: **98.44%**
- Translation: **97.78%**
- Property/statistic tasks: **99.17%**
- Algebraic tasks: **98.36%**

These figures are from the local log and should be treated as checkpoint-adjacent repository metadata, not a full benchmark report for every downstream setting.

The paper also reports that PermuFormer is substantially more accurate than frontier general-purpose LLMs on a small held-out sample from the model's symbolic test distribution, while noting that the comparison is imperfect because PermuFormer was trained directly in this syntax.

## Finetuning

PermuFormer is designed to be finetuned on specialized permutation tasks. Experiments in the paper include:

- 231-avoidance and 2143-avoidance
- mHeight
- Schubert polynomial structure constants
- Kazhdan-Lusztig polynomial degree prediction

The repository's finetuning scripts compare starting from this pretrained checkpoint with training the same architecture from scratch.

## Limitations

- This is a specialist symbolic model. It expects the exact whitespace-tokenized syntax used during training and is brittle to natural-language paraphrases or malformed prompts.
- The model is trained on permutations of sizes represented in the training data, primarily `S_2` through `S_11`; behavior outside that regime is not guaranteed.
- Exact-match accuracy depends on canonical output formatting. Some mathematical tasks may have multiple valid answers, but evaluation expects the chosen canonical form.
- The model focuses on permutations. It does not natively handle broader combinatorial structures such as arbitrary graphs or partitions unless encoded through the supported task syntax.
- Outputs should be verified by exact combinatorial software for research-critical use.

## Citation

If you use this model, please cite the accompanying PermuFormer paper once citation details are available.