ReSyn: A Generalized Recursive Regular Expression Synthesis Framework
Abstract
A divide-and-conquer framework named ReSyn enhances regex synthesis accuracy by decomposing complex problems, combined with a parameter-efficient synthesizer called Set2Regex that handles example permutation invariance.
Existing Programming-By-Example (PBE) systems often rely on simplified benchmarks that fail to capture the high structural complexity of real-world regexes, such as deeper nesting and frequent use of union operations. To overcome the resulting performance drop, we propose ReSyn, a synthesizer-agnostic divide-and-conquer framework that decomposes complex synthesis problem into manageable sub-problems. We also introduce Set2Regex, a parameter-efficient synthesizer capturing the permutation invariance of examples. Experimental results demonstrate that ReSyn significantly boosts accuracy across various synthesizers, and its combination with Set2Regex establishes a new state-of-the-art on challenging real-world benchmark. The complete source code, datasets, and pre-trained model checkpoints are publicly available at https://github.com/mrseongminkim/ReSyn.
Community
We've released everything to use and build on ReSyn from the Hub:
- 📚 Dataset: https://huggingface.co/datasets/mrseongminkim/ReSyn
- 🤖 Pre-trained components (loadable via
from_pretrained): Set2Regex · Router · Partitioner · Segmenter - 🤖 Prax baseline: ReSyn-byt5-small
- 💻 Code: https://github.com/mrseongminkim/ReSyn
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code (2026)
- Chiseling Out Efficiency: Structured Skeleton Supervision for Efficient Code Generation (2026)
- REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version) (2026)
- LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs (2026)
- UniRTL: Unifying Code and Graph for Robust RTL Representation Learning (2026)
- Automated Large-scale CVRP Solver Design via LLM-assisted Flexible MCTS (2026)
- An Empirical Study of Speculative Decoding on Software Engineering Tasks (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Neat approach to regex synthesis. PBE tools usually struggle once things get nested or rely heavily on unions, so the divide-and-conquer strategy here sounds like a solid way to handle that complexity.
I'm curious how the Set2Regex component handles cases where the provided examples are ambiguous or don't fully define the intended pattern. Does the permutation invariance help prune the search space significantly when the input set is small?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/0eb2f81d-649c-4cf5-bd64-d1823b2bc89e
Get this paper in your agent:
hf papers read 2603.24624 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 5
mrseongminkim/ReSyn-Partitioner
Datasets citing this paper 1
mrseongminkim/ReSyn
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper