Do not copy and paste! Rewriting strategies for code retrieval
Abstract
Research investigates how different text rewriting strategies impact code retrieval performance, identifying that full natural language rewriting provides the greatest improvements while proposing entropy-based diagnostics to determine when such costly rewrites are beneficial.
Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.
Community
Our article explores how rewriting code into different forms can improve code retrieval systems.
We tested three rewriting strategies: rephrasing code, converting it into pseudocode, and translating it into full natural language.
We found that rewriting both the query and the code corpus together produces the best retrieval performance.
Our strongest results came from using full natural language rewriting, especially for smaller code encoders.
However, rewriting only the corpus often reduced performance because of mismatches between the query and the rewritten code.
We also introduced a diagnostic called ∆H, which helps predict when rewriting will improve retrieval results.
Overall, our article demonstrates that LLM-based rewriting can significantly enhance code search when applied in the right conditions.
Get this paper in your agent:
hf papers read 2605.08299 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper