Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency Paper • 2506.08343 • Published Jun 10, 2025 • 54
TESTEVAL: Benchmarking Large Language Models for Test Case Generation Paper • 2406.04531 • Published Jun 6, 2024 • 1
Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning Paper • 2509.13755 • Published Sep 17, 2025 • 19
ContextBench: A Benchmark for Context Retrieval in Coding Agents Paper • 2602.05892 • Published Feb 5 • 4
ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning Paper • 2603.11226 • Published Mar 11
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks Paper • 2605.22535 • Published 2 days ago • 3
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks Paper • 2605.22535 • Published 2 days ago • 3 • 2
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks Paper • 2605.22535 • Published 2 days ago • 3
ContextBench: A Benchmark for Context Retrieval in Coding Agents Paper • 2602.05892 • Published Feb 5 • 4
ContextBench: A Benchmark for Context Retrieval in Coding Agents Paper • 2602.05892 • Published Feb 5 • 4
AgentOCR: Reimagining Agent History via Optical Self-Compression Paper • 2601.04786 • Published Jan 8 • 31
view article Article We Got Claude to Fine-Tune an Open Source LLM burtenshaw, evalstate • Dec 4, 2025 • 627
Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning Paper • 2509.13755 • Published Sep 17, 2025 • 19
Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning Paper • 2509.13755 • Published Sep 17, 2025 • 19 • 2
MultiRef: Controllable Image Generation with Multiple Visual References Paper • 2508.06905 • Published Aug 9, 2025 • 21