Everything evaluation - a Ankush-Chander Collection

Ankush-Chander 's Collections

Everything evaluation

Hypo gen datasets-benchmarks

NLP for Human Resources

Document processing

Hypothesis generation

Everything evaluation

updated Mar 20

Reading list on evaluation metrics, benchmarks, frameworks, datasets

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Paper • 2310.11324 • Published Oct 17, 2023 • 1
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Paper • 2509.01790 • Published Sep 1, 2025 • 7
POSIX: A Prompt Sensitivity Index For Large Language Models

Paper • 2410.02185 • Published Oct 3, 2024
A Survey on Evaluation of Large Language Models

Paper • 2307.03109 • Published Jul 6, 2023 • 43