Running
Agents
Auditing Financial Autonomy: Enhanced Tasks and Multi-Dimensional Rubrics for Agentic Benchmarking
π
A three-tier rubric for the Finance sector of GDPval
Healthcare AI, Medical AI, Vision AI, Audio AI, AI Safety, Agentic Systems, Multilingual AI, Physical AI, Model Fine-tuning, Safety Evaluation, Data Annotation, Localization, Natural Language Processing, Benchmarking, AI Evaluation
A three-tier rubric for the Finance sector of GDPval
Document-work benchmark for healthcare persona
Co-evolutionary adversarial training demo (DA vs CA)
Explore RL tasks and view model trajectories
Explore iOS agent benchmark tasks and view their run details
RL env & benchmark for enterprise BA agents
RL environment for sales & revenue-ops agents
Interactive demo for the MedMosaic medical-audio benchmark
RL environment & benchmark for clinical EHR agents