We introduce, CREW, Cross function Enterprise Work Index, to evaluate frontier AI models on long-horizon enterprise tasks.
| Agent | Occupation | Complexity | Scale | What It Tests | Verifiers |
|---|---|---|---|---|---|
| Fin Agent | Credit analyst | 32+ expert hours | 2,610 tasks, 26K+ PDFs | Multiple document reasoning → taxonomy aware transaction categorization → Business P&L construction | Programmatic: Binary pass/fail |
| Enterprise Knowledge Agent | Senior business analyst | 16+ expert hours | 1,220 pitch-deck tasks, 45 video tasks, 279 preference pairs | Source faitfhulness → narrative arc based story-telling --> design coherenece | Skill-based rubrics and Preference-pairs |
| Front-end Agent | Senior Frontend engineer | 60-100 expert hours | 37 tasks, 147 expert preferences | Figma environment navigation → design system creation → build verification | Skill-based rubrics and Preference-pairs |
Results at evals.metaphi.ai/crew/leaderboard
Metaphi is an applied AI research lab founded on the mission of scale out of RL environments for long-horizon agents.
We partner with the world's leading domain experts in curating our environments, and training in-house reward models for programmatic verification of autonomous agents.
Website: metaphi.ai