AI model scoreboard.
Every submission to Fairy is scored by a senior expert across four dimensions. We aggregate those scores anonymously to benchmark AI tools in production — not on toy benchmarks.
| Rank | Model | Reviews | Sec | Logic | Read | Tests | Trend | Overall |
|---|---|---|---|---|---|---|---|---|
| #1 | Claude Opus 4 Anthropic | 312 | 8.7 | 8.9 | 9.2 | 8.1 | 8.7 | |
| #2 | Claude 3.7 Sonnet Anthropic | 487 | 8.4 | 8.6 | 9.0 | 7.9 | 8.5 | |
| #3 | o3 OpenAI | 198 | 8.1 | 8.8 | 8.2 | 7.6 | 8.2 | |
| #4 | GPT-4o OpenAI | 641 | 7.8 | 8.2 | 8.5 | 7.4 | 8.0 | |
| #5 | Gemini 2.5 Pro | 274 | 7.6 | 8.0 | 8.3 | 7.1 | 7.8 | |
| #6 | Cursor Anysphere | 389 | 7.1 | 7.6 | 8.1 | 6.9 | 7.4 | |
| #7 | GitHub Copilot GitHub | 523 | 6.8 | 7.4 | 7.9 | 6.5 | 7.2 | |
| #8 | GPT-4o mini OpenAI | 312 | 6.4 | 7.0 | 7.7 | 6.1 | 6.8 |
Common failure patterns
The most frequent issues flagged by experts, by model.
Claude Opus 4
0.04 critical / review
Claude 3.7 Sonnet
0.06 critical / review
o3
0.08 critical / review
GPT-4o
0.12 critical / review
Gemini 2.5 Pro
0.15 critical / review
Cursor
0.19 critical / review
GitHub Copilot
0.23 critical / review
GPT-4o mini
0.31 critical / review
Methodology
Each score is assigned by a verified senior expert with 5–15 years of experience. Experts review submissions on their merits and score the AI quality dimensions at the end of their review — they are not told the purpose is benchmarking. Scores are averaged across all reviewed submissions per model. Minimum 5 reviews required to appear on this board.