Fairy
Live benchmark

AI model scoreboard.

Every submission to Fairy is scored by a senior expert across four dimensions. We aggregate those scores anonymously to benchmark AI tools in production — not on toy benchmarks.

Based on 3,136 reviewed submissionsAll expert identities remain anonymousScores reflect real production code shipped by real teams
SecurityLogicReadabilityTests
RankModelReviews Sec Logic Read TestsTrendOverall
#1

Claude Opus 4

Anthropic

3128.78.99.28.1
8.7
#2

Claude 3.7 Sonnet

Anthropic

4878.48.69.07.9
8.5
#3

o3

OpenAI

1988.18.88.27.6
8.2
#4

GPT-4o

OpenAI

6417.88.28.57.4
8.0
#5

Gemini 2.5 Pro

Google

2747.68.08.37.1
7.8
#6

Cursor

Anysphere

3897.17.68.16.9
7.4
#7

GitHub Copilot

GitHub

5236.87.47.96.5
7.2
#8

GPT-4o mini

OpenAI

3126.47.07.76.1
6.8

Common failure patterns

The most frequent issues flagged by experts, by model.

Claude Opus 4

incomplete error handling
missing input validation
doc gaps

0.04 critical / review

Claude 3.7 Sonnet

missing input validation
edge case gaps
test coverage

0.06 critical / review

o3

verbose patterns
over-engineering
readability

0.08 critical / review

GPT-4o

auth check missing
sql injection risk
hardcoded secrets

0.12 critical / review

Gemini 2.5 Pro

missing rate limiting
insecure defaults
test coverage

0.15 critical / review

Cursor

copy-paste patterns
missing validation
brittle tests

0.19 critical / review

GitHub Copilot

stale pattern suggestions
security misses
missing context

0.23 critical / review

GPT-4o mini

shallow analysis
incomplete implementation
missing edge cases

0.31 critical / review

Methodology

Each score is assigned by a verified senior expert with 5–15 years of experience. Experts review submissions on their merits and score the AI quality dimensions at the end of their review — they are not told the purpose is benchmarking. Scores are averaged across all reviewed submissions per model. Minimum 5 reviews required to appear on this board.