Live benchmark

AI model scoreboard.

Every submission to Fairy is scored by a senior expert across four dimensions. We aggregate those scores anonymously to benchmark AI tools in production — not on toy benchmarks.

Based on 3,136 reviewed submissionsAll expert identities remain anonymousScores reflect real production code shipped by real teams

SecurityLogicReadabilityTests

Rank	Model	Reviews	Sec	Logic	Read	Tests	Overall
#1	Claude Opus 4 Anthropic	312	8.7	8.9	9.2	8.1	8.7
#2	Claude 3.7 Sonnet Anthropic	487	8.4	8.6	9.0	7.9	8.5
#3	o3 OpenAI	198	8.1	8.8	8.2	7.6	8.2
#4	GPT-4o OpenAI	641	7.8	8.2	8.5	7.4	8.0
#5	Gemini 2.5 Pro Google	274	7.6	8.0	8.3	7.1	7.8
#6	Cursor Anysphere	389	7.1	7.6	8.1	6.9	7.4
#7	GitHub Copilot GitHub	523	6.8	7.4	7.9	6.5	7.2
#8	GPT-4o mini OpenAI	312	6.4	7.0	7.7	6.1	6.8

Common failure patterns

The most frequent issues flagged by experts, by model.

Claude Opus 4

incomplete error handling

missing input validation

doc gaps

0.04 critical / review

Claude 3.7 Sonnet

missing input validation

edge case gaps

test coverage

0.06 critical / review

verbose patterns

over-engineering

readability

0.08 critical / review

GPT-4o

auth check missing

sql injection risk

hardcoded secrets

0.12 critical / review

Gemini 2.5 Pro

missing rate limiting

insecure defaults

test coverage

0.15 critical / review

Cursor

copy-paste patterns

missing validation

brittle tests

0.19 critical / review

GitHub Copilot

stale pattern suggestions

security misses

missing context

0.23 critical / review

GPT-4o mini

shallow analysis

incomplete implementation

missing edge cases

0.31 critical / review

Methodology

Each score is assigned by a verified senior expert with 5–15 years of experience. Experts review submissions on their merits and score the AI quality dimensions at the end of their review — they are not told the purpose is benchmarking. Scores are averaged across all reviewed submissions per model. Minimum 5 reviews required to appear on this board.