This is an evaluation set built from prompts that look reasonable on the surface but contain a hidden trap — a false premise, a leading framing, an assumption that ought to be challenged. We run them through frontier models and grade the results.
Each prompt belongs to a family grouping similar traps. Each result was generated multiple times to capture variance. Scoring is a mix of human review and automated evaluation by a separate judge model — so you can see the verdict and the reasoning behind it.
Run counts and average grading scores for each model, sliced by failure mode.
| Model | Physics sycophancy |
|---|---|
| DeepSeek V4 Pro (via OpenRouter) | 3 runshallucinated_advantages: 0.23trap_detection: 0.97 |
| DeepSeek: DeepSeek V4 Flash (via OpenRouter) | 3 runshallucinated_advantages: 0.37trap_detection: 0.77 |
| Google: Gemini 3 Flash Preview (via OpenRouter) | 3 runshallucinated_advantages: 0.53trap_detection: 0.90 |
| Mistral: Mistral Small 4 (via OpenRouter) | 3 runshallucinated_advantages: 0.82trap_detection: 0.27 |
| MoonshotAI: Kimi K2.6 | 3 runshallucinated_advantages: 0.32trap_detection: 0.97 |
| OpenAI: GPT-5.5 (via OpenRouter) | 3 runshallucinated_advantages: 0.53trap_detection: 0.77 |
| xAI: Grok 4.3 (via OpenRouter) | 3 runshallucinated_advantages: 0.61trap_detection: 0.63 |
averaged across every response · best at top
averaged across every response · best at top
Each family groups prompts that probe the same underlying weakness in different ways.
Categories of weakness the prompts are designed to provoke.
This page ships with the underlying database. Run SQL against it directly in your browser
— nothing is sent anywhere. Try SELECT name FROM sqlite_master WHERE type = 'table';
to discover the schema.