Where frontier models slip

An audit of language-model failure modes

This is an evaluation set built from prompts that look reasonable on the surface but contain a hidden trap — a false premise, a leading framing, an assumption that ought to be challenged. We run them through frontier models and grade the results.

Each prompt belongs to a family grouping similar traps. Each result was generated multiple times to capture variance. Scoring is a mix of human review and automated evaluation by a separate judge model — so you can see the verdict and the reasoning behind it.

How models score

Run counts and average grading scores for each model, sliced by failure mode.

ModelPhysics sycophancy
DeepSeek V4 Pro (via OpenRouter) openrouter · deepseek/deepseek-v4-pro 3 runshallucinated_advantages: 0.23trap_detection: 0.97
DeepSeek: DeepSeek V4 Flash (via OpenRouter) openrouter · deepseek/deepseek-v4-flash 3 runshallucinated_advantages: 0.37trap_detection: 0.77
Google: Gemini 3 Flash Preview (via OpenRouter) openrouter · google/gemini-3-flash-preview 3 runshallucinated_advantages: 0.53trap_detection: 0.90
Mistral: Mistral Small 4 (via OpenRouter) openrouter · mistralai/mistral-small-2603 3 runshallucinated_advantages: 0.82trap_detection: 0.27
MoonshotAI: Kimi K2.6 openrouter · moonshotai/kimi-k2.6 3 runshallucinated_advantages: 0.32trap_detection: 0.97
OpenAI: GPT-5.5 (via OpenRouter) openrouter · openai/gpt-5.5 3 runshallucinated_advantages: 0.53trap_detection: 0.77
xAI: Grok 4.3 (via OpenRouter) openrouter · x-ai/grok-4.3 3 runshallucinated_advantages: 0.61trap_detection: 0.63
strong mixed weak

hallucinated_advantages lower is better

averaged across every response · best at top

DeepSeek V4 Pro (via OpenRouter) 0.23 MoonshotAI: Kimi K2.6 0.32 DeepSeek: DeepSeek V4 Flash (via OpenRouter) 0.37 Google: Gemini 3 Flash Preview (via OpenRouter) 0.53 OpenAI: GPT-5.5 (via OpenRouter) 0.53 xAI: Grok 4.3 (via OpenRouter) 0.61 Mistral: Mistral Small 4 (via OpenRouter) 0.82 0 1

trap_detection higher is better

averaged across every response · best at top

DeepSeek V4 Pro (via OpenRouter) 0.97 MoonshotAI: Kimi K2.6 0.97 Google: Gemini 3 Flash Preview (via OpenRouter) 0.90 DeepSeek: DeepSeek V4 Flash (via OpenRouter) 0.77 OpenAI: GPT-5.5 (via OpenRouter) 0.77 xAI: Grok 4.3 (via OpenRouter) 0.63 Mistral: Mistral Small 4 (via OpenRouter) 0.27 0 1

Prompt families

Each family groups prompts that probe the same underlying weakness in different ways.

Models tested

Failure modes

Categories of weakness the prompts are designed to provoke.

Explore the data yourself

This page ships with the underlying database. Run SQL against it directly in your browser — nothing is sent anywhere. Try SELECT name FROM sqlite_master WHERE type = 'table'; to discover the schema.

loading…