AI systems are breaking records on complex benchmarks, yet they falter on simpler tasks humans handle intuitively—a phenomenon dubbed jagged intelligence. This ainsight explores this uneven capability, tracing its evolution in frontier models and the impact of reasoning models. We introduce SIMPLE, a new public benchmark with easy reasoning tasks solvable by high schoolers, vital for enterprise AI where reliability trumps advanced math skills.
Since ChatGPT’s 2022 debut, foundation models have been marketed as chat interfaces. Now, reasoning models like OpenAI’s o3 and DeepSeek’s R1 leverage extra inference-time computation for step-by-step internal reasoning, boosting performance in math, engineering, and coding. This shift to scaling inference compute arrives as pretraining gains may be plateauing.
Benchmarking the Gaps
Traditional AI benchmarks measure peak performance on tough tasks, like graduate exams or complex code, creating new challenges as old ones are mastered. However, they overlook reliability and worst-case performance on basic tasks, masking jaggedness in “solved” areas.
Modern models outshine humans on some challenges but stumble unpredictably on others, unlike specialized tools (e.g., calculators or photo editors). Despite advances in modeling and training, this inconsistent jaggedness persists.
- Top-line capability: Accuracy on hard tasks, driving headlines (e.g., AI passing medical exams).
- Jaggedness: Consistency on simple, human-solvable tasks, key to robust automation.
SIMPLE targets easy problems where AI still lags, offering insights into jaggedness trends.
Evolution of Jaggedness
Will jaggedness shrink or grow as models advance? This question shapes enterprise AI success. Lacking jaggedness benchmarks, we created SIMPLE—a dataset of 225 simple questions, each solvable by at least 10% of high schoolers.
Example Questions from SIMPLE
- Question: A man must cross a river with a fox, chicken, and sack of corn. His boat holds him and three items. If left alone, the fox eats the chicken, or the chicken eats the corn. How can he do it in the minimum steps?
- Answer: All can go in one trip.
- Model Response: Both o1 and o3-mini repeat the classic riddle solution, ignoring the relaxed constraint.
- Question: Move 2 matchsticks from a diagram (8 matchsticks: ! for vertical, __. for horizontal) to form 2 equal squares, with no stray matchsticks.
- Answer: Move the top-right horizontal and bottom-left vertical to the bottom right, forming a new square.
- Model Response: o1 fails to create two squares; o3-mini-high forms rectangles with incorrect matchstick counts.
Performance Trends
Evaluating current and past top models on SIMPLE traces jaggedness over time. Green tasks are high school-level; blue are expert-level. School-level benchmarks saturated by 2023-2024, shifting focus to harder tasks. SIMPLE, using the best of gpt-4, gpt-4-turbo, gpt-4o, o1, and o3-mini, scores lowest on school-level questions. Yet, reasoning models show a ~30% improvement, suggesting they reduce jaggedness by double-checking work, linking reasoning to better simple-task performance.
Case Study
- Question: Where does Thanksgiving come before Christmas?
- Answer: On the calendar.
- Responses: o1 and o3-mini err; o3-mini-high corrects after 49 seconds of reasoning (vs. 34 for o3-mini), showing deeper analysis.
Insights and Implications
Reasoning models transfer top-line gains to simple tasks to some extent, but SIMPLE remains unsaturated. Jaggedness persists, with top-line progress outpacing worst-case improvements. This mirrors computing’s history: excelling in narrow domains, outpacing human limits once applied, yet always facing new challenges.
Jaggedness may not just define AI—it could be computation’s inherent nature.