TLDR;
Winner: gpt-oss-120b — it led the pack in technical tasks like bug solving and complex logic, and it ran efficiently on a single 80GB GPU, which let us test heavy prompts without huge infra overhead.
Other highlights: Qwen3 stood out for research-style prompts and live-information lookups. GPT-5 and DEEPSEEK V3.1 performed very well on coding and analysis. KIMI-THINKING-PREVIEW produced solid output in several areas but failed a specific “needle in a haystack” test.
Introduction
Benchmarks are useful, but many published scores do not match how teams use models in practice. Public leaderboards often focus on small or curated tasks, or on single metrics that do not reflect real engineering demands. We built a test set that mixes: (a) practical engineering problems, (b) long-form writing, (c) research lookups and accuracy, (d) multi-step logic puzzles, and (e) retrieval-style “needle in a haystack” challenges. The goal: measure capability under conditions you might actually push a model into on Oct 18, 2025.
How we conduct testing
-
Task suite: five categories — Bug Solving, Creative Writing, Research, Complex Logic, and Needle-in-Haystack retrieval. (The raw scoring table is shown below.)
-
Prompts: real examples collected from engineers, content teams, and analysts; each prompt run 3 times and scored by two human raters for correctness, completeness, and practical usefulness.
-
Latency & resources: models ran on representative GPU setups. Wherever possible we used recommended runtime configs (for example, 80GB H100/H100-class or H100/H100-equivalents for large MoE and 20–40GB setups for smaller weights).
-
Scoring: 0–10 scale per task, averaged across prompts. Ratings focused on usable output — e.g., a bug fix that compiles and is correct scored higher than a plausible but incorrect patch.
-
Reproducibility: every prompt, seed, and GPU profile is logged in the shared spreadsheet linked at the end.
Results
Below is the table of averaged scores as of (Oct 18, 2025):
Test | Kimi-Latest | Kimi K2 | Moonshot-V1 | GPT-OSS-20B | GPT-OSS-120B | QWEN3 | GPT-5 | COPILOT | DEEPSEEK V3.1 | KIMI-THINKING-PREVIEW |
---|---|---|---|---|---|---|---|---|---|---|
1. Bug Solving | 6/10 | 8/10 | 9/10 | 9/10 | 10/10 | 9/10 | 10/10 | 9/10 | 10/10 | 10/10 |
2. Creative Writing | 9/10 | 9/10 | 9/10 | 9/10 | 9/10 | 8/10 | 9/10 | 9/10 | 9/10 | 9/10 |
3. Research | 6/10 | 4/10 | 4/10 | 4/10 | 4/10 | 9/10 | 4/10 | 4/10 | 3/10 | 4/10 |
4. Complex Logic | 7/10 | 6/10 | 6/10 | 9/10 | 10/10 | 10/10 | 9/10 | 9/10 | 10/10 | 10/10 |
5. Needle in Haystack | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 0/10 |
-
gpt-oss-120b was the most consistent high scorer across technical tasks. That aligns with gpt-oss positioning as a high-reasoning model designed to run on a single 80GB GPU.
-
Qwen3 was the strongest model for the research-style prompts in our suite; it handled factual lookups and citation-style replies better than most other models in this test. That corresponds with Qwen3’s public positioning and launch notes.
-
DEEPSEEK V3.1 and GPT-OSS variants both performed well on analytical and coding demands. Several community reports and docs place these models in the same performance tier for reasoning benchmarks.
-
KIMI-THINKING-PREVIEW produced reasonable answers in multiple categories but failed the needle-in-haystack task in a way that suggests a retrieval or prompt-routing bug during that run.
Take away
-
If your priority is reliable, high-quality reasoning and you want a model that can be run on a single 80GB GPU, gpt-oss-120b is the strongest choice from our set. The model’s design and available runtime metadata match the strengths we observed in bug solving and logic tasks. link
-
If you need the best results for factual lookups and research-style responses, test Qwen3 on your actual prompts. It handled our lookup suite better than the other open models in this round. link
-
For teams building tooling around agent workflows or heavy code generation, GPT-5 and some of the leading research models are strong candidates; run your own representative prompts and measure both correctness and execution cost. link
For full logs, per-prompt scores, and the exact GPU/seed settings we used, see the research spreadsheet: benchmark link