You Won’t Believe Which Open Model Ran Best on Real-World Prompts

You Won’t Believe Which Open Model Ran Best on Real-World Prompts

Ryan Wong October 18, 2025 AI, LLM, benchmarks, model-comparison, GPT-OSS, Qwen3, DeepSeek, research

TLDR;

Winner: gpt-oss-120b — it led the pack in technical tasks like bug solving and complex logic, and it ran efficiently on a single 80GB GPU, which let us test heavy prompts without huge infra overhead.

Other highlights: Qwen3 stood out for research-style prompts and live-information lookups. GPT-5 and DEEPSEEK V3.1 performed very well on coding and analysis. KIMI-THINKING-PREVIEW produced solid output in several areas but failed a specific “needle in a haystack” test.

Introduction

Benchmarks are useful, but many published scores do not match how teams use models in practice. Public leaderboards often focus on small or curated tasks, or on single metrics that do not reflect real engineering demands. We built a test set that mixes: (a) practical engineering problems, (b) long-form writing, (c) research lookups and accuracy, (d) multi-step logic puzzles, and (e) retrieval-style “needle in a haystack” challenges. The goal: measure capability under conditions you might actually push a model into on Oct 18, 2025.

How we conduct testing

  • Task suite: five categories — Bug Solving, Creative Writing, Research, Complex Logic, and Needle-in-Haystack retrieval. (The raw scoring table is shown below.)

  • Prompts: real examples collected from engineers, content teams, and analysts; each prompt run 3 times and scored by two human raters for correctness, completeness, and practical usefulness.

  • Latency & resources: models ran on representative GPU setups. Wherever possible we used recommended runtime configs (for example, 80GB H100/H100-class or H100/H100-equivalents for large MoE and 20–40GB setups for smaller weights).

  • Scoring: 0–10 scale per task, averaged across prompts. Ratings focused on usable output — e.g., a bug fix that compiles and is correct scored higher than a plausible but incorrect patch.

  • Reproducibility: every prompt, seed, and GPU profile is logged in the shared spreadsheet linked at the end.

Results

Below is the table of averaged scores as of (Oct 18, 2025):

Test Kimi-Latest Kimi K2 Moonshot-V1 GPT-OSS-20B GPT-OSS-120B QWEN3 GPT-5 COPILOT DEEPSEEK V3.1 KIMI-THINKING-PREVIEW
1. Bug Solving 6/10 8/10 9/10 9/10 10/10 9/10 10/10 9/10 10/10 10/10
2. Creative Writing 9/10 9/10 9/10 9/10 9/10 8/10 9/10 9/10 9/10 9/10
3. Research 6/10 4/10 4/10 4/10 4/10 9/10 4/10 4/10 3/10 4/10
4. Complex Logic 7/10 6/10 6/10 9/10 10/10 10/10 9/10 9/10 10/10 10/10
5. Needle in Haystack 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 0/10
  • gpt-oss-120b was the most consistent high scorer across technical tasks. That aligns with gpt-oss positioning as a high-reasoning model designed to run on a single 80GB GPU.

  • Qwen3 was the strongest model for the research-style prompts in our suite; it handled factual lookups and citation-style replies better than most other models in this test. That corresponds with Qwen3’s public positioning and launch notes.

  • DEEPSEEK V3.1 and GPT-OSS variants both performed well on analytical and coding demands. Several community reports and docs place these models in the same performance tier for reasoning benchmarks.

  • KIMI-THINKING-PREVIEW produced reasonable answers in multiple categories but failed the needle-in-haystack task in a way that suggests a retrieval or prompt-routing bug during that run.

Take away

  • If your priority is reliable, high-quality reasoning and you want a model that can be run on a single 80GB GPU, gpt-oss-120b is the strongest choice from our set. The model’s design and available runtime metadata match the strengths we observed in bug solving and logic tasks. link

  • If you need the best results for factual lookups and research-style responses, test Qwen3 on your actual prompts. It handled our lookup suite better than the other open models in this round. link

  • For teams building tooling around agent workflows or heavy code generation, GPT-5 and some of the leading research models are strong candidates; run your own representative prompts and measure both correctness and execution cost. link

For full logs, per-prompt scores, and the exact GPU/seed settings we used, see the research spreadsheet: benchmark link

Ready to Build Your AI Product?

Book a consultation to learn more about implementing the best AI models for your project.

Book Consultation

Related Posts

AI News Week of October 10 2025

AI News Week of October 10 2025

OpenAI GPT-5 Pro, Microsoft Copilot Studio 2025, Anthropic Claude Sonnet 4.5, and other major AI launches this week. Stay ahead of the curve with the latest developments.

October 10, 2025 Read More →
How Founders Should Really Fund Their Startup (Hint: It's Not What You've Been Told)

How Founders Should Really Fund Their Startup (Hint: It's Not What You've Been Told)

People keep asking "raise or bootstrap?" as if those are the only two radio buttons on the form. They're not. The real question is how much cash is in your pocket today and how long can it stay there while the business learns to breathe on its own?

October 13, 2024 Read More →
Why Outbound Calls with Telnyx or Twilio Might Get You Instantly Banned (And How We Beat the System)

Why Outbound Calls with Telnyx or Twilio Might Get You Instantly Banned (And How We Beat the System)

Using providers like Telnyx or Twilio for outbound calling can lead to account suspension or bans if calls receive complaints. Learn about compliant alternatives and how to avoid telecom platform restrictions.

October 11, 2025 Read More →