TL;DR: We ran 13 standardized tasks across four leading open source models and scored every response from 0 to 10. Kimi 2.6 came out on top with 112 out of 130. MiniMax M3 was the most reliable for real-world data tasks. DeepSeek was the best pure coder. Qwen showed strong analytical thinking but inconsistent execution. Here is exactly what we found.
Why We Did This
Picking an open source LLM for production work is harder than it looks. Leaderboard numbers tell you one thing. Actual behavior on the kinds of tasks your team needs to run tells you something else entirely.
We wanted to know: how do these models actually perform when pushed across a wide range of real tasks? Not just coding. Not just chat. Reasoning, writing, live data retrieval, multi-step instructions, document search, and more.
So we built a 13-question benchmark, ran all four models through every question, and graded each response independently with detailed reasoning.
The Four Models
| Model | Developer | Overall Score |
|---|---|---|
| Kimi 2.6 | Moonshot AI | 112 / 130 |
| MiniMax M3 | MiniMax AI | 104 / 130 |
| DeepSeek V4 Pro | DeepSeek AI | 96 / 130 |
| Qwen 3.6 Plus | Alibaba | 76 / 130 |
The 36-point gap between Kimi 2.6 and Qwen 3.6 Plus is larger than it might look. That is not a marginal difference. It reflects consistent gaps across multiple categories, not just one or two bad questions.
How We Scored
Each of the 13 questions targeted a specific capability. Scores ranged from 0 to 10, using even numbers only. A 10 meant a complete and correct answer with solid reasoning shown. A 0 meant no useful output was produced.
The 13 questions covered these areas:
- Logic and reasoning: multi-step elimination puzzles, truth-teller logic
- Writing: humanizing a formal blog article to grade-8 reading level
- Creative constraints: rewriting a poem under 11 simultaneous rules
- Coding: fixing a buggy string compression function, normalizing nested expressions
- Reasoning and coding: resolving package dependency conflicts with backtracking
- Instruction following: applying a set of custom world-facts rules to derive answers
- Retrieval: finding a hidden code buried in a long document
- Tool use and live data: chaining live web search across multiple steps (Olympics year, stock prices, Nobel Prize winner)
- Deterministic reasoning: a 10-step string transformation with no room for error
Full Score Breakdown
| Question | Category | MiniMax M3 | Kimi 2.6 | DeepSeek V4 | Qwen 3.6 Plus |
|---|---|---|---|---|---|
| Q1: Logic Puzzle | Logic / Reasoning | 6 | 6 | 6 | 8 |
| Q2: Truth-Teller Logic | Logic / Reasoning | 4 | 8 | 8 | 6 |
| Q3: Article Rewrite | Writing / Style | 10 | 10 | 10 | 10 |
| Q4: Constrained Poem | Creative / Constraints | 2 | 4 | 4 | 4 |
| Q5: String Compression | Coding | 8 | 10 | 10 | 8 |
| Q6: Expression Normaliser | Coding | 8 | 10 | 10 | 8 |
| Q7: Dependency Resolver | Reasoning / Coding | 10 | 10 | 10 | 6 |
| Q8: Instruction Following | Instruction Following | 10 | 10 | 10 | 10 |
| Q9: Document Retrieval | Retrieval | 10 | 10 | 8 | 6 |
| Q10: Weather Chain | Tool Use / Search | 10 | 10 | 2 | 2 |
| Q11: Stock Investment Chain | Tool Use / Search | 8 | 8 | 6 | 2 |
| Q12: String Transformation | Deterministic Reasoning | 10 | 10 | 10 | 4 |
| Q13: Nobel Prize Chain | Tool Use / Search | 8 | 6 | 2 | 2 |
| TOTAL | Max = 130 | 104 | 112 | 96 | 76 |
Bold indicates the highest score for that question among the four models.
Performance by Category
This table shows the average score per category. It is where the real differences become clear.
| Category | MiniMax M3 | Kimi 2.6 | DeepSeek V4 | Qwen 3.6 Plus |
|---|---|---|---|---|
| Logic / Reasoning | 5.0 | 7.0 | 7.0 | 7.0 |
| Writing / Style | 10.0 | 10.0 | 10.0 | 10.0 |
| Creative / Constraints | 2.0 | 4.0 | 4.0 | 4.0 |
| Coding | 8.0 | 10.0 | 10.0 | 8.0 |
| Reasoning / Coding | 10.0 | 10.0 | 10.0 | 6.0 |
| Instruction Following | 10.0 | 10.0 | 10.0 | 10.0 |
| Retrieval | 10.0 | 10.0 | 8.0 | 6.0 |
| Tool Use / Search | 8.7 | 8.0 | 3.3 | 2.0 |
| Deterministic Reasoning | 10.0 | 10.0 | 10.0 | 4.0 |
The Tool Use and Search category is where the models separate most dramatically. MiniMax M3 averaged 8.7. DeepSeek dropped to 3.3. Qwen fell to 2.0. This single category accounts for a significant portion of the overall score gap.
What Each Model Did Well and Where It Fell Short
MiniMax M3 (104/130)
MiniMax was the most reliable model for tasks that required live web data. It correctly identified the 2026 Winter Olympics in Milan Cortina when DeepSeek and Qwen both returned the 2022 Beijing games. It was also the only model to return the correct paragraph number and line number when retrieving a hidden code from a long document.
No zero scores. Completed all 13 questions with a full answer every time.
Its weak spot was logic. It chose the wrong truth-teller in Q2, picking B instead of the correct answer D. It also fabricated an answer for Q1 rather than identifying that the puzzle lacked enough information to solve.
Kimi 2.6 (112/130)
Kimi 2.6 had the highest overall score and was the strongest model across the most categories. It scored perfect on both coding questions, got the correct truth-teller in Q2 with a verified key assignment, and also retrieved the correct 2026 Olympics year.
Its main weakness was Q4, the constrained poem. The poem had 11 simultaneous rules to follow. Kimi attempted a full A to Z structure (52 lines), which was the most complete attempt of any model, but used the wrong tense throughout and broke the monovowel constraint repeatedly.
In Q13, Kimi had all the right data for Steps 1 through 9 but skipped the exchange rate division in Step 10 by arguing it was unnecessary. The formula must be followed as written. That reasoning error cost it points.
DeepSeek V4 Pro (96/130)
DeepSeek tied Kimi for the best coding performance and produced the most thorough logic analysis in Q2, working through every possible assignment before arriving at the correct answer.
It was also the most transparent of the four models when data was unavailable. Rather than making up stock prices or weather figures, it stated clearly that it could not retrieve the information and explained the calculation it would perform once the data was provided.
The problem is that live data retrieval is a large part of what modern agents actually do. DeepSeek scored 2 on the Weather Chain because it returned 2022 Beijing instead of the current 2026 games. It scored 2 on the Nobel Prize Chain because it retrieved the 2024 winner (Han Kang) instead of the 2025 winner (Laszlo Krasznahorkai). These are not edge cases. They are exactly the type of task where live search matters most.
Qwen 3.6 Plus (76/130)
Qwen produced the best Q1 analysis of all four models. Where other models fabricated a day-and-venue combination, Qwen correctly identified that the puzzle was underdetermined, explained the full elimination chain needed to solve it, and refused to guess.
That kind of analytical clarity shows up elsewhere too. The Q2 analysis was the most exhaustive of any model, running through every possible key assignment systematically.
But Qwen's execution was inconsistent. In Q12, a 10-step string transformation, Qwen got steps 1 through 6 perfectly correct and then made a binary chunking error in Step 7. That single mistake cascaded into a wrong hexadecimal value, a wrong conditional check, and a final answer of 250FF instead of the correct 0D2. Nine out of ten steps right and the output is still wrong.
In Q7, it started with an incorrect dependency resolution, worked through it, and self-corrected mid-response. The final answer was right but the instability was visible.
The One Question That Separated Everyone
Q10, the Weather Chain, tells you the most about how each model handles real-world tool use.
The task: find out which country won the most gold medals at the most recent Winter Olympics, identify its capital, and retrieve the current temperature there.
Kimi 2.6 and MiniMax M3 both identified the 2026 Milan Cortina games, correctly named Norway with 18 golds, and retrieved live temperatures in Oslo. DeepSeek and Qwen both returned the 2022 Beijing games. That is not a small error. It means the model used training data instead of running a live search, and its training data was outdated.
The same pattern repeated in Q13. The Nobel Prize in Literature question requires knowing the 2025 winner. Models that searched returned Laszlo Krasznahorkai (correct). Models that relied on memory returned Han Kang (the 2024 winner). DeepSeek and Qwen both fell into this trap.
If your use case involves live data at all, this gap is meaningful.
Choosing the Right Model
| Use Case | Best Choice | Why |
|---|---|---|
| Coding and implementation | Kimi 2.6 / DeepSeek V4 | Both scored perfect on coding questions |
| Live data and search tasks | MiniMax M3 | Most accurate year lookups, best document retrieval |
| Logic and reasoning | Kimi 2.6 | Correct Q2, clean multi-step derivation |
| Analytical depth | Qwen 3.6 Plus | Best structural analysis on open-ended problems |
| Consistency across all tasks | MiniMax M3 | No zeros, all questions answered fully |
| Best overall model | Kimi 2.6 | Highest score, strongest across most categories |
API Cost Comparison (as of June 2026)
All four models are open-weight and available for self-hosting at zero per-token cost. For teams using the managed API, pricing is pay-as-you-go with no monthly subscription required.
| Model | Developer | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| Kimi 2.6 | Moonshot AI | $0.95 | $4.00 | 256K |
| MiniMax M3 | MiniMax AI | $0.60 | $2.40 | 1M |
| DeepSeek V4 Pro | DeepSeek AI | $0.44 | $0.87 | 1M |
| Qwen 3.6 Plus | Alibaba | $0.33 | $1.95 | 1M |
Sources: Kimi official pricing · MiniMax official pricing · DeepSeek official pricing · Qwen via OpenRouter
Key Takeaways
The biggest finding from this benchmark is how much live data retrieval matters and how differently these models handle it.
Three of the four models failed on the same type of question: tasks that require searching for current information rather than recalling training data. MiniMax M3 and Kimi 2.6 handled these well. DeepSeek and Qwen did not.
The second finding is that high analytical capability does not guarantee high scores. Qwen produced the most rigorous reasoning on Q1 and Q2. But a single execution error in Q12 turned 9 correct steps into a wrong answer. If the output is wrong, the reasoning quality leading up to it does not fully compensate.
Third: writing quality was effectively equal across all four models. All four scored 10 on the article rewrite. Writing is no longer a differentiator at this tier.
Conclusion
Four strong open source models. Four very different profiles.
Kimi 2.6 is the best all-round choice for teams that need strong performance across coding, reasoning, and live data tasks. MiniMax M3 is the right pick when reliable real-world search and retrieval matters most. DeepSeek V4 Pro is a solid option for coding-focused workflows where live data retrieval is handled separately. Qwen 3.6 Plus is best suited to analytical and reasoning tasks where execution consistency is less critical.
The models are good. The gap between them is real. Knowing which one fits your workflow before you build around it will save you significant time.