Open Source LLM Evaluation Report - MiniMax vs Kimi vs DeepSeek vs Qwen

TL;DR: We ran 13 standardized tasks across four leading open source models and scored every response from 0 to 10. Kimi 2.6 came out on top with 112 out of 130. MiniMax M3 was the most reliable for real-world data tasks. DeepSeek was the best pure coder. Qwen showed strong analytical thinking but inconsistent execution. Here is exactly what we found.

Why We Did This

Picking an open source LLM for production work is harder than it looks. Leaderboard numbers tell you one thing. Actual behavior on the kinds of tasks your team needs to run tells you something else entirely.

We wanted to know: how do these models actually perform when pushed across a wide range of real tasks? Not just coding. Not just chat. Reasoning, writing, live data retrieval, multi-step instructions, document search, and more.

So we built a 13-question benchmark, ran all four models through every question, and graded each response independently with detailed reasoning.

The Four Models

Model	Developer	Overall Score
Kimi 2.6	Moonshot AI	112 / 130
MiniMax M3	MiniMax AI	104 / 130
DeepSeek V4 Pro	DeepSeek AI	96 / 130
Qwen 3.6 Plus	Alibaba	76 / 130

The 36-point gap between Kimi 2.6 and Qwen 3.6 Plus is larger than it might look. That is not a marginal difference. It reflects consistent gaps across multiple categories, not just one or two bad questions.

How We Scored

Each of the 13 questions targeted a specific capability. Scores ranged from 0 to 10, using even numbers only. A 10 meant a complete and correct answer with solid reasoning shown. A 0 meant no useful output was produced.

The 13 questions covered these areas:

Logic and reasoning: multi-step elimination puzzles, truth-teller logic
Writing: humanizing a formal blog article to grade-8 reading level
Creative constraints: rewriting a poem under 11 simultaneous rules
Coding: fixing a buggy string compression function, normalizing nested expressions
Reasoning and coding: resolving package dependency conflicts with backtracking
Instruction following: applying a set of custom world-facts rules to derive answers
Retrieval: finding a hidden code buried in a long document
Tool use and live data: chaining live web search across multiple steps (Olympics year, stock prices, Nobel Prize winner)
Deterministic reasoning: a 10-step string transformation with no room for error

Full Score Breakdown

Question	Category	MiniMax M3	Kimi 2.6	DeepSeek V4	Qwen 3.6 Plus
Q1: Logic Puzzle	Logic / Reasoning	6	6	6	8
Q2: Truth-Teller Logic	Logic / Reasoning	4	8	8	6
Q3: Article Rewrite	Writing / Style	10	10	10	10
Q4: Constrained Poem	Creative / Constraints	2	4	4	4
Q5: String Compression	Coding	8	10	10	8
Q6: Expression Normaliser	Coding	8	10	10	8
Q7: Dependency Resolver	Reasoning / Coding	10	10	10	6
Q8: Instruction Following	Instruction Following	10	10	10	10
Q9: Document Retrieval	Retrieval	10	10	8	6
Q10: Weather Chain	Tool Use / Search	10	10	2	2
Q11: Stock Investment Chain	Tool Use / Search	8	8	6	2
Q12: String Transformation	Deterministic Reasoning	10	10	10	4
Q13: Nobel Prize Chain	Tool Use / Search	8	6	2	2
TOTAL	Max = 130	104	112	96	76

Bold indicates the highest score for that question among the four models.

Performance by Category

This table shows the average score per category. It is where the real differences become clear.

Category	MiniMax M3	Kimi 2.6	DeepSeek V4	Qwen 3.6 Plus
Logic / Reasoning	5.0	7.0	7.0	7.0
Writing / Style	10.0	10.0	10.0	10.0
Creative / Constraints	2.0	4.0	4.0	4.0
Coding	8.0	10.0	10.0	8.0
Reasoning / Coding	10.0	10.0	10.0	6.0
Instruction Following	10.0	10.0	10.0	10.0
Retrieval	10.0	10.0	8.0	6.0
Tool Use / Search	8.7	8.0	3.3	2.0
Deterministic Reasoning	10.0	10.0	10.0	4.0

The Tool Use and Search category is where the models separate most dramatically. MiniMax M3 averaged 8.7. DeepSeek dropped to 3.3. Qwen fell to 2.0. This single category accounts for a significant portion of the overall score gap.

What Each Model Did Well and Where It Fell Short

MiniMax M3 (104/130)

MiniMax was the most reliable model for tasks that required live web data. It correctly identified the 2026 Winter Olympics in Milan Cortina when DeepSeek and Qwen both returned the 2022 Beijing games. It was also the only model to return the correct paragraph number and line number when retrieving a hidden code from a long document.

No zero scores. Completed all 13 questions with a full answer every time.

Its weak spot was logic. It chose the wrong truth-teller in Q2, picking B instead of the correct answer D. It also fabricated an answer for Q1 rather than identifying that the puzzle lacked enough information to solve.

Kimi 2.6 (112/130)

Kimi 2.6 had the highest overall score and was the strongest model across the most categories. It scored perfect on both coding questions, got the correct truth-teller in Q2 with a verified key assignment, and also retrieved the correct 2026 Olympics year.

Its main weakness was Q4, the constrained poem. The poem had 11 simultaneous rules to follow. Kimi attempted a full A to Z structure (52 lines), which was the most complete attempt of any model, but used the wrong tense throughout and broke the monovowel constraint repeatedly.

In Q13, Kimi had all the right data for Steps 1 through 9 but skipped the exchange rate division in Step 10 by arguing it was unnecessary. The formula must be followed as written. That reasoning error cost it points.

DeepSeek V4 Pro (96/130)

DeepSeek tied Kimi for the best coding performance and produced the most thorough logic analysis in Q2, working through every possible assignment before arriving at the correct answer.

It was also the most transparent of the four models when data was unavailable. Rather than making up stock prices or weather figures, it stated clearly that it could not retrieve the information and explained the calculation it would perform once the data was provided.

The problem is that live data retrieval is a large part of what modern agents actually do. DeepSeek scored 2 on the Weather Chain because it returned 2022 Beijing instead of the current 2026 games. It scored 2 on the Nobel Prize Chain because it retrieved the 2024 winner (Han Kang) instead of the 2025 winner (Laszlo Krasznahorkai). These are not edge cases. They are exactly the type of task where live search matters most.

Qwen 3.6 Plus (76/130)

Qwen produced the best Q1 analysis of all four models. Where other models fabricated a day-and-venue combination, Qwen correctly identified that the puzzle was underdetermined, explained the full elimination chain needed to solve it, and refused to guess.

That kind of analytical clarity shows up elsewhere too. The Q2 analysis was the most exhaustive of any model, running through every possible key assignment systematically.

But Qwen's execution was inconsistent. In Q12, a 10-step string transformation, Qwen got steps 1 through 6 perfectly correct and then made a binary chunking error in Step 7. That single mistake cascaded into a wrong hexadecimal value, a wrong conditional check, and a final answer of 250FF instead of the correct 0D2. Nine out of ten steps right and the output is still wrong.

In Q7, it started with an incorrect dependency resolution, worked through it, and self-corrected mid-response. The final answer was right but the instability was visible.

The One Question That Separated Everyone

Q10, the Weather Chain, tells you the most about how each model handles real-world tool use.

The task: find out which country won the most gold medals at the most recent Winter Olympics, identify its capital, and retrieve the current temperature there.

Kimi 2.6 and MiniMax M3 both identified the 2026 Milan Cortina games, correctly named Norway with 18 golds, and retrieved live temperatures in Oslo. DeepSeek and Qwen both returned the 2022 Beijing games. That is not a small error. It means the model used training data instead of running a live search, and its training data was outdated.

The same pattern repeated in Q13. The Nobel Prize in Literature question requires knowing the 2025 winner. Models that searched returned Laszlo Krasznahorkai (correct). Models that relied on memory returned Han Kang (the 2024 winner). DeepSeek and Qwen both fell into this trap.

If your use case involves live data at all, this gap is meaningful.

Choosing the Right Model

Use Case	Best Choice	Why
Coding and implementation	Kimi 2.6 / DeepSeek V4	Both scored perfect on coding questions
Live data and search tasks	MiniMax M3	Most accurate year lookups, best document retrieval
Logic and reasoning	Kimi 2.6	Correct Q2, clean multi-step derivation
Analytical depth	Qwen 3.6 Plus	Best structural analysis on open-ended problems
Consistency across all tasks	MiniMax M3	No zeros, all questions answered fully
Best overall model	Kimi 2.6	Highest score, strongest across most categories

API Cost Comparison (as of June 2026)

All four models are open-weight and available for self-hosting at zero per-token cost. For teams using the managed API, pricing is pay-as-you-go with no monthly subscription required.

Model	Developer	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
Kimi 2.6	Moonshot AI	$0.95	$4.00	256K
MiniMax M3	MiniMax AI	$0.60	$2.40	1M
DeepSeek V4 Pro	DeepSeek AI	$0.44	$0.87	1M
Qwen 3.6 Plus	Alibaba	$0.33	$1.95	1M

Sources: Kimi official pricing · MiniMax official pricing · DeepSeek official pricing · Qwen via OpenRouter

Key Takeaways

The biggest finding from this benchmark is how much live data retrieval matters and how differently these models handle it.

Three of the four models failed on the same type of question: tasks that require searching for current information rather than recalling training data. MiniMax M3 and Kimi 2.6 handled these well. DeepSeek and Qwen did not.

The second finding is that high analytical capability does not guarantee high scores. Qwen produced the most rigorous reasoning on Q1 and Q2. But a single execution error in Q12 turned 9 correct steps into a wrong answer. If the output is wrong, the reasoning quality leading up to it does not fully compensate.

Third: writing quality was effectively equal across all four models. All four scored 10 on the article rewrite. Writing is no longer a differentiator at this tier.

Conclusion

Four strong open source models. Four very different profiles.

Kimi 2.6 is the best all-round choice for teams that need strong performance across coding, reasoning, and live data tasks. MiniMax M3 is the right pick when reliable real-world search and retrieval matters most. DeepSeek V4 Pro is a solid option for coding-focused workflows where live data retrieval is handled separately. Qwen 3.6 Plus is best suited to analytical and reasoning tasks where execution consistency is less critical.

The models are good. The gap between them is real. Knowing which one fits your workflow before you build around it will save you significant time.