Jingyuan Zhang*, Kai Fu*, Yang Yue*, Chenxi Sun, Hongzhi Zhang, Yahui Liu, Xingguang Ji, Jia Fu, Tinghai Zhang, Yan Li, Qi Wang, Fuzheng Zhang, Guorui Zhou*†*, Kun Gai
Equal Contribution*, Corresponding Author*†*
July 25, 2025
Klear Team, Kuaishou Technology
<aside> 📌
Improving the reasoning capabilities of large language models (LLMs) has recently attracted significant attention in the AI community. The current paradigm for developing strong reasoning models typically involves a two-stage approach: supervised fine-tuning (SFT) with distilled data, followed by reinforcement learning (RL). While the open-source community has flourished with increasingly available open-source datasets, many critical training details remain unclear.
In this study, we present a comprehensive and open-source pipeline for training a high-performance reasoning model, named Klear-Qwen3-Thinking
, starting from the Qwen3-8B-Base
. We balance training stability and exploratory behavior in RL through multiple strategies. Klear-Qwen3-Thinking-Preview
achieves 76.4% on AIME 2025 and 63.9% on LiveCodeBench V5, improving +13.7% and +8.8% over its SFT baseline, respectively. Notably, Klear-Qwen3-Thinking-Preview
yields better performance than Qwen3-8B
(Thinking mode), and competitive performance as DeepSeek-R1-0528-Qwen3-8B
in math and coding, without distilling from DeepSeek-R1-0528.
👨💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📖 [Tech Report](comming soon), 🔎 Evaluation results
</aside>
<aside> 📌
Our team will continuously release and maintain advanced thinking models, with open-source access to our models, training recipes, and data.
</aside>
Performance in comparison with SOTA models on AIME 24&25 and LiveCodeBench v5. Klear-SFT and Klear-Preview refer to Klear-Qwen3-Thinking-SFT and Klear-Qwen3-Thinking-Preview, respectively. Among 7B and 8B models, we outperform AceReason-Nemotron-1.1-7B (AceReason) and Qwen3-8B. Although we do not use the DeepSeek-R1-0528 dataset, we achieve comparable results to DeepSeek-R1-0528-Qwen3-8B. Additionally, compared to larger models like Qwen3-32B and DeepSeek-R1 (0120), we also demonstrate significant advantages.
We use open-source SFT prompt for cold start, including Art of Problem Solving (AoPS), OpenR1-Math-Raw (distilled from NuminaMath with DeepSeek-R1), Open-Thoughts-114k, and Nvidia-AceReason-1.1-SFT. We employ DeepSeek-R1 (v0120) to generate long CoT reasoning trajectories. We deliberately avoid exploiting DeepSeek-R1 (v0528) as a teacher model to maintain fairness when comparing with the Qwen3 models, which were released before 0528. ****In our experiments, we fine-tune Qwen3-8B-Base using the following hyperparameters: batch size=256, max length=32K, learning rate=8e-5, epochs=5.
We initialize our reinforcement learning from the above cold start model. To effectively scale RL, it is crucial to develop training strategies that maintain stable learning while enhancing the policy model's exploratory behavior during training. Therefore, we explore a training scheme from two aspects: training strategy and data curation, respectively.
An initial short-token-budget (i.e., overlong 8K responses are marked with a negative reward) RL phase to transition the model's reasoning pattern from distillation to self-exploration.
Key Findings: