網(wǎng)易首頁(yè) > 網(wǎng)易號(hào) > 正文申請(qǐng)入駐

阿里離職風(fēng)波后，林俊旸首發(fā)長(zhǎng)文回顧Qwen技術(shù)哲學(xué)，并探討“智能體式思考”

2026-03-27 08:17:09　來(lái)源: 鈦媒體APP

北京舉報(bào)

分享至

3月26日，被譽(yù)為“阿里最年輕P10”的千問(wèn)（Qwen）大模型靈魂人物林俊旸，在月初離職風(fēng)波輿論漸息之際，在X平臺(tái)發(fā)布長(zhǎng)文《從“推理式思考”到“智能體式思考”》，系統(tǒng)闡述了他對(duì)AI技術(shù)范式演進(jìn)剖析。通過(guò)這篇文章，林俊旸不僅總結(jié)了過(guò)去，更清晰地指向了AI未來(lái)競(jìng)爭(zhēng)的真正戰(zhàn)場(chǎng)——一個(gè)超越單一模型比拼、關(guān)乎系統(tǒng)、環(huán)境與協(xié)同的智能體新時(shí)代。

文章清晰地勾勒出一條AI能力進(jìn)化的路線圖。林俊旸將2024-2025年定義為“推理思考”階段，以O(shè)penAI o1和DeepSeek-R1為代表，其核心成就是證明了“思考”可以作為一種可訓(xùn)練、可交付的一流能力。這一階段的本質(zhì)，是通過(guò)強(qiáng)化學(xué)習(xí)（RL）在數(shù)學(xué)、代碼等可驗(yàn)證領(lǐng)域獲得確定性反饋，從而讓模型“為正確而優(yōu)化，而非為合理”。然而，這背后是巨大的基礎(chǔ)設(shè)施挑戰(zhàn)——推理RL已從輕量級(jí)微調(diào)附件，演變?yōu)樾枰笠?guī)模部署、高吞吐驗(yàn)證的系統(tǒng)工程問(wèn)題。

不過(guò)，真正的難題遠(yuǎn)不止于此。文章第二部分深入探討了“思考模式”與“指令模式”融合的實(shí)踐困境。這一分析也映照了商業(yè)現(xiàn)實(shí)：阿里在Qwen3嘗試融合后，后續(xù)的2507版本中Instruct與Thinking版本獨(dú)立呈現(xiàn)，因?yàn)榇罅靠蛻粼谂坎僮髦腥孕枰咝詢r(jià)比、高可控的指令行為。

文章明確提出“智能體式思考”（Agentic Thinking）是下一代AI的核心范式。這標(biāo)志著訓(xùn)練核心從模型本身轉(zhuǎn)向 “模型-環(huán)境”系統(tǒng)。智能體思維的核心是“為行動(dòng)而思考”，它必須處理純推理模型無(wú)需面對(duì)的難題：決定何時(shí)行動(dòng)、調(diào)用何種工具、處理環(huán)境的不確定反饋、在失敗后修訂計(jì)劃、在多輪交互中保持連貫。

林俊旸認(rèn)為，在推理時(shí)代，優(yōu)勢(shì)源于更好的RL算法和反饋信號(hào)；而在智能體時(shí)代，競(jìng)爭(zhēng)優(yōu)勢(shì)將建立在更優(yōu)質(zhì)的環(huán)境設(shè)計(jì)、更緊密的訓(xùn)練-服務(wù)一體化架構(gòu)、以及更強(qiáng)大的智能體協(xié)同工程之上。環(huán)境本身成為一等品，其穩(wěn)定性、真實(shí)性、反饋豐富度和抗過(guò)擬合能力至關(guān)重要。同時(shí)，多智能體組織架構(gòu)——由規(guī)劃者、領(lǐng)域?qū)＜液蛨?zhí)行子代理構(gòu)成的系統(tǒng)——將成為核心智能的來(lái)源。

這篇文章可以看做是林俊旸關(guān)于技術(shù)理念的完整闡述，將他任職期間推動(dòng)Qwen發(fā)展的技術(shù)哲學(xué)系統(tǒng)化輸出。或許，這也是一份個(gè)人未來(lái)的宣言，文章中對(duì)“智能體時(shí)代”基礎(chǔ)設(shè)施、環(huán)境工程重要性的強(qiáng)調(diào)，暗示了他看好的下一個(gè)創(chuàng)業(yè)或研究方向。

全文由千問(wèn)Qwen翻譯： From "Reasoning" Thinking to "Agentic" Thinking 從“推理式思考”到“智能體式思考”

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.

過(guò)去兩年重塑了我們?cè)u(píng)估模型的方式以及對(duì)模型的期望。OpenAI的o1證明，“思考”可以成為一種一流的技能——一種需要專門訓(xùn)練并面向用戶開放的能力。DeepSeek-R1則表明，推理風(fēng)格的后訓(xùn)練方法不僅能在原始實(shí)驗(yàn)室之外重現(xiàn)，還能實(shí)現(xiàn)規(guī)模化應(yīng)用。OpenAI將o1描述為一種通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練而成的模型，它能夠在回答問(wèn)題前“先進(jìn)行思考”。DeepSeek則將R1定位為一款與o1相媲美的開放式推理模型。

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.

那個(gè)階段很重要。但2025年上半年主要聚焦于推理思維：如何讓模型在推理時(shí)花費(fèi)更多時(shí)間。計(jì)算，如何用更強(qiáng)烈的獎(jiǎng)勵(lì)來(lái)訓(xùn)練它們，如何暴露或控制那種額外的推理努力。現(xiàn)在的問(wèn)題是：接下來(lái)該怎么做？我認(rèn)為答案是代理思維：即思考——為了在與環(huán)境互動(dòng)時(shí)采取行動(dòng)，并根據(jù)來(lái)自外界的反饋不斷更新計(jì)劃。

1. What the Rise of o1 and R1 Actually Taught Uso1和R1的崛起實(shí)際上教會(huì)了我們什么

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.

第一波推理模型告訴我們，若想在語(yǔ)言模型中規(guī)模化應(yīng)用強(qiáng)化學(xué)習(xí)，我們就需要具備確定性、穩(wěn)定性和可擴(kuò)展性的反饋信號(hào)。數(shù)學(xué)、代碼、邏輯及其他可驗(yàn)證的領(lǐng)域因此成為核心，因?yàn)樵谶@些場(chǎng)景中，獎(jiǎng)勵(lì)信號(hào)遠(yuǎn)比一般的偏好監(jiān)督更為有力。它們使強(qiáng)化學(xué)習(xí)能夠?qū)Ｗ⒂谧非笳_性，而非僅僅追求合理性。與此同時(shí)，基礎(chǔ)設(shè)施也變得至關(guān)重要。

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.

一旦模型經(jīng)過(guò)訓(xùn)練能夠推理更長(zhǎng)的軌跡，強(qiáng)化學(xué)習(xí)便不再只是監(jiān)督微調(diào)的一個(gè)輕量級(jí)附加組件。它……變成一個(gè)系統(tǒng)性問(wèn)題。你需要大規(guī)模部署、高吞吐量驗(yàn)證、穩(wěn)定的策略更新以及高效的采樣。推理模型的出現(xiàn)，其背后既涉及基礎(chǔ)設(shè)施建設(shè)，也關(guān)乎建模本身。OpenAI 將 o1 描述為一種通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練的推理模型，而 DeepSeek R1 后來(lái)進(jìn)一步印證了這一方向，展示了——多少針對(duì)基于推理的強(qiáng)化學(xué)習(xí)，需要專門的算法和基礎(chǔ)設(shè)施工作。第一次重大轉(zhuǎn)變：從擴(kuò)大預(yù)訓(xùn)練規(guī)模轉(zhuǎn)向擴(kuò)大后訓(xùn)練規(guī)模以實(shí)現(xiàn)推理能力。

2. The Real Problem Was Never Just "Merge Thinking and Instruct"真正的問(wèn)題從來(lái)不僅僅是“融合思考與指令”。

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.

2025年初，我們Qwen團(tuán)隊(duì)的許多成員心中都描繪了一幅雄心勃勃的藍(lán)圖。理想的系統(tǒng)是將實(shí)現(xiàn)思維與指令模式統(tǒng)一，并支持可調(diào)節(jié)的推理力度，其理念類似于低/中/高三種推理設(shè)置。更棒的是，該系統(tǒng)能夠根據(jù)提示和上下文自動(dòng)推斷出恰當(dāng)?shù)耐评砹浚耗Ｐ图饶芗磿r(shí)作答，也能選擇深入思考，甚至在面對(duì)真正棘手的問(wèn)題時(shí)，投入更多計(jì)算資源進(jìn)行細(xì)致求解。

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.

從概念上講，這是正確的方向。Qwen3是最清晰的公開嘗試之一。它引入了“混合思考模式”，在一個(gè)模型家族中同時(shí)支持思考和非思考行為，強(qiáng)調(diào)可控的思考預(yù)算，并描述了一個(gè)明確包含“思考模式融合”的四階段后訓(xùn)練流程，該流程位于長(zhǎng)思維鏈冷啟動(dòng)和推理強(qiáng)化學(xué)習(xí)之后。

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.

但融合比良好執(zhí)行更容易描述。困難的部分是數(shù)據(jù)。當(dāng)人們談?wù)撊诤纤伎寂c指令時(shí)，他們通常首先想到的是模型側(cè)的兼容性：一個(gè)檢查點(diǎn)能否同時(shí)支持兩種模式，一個(gè)聊天模板能否在它們之間切換，一個(gè)服務(wù)棧能否暴露正確的切換開關(guān)。更深層的問(wèn)題是，這兩種模式的數(shù)據(jù)分布和行為目標(biāo)存在本質(zhì)差異。

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.

我們?cè)趪L試平衡模型合并與提升訓(xùn)練后數(shù)據(jù)的質(zhì)量和多樣性時(shí)，并未完全做到盡善盡美。在這一修訂過(guò)程中，我們還密切關(guān)注了用戶如何實(shí)際參與具備思考與指導(dǎo)兩種模式。在企業(yè)級(jí)任務(wù)中，例如重寫、標(biāo)注、模板化支持、結(jié)構(gòu)化提取以及運(yùn)營(yíng)質(zhì)量保證等重復(fù)性高、工作量大的場(chǎng)景，表現(xiàn)強(qiáng)勁的指導(dǎo)模型通常因其直接性、簡(jiǎn)潔性、格式合規(guī)性以及低延遲而受到青睞。而表現(xiàn)強(qiáng)勁的思考模型則因在解決難題時(shí)消耗更多標(biāo)記、保持連貫的中間結(jié)構(gòu)、探索多種備選路徑，并保留足夠的內(nèi)部計(jì)算以切實(shí)提升最終結(jié)果的正確性而備受推崇。

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.

這兩種行為模式相互抵消。如果對(duì)合并后的數(shù)據(jù)不加以精心篩選，最終結(jié)果往往兩頭不討好：所謂的“思考”型行為變得雜亂無(wú)章、臃腫不堪，或缺乏足夠的決斷力；而“指令”型行為則變得不夠干脆利落、可靠性降低，且成本高于商業(yè)用戶的需求。實(shí)際上想要。

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.

分離在實(shí)踐中仍頗具吸引力。2025年晚些時(shí)候，在Qwen3最初的混合框架之后，2507版本推出了獨(dú)立的Instruct和Thinking更新版本，其中包括分別針對(duì)30B和235B參數(shù)量的變體。在商業(yè)部署中，大量客戶仍然希望在批量操作中實(shí)現(xiàn)高吞吐、低成本且高度可操控的指令行為。對(duì)于這些場(chǎng)景，合并顯然并不具備優(yōu)勢(shì)。將各條線分開，能讓團(tuán)隊(duì)更清晰地專注于解決每種模式的數(shù)據(jù)和訓(xùn)練問(wèn)題。

Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.

其他實(shí)驗(yàn)室則選擇了截然不同的路徑。Anthropic公開倡導(dǎo)一種集成式模型理念：Claude 3.7 Sonnet被定位為一種混合推理模型，用戶可選擇普通回復(fù)或深度思考模式，API用戶還可設(shè)定思考預(yù)算。Anthropic明確表示，他們認(rèn)為推理應(yīng)當(dāng)是一種集成化的能力，而非獨(dú)立的模型。GLM-4.5同樣公開將自身定位為一種混合推理模型，兼具思考與非思考兩種模式，實(shí)現(xiàn)了推理、編碼及智能體能力的統(tǒng)一；DeepSeek隨后也朝著類似方向邁進(jìn)，其V3.1版本推出了“思考與非思考”混合推理功能。

The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.

關(guān)鍵問(wèn)題在于，這種融合是否是自然有機(jī)的。如果思維與指令僅僅被安置于同一個(gè)檢查點(diǎn)內(nèi)，卻仍表現(xiàn)為兩種生硬拼接的個(gè)性，那么產(chǎn)品的用戶體驗(yàn)將依然顯得不自然。真正成功的融合，需要實(shí)現(xiàn)推理努力的平滑連續(xù)變化。模型應(yīng)當(dāng)能夠表達(dá)不同層次的推理強(qiáng)度，并且最好能自適應(yīng)地在這些層次之間做出選擇。GPT式的努力控制正朝著這一方向邁進(jìn)：它采用的是對(duì)計(jì)算資源的策略性調(diào)控，而非簡(jiǎn)單的二元開關(guān)。

3. Why Anthropic's Direction Was a Useful Corrective為什么Anthropic的方針是一種有益的糾正措施

Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.

Anthropic圍繞Claude 3.7和Claude 4的公開表述是克制的。他們著重強(qiáng)調(diào)了整合推理、用戶可控的思維預(yù)算、真實(shí)世界任務(wù)、代碼質(zhì)量，以及后期在長(zhǎng)時(shí)間思考過(guò)程中使用工具的能力。Claude 3.7被定位為一種具備可控預(yù)算的混合推理模型；Claude 4則在此基礎(chǔ)上進(jìn)一步拓展，允許推理與工具使用相互交織。與此同時(shí)，Anthropic還特別強(qiáng)調(diào)了編碼、長(zhǎng)期運(yùn)行任務(wù)以及智能體工作流作為其主要目標(biāo)。

Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.

生成更長(zhǎng)的推理軌跡并不會(huì)自動(dòng)使模型變得更智能。在許多情況下，過(guò)多的顯式推理信號(hào)反而會(huì)導(dǎo)致分配效率低下。如果模型試圖以同樣冗長(zhǎng)的方式對(duì)所有內(nèi)容進(jìn)行推理，它很可能無(wú)法合理 prioritization，無(wú)法有效壓縮，也無(wú)法采取行動(dòng)。人類的軌跡表明，一種更嚴(yán)謹(jǐn)?shù)囊暯歉鼮榍‘?dāng)：思考應(yīng)以目標(biāo)工作量為導(dǎo)向。如果目標(biāo)是編寫代碼，那么思考就應(yīng)有助于代碼庫(kù)導(dǎo)航、規(guī)劃、分解、錯(cuò)誤恢復(fù)以及工具編排。如果目標(biāo)是代理工作流，那么思考的重點(diǎn)應(yīng)放在提升長(zhǎng)期執(zhí)行質(zhì)量上，而非追求令人驚艷的中間成果。

This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

這種對(duì)目標(biāo)導(dǎo)向型實(shí)用性的強(qiáng)調(diào)，指向了一個(gè)更為宏大的趨勢(shì)：我們正從模型訓(xùn)練時(shí)代邁向智能體訓(xùn)練時(shí)代。我們?cè)赒wen3博客中明確指出：“我們正在從一個(gè)以模型訓(xùn)練為核心的時(shí)代，轉(zhuǎn)型為以智能體訓(xùn)練為核心的時(shí)代”，并把未來(lái)的強(qiáng)化學(xué)習(xí)進(jìn)展與環(huán)境反饋相結(jié)合，以支持長(zhǎng)時(shí)程的推理能力。所謂智能體，是一種能夠制定計(jì)劃、決定行動(dòng)時(shí)機(jī)、運(yùn)用工具、感知環(huán)境反饋、調(diào)整策略，并在長(zhǎng)周期內(nèi)持續(xù)運(yùn)行的系統(tǒng)。它之所以與眾不同，就在于其與外界之間形成了閉環(huán)互動(dòng)。

4. What "Agentic Thinking" Really Means“智能體式思考”的真正含義

Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.

“智能體式思考”是一種不同的優(yōu)化目標(biāo)。推理思維通常以最終答案之前的內(nèi)部推敲質(zhì)量來(lái)衡量：模型能否解出定理、寫出證明、生成正確的代碼，或通過(guò)基準(zhǔn)測(cè)試。而“智能體式思考”則關(guān)注的是，模型在與環(huán)境交互的過(guò)程中能否持續(xù)取得進(jìn)展。

The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:

Deciding when to stop thinking and take an action
Choosing which tool to invoke and in what order
Incorporating noisy or partial observations from the environment
Revising plans after failures
Maintaining coherence across many turns and many tool calls
Agentic thinking is a model that reasons through action.

核心問(wèn)題從“模型能否思考足夠長(zhǎng)的時(shí)間？”轉(zhuǎn)變?yōu)椤澳Ｐ湍芊褚跃S持有效行動(dòng)的方式進(jìn)行思考？”。智能體式思考必須處理幾件純推理模型大多可以避免的事情：

決定何時(shí)停止思考并采取行動(dòng)
選擇調(diào)用哪個(gè)工具以及調(diào)用順序
融入來(lái)自環(huán)境的噪聲或部分觀測(cè)數(shù)據(jù)
在失敗后修訂計(jì)劃
在多次輪次和多次工具調(diào)用中保持連貫性
智能體式思考是一個(gè)通過(guò)行動(dòng)進(jìn)行推理的模型

5. Why Agentic RL Infrastructure Is Harder為什么智能體強(qiáng)化學(xué)習(xí)基礎(chǔ)設(shè)施更難

Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.

一旦目標(biāo)從解決基準(zhǔn)問(wèn)題轉(zhuǎn)向解決交互式任務(wù)，強(qiáng)化學(xué)習(xí)的架構(gòu)便會(huì)發(fā)生變化。用于經(jīng)典推理強(qiáng)化學(xué)習(xí)的基礎(chǔ)設(shè)施已不足以應(yīng)對(duì)這一需求。在推理強(qiáng)化學(xué)習(xí)中，你通常可以將采樣軌跡視為大體自成一體的路徑，并配備相對(duì)清晰的評(píng)估器。而在代理強(qiáng)化學(xué)習(xí)中，策略被嵌入一個(gè)更大的框架之中：工具服務(wù)器、瀏覽器、終端、搜索引擎、模擬器、執(zhí)行沙箱、API層、內(nèi)存系統(tǒng)以及編排框架。此時(shí)，環(huán)境不再只是靜態(tài)的驗(yàn)證者；它已成為訓(xùn)練系統(tǒng)的一部分。

This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

這會(huì)創(chuàng)建一個(gè)新的系統(tǒng)要求：訓(xùn)練與推理必須實(shí)現(xiàn)更徹底的解耦。若缺乏這種解耦，模型上線的吞吐量將大幅下降。試想一下，一個(gè)編碼智能體需要針對(duì)實(shí)時(shí)測(cè)試框架執(zhí)行生成的代碼：推理端會(huì)因等待執(zhí)行反饋而停滯不前，訓(xùn)練端則因缺乏已完成的軌跡而陷入饑餓狀態(tài)，整個(gè)流水線的運(yùn)行效率遠(yuǎn)低于基于經(jīng)典推理的強(qiáng)化學(xué)習(xí)所預(yù)期的GPU利用率。如果再疊加工具延遲、部分可觀測(cè)性以及有狀態(tài)環(huán)境等因素，這些低效問(wèn)題便會(huì)進(jìn)一步加劇。其結(jié)果是，實(shí)驗(yàn)進(jìn)度緩慢且充滿痛苦，甚至在你尚未達(dá)到目標(biāo)能力水平之前，就已經(jīng)陷入困境。

The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

環(huán)境本身也正成為一類一流的研究工具。在SFT時(shí)代，我們癡迷于數(shù)據(jù)的多樣性；而在智能體時(shí)代，我們則應(yīng)癡迷于環(huán)境的質(zhì)量：包括穩(wěn)定性、真實(shí)性、覆蓋范圍、難度、狀態(tài)多樣性、反饋豐富度、抗過(guò)擬合能力以及 rollout 生成的可擴(kuò)展性。環(huán)境構(gòu)建已開始成為一個(gè)真正的創(chuàng)業(yè)領(lǐng)域，而不再僅僅是副業(yè)項(xiàng)目。如果智能體正在接受訓(xùn)練，以適應(yīng)類似生產(chǎn)環(huán)境的運(yùn)行場(chǎng)景，那么環(huán)境便成了核心能力棧的重要組成部分。

6. The Next Frontier Is More Usable Thought下一個(gè)前沿是更易用的思維

My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

我的預(yù)期是，智能體式思考將成為思考的主導(dǎo)形式。我認(rèn)為它可能最終取代大部分舊的靜態(tài)獨(dú)白式推理思考：那種因缺乏交互而通過(guò)輸出越來(lái)越多文本來(lái)補(bǔ)償?shù)摹⑦^(guò)長(zhǎng)的、孤立的內(nèi)部軌跡。即使在非常困難的數(shù)學(xué)或編碼任務(wù)上，一個(gè)真正先進(jìn)的系統(tǒng)也應(yīng)該有權(quán)進(jìn)行搜索、模擬、執(zhí)行、檢查、驗(yàn)證和修訂。目標(biāo)是穩(wěn)健且高效地解決問(wèn)題。

The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.

訓(xùn)練這類系統(tǒng)時(shí)，最棘手的挑戰(zhàn)便是獎(jiǎng)勵(lì)作弊。一旦模型獲得了有意義的工具訪問(wèn)權(quán)限，獎(jiǎng)勵(lì)作弊便會(huì)變得愈加危險(xiǎn)。具備搜索功能的模型可能會(huì)在強(qiáng)化學(xué)習(xí)過(guò)程中直接查找到答案；編碼智能體則可能利用倉(cāng)庫(kù)中的未來(lái)信息、濫用日志，或發(fā)現(xiàn)一些能輕易繞過(guò)任務(wù)要求的捷徑。如果環(huán)境中存在隱蔽漏洞，智能體看似表現(xiàn)得超凡脫俗，實(shí)則是在被訓(xùn)練去作弊。正因如此，智能體時(shí)代比推理時(shí)代更加微妙和復(fù)雜。更強(qiáng)大的工具讓模型變得更加有用，但同時(shí)也擴(kuò)大了虛假優(yōu)化的攻擊面。我們應(yīng)預(yù)期，下一階段的重大研究瓶頸將來(lái)自環(huán)境設(shè)計(jì)、評(píng)估器的魯棒性、反作弊機(jī)制，以及策略與世界之間更具原則性的接口。盡管如此，方向已然明確：借助工具的思維模式遠(yuǎn)比孤立的思考更有價(jià)值，也更有可能切實(shí)提升生產(chǎn)力。

Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.

智能體式思考也將意味著對(duì)工程的駕馭。核心智能將越來(lái)越多地源自于多個(gè)代理的組織方式：一位負(fù)責(zé)規(guī)劃與調(diào)度工作的統(tǒng)籌者，一群充當(dāng)領(lǐng)域?qū)＜业膶I(yè)代理，以及一群執(zhí)行更具體任務(wù)、同時(shí)協(xié)助控制上下文、避免干擾并保持不同層次推理之間隔離性的子代理。未來(lái)，我們將從訓(xùn)練模型轉(zhuǎn)向訓(xùn)練代理，再進(jìn)一步從訓(xùn)練代理轉(zhuǎn)向訓(xùn)練系統(tǒng)。

Conclusion結(jié)語(yǔ)

The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.

推理浪潮的第一階段確立了一項(xiàng)重要發(fā)現(xiàn)：在語(yǔ)言模型之上應(yīng)用強(qiáng)化學(xué)習(xí)，當(dāng)反饋信號(hào)可靠且基礎(chǔ)設(shè)施能夠支撐時(shí)，可產(chǎn)生質(zhì)量上更強(qiáng)大的認(rèn)知能力。

The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

深層次的轉(zhuǎn)變是從推理式思維轉(zhuǎn)向行動(dòng)式思維：從更長(zhǎng)時(shí)間的思考，轉(zhuǎn)變?yōu)闉榱瞬扇⌒袆?dòng)而進(jìn)行的有序思考。培訓(xùn)的核心對(duì)象也隨之發(fā)生了變化——如今，關(guān)注的焦點(diǎn)已不再是單純的模型本身，而是“模型+環(huán)境”這一系統(tǒng)，更具體地說(shuō)，是智能體及其周圍的生態(tài)系統(tǒng)。這使得哪些研究成果最為關(guān)鍵也發(fā)生了改變：固然，模型架構(gòu)和訓(xùn)練數(shù)據(jù)依然至關(guān)重要；但與此同時(shí)，環(huán)境設(shè)計(jì)、部署基礎(chǔ)設(shè)施、評(píng)估器的穩(wěn)健性，以及多個(gè)智能體之間協(xié)同互動(dòng)所依賴的各類接口，也都變得同樣重要。這也重新定義了“良好思考”的含義：在現(xiàn)實(shí)世界的約束條件下，最能持續(xù)推動(dòng)行動(dòng)的有效軌跡，而非單純追求最長(zhǎng)或最顯眼的軌跡。

It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.

它也改變了競(jìng)爭(zhēng)優(yōu)勢(shì)的來(lái)源。在推理時(shí)代，優(yōu)勢(shì)來(lái)自更優(yōu)秀的強(qiáng)化學(xué)習(xí)算法、更強(qiáng)的反饋信號(hào)以及更高的可擴(kuò)展性。訓(xùn)練流水線。在智能體時(shí)代，優(yōu)勢(shì)將來(lái)自更優(yōu)質(zhì)的環(huán)境、更緊密的訓(xùn)練與服務(wù)一體化、更強(qiáng)大的模型約束工程，以及實(shí)現(xiàn)模型決策與其所產(chǎn)生后果之間閉環(huán)的能力。

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺(tái)“網(wǎng)易號(hào)”用戶上傳并發(fā)布，本平臺(tái)僅提供信息存儲(chǔ)服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.