[agent] Agent Engineering 2024-2025
Q1 2024
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2024-01-22 | LangGraph | framework | Made agent loops + state tracking + human approvals explicit primitives. | Shifted community thinking from “prompt chains” to “controllable state machines” for agents. |
| 2024-02-05 | Understanding the planning of LLM agents: A survey | survey (paper) | Stabilized a shared taxonomy for planning (decompose/select/reflect/memory/modules). | Helped the field converge on what planning means and which sub-problems deserve separate evaluation. |
Q2 2024
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2024-04-11 | OSWorld | benchmark (paper) | Execution-based evaluation in real OS/app environments showed a large human–agent gap (best model far below human). | Made “real environment + reproducible execution” the gold standard for computer-use agents. |
| 2024-05-06 | SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering | paper/system | Demonstrated that agent-computer interfaces (ACI) can unlock meaningful gains, not just better base models. | Reframed progress as partly an interface / environment-design problem (the “agent UX” matters). |
| 2024-05-28 | Tool Learning with LLMs: A Survey | survey (paper) | Formalized tool learning as a staged pipeline (plan→select→call→respond) with stage-specific benchmarks. | Helped consensus converge that “tool use” is not one skill but multiple measurable failure points. |
| 2024-06-17 | τ-bench | benchmark (paper) | Showed <50% success and poor multi-trial reliability (pass^k) for tool-agent-user interactions in real domains. | Popularized “reliability across runs” as a core agent metric, not a nice-to-have. |
| 2024-06-27 | LangGraph Cloud (beta) | product/platform | Positioned “running agents reliably at scale” (persistence, queues, tracing) as a platform problem. | Reinforced an emerging consensus: production agents need infrastructure (ops/observability), not just prompts. |
Q3 2024
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2024-08-15 | Automated Design of Agentic Systems (ADAS) | paper | Proposed automating the search over agent system designs (meta-agents generating better agents). | Marked a shift from “design a workflow” to “optimize/search agent architectures,” similar to AutoML. |
| 2024-09-12 | Learning to reason with LLMs / o1-preview | model/research release | Framed stronger reasoning as RL-trained multi-step thinking that can reduce manual prompt scaffolding. | Increased consensus that reasoning improvements help planning—but did not remove the need for better tool loops and evaluation. |
Q4 2024 (Anthropic 制定标准的发布季)
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2024-10-10 | Orchestrating Agents: Routines and Handoffs (and Swarm repo) | platform guidance + OSS | Introduced “routines + handoffs” as a controllable multi-agent orchestration pattern; published Swarm as a reference implementation. | Made multi-agent systems feel like engineering patterns (handoffs/state) rather than magical “team of agents” demos. |
| 2024-10-22 | Anthropic computer use (Claude 3.5 + computer use) | product capability | Made GUI-level “computer use” a mainstream API feature (explicitly framed as a new capability). | Validated computer-use as a central agent modality (aligning with OSWorld-style evaluation). |
| 2024-11-25 | Model Context Protocol (MCP) introduced | standard/protocol | Proposed an open standard for secure, two-way connections between AI clients and external tools/data via MCP servers. | Shifted consensus toward interoperability: tool integration became a standards/ecosystem issue, not per-app glue. |
| 2024-12-19 | Building effective agents (Anthropic) | industry guidance/report | Argued successful agents use simple, composable patterns and clarified “workflows vs agents” tradeoffs. | Helped consolidate a pragmatic consensus: reliability often comes from constrained designs, not maximal autonomy. |
Q1 2025
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2025-01-23 | Introducing Operator (and Computer-Using Agent) | product + model note | Presented a web-operating agent that explicitly requires user confirmation at sensitive steps. | Mainstreamed “supervised autonomy” (handover points) as the default for real-world action agents. |
| 2025-02-02 | Introducing deep research (plus later system card) | product + safety report | Launched a multi-step web research agent emphasizing citations and risk mitigation (prompt injection, privacy, etc.). | Established “auditability (citations) + threat model” as core to research agents, not optional polish. |
| 2025-03-11 | New tools for building agents (Responses API, Agents SDK, tracing) | platform | Introduced agent primitives: unified Responses API + built-in tools + Agents SDK for orchestration and production tracing. | Signaled a platform-level consensus: agents require native tooling for orchestration/observability, not bespoke glue. |
| 2025-03-20 | Survey on Evaluation of LLM-based Agents | survey (paper) | Framed agent evaluation as fragmented and underdeveloped, calling for realism and broader metrics beyond accuracy. | Codified “evaluation” as a first-class research area and supported the pivot toward benchmark/tooling infrastructure. |
Deep Research:研究目标->搜索->阅读->综合->生成报告
高度结构化
ChatGPT Agent:目标->规划->工具选择->执行->观察->重规划↓
持续运行,是一个真正的动态闭环。
它是:
工程成熟度跃迁季度。
模型能力已基本足够,
系统设计成为主战场。
Q2 2025
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2025-05-21 | New tools and features in the Responses API | platform update | Added remote MCP servers, background mode (async), reasoning summaries, and other production-oriented features. | Reinforced consensus that long-running agents need platform support for async execution, auditability, and interoperability. |
| 2025-05-29 | SWE-bench-Live | benchmark (paper) | Introduced continuously updatable, executable tasks to reduce staleness/contamination in coding-agent evaluation. | Strengthened consensus that static benchmarks get “solved” or leaked; evaluation must be live and reproducible. |
| 2025-06-25 | Gartner: >40% of agentic AI projects canceled by 2027 | industry report | Predicted widespread cancellations due to costs, unclear business value, and inadequate risk controls; warned about “agent washing.” | Moved enterprise consensus from hype to operational reality: ROI discipline and governance became non-negotiable. |
UI 真正的角色是什么?
GUI 是:
覆盖盲区的 fallback 层
意思是:
当:
-
-
没有 API
-
没有结构化接口
-
没有官方集成
-
没有可用插件
-
才使用 GUI。
它不是默认路径。
那为什么产品还要展示 GUI?
因为它:
-
-
展示“通用执行能力”
-
给人信心“它能做任何事”
-
是 marketing 上的能力上限
-
但实际系统设计里:
工程师会尽量绕开 GUI。
[2]
读了Gartner,有以下的feeling,感觉可视化,tracking非常重要。大家都开始构建Agent,有些是尝试通用Agent,但还未意识到其难度,之后失败的概率会很高,但 失败的challenges在哪里?共识在形成中。
三阶段模型:
Stage 1:垂直 agent
Stage 2:通用 tool orchestration
Stage 3:自治长运行
Gartner 的预测不是在否定 Stage 1。
它是在说:
企业想直接跳到 Stage 2 / Stage 3,会有很高失败率。
Q3 2025
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2025-07-29 | Evaluation and Benchmarking of LLM Agents: A Survey | survey (paper) | Emphasized reliability/safety/objectives-×-process taxonomy; highlighted enterprise constraints (RBAC, compliance). | Cemented multi-dimensional “agent readiness” metrics (capability + reliability + safety + process). |
| 2025-08-26 | Assistants API deprecated; migrate to Responses | platform governance | Officially positioned Responses API as the future direction for building agents; set a sunset timeline. | Showed platform convergence: agent builders should rely on unified primitives (Responses) rather than bespoke beta abstractions. |
| 2025-08-27 | MCP-Bench | benchmark (paper) | Benchmarked tool-using agents on complex real-world tasks via MCP servers (tool discovery/coordination/precision). | Connected protocol adoption to measurable outcomes: standard access alone doesn’t guarantee good tool reasoning. |
2025 Q3 开始大量讨论:
-
-
Tool mis-selection
-
Over-delegation
-
Hallucinated state
-
Permission drift
-
Infinite loop failure
-
行业共识逐渐形成:
Agent 的核心挑战不是推理,而是状态一致性。
开始出现“状态管理框架”的讨论。
Q4 2025
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2025-10-28 | OSWorld-MCP | benchmark (paper) | Proposed fair evaluation of computer-use agents that can choose between GUI actions and MCP tool invocation. | Made hybrid evaluation (GUI + tools) a new norm, aligning benchmarks with real product architectures. |
| 2025-11-04 | Code execution with MCP (Anthropic) | platform/engineering note | Argued code execution can reduce context load, filter data, and improve efficiency/security for tool-heavy agents. | Strengthened consensus that scaling agents is a context/latency/state engineering problem as much as a modeling problem. |
| 2025-11-25 | MCP Specification (authoritative spec) | standard | Published authoritative protocol requirements for MCP (stabilizing it as infrastructure). | Maturation of standards signaled readiness for broader cross-vendor adoption and stable tooling ecosystems. |
| 2025-12-09 | Linux Foundation forms Agentic AI Foundation (AAIF) | governance event | Established vendor-neutral governance for MCP, OpenAI’s AGENTS.md, and other agent infrastructure projects. | Institutionalized interoperability and open governance as “critical infrastructure” for the agent ecosystem. |
Q1 2026
| Date | Title | Type | One-line takeaway | Why it influenced consensus |
|---|---|---|---|---|
| 2026-01-21 | 2026 Agentic Coding Trends Report (Anthropic) | industry report | Reported high AI usage in coding but limited “full delegation,” emphasizing supervision/validation in practice. | Grounded consensus toward “collaborative autonomy”: agents assist heavily but humans remain accountable. |
| 2026-01-26 | MCP Apps (official MCP extension) | standard/ecosystem | Enabled tools to return interactive UI components (forms/dashboards/workflows) in agent conversations. | Directly addressed the supervision UX bottleneck: users can understand/approve actions inside the loop at scale. |
| 2026-02-05 | Introducing OpenAI Frontier | enterprise platform | Framed the bottleneck as “how agents are built and run,” offering shared context + execution + eval/optimization + permissions. | Consolidated a mature consensus: enterprise agents need context layers, identity/permissions, and lifecycle evaluation. |
| 2026-02-11 | Power Platform Feb 2026 update (Power Apps MCP Server + enhanced agent feed) | enterprise product | Brought MCP server + “built-in human supervision” into business apps via an enhanced agent feed. | Validated that mainstream enterprise software is adopting the supervised-agent pattern as a default UX. |
| 2026-02-23 | Introducing Frontier Alliances | governance/enterprise program | Announced multi-year partnerships stressing workflow redesign and change management as deployment bottlenecks. | Elevated an emerging consensus: scaling agents is organizational engineering (process + governance), not just tech. |
2025 年下半年“Agent 是工程系统”的共识,确实不是空穴来风,而是源于 2024–2025 上半年大量企业试点与实践后的系统性摩擦。不过需要校准一点:
这不是“因为失败太多才形成共识”,
而是“因为开始规模化实践,复杂性暴露出来,工程问题成为主战场”。
下面给你一份可学习、可内化的结构化讲解。
一、先讲因果链(从能力 → 热潮 → 摩擦 → 工程化)
阶段 1:能力跃迁(2024)
-
推理模型稳定
-
工具调用成熟
-
GUI 能跑
-
多 agent 编排框架流行
行业判断:
技术条件已满足,可以规模化落地。
阶段 2:大规模试点(2024 Q4 – 2025 Q2)
企业开始做:
-
自动客服
-
自动审批
-
自动财务处理
-
自动研发辅助
-
自动销售跟进
问题开始出现。
阶段 3:现实摩擦暴露(2025 Q2–Q3)
真实问题集中爆发:
1️⃣ 状态一致性问题
-
Agent 忘记自己做过什么
-
工具返回值与内部推理不一致
-
多步流程中变量漂移
2️⃣ 工具选择错误
-
选错 API
-
错误调用参数
-
重复执行副作用操作
3️⃣ 成本失控
-
多步循环 token 暴涨
-
GUI 自动化耗时极高
-
并发运行成本不可预测
4️⃣ 权限与安全
-
写操作风险高
-
数据泄露风险
-
Prompt injection
5️⃣ 运维困难
-
无法 debug
-
无法重放
-
无法回溯决策路径
企业意识到:
这不是一个“模型问题”,而是一个“系统问题”。
二、于是形成的核心共识
2025 下半年行业逐渐形成五条稳定认知:
共识 1:Agent 是“分布式系统”,不是“智能函数”
早期误解:
Agent = 更聪明的函数调用
成熟认知:
Agent = 长生命周期的分布式状态机
它涉及:
-
多组件协作
-
异步执行
-
状态持久化
-
并发控制
-
失败恢复
这已经是系统工程问题。
共识 2:最难的不是推理,而是“状态管理”
模型可以规划。
但难的是:
-
上一步做了什么?
-
哪些变量已经改变?
-
哪些副作用已发生?
-
是否需要补偿机制?
这类似于数据库事务管理。
共识 3:Tool layer 是最大复杂度来源
不是模型。
而是:
-
API 不稳定
-
工具 schema 变化
-
返回结果格式漂移
-
GUI 版本变化
因此开始强调:
-
Tool contract
-
Schema validation
-
Retry policy
-
Timeout strategy
共识 4:Observability 是必须的
必须具备:
-
日志
-
可视化 trace
-
决策回放
-
成本监控
-
失败分析
否则无法进入生产。
共识 5:API-first,GUI-fallback
GUI 是展示能力。
生产系统优先:
-
API
-
RPC
-
Structured tool
这是一种工程理性。
三、为什么 Gartner 会做那种预测?
不是因为技术退步。
而是因为:
市场预期过高。
企业误以为:
-
Agent 会自动重构流程
-
可以替代大量人工
-
不需要组织改变
现实是:
-
Agent 需要流程重构
-
需要权限重构
-
需要监控系统
-
需要运维团队
大量企业没准备好。
所以失败率高。
四、真正成熟的 Agent 工程结构
2025 Q4 以后稳定的架构模型:
↓
Planner Layer
↓
Tool Abstraction Layer
↓
Execution Sandbox
↓
State Store
↓
Monitoring & Audit
这不是 prompt engineering。
这是系统工程。
五、一个非常重要的认知升级
2024 年大家问:
模型够不够强?
2025 年下半年问:
系统是否可治理?
这是成熟标志。
六、Agent 工程学的核心问题
如果你要系统学习,应该关注这五个维度:
-
状态一致性(state consistency)
-
幂等性(idempotency)
-
副作用控制(side effect control)
-
失败恢复(recovery strategy)
-
可观测性(observability)
这五个比模型能力更重要。
七、总结一句话
2025 下半年形成的共识不是“Agent 不行”,
而是“Agent 是工程系统,必须用工程思维去做”。
八、你现在应该如何学习?
如果你想真正理解这一阶段的成熟认知:
不要只看论文。
要学习:
-
分布式系统
-
事务管理
-
API 设计
-
失败恢复策略
-
监控与 tracing
Agent 工程已经进入系统工程范畴。
OpenAI并不是不知道方向。
而是:
-
内部推进慢
-
产品风险高
-
需要外部验证
OpenClaw 相当于市场验证。

浙公网安备 33010602011771号