[agent] Agent Engineering 2024-2025

Q1 2024

Date	Title	Type	One-line takeaway	Why it influenced consensus
2024-01-22	LangGraph	framework	Made agent loops + state tracking + human approvals explicit primitives.	Shifted community thinking from “prompt chains” to “controllable state machines” for agents.
2024-02-05	Understanding the planning of LLM agents: A survey	survey (paper)	Stabilized a shared taxonomy for planning (decompose/select/reflect/memory/modules).	Helped the field converge on what planning means and which sub-problems deserve separate evaluation.

Q2 2024

Date	Title	Type	One-line takeaway	Why it influenced consensus
2024-04-11	OSWorld	benchmark (paper)	Execution-based evaluation in real OS/app environments showed a large human–agent gap (best model far below human).	Made “real environment + reproducible execution” the gold standard for computer-use agents.
2024-05-06	SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering	paper/system	Demonstrated that agent-computer interfaces (ACI) can unlock meaningful gains, not just better base models.	Reframed progress as partly an interface / environment-design problem (the “agent UX” matters).
2024-05-28	Tool Learning with LLMs: A Survey	survey (paper)	Formalized tool learning as a staged pipeline (plan→select→call→respond) with stage-specific benchmarks.	Helped consensus converge that “tool use” is not one skill but multiple measurable failure points.
2024-06-17	τ-bench	benchmark (paper)	Showed <50% success and poor multi-trial reliability (pass^k) for tool-agent-user interactions in real domains.	Popularized “reliability across runs” as a core agent metric, not a nice-to-have.
2024-06-27	LangGraph Cloud (beta)	product/platform	Positioned “running agents reliably at scale” (persistence, queues, tracing) as a platform problem.	Reinforced an emerging consensus: production agents need infrastructure (ops/observability), not just prompts.

发现模型直接给出答案可能不是很好的策略，利用已有的一些工具，“循序渐进”才是正确的路。把模型能力的“瓶颈”转移。

另外，强调了“可观测”的重要性，也就是初现了与模型频繁"小交互"的趋势。

Tool Learning with LLMs: A Survey中提到的select是不是与tnt有点关系？对某一条路，模型会评估并与其他路去对比？-- 哦，这是“选择更合适的工具”，而非“系统性搜索策略”。

Q3 2024

Date	Title	Type	One-line takeaway	Why it influenced consensus
2024-08-15	Automated Design of Agentic Systems (ADAS)	paper	Proposed automating the search over agent system designs (meta-agents generating better agents).	Marked a shift from “design a workflow” to “optimize/search agent architectures,” similar to AutoML.
2024-09-12	Learning to reason with LLMs / o1-preview	model/research release	Framed stronger reasoning as RL-trained multi-step thinking that can reduce manual prompt scaffolding.	Increased consensus that reasoning improvements help planning—but did not remove the need for better tool loops and evaluation.

承上启下：2024年上半年，大模型使用工具基本在理论与实践上开始成熟，下半年开始制定标准，之后就是等待产品化。

Q4 2024 （Anthropic 制定标准的发布季）

Date	Title	Type	One-line takeaway	Why it influenced consensus
2024-10-10	Orchestrating Agents: Routines and Handoffs (and Swarm repo)	platform guidance + OSS	Introduced “routines + handoffs” as a controllable multi-agent orchestration pattern; published Swarm as a reference implementation.	Made multi-agent systems feel like engineering patterns (handoffs/state) rather than magical “team of agents” demos.
2024-10-22	Anthropic computer use (Claude 3.5 + computer use)	product capability	Made GUI-level “computer use” a mainstream API feature (explicitly framed as a new capability).	Validated computer-use as a central agent modality (aligning with OSWorld-style evaluation).
2024-11-25	Model Context Protocol (MCP) introduced	standard/protocol	Proposed an open standard for secure, two-way connections between AI clients and external tools/data via MCP servers.	Shifted consensus toward interoperability: tool integration became a standards/ecosystem issue, not per-app glue.
2024-12-19	Building effective agents (Anthropic)	industry guidance/report	Argued successful agents use simple, composable patterns and clarified “workflows vs agents” tradeoffs.	Helped consolidate a pragmatic consensus: reliability often comes from constrained designs, not maximal autonomy.

Q1 2025

Date	Title	Type	One-line takeaway	Why it influenced consensus
2025-01-23	Introducing Operator (and Computer-Using Agent)	product + model note	Presented a web-operating agent that explicitly requires user confirmation at sensitive steps.	Mainstreamed “supervised autonomy” (handover points) as the default for real-world action agents.
2025-02-02	Introducing deep research (plus later system card)	product + safety report	Launched a multi-step web research agent emphasizing citations and risk mitigation (prompt injection, privacy, etc.).	Established “auditability (citations) + threat model” as core to research agents, not optional polish.
2025-03-11	New tools for building agents (Responses API, Agents SDK, tracing)	platform	Introduced agent primitives: unified Responses API + built-in tools + Agents SDK for orchestration and production tracing.	Signaled a platform-level consensus: agents require native tooling for orchestration/observability, not bespoke glue.
2025-03-20	Survey on Evaluation of LLM-based Agents	survey (paper)	Framed agent evaluation as fragmented and underdeveloped, calling for realism and broader metrics beyond accuracy.	Codified “evaluation” as a first-class research area and supported the pivot toward benchmark/tooling infrastructure.

ChatGPT Deep Research: Initial release, February 3, 2025

Deep Research：研究目标->搜索->阅读->综合->生成报告

高度结构化

ChatGPT Agent：目标->规划->工具选择->执行->观察->重规划↓

持续运行，是一个真正的动态闭环。

它是：

工程成熟度跃迁季度。

模型能力已基本足够，
系统设计成为主战场。

Q2 2025

Date	Title	Type	One-line takeaway	Why it influenced consensus
2025-05-21	New tools and features in the Responses API	platform update	Added remote MCP servers, background mode (async), reasoning summaries, and other production-oriented features.	Reinforced consensus that long-running agents need platform support for async execution, auditability, and interoperability.
2025-05-29	SWE-bench-Live	benchmark (paper)	Introduced continuously updatable, executable tasks to reduce staleness/contamination in coding-agent evaluation.	Strengthened consensus that static benchmarks get “solved” or leaked; evaluation must be live and reproducible.
2025-06-25	Gartner: >40% of agentic AI projects canceled by 2027	industry report	Predicted widespread cancellations due to costs, unclear business value, and inadequate risk controls; warned about “agent washing.”	Moved enterprise consensus from hype to operational reality: ROI discipline and governance became non-negotiable.

[1]

17 July 2025 — ChatGPT agent starts rolling out today to Pro, Plus, and Team;

早期 browser agent 论文发生在2023年，意识萌发期~

有可能，OpenAI 是第一个大规模产品化 rollout 的。

UI 真正的角色是什么？

GUI 是：

覆盖盲区的 fallback 层

意思是：

当：

- 没有 API
- 没有结构化接口
- 没有官方集成
- 没有可用插件

才使用 GUI。

它不是默认路径。

那为什么产品还要展示 GUI？

因为它：

- 展示“通用执行能力”
- 给人信心“它能做任何事”
- 是 marketing 上的能力上限

但实际系统设计里：

工程师会尽量绕开 GUI。

[2]

读了Gartner，有以下的feeling，感觉可视化，tracking非常重要。大家都开始构建Agent，有些是尝试通用Agent，但还未意识到其难度，之后失败的概率会很高，但失败的challenges在哪里？共识在形成中。

三阶段模型：

Stage 1：垂直 agent
Stage 2：通用 tool orchestration
Stage 3：自治长运行

Gartner 的预测不是在否定 Stage 1。

它是在说：

企业想直接跳到 Stage 2 / Stage 3，会有很高失败率。

Q3 2025

Date	Title	Type	One-line takeaway	Why it influenced consensus
2025-07-29	Evaluation and Benchmarking of LLM Agents: A Survey	survey (paper)	Emphasized reliability/safety/objectives-×-process taxonomy; highlighted enterprise constraints (RBAC, compliance).	Cemented multi-dimensional “agent readiness” metrics (capability + reliability + safety + process).
2025-08-26	Assistants API deprecated; migrate to Responses	platform governance	Officially positioned Responses API as the future direction for building agents; set a sunset timeline.	Showed platform convergence: agent builders should rely on unified primitives (Responses) rather than bespoke beta abstractions.
2025-08-27	MCP-Bench	benchmark (paper)	Benchmarked tool-using agents on complex real-world tasks via MCP servers (tool discovery/coordination/precision).	Connected protocol adoption to measurable outcomes: standard access alone doesn’t guarantee good tool reasoning.

2025 Q3 开始大量讨论：

- Tool mis-selection
- Over-delegation
- Hallucinated state
- Permission drift
- Infinite loop failure

行业共识逐渐形成：

Agent 的核心挑战不是推理，而是状态一致性。

开始出现“状态管理框架”的讨论。

Agent 是工程系统，不是魔法。

Q4 2025

Date	Title	Type	One-line takeaway	Why it influenced consensus
2025-10-28	OSWorld-MCP	benchmark (paper)	Proposed fair evaluation of computer-use agents that can choose between GUI actions and MCP tool invocation.	Made hybrid evaluation (GUI + tools) a new norm, aligning benchmarks with real product architectures.
2025-11-04	Code execution with MCP (Anthropic)	platform/engineering note	Argued code execution can reduce context load, filter data, and improve efficiency/security for tool-heavy agents.	Strengthened consensus that scaling agents is a context/latency/state engineering problem as much as a modeling problem.
2025-11-25	MCP Specification (authoritative spec)	standard	Published authoritative protocol requirements for MCP (stabilizing it as infrastructure).	Maturation of standards signaled readiness for broader cross-vendor adoption and stable tooling ecosystems.
2025-12-09	Linux Foundation forms Agentic AI Foundation (AAIF)	governance event	Established vendor-neutral governance for MCP, OpenAI’s AGENTS.md, and other agent infrastructure projects.	Institutionalized interoperability and open governance as “critical infrastructure” for the agent ecosystem.

Q1 2026

Date	Title	Type	One-line takeaway	Why it influenced consensus
2026-01-21	2026 Agentic Coding Trends Report (Anthropic)	industry report	Reported high AI usage in coding but limited “full delegation,” emphasizing supervision/validation in practice.	Grounded consensus toward “collaborative autonomy”: agents assist heavily but humans remain accountable.
2026-01-26	MCP Apps (official MCP extension)	standard/ecosystem	Enabled tools to return interactive UI components (forms/dashboards/workflows) in agent conversations.	Directly addressed the supervision UX bottleneck: users can understand/approve actions inside the loop at scale.
2026-02-05	Introducing OpenAI Frontier	enterprise platform	Framed the bottleneck as “how agents are built and run,” offering shared context + execution + eval/optimization + permissions.	Consolidated a mature consensus: enterprise agents need context layers, identity/permissions, and lifecycle evaluation.
2026-02-11	Power Platform Feb 2026 update (Power Apps MCP Server + enhanced agent feed)	enterprise product	Brought MCP server + “built-in human supervision” into business apps via an enhanced agent feed.	Validated that mainstream enterprise software is adopting the supervised-agent pattern as a default UX.
2026-02-23	Introducing Frontier Alliances	governance/enterprise program	Announced multi-year partnerships stressing workflow redesign and change management as deployment bottlenecks.	Elevated an emerging consensus: scaling agents is organizational engineering (process + governance), not just tech.

2025 年下半年“Agent 是工程系统”的共识，确实不是空穴来风，而是源于 2024–2025 上半年大量企业试点与实践后的系统性摩擦。不过需要校准一点：

这不是“因为失败太多才形成共识”，
而是“因为开始规模化实践，复杂性暴露出来，工程问题成为主战场”。

下面给你一份可学习、可内化的结构化讲解。

一、先讲因果链（从能力 → 热潮 → 摩擦 → 工程化）

阶段 1：能力跃迁（2024）

推理模型稳定
工具调用成熟
GUI 能跑
多 agent 编排框架流行

行业判断：

技术条件已满足，可以规模化落地。

阶段 2：大规模试点（2024 Q4 – 2025 Q2）

企业开始做：

自动客服
自动审批
自动财务处理
自动研发辅助
自动销售跟进

问题开始出现。

阶段 3：现实摩擦暴露（2025 Q2–Q3）

真实问题集中爆发：

1️⃣ 状态一致性问题

Agent 忘记自己做过什么
工具返回值与内部推理不一致
多步流程中变量漂移

2️⃣ 工具选择错误

选错 API
错误调用参数
重复执行副作用操作

3️⃣ 成本失控

多步循环 token 暴涨
GUI 自动化耗时极高
并发运行成本不可预测

4️⃣ 权限与安全

写操作风险高
数据泄露风险
Prompt injection

5️⃣ 运维困难

无法 debug
无法重放
无法回溯决策路径

企业意识到：

这不是一个“模型问题”，而是一个“系统问题”。

二、于是形成的核心共识

2025 下半年行业逐渐形成五条稳定认知：

共识 1：Agent 是“分布式系统”，不是“智能函数”

早期误解：

Agent = 更聪明的函数调用

成熟认知：

Agent = 长生命周期的分布式状态机

它涉及：

多组件协作
异步执行
状态持久化
并发控制
失败恢复

这已经是系统工程问题。

共识 2：最难的不是推理，而是“状态管理”

模型可以规划。

但难的是：

上一步做了什么？
哪些变量已经改变？
哪些副作用已发生？
是否需要补偿机制？

这类似于数据库事务管理。

共识 3：Tool layer 是最大复杂度来源

不是模型。

而是：

API 不稳定
工具 schema 变化
返回结果格式漂移
GUI 版本变化

因此开始强调：

Tool contract
Schema validation
Retry policy
Timeout strategy

共识 4：Observability 是必须的

必须具备：

日志
可视化 trace
决策回放
成本监控
失败分析

否则无法进入生产。

共识 5：API-first，GUI-fallback

GUI 是展示能力。

生产系统优先：

API
RPC
Structured tool

这是一种工程理性。

三、为什么 Gartner 会做那种预测？

不是因为技术退步。

而是因为：

市场预期过高。

企业误以为：

Agent 会自动重构流程
可以替代大量人工
不需要组织改变

现实是：

Agent 需要流程重构
需要权限重构
需要监控系统
需要运维团队

大量企业没准备好。

所以失败率高。

四、真正成熟的 Agent 工程结构

2025 Q4 以后稳定的架构模型：

Reasoning Model
      ↓
Planner Layer
      ↓
Tool Abstraction Layer
      ↓
Execution Sandbox
      ↓
State Store
      ↓
Monitoring & Audit

这不是 prompt engineering。

这是系统工程。

五、一个非常重要的认知升级

2024 年大家问：

模型够不够强？

2025 年下半年问：

系统是否可治理？

这是成熟标志。

六、Agent 工程学的核心问题

如果你要系统学习，应该关注这五个维度：

状态一致性（state consistency）
幂等性（idempotency）
副作用控制（side effect control）
失败恢复（recovery strategy）
可观测性（observability）

这五个比模型能力更重要。

七、总结一句话

2025 下半年形成的共识不是“Agent 不行”，
而是“Agent 是工程系统，必须用工程思维去做”。

八、你现在应该如何学习？

如果你想真正理解这一阶段的成熟认知：

不要只看论文。

要学习：

分布式系统
事务管理
API 设计
失败恢复策略
监控与 tracing

Agent 工程已经进入系统工程范畴。

------------------------------------------------------------------------------------------------------

而这种系统性的工程，上述提到的这些，在OpenClaw中得以体现并实现。

OpenAI并不是不知道方向。

而是：

内部推进慢
产品风险高
需要外部验证

OpenClaw 相当于市场验证。

posted @ 2024-02-07 13:29 郝壹贰叁阅读(29) 评论(0) 收藏举报

刷新页面返回顶部

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston