[agent] Agent Engineering 2024-2025

 

 

Q1 2024

DateTitleTypeOne-line takeawayWhy it influenced consensus
2024-01-22 LangGraph framework Made agent loops + state tracking + human approvals explicit primitives.  Shifted community thinking from “prompt chains” to “controllable state machines” for agents. 
2024-02-05 Understanding the planning of LLM agents: A survey survey (paper) Stabilized a shared taxonomy for planning (decompose/select/reflect/memory/modules).  Helped the field converge on what planning means and which sub-problems deserve separate evaluation. 
 

Q2 2024

DateTitleTypeOne-line takeawayWhy it influenced consensus
2024-04-11 OSWorld benchmark (paper) Execution-based evaluation in real OS/app environments showed a large human–agent gap (best model far below human).  Made “real environment + reproducible execution” the gold standard for computer-use agents. 
2024-05-06 SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering paper/system Demonstrated that agent-computer interfaces (ACI) can unlock meaningful gains, not just better base models.  Reframed progress as partly an interface / environment-design problem (the “agent UX” matters). 
2024-05-28 Tool Learning with LLMs: A Survey survey (paper) Formalized tool learning as a staged pipeline (plan→select→call→respond) with stage-specific benchmarks.  Helped consensus converge that “tool use” is not one skill but multiple measurable failure points. 
2024-06-17 τ-bench benchmark (paper) Showed <50% success and poor multi-trial reliability (pass^k) for tool-agent-user interactions in real domains.  Popularized “reliability across runs” as a core agent metric, not a nice-to-have. 
2024-06-27 LangGraph Cloud (beta) product/platform Positioned “running agents reliably at scale” (persistence, queues, tracing) as a platform problem.  Reinforced an emerging consensus: production agents need infrastructure (ops/observability), not just prompts. 
 
发现模型直接给出答案可能不是很好的策略,利用已有的一些工具,“循序渐进”才是正确的路。把模型能力的“瓶颈”转移。
另外,强调了“可观测”的重要性,也就是初现了与模型频繁"小交互"的趋势。
 
Tool Learning with LLMs: A Survey中提到的select是不是与tnt有点关系?对某一条路,模型会评估并与其他路去对比?-- 哦,这是“选择更合适的工具”,而非“系统性搜索策略”。
 

Q3 2024

DateTitleTypeOne-line takeawayWhy it influenced consensus
2024-08-15 Automated Design of Agentic Systems (ADAS) paper Proposed automating the search over agent system designs (meta-agents generating better agents).  Marked a shift from “design a workflow” to “optimize/search agent architectures,” similar to AutoML. 
2024-09-12 Learning to reason with LLMs / o1-preview model/research release Framed stronger reasoning as RL-trained multi-step thinking that can reduce manual prompt scaffolding.  Increased consensus that reasoning improvements help planning—but did not remove the need for better tool loops and evaluation. 
 
承上启下:2024年上半年,大模型使用工具基本在理论与实践上开始成熟,下半年开始制定标准,之后就是等待产品化。
 

Q4 2024 (Anthropic 制定标准的发布季)

DateTitleTypeOne-line takeawayWhy it influenced consensus
2024-10-10 Orchestrating Agents: Routines and Handoffs (and Swarm repo) platform guidance + OSS Introduced “routines + handoffs” as a controllable multi-agent orchestration pattern; published Swarm as a reference implementation.  Made multi-agent systems feel like engineering patterns (handoffs/state) rather than magical “team of agents” demos. 
2024-10-22 Anthropic computer use (Claude 3.5 + computer use) product capability Made GUI-level “computer use” a mainstream API feature (explicitly framed as a new capability).  Validated computer-use as a central agent modality (aligning with OSWorld-style evaluation). 
2024-11-25 Model Context Protocol (MCP) introduced standard/protocol Proposed an open standard for secure, two-way connections between AI clients and external tools/data via MCP servers.  Shifted consensus toward interoperability: tool integration became a standards/ecosystem issue, not per-app glue. 
2024-12-19 Building effective agents (Anthropic) industry guidance/report Argued successful agents use simple, composable patterns and clarified “workflows vs agents” tradeoffs.  Helped consolidate a pragmatic consensus: reliability often comes from constrained designs, not maximal autonomy. 
 

Q1 2025

DateTitleTypeOne-line takeawayWhy it influenced consensus
2025-01-23 Introducing Operator (and Computer-Using Agent) product + model note Presented a web-operating agent that explicitly requires user confirmation at sensitive steps.  Mainstreamed “supervised autonomy” (handover points) as the default for real-world action agents. 
2025-02-02 Introducing deep research (plus later system card) product + safety report Launched a multi-step web research agent emphasizing citations and risk mitigation (prompt injection, privacy, etc.).  Established “auditability (citations) + threat model” as core to research agents, not optional polish. 
2025-03-11 New tools for building agents (Responses API, Agents SDK, tracing) platform Introduced agent primitives: unified Responses API + built-in tools + Agents SDK for orchestration and production tracing.  Signaled a platform-level consensus: agents require native tooling for orchestration/observability, not bespoke glue. 
2025-03-20 Survey on Evaluation of LLM-based Agents survey (paper) Framed agent evaluation as fragmented and underdeveloped, calling for realism and broader metrics beyond accuracy.  Codified “evaluation” as a first-class research area and supported the pivot toward benchmark/tooling infrastructure. 
 
ChatGPT Deep Research: Initial release, February 3, 2025
 

Deep Research:研究目标->搜索->阅读->综合->生成报告 

高度结构化

ChatGPT Agent:目标->规划->工具选择->执行->观察->重规划

持续运行,是一个真正的动态闭环。

 

它是:

工程成熟度跃迁季度。

模型能力已基本足够,
系统设计成为主战场。

 

Q2 2025

DateTitleTypeOne-line takeawayWhy it influenced consensus
2025-05-21 New tools and features in the Responses API platform update Added remote MCP servers, background mode (async), reasoning summaries, and other production-oriented features.  Reinforced consensus that long-running agents need platform support for async execution, auditability, and interoperability. 
2025-05-29 SWE-bench-Live benchmark (paper) Introduced continuously updatable, executable tasks to reduce staleness/contamination in coding-agent evaluation.  Strengthened consensus that static benchmarks get “solved” or leaked; evaluation must be live and reproducible. 
2025-06-25 Gartner: >40% of agentic AI projects canceled by 2027 industry report Predicted widespread cancellations due to costs, unclear business value, and inadequate risk controls; warned about “agent washing.”  Moved enterprise consensus from hype to operational reality: ROI discipline and governance became non-negotiable. 
 
[1]
17 July 2025 — ChatGPT agent starts rolling out today to Pro, Plus, and Team;
早期 browser agent 论文发生在2023年,意识萌发期~
有可能,OpenAI 是第一个大规模产品化 rollout 的。
 

UI 真正的角色是什么?

GUI 是:

覆盖盲区的 fallback 层

意思是:

当:

    • 没有 API

    • 没有结构化接口

    • 没有官方集成

    • 没有可用插件

才使用 GUI。

它不是默认路径。

 

那为什么产品还要展示 GUI?

因为它:

    • 展示“通用执行能力”

    • 给人信心“它能做任何事”

    • 是 marketing 上的能力上限

但实际系统设计里:

工程师会尽量绕开 GUI。

 

 

[2]

读了Gartner,有以下的feeling,感觉可视化,tracking非常重要。大家都开始构建Agent,有些是尝试通用Agent,但还未意识到其难度,之后失败的概率会很高,但 失败的challenges在哪里?共识在形成中。

三阶段模型:

Stage 1:垂直 agent
Stage 2:通用 tool orchestration
Stage 3:自治长运行

Gartner 的预测不是在否定 Stage 1。

它是在说:

企业想直接跳到 Stage 2 / Stage 3,会有很高失败率。

 

Q3 2025

DateTitleTypeOne-line takeawayWhy it influenced consensus
2025-07-29 Evaluation and Benchmarking of LLM Agents: A Survey survey (paper) Emphasized reliability/safety/objectives-×-process taxonomy; highlighted enterprise constraints (RBAC, compliance).  Cemented multi-dimensional “agent readiness” metrics (capability + reliability + safety + process). 
2025-08-26 Assistants API deprecated; migrate to Responses platform governance Officially positioned Responses API as the future direction for building agents; set a sunset timeline.  Showed platform convergence: agent builders should rely on unified primitives (Responses) rather than bespoke beta abstractions. 
2025-08-27 MCP-Bench benchmark (paper) Benchmarked tool-using agents on complex real-world tasks via MCP servers (tool discovery/coordination/precision).  Connected protocol adoption to measurable outcomes: standard access alone doesn’t guarantee good tool reasoning. 
 

2025 Q3 开始大量讨论:

    • Tool mis-selection

    • Over-delegation

    • Hallucinated state

    • Permission drift

    • Infinite loop failure

行业共识逐渐形成:

Agent 的核心挑战不是推理,而是状态一致性。

开始出现“状态管理框架”的讨论。

Agent 是工程系统,不是魔法。
 

Q4 2025

DateTitleTypeOne-line takeawayWhy it influenced consensus
2025-10-28 OSWorld-MCP benchmark (paper) Proposed fair evaluation of computer-use agents that can choose between GUI actions and MCP tool invocation.  Made hybrid evaluation (GUI + tools) a new norm, aligning benchmarks with real product architectures. 
2025-11-04 Code execution with MCP (Anthropic) platform/engineering note Argued code execution can reduce context load, filter data, and improve efficiency/security for tool-heavy agents.  Strengthened consensus that scaling agents is a context/latency/state engineering problem as much as a modeling problem. 
2025-11-25 MCP Specification (authoritative spec) standard Published authoritative protocol requirements for MCP (stabilizing it as infrastructure).  Maturation of standards signaled readiness for broader cross-vendor adoption and stable tooling ecosystems. 
2025-12-09 Linux Foundation forms Agentic AI Foundation (AAIF) governance event Established vendor-neutral governance for MCP, OpenAI’s AGENTS.md, and other agent infrastructure projects.  Institutionalized interoperability and open governance as “critical infrastructure” for the agent ecosystem. 
 

Q1 2026

DateTitleTypeOne-line takeawayWhy it influenced consensus
2026-01-21 2026 Agentic Coding Trends Report (Anthropic) industry report Reported high AI usage in coding but limited “full delegation,” emphasizing supervision/validation in practice.  Grounded consensus toward “collaborative autonomy”: agents assist heavily but humans remain accountable. 
2026-01-26 MCP Apps (official MCP extension) standard/ecosystem Enabled tools to return interactive UI components (forms/dashboards/workflows) in agent conversations.  Directly addressed the supervision UX bottleneck: users can understand/approve actions inside the loop at scale. 
2026-02-05 Introducing OpenAI Frontier enterprise platform Framed the bottleneck as “how agents are built and run,” offering shared context + execution + eval/optimization + permissions.  Consolidated a mature consensus: enterprise agents need context layers, identity/permissions, and lifecycle evaluation. 
2026-02-11 Power Platform Feb 2026 update (Power Apps MCP Server + enhanced agent feed) enterprise product Brought MCP server + “built-in human supervision” into business apps via an enhanced agent feed.  Validated that mainstream enterprise software is adopting the supervised-agent pattern as a default UX. 
2026-02-23 Introducing Frontier Alliances governance/enterprise program Announced multi-year partnerships stressing workflow redesign and change management as deployment bottlenecks.  Elevated an emerging consensus: scaling agents is organizational engineering (process + governance), not just tech. 
 
 
 
 

2025 年下半年“Agent 是工程系统”的共识,确实不是空穴来风,而是源于 2024–2025 上半年大量企业试点与实践后的系统性摩擦。不过需要校准一点:

这不是“因为失败太多才形成共识”,
而是“因为开始规模化实践,复杂性暴露出来,工程问题成为主战场”。

下面给你一份可学习、可内化的结构化讲解

 

一、先讲因果链(从能力 → 热潮 → 摩擦 → 工程化)

阶段 1:能力跃迁(2024)

  • 推理模型稳定

  • 工具调用成熟

  • GUI 能跑

  • 多 agent 编排框架流行

行业判断:

技术条件已满足,可以规模化落地。


阶段 2:大规模试点(2024 Q4 – 2025 Q2)

企业开始做:

  • 自动客服

  • 自动审批

  • 自动财务处理

  • 自动研发辅助

  • 自动销售跟进

问题开始出现。


阶段 3:现实摩擦暴露(2025 Q2–Q3)

真实问题集中爆发:

1️⃣ 状态一致性问题

  • Agent 忘记自己做过什么

  • 工具返回值与内部推理不一致

  • 多步流程中变量漂移

2️⃣ 工具选择错误

  • 选错 API

  • 错误调用参数

  • 重复执行副作用操作

3️⃣ 成本失控

  • 多步循环 token 暴涨

  • GUI 自动化耗时极高

  • 并发运行成本不可预测

4️⃣ 权限与安全

  • 写操作风险高

  • 数据泄露风险

  • Prompt injection

5️⃣ 运维困难

  • 无法 debug

  • 无法重放

  • 无法回溯决策路径

企业意识到:

这不是一个“模型问题”,而是一个“系统问题”。

 

二、于是形成的核心共识

2025 下半年行业逐渐形成五条稳定认知:


共识 1:Agent 是“分布式系统”,不是“智能函数”

早期误解:

Agent = 更聪明的函数调用

成熟认知:

Agent = 长生命周期的分布式状态机

它涉及:

  • 多组件协作

  • 异步执行

  • 状态持久化

  • 并发控制

  • 失败恢复

这已经是系统工程问题。


共识 2:最难的不是推理,而是“状态管理”

模型可以规划。

但难的是:

  • 上一步做了什么?

  • 哪些变量已经改变?

  • 哪些副作用已发生?

  • 是否需要补偿机制?

这类似于数据库事务管理。


共识 3:Tool layer 是最大复杂度来源

不是模型。

而是:

  • API 不稳定

  • 工具 schema 变化

  • 返回结果格式漂移

  • GUI 版本变化

因此开始强调:

  • Tool contract

  • Schema validation

  • Retry policy

  • Timeout strategy


共识 4:Observability 是必须的

必须具备:

  • 日志

  • 可视化 trace

  • 决策回放

  • 成本监控

  • 失败分析

否则无法进入生产。


共识 5:API-first,GUI-fallback

GUI 是展示能力。

生产系统优先:

  • API

  • RPC

  • Structured tool

这是一种工程理性。

 

三、为什么 Gartner 会做那种预测?

不是因为技术退步。

而是因为:

市场预期过高。

企业误以为:

  • Agent 会自动重构流程

  • 可以替代大量人工

  • 不需要组织改变

现实是:

  • Agent 需要流程重构

  • 需要权限重构

  • 需要监控系统

  • 需要运维团队

大量企业没准备好。

所以失败率高。

 

四、真正成熟的 Agent 工程结构

2025 Q4 以后稳定的架构模型:

 
Reasoning Model

Planner Layer

Tool Abstraction Layer

Execution Sandbox

State Store

Monitoring & Audit
 

这不是 prompt engineering。

这是系统工程。

 

五、一个非常重要的认知升级

2024 年大家问:

模型够不够强?

2025 年下半年问:

系统是否可治理?

这是成熟标志。

 

六、Agent 工程学的核心问题

如果你要系统学习,应该关注这五个维度:

  1. 状态一致性(state consistency)

  2. 幂等性(idempotency)

  3. 副作用控制(side effect control)

  4. 失败恢复(recovery strategy)

  5. 可观测性(observability)

这五个比模型能力更重要。

 

七、总结一句话

2025 下半年形成的共识不是“Agent 不行”,
而是“Agent 是工程系统,必须用工程思维去做”。

 

八、你现在应该如何学习?

如果你想真正理解这一阶段的成熟认知:

不要只看论文。

要学习:

  • 分布式系统

  • 事务管理

  • API 设计

  • 失败恢复策略

  • 监控与 tracing

Agent 工程已经进入系统工程范畴。

 
 
------------------------------------------------------------------------------------------------------
而这种系统性的工程,上述提到的这些,在OpenClaw中得以体现并实现。

OpenAI并不是不知道方向。

而是:

  • 内部推进慢

  • 产品风险高

  • 需要外部验证

OpenClaw 相当于市场验证。

 
posted @ 2024-02-07 13:29  郝壹贰叁  阅读(28)  评论(0)    收藏  举报