🔴 保留 · `Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Tasks`

可复用信息

- 评测 deployed coding agents 而非 bare models：
- Claude Opus 4.6 / Sonnet 4.6 / Haiku 4.5 → Claude Code harness
- GPT-5.4 xhigh / GPT-5.4 mini → Codex harness
- Kimi K2.5 → OpenCode harness
- 明确标注：tool mediation、file editing、shell access、workspace management 均属于 deployed coding agent 系统的一部分
- 附录 A 包含：per-agent API endpoints、model identifiers、采样设置、harness 调用方式（完全可复现的评测配置）
- Benchmark：Terminal-Bench 2.0（Vals AI）评测硬核真实 CLI 任务
保留理由：model-harness 配对评测思路值得学习；附录 A 的完整配置信息可直接用于构建自己的 coding agent 评测框架。