Tom 文献雷达草稿 · AI Agent 记忆、Agentic RAG 与长程评测

实例：Tom
产出时间：2026-06-10 08:40 CST / 2026-06-10 00:40 UTC
本次主题：AI Agent 记忆系统、长程个人助理评测、Agentic RAG、检索/长上下文评测
草稿用途：供 research-kb 审稿与后续串行合并；本轮不写入 review/、published/，不执行 GitHub 写入。

1. 本次主题

本轮聚焦 AI Agent、RAG、检索、长上下文、长期/多模态记忆与评测基准。重点关注 2026 年 5-6 月新论文与工程实现线索，优先收录能回答以下问题的材料：

Agent 记忆是否应从“语义相似检索”转向“执行状态管理/图重构”？
长程个人助理如何评测跨会话偏好、隐藏意图与主动性？
Computer-use / GUI agent 如何做可验证、可审计评测？
Agentic RAG 的新检索机制、证据完整性与生产级记忆层如何落地？

2. 检索范围与去重

2.1 检索来源

学术平台：arXiv、OpenReview、Hugging Face Papers、ACL Anthology / Papers with Code / Semantic Scholar 相关检索入口。
代码与模型平台：GitHub、Hugging Face。
官方技术博客/文档：Microsoft Foundry / Microsoft Learn。
CSDN：仅保留有明确工具链、步骤、命令或真实落地流程的文章；过滤泛泛概念文、软文、无版本/无命令/无源码的搬运文。

2.2 已读入去重材料

/shared/research-kb/inbox/jay/2026-06-10-llm-finetuning-rag.md

Jay 的草稿主题偏 LoRA/QLoRA 微调 + RAG 工程实践。本轮避免重复收录其 LoRA 相关条目，仅在 RAG 方向补充 Agent memory / agentic retrieval / 长程评测的新线索。

3. 候选条目（10 条）

序号	条目	来源	一句话价值	建议进入 `registry/papers.jsonl`	初步动作
1	Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents / MAGE	arXiv `2606.06090`	把 agent memory 从相似检索改写为“执行状态树”，直击长程任务错误隔离与状态连续性。	是	精读
2	Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents / MRAgent	arXiv `2606.06036`	用图记忆 + 主动重构替代一次性检索，在 LoCoMo / LongMemEval 上对比 Mem0、MemoryOS、LangMem 等。	是	精读
3	π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows	arXiv `2605.14678` / HF Papers	面向个人助理主动性、隐藏意图、跨会话依赖的 100 任务 benchmark。	是	精读
4	OpenComputer: Verifiable Software Worlds for Computer-Use Agents	arXiv `2605.19769` / HF Papers	33 个桌面应用、1000 个可机器检查任务，强调 state verifier 和可审计轨迹。	是	精读
5	Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking	arXiv `2606.01240`	IAR 动态混合检索 + SPC 证据块修复，面向多跳 QA / FEVER 等证据敏感任务。	是	精读
6	M3Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions	arXiv `2606.07402`	评估多模态、多会话记忆；全文上下文反而使模型质量下降，是“长上下文≠好记忆”的证据。	是	精读候选
7	ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment	arXiv `2606.00644`	评估 research agent 对方向预测、瓶颈诊断、venue positioning 等前瞻判断能力。	是	审稿
8	OpenViking	GitHub `volcengine/OpenViking`	开源 context database，给出 LoCoMo、agent experience memory、HotpotQA 的效率/准确率对比。	否，建议进工具/项目 registry	审稿，核实 benchmark 设置
9	Build and run agents at scale with Microsoft Foundry at Build 2026	Microsoft 官方博客	官方总结企业 agent 的 long-term memory、procedural memory、trace/eval/OTel 生产化方向。	否	工程主题页参考
10	Agentic RAG 的实现方式？	CSDN / TextIn	有 OCR + Agent + 记忆 + Coze 知识库的低门槛工作流、命令和配置示例；偏产品化，需严格审稿。	否	CSDN 高筛后保留，需人工核验

4. 高价值条目（5 条）

4.1 MAGE：Memory as Execution State Management for Long-Horizon Agents（⭐⭐⭐⭐⭐）

链接：https://arxiv.org/abs/2606.06090
HTML：https://arxiv.org/html/2606.06090v1
核心问题：现有 RAG / memory 系统按语义相似度组织历史，会把有效轨迹、失败轨迹、局部相关但状态不连续的信息混在一起；长程任务中这会破坏执行状态，导致错误级联。
方法要点：
MAGE 将历史组织为两层层级状态树：底层记录 action-observation trace，上层在子目标/决策边界生成压缩摘要。
agent 当前状态来自 root-to-current path，而不是临时拼接相似片段。
四类操作：Grow、Compress、Maintain、Revise；Revise 可在检测到错误时回到目标边界并开新分支，隔离失败轨迹。
结果线索：在 MemoryArena 上，平均任务成功率相对基线提升约 7.8-20.4 个百分点，相比长上下文方案 token 消耗降低约 55.1%。
评价：这是本轮最值得进知识库的 agent memory 论文。它把“记忆”从检索后端提升为执行状态管理层，适合放入 Agent Memory / Long-Horizon Agent 主题页。
建议：进入 registry/papers.jsonl；需要精读方法图、MemoryArena 设置与基线公平性。

4.2 MRAgent：Memory is Reconstructed, Not Retrieved（⭐⭐⭐⭐⭐）

链接：https://arxiv.org/html/2606.06036v1
核心问题：长期记忆查询不是一次性找 top-k，而是需要沿着人物、事件、时间、语义标签逐步重构证据链。
方法要点：
构建 graph memory，把 cues、tags、contents 等作为可遍历结构。
维护 reconstruction state：活动记忆元素集合 + 已累积证据。
LLM 在多轮中选择 traversal action，逐步扩展、剪枝和重构上下文。
实验线索：在 LoCoMo 和 LongMemEval 上对比 RAG、LangMem、A-Mem、MemoryOS、Mem0；抽取结果显示其在多项指标上有明显提升，并宣称最高约 23% 改进，同时降低 token / runtime 成本。
评价：与 MAGE 形成互补：MAGE 强调“执行状态路径”，MRAgent 强调“图上的证据重构”。两篇可合并成一组“后 RAG 时代 agent memory”专题。
建议：进入 registry/papers.jsonl；精读图结构、routing prompt、停止条件和成本统计。

4.3 π-Bench：Proactive Personal Assistant Agents in Long-Horizon Workflows（⭐⭐⭐⭐⭐）

链接：https://arxiv.org/abs/2605.14678
HF：https://huggingface.co/papers/2605.14678
核心问题：个人助理场景中，用户经常给出欠明确请求；benchmark 需要评估 agent 是否能识别隐藏意图、跨任务复用偏好，而不是只完成当前显式任务。
Benchmark 设计：
100 个多轮任务，覆盖 5 个领域化 user personas。
引入 hidden intents、inter-task dependencies、cross-session continuity。
同时评估 task completion 和 proactivity，区分“任务做完”与“主动满足隐含需求”。
评价：非常贴近 Anan 当前关心的个人 AI 助理 / 工作室管理 agent。对 OpenClaw 类系统尤其有参考价值：如何判断 agent 是否应该主动追问、主动复用历史规则、还是避免过度主动。
建议：进入 registry/papers.jsonl；精读任务 schema、评分 rubrics、proactivity 与 completion 的冲突案例。

4.4 OpenComputer：Verifiable Software Worlds for Computer-Use Agents（⭐⭐⭐⭐⭐）

链接：https://arxiv.org/abs/2605.19769
HTML：https://arxiv.org/html/2605.19769v1
HF：https://huggingface.co/papers/2605.19769
核心问题：computer-use agent 评测常依赖人工构造任务、视觉代理指标或 LLM-as-judge，难以扩展且难审计。
方法要点：
构造 verifier-grounded software worlds。
app-specific state verifiers 暴露真实应用的结构化检查端点。
自进化 verification layer 通过执行反馈修复 verifier 失败。
task-generation pipeline 合成 realistic + machine-checkable desktop tasks。
evaluation harness 记录完整轨迹并给出可审计 partial-credit rewards。
规模：覆盖 33 个桌面应用、1000 个 finalized tasks，包含浏览器、办公、创意软件、开发环境、文件管理、通信等。
结果线索：搜索摘要显示 GPT-5.4 总体 success rate 约 68.3%，仍有近三分之一任务失败；说明该 benchmark 还未饱和。
评价：适合进入 “Computer-use agent / GUI agent evaluation / verifier-based eval” 主题页。对任何自动化桌面 agent 评测都很关键。
建议：进入 registry/papers.jsonl；精读 verifier 设计和 partial-credit reward，关注是否开源任务与评测 harness。

4.5 Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking（⭐⭐⭐⭐）

链接：https://arxiv.org/html/2606.01240v1
核心问题：复杂 RAG 系统在检索通道、chunk 语义完整性、证据完整性上都有损失；单一 dense retrieval 或固定 chunk 容易破坏多跳证据链。
方法要点：
IAR：根据 query intent 动态加权不同检索通道。
SPC：检测并修复损坏/语义不完整证据块，保护 evidence integrity。
为缓解迭代机制带来的延迟，引入 small language models。
实验范围：NQ-open、TriviaQA、WebQuestions、HotpotQA、2WikiMultiHopQA、ELI5、FEVER；指标含 EM、F1、ROUGE、accuracy、FEVER score。
结果线索：在 HotpotQA 上 F1 提升约 2.65，FEVER accuracy 提升约 1.5。
评价：偏 RAG 系统优化，适合接在 Jay 草稿的 RAG 工程主题后，补一条“证据完整性 + 意图感知检索”的学术线。
建议：进入 registry/papers.jsonl；需要核实代码是否公开、延迟/成本统计是否公平。

5. 其他值得关注条目

5.1 M3Exam：多模态长期记忆评测

链接：https://arxiv.org/html/2606.07402v1
看点：把用户-助理长期互动扩展到文本、图片、文档等多模态历史。抽取片段显示，把整个对话塞入上下文会在多模型上降低回答质量，尤其开放模型下降更明显。
判断：适合补充“长上下文不等于长期记忆”的证据。建议进入 registry，但优先级略低于 MAGE/MRAgent。

5.2 ForeSci：研究判断型 agent 评测

链接：https://arxiv.org/html/2606.00644v2
看点：从文献 QA 转向评估 research agent 的方向预测、瓶颈诊断、战略规划与 venue/community 判断。
判断：与“学术研究知识库运营”本身高度相关，但需要审稿其任务构造是否足够客观。建议进入 registry，标注为 research-agent-eval。

5.3 OpenViking：开源 context database / agent memory 工具

链接：https://github.com/volcengine/OpenViking
看点：README 给出 OpenViking 0.3.22 在 LoCoMo、agent experience memory、HotpotQA 上的评测；HotpotQA top-20 检索声称达到 91% accuracy，延迟 0.23s。
风险：GitHub README 的 benchmark 需要核实复现实验脚本、硬件、数据切分和 baselines 参数。暂不进 papers registry，可进工具/项目列表。

5.4 Microsoft Foundry Agent Service Build 2026：生产级 agent 记忆与评测

链接：https://devblogs.microsoft.com/foundry/agent-service-build2026
看点：官方提出 user memory、session memory、procedural memory；procedural memory 早期 Tau-bench 结果声称带来 +7-14% absolute success-rate gains；同时强调 OpenTelemetry traces 与 evaluation 关联。
判断：不是论文，但适合补充“企业 agent production stack”主题页：memory、isolation、durable state、trace-driven eval。

5.5 CSDN / TextIn：Agentic RAG 的实现方式

链接：https://blog.csdn.net/TextIn666/article/details/161260551
保留原因：有清晰工作流和命令，例如 npm i -g clawmaster、clawmaster doctor && clawmaster serve，以及 PaddleOCR / OpenClaw / PowerMem、TextIn + Coze 知识库配置路径。
限制：偏产品宣传，缺少严格版本矩阵、可下载源码、独立 benchmark。建议只作为“工程案例候选”，人工审稿前不要高权重收录。

6. 分类标签

标签	条目数	代表条目
`Agent Memory`	4	MAGE、MRAgent、M3Exam、OpenViking
`Long-Horizon Agent`	4	MAGE、π-Bench、MRAgent、OpenComputer
`Agent Evaluation`	5	π-Bench、OpenComputer、ForeSci、M3Exam、Microsoft Foundry eval
`Agentic RAG`	3	Efficient RAG IAR/SPC、CSDN TextIn、OpenViking HotpotQA
`Retrieval / Evidence Integrity`	2	Efficient RAG、OpenViking
`Computer-Use Agent`	1	OpenComputer
`Proactive Assistant`	1	π-Bench
`Multimodal Memory`	1	M3Exam
`Production Agent Stack`	2	Microsoft Foundry、OpenViking
`CSDN-工程实战候选`	1	TextIn Agentic RAG

7. 建议写入路径

7.1 本轮实际草稿路径

/shared/research-kb/inbox/tom/2026-06-10-agent-memory-rag-eval-radar.md

7.2 后续审稿建议路径（仅建议，本轮未写入）

/shared/research-kb/review/tom/2026-06-10-agent-memory-rag-eval-radar.md

7.3 后续可拆主题页（仅建议，本轮未写入）

research-kb/topics/agent-memory.md
research-kb/topics/long-horizon-agent-evaluation.md
research-kb/topics/agentic-rag-evidence-integrity.md
research-kb/topics/computer-use-agent-evaluation.md

8. 是否需要精读 / 审稿 / 主题页更新

动作	条目	原因
精读	MAGE	可能成为 agent memory 从 RAG 到 state management 的关键论文。
精读	MRAgent	与 MAGE 互补，代表 graph reconstruction memory。
精读	π-Bench	贴近个人助理与 OpenClaw 类系统，能指导主动性边界。
精读	OpenComputer	computer-use agent 的 verifier-based eval 很重要。
精读	Efficient RAG IAR/SPC	RAG 证据完整性与检索意图适合补充工程主题。
审稿	ForeSci	任务重要，但需核查前瞻性判断的客观评价设计。
审稿	OpenViking	工程价值高，但 README benchmark 需复现性核查。
审稿	CSDN TextIn Agentic RAG	有流程和命令，但营销属性明显，需过滤后引用。
主题页更新	`agent-memory.md`	加入 MAGE / MRAgent / M3Exam / OpenViking。
主题页更新	`long-horizon-agent-evaluation.md`	加入 π-Bench / OpenComputer / ForeSci。
主题页更新	`agentic-rag-evidence-integrity.md`	加入 IAR/SPC 与 CSDN 工程案例。

9. Registry JSONL 建议片段（仅供后续同步任务人工合并）

注意：本轮不直接写入 research-kb/registry/papers.jsonl。

{"title":"Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents","year":2026,"source":"arXiv","url":"https://arxiv.org/abs/2606.06090","tags":["agent-memory","long-horizon-agent","state-management","evaluation"],"priority":"high","status":"candidate"}
{"title":"Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents","year":2026,"source":"arXiv","url":"https://arxiv.org/html/2606.06036v1","tags":["agent-memory","graph-memory","long-context","retrieval"],"priority":"high","status":"candidate"}
{"title":"π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows","year":2026,"source":"arXiv","url":"https://arxiv.org/abs/2605.14678","tags":["agent-evaluation","personal-assistant","proactivity","long-horizon"],"priority":"high","status":"candidate"}
{"title":"OpenComputer: Verifiable Software Worlds for Computer-Use Agents","year":2026,"source":"arXiv","url":"https://arxiv.org/abs/2605.19769","tags":["computer-use-agent","agent-evaluation","verifier","desktop-agent"],"priority":"high","status":"candidate"}
{"title":"Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking","year":2026,"source":"arXiv","url":"https://arxiv.org/html/2606.01240v1","tags":["rag","agentic-rag","retrieval","evidence-integrity"],"priority":"medium-high","status":"candidate"}
{"title":"M3Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions","year":2026,"source":"arXiv","url":"https://arxiv.org/html/2606.07402v1","tags":["multimodal-memory","agent-memory","long-context","benchmark"],"priority":"medium-high","status":"candidate"}
{"title":"ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment","year":2026,"source":"arXiv","url":"https://arxiv.org/html/2606.00644v2","tags":["research-agent","agent-evaluation","forecasting","scientific-judgment"],"priority":"medium","status":"candidate"}

10. 未收录 / 过滤说明

泛泛介绍 Agentic RAG、无命令/无源码/无版本矩阵的 CSDN 文章：过滤。
GitHub awesome list：只作为线索，不作为高价值条目收录，避免知识库被目录型页面污染。
需要账号、cookie、私有下载链接或付费资源才能复现的 CSDN 下载页：过滤。
未核实的产品 benchmark：只标注为“审稿/核实”，不写成确定结论。

本草案由 Tom 实例自动产出 · 2026-06-10 · 等待人工审稿后再由串行同步任务合并至知识库。