flyP 午间轻量精读 · 2026-06-21（cron 3d8f503a · 15:50 CST）

整理人：flyP
整理时间：2026-06-21 15:50 (Asia/Shanghai)
任务：研究知识库 · flyP 精读与批判 · 每天 3 次（本轮第 2 次）
模式：轻量精读，1 篇 arXiv 论文（v1 2026-06-03）+ 0 Substack
配额：Substack 至多 1 条/任务；本轮不消耗（已用 morning-read 0 条，本轮仍 0）
不重叠：morning-read S-Agent（arXiv 2606.20515）、tom 6-21 radar 4 篇候选（SAC / S-Agent / ToolPrivBench / Qiskit RAG）
接力：jay 6-12 research-briefing 把 VSTAT 列为"精读"项但 9 天未补完；本轮 flyP 用方法学 + 批判视角接力
配套说明：本文件只产出 GitHub-ready 草稿，不执行 git 写入；最终合并由 Stephen 协调的同步任务串行处理

0. 选篇与去重

维度	说明
候选 1（论文）	VSTAT: Benchmarking Visual State Tracking in Multimodal Video Understanding（arXiv:2606.03920, v1 2026-06-03）
候选 2（备选）	Moment-Video (arXiv 2606.02522) — 同方向但更"瞬时事件"；团队信号弱于 VSTAT
候选 3（备选）	NA-VQA (CVPR 2026 Workshop) — 叙事对齐电影级 QA，规模大但未挂 arXiv v1 公开
落选	(1) Thinking-with-Video、LongVideoAgent — flyP 6-17/6-12 已覆盖；(2) VideoReasonBench — 2025-05 老文；(3) TemporalBench — flyP 主线已多次触达
与本周 flyP 主线关系	接 6-19 UXBench（MLLM UX 推理缺陷）、6-18 Expense-of-Seeing（评测批判）、6-17 multimodal-positional-evidence 的"MLLM 在特定子能力上的评测批判"主题；也是对 6-21 morning-read S-Agent "MLLM as planner + spatial tools" 范式的反向打脸——再多的 agent/tool 也救不了视觉感知失败

1. 论文：VSTAT — Benchmarking Visual State Tracking in Multimodal Video Understanding

1.1 元数据

论文：Benchmarking Visual State Tracking in Multimodal Video Understanding
arXiv: 2606.03920（v1, 2026-06-03）
作者：Sihyun Yu¹²†, Nanye Ma¹†, Pinzhi Huang¹†, Hyunseok Lee², Shusheng Yang¹, June Suk Choi², Ellis Brown¹, Oscar Michel¹, Boyang Zheng¹, Jinwoo Shin², Saining Xie¹
阵营：NYU (Saining Xie/Nanye Ma) + KAIST (Jinwoo Shin)——Saining Xie 是 Cambrian 系列作者（yang2026cambrians）、Nanye Ma 是 xLSTM/GenAI 圈核心
HTML v1: https://arxiv.org/html/2606.03920v1
PDF: https://arxiv.org/pdf/2606.03920
项目页（待补查）：Abstract 提到 website this https URL，需查 GitHub/HF 仓库链接

1.2 核心问题（一句话）

现有视频 MLLM 评测大多可被"少数关键帧 + 末态识别"攻破——MME 类 benchmark 表面分数高，但模型是否真的在持续追踪视觉状态仍未知
真实视频理解需要对实体 / 状态 / 事件做连续追踪（如篮球赛记分、三杯游戏追踪物品位置、键盘输入字符）
这种 visual state tracking（视觉状态追踪）能力是机器人 / 长流程操作 / 具身智能的核心前置，但几乎所有 SOTA MLLM 在这件事上接近随机

1.3 关键设计

834 视频 + 1,500 问题——比 CP-Bench (101/101) 和 VET-Bench (100/100) 大 10x+
三类来源： - Blender 合成：9 个 3D 环境，450 clips（控制变量友好） - YouTube 真实视频：304 clips - 自录脚本化视频：80 clips
Taxonomy 双轴： - State Element（追踪什么）：Count / Location / Attribute - State Structure（状态结构）：Atomic / Sequence / Set / Dictionary - Perceptual Complexity（感知难度）：occlusion / camera motion / homogeneity / symbolic decoding / multi-entity attribution / event ambiguity
"答案无法从单帧 / 少量关键帧 / 末态推出"——强制模型处理整个视频流
两类问题格式：Numerical (NQs) + MCQs（带精心设计的 distractors）
Human-in-the-loop 标注与校验

1.4 数字（Table 2 节选，核心 negative finding）

Baselines

Chance Level (Random): 26.1% Avg
Chance Level (Frequency): 37.8% Avg
Human Performance: 90.5% Avg（92.8 / 89.9 / 86.4 / 93.7 / 77.5 / 90.0 / 92.4，按 state 维度）

Proprietary（API）— 全部 4 个

模型	Rank	Avg
Gemini-3.1 Pro (low)	1	44.4%
Gemini-3.1 Pro (high)	2	43.9%
Gemini-3.0 Flash (low)	3	39.8%
Gemini-3.0 Flash (high)	4	38.8%

最强大模型 Gemini-3.1 Pro 44.4%——比 human 90.5% 低 46.1pp，仅比 frequency baseline 37.8% 高 6.6pp

Open-source Thinking 模式

模型	Rank	Avg
MiMo-VL-7B	11	31.2%
InternVL3.5-8B-Thinking	13	30.2%
GLM-4.1V-9B-Thinking	14	30.2%
Qwen3VL-8B-Thinking	18	28.2%
Qwen3VL-4B-Thinking	19	26.0%

所有 Thinking 模型 28-31%——在 frequency baseline 37.8% 以下，thinking 反而更差

Open-source Instruct 模式

模型	Rank	Avg
LLaVA-OV-2-8B	1	35.1%
LLaVA-OV-2-8B (codec)	2	35.0%
Molmo2-4B	3	34.4%
Cambrian-S-7B	4	34.2%
Qwen3VL-8B	6	33.2%
LLaVA-OV-0.5B	20	21.3%

最强开源 LLaVA-OV-2-8B 35.1%——依旧在 frequency baseline 37.8% 之下

1.5 三大控制实验（批判性最强部分）

Exp A：Frame subsampling 是否是瓶颈？

假设：MLLM 子采样丢帧导致失败
控制：把 Blender 视频时间拉长（每个事件跨更多帧）
结果：仅 marginal improvement——否定"subsampling 是主因"

Exp B：Reasoning vs Perception 根因诊断

设计：选 VSTAT 中事件可手动转写为文本的简单子集
两个条件：原始视频 vs 文字转写（每帧/每事件明确描述）
结果：视频条件挣扎，文本条件下几乎完美（SOTA 接近 100%）
结论：bottleneck 在视觉感知，不是 reasoning 引擎

Exp C：Agentic 框架是否解决？

测了：MLLM-based video agents (wang2025active) + state-of-the-art coding agents (Claude Opus 4.7、OpenAI coding agent)
结果：性能差距仍然巨大，agent 不能消解失败
这是对今天 morning-read S-Agent "MLLM as planner + tools" 范式的直接 negative signal

1.6 失败模式三分法

从 thinking traces 与视频流的不匹配中归纳出三大失败模式： 1. Event Recognition：未识别出"哪一帧发生了什么" 2. Entity Association：跨帧关联同一实体失败（如 shell game 中哪只杯子始终装目标物） 3. State Update：识别对了事件但没更新内部状态

1.7 与本周/本月 flyP 主线的关系

已覆盖	主题	与 VSTAT 关系
6-19 UXBench	MLLM UX 推理缺陷	同方向（评测批判）
6-18 Expense-of-Seeing	多模态评测中"看"的代价	同方向（评测成本与真实性）
6-17 multimodal-positional-evidence	位置编码 / 注意力证据	互为镜像（VSTAT 关注视觉感知失败）
6-19 V2PE	长上下文位置编码扩到 1M+	反向——VSTAT 说"就算位置编码好了，视觉状态追踪依旧崩"
6-21 S-Agent	MLLM as planner + spatial tools	VSTAT 直接打脸："再多的 agent / tool 也救不了视觉感知失败"

VSTAT 是 2026 多模态 agent 范式（MLLM as planner + specialized tools）的根本性反例：当底层 visual perception 不可靠时，agent loop 反而会放大错误（thinking budget 越大，模型用更多 token 生成看似合理但错误的叙事——jay 6-12 briefing 已经抓到这点）

1.8 价值与影响

对研究界：明确指出"长视频 benchmark 表面繁荣"的真相——多数 SOTA MLLM 在 state tracking 上接近随机
对工业界：机器人 / 长流程操作 / agentic workflow 的真实部署需要 state tracking——VSTAT 是必跑 gate
对方法论：perception vs reasoning 根因诊断是高质量评测的范本（control experiment + ablation 思路）
对 agent 设计：与 morning-read S-Agent 形成对照——agentic 不解决根本感知缺陷；可能需要专门训练 perception-aware 视频 backbone，或把 visual memory 外置（与 6-17 Seeker / 6-19 GateMem 思路一致）

1.9 复现风险（粗判）

✅ 834 clips + 1500 questions 数据集规模适中，会开源（abstract "open-source these labels"）
✅ Blender 合成 450 clips 完全可复现
⚠️ YouTube 304 clips 面临版权 / 下架风险——可能需要 rehost
⚠️ Human-in-the-loop 标注：标注协议与质量在 Appendix A
⚠️ 未给 MRA-with-MCQ 详细公式（Table 2 footnote 提到 "reparsed MRA-with-MCQ metric"）——是 VSTAT 自定义指标，需查正文定义
⚠️ Gemini 3.1 Pro "low" / "high"：是 thinking budget 控制——具体 API 参数
⚠️ Sihyun Yu 第一单位是 NYU，Jinwoo Shin 在 KAIST——作者分布在两个机构，开源协调可能略慢
❌ 未给 peer review 接收信号：v1 2026-06-03，CVPR 2026 已截稿（6 月初），可能奔 NeurIPS 2026（5 月截稿已过）或 ICCV 2027

1.10 与 Expense-of-Seeing / UXBench 的对照

维度	6-18 Expense-of-Seeing	6-19 UXBench	6-21 VSTAT（本轮）
评测对象	视频帧采样与位置	UI/UX 推理	视觉状态追踪
评测方法	提示 vs 答案位置	9 维 UX 子任务	双轴 taxonomy（element × structure）
核心 finding	简单提问→答首/尾帧即可	SOTA 在 UX 推理上仍弱	SOTA 接近随机，agent 不解决
根因诊断	无	无	三层 control experiment
团队	FAIR/Princeton 系	CHI/CSCW 系	NYU/KAIST 系

VSTAT 的方法学（根因诊断 + 文本对照 + temporal stretch）比 Expense-of-Seeing / UXBench 都更深入——是评测类论文的范本。

1.11 标签

#multimodal-benchmark #video-mllm #visual-state-tracking #negative-finding #gemini-3-1-pro #qwen3-vl #llava-onevision-2 #cambrian #molmo2 #perception-bottleneck #agent-fail #claude-opus-47 #blender-synthesis #cvpr-2027-watchlist #nyu #kaist #saining-xie #reproduction-medium #evaluation-methodology

1.12 后续行动

必读 §2.1 数据集构建细节 + Appendix A（human-in-the-loop 协议）
必读 §3.1-3.2 三大控制实验（已抓 abstract，但具体数字需 PDF）
关注作者后续是否在 CVPR 2026 末班车 / NeurIPS 2026 5 月截稿 / ICCV 2027 投稿
必查：项目页 URL（abstract 提到 website 但 PDF 未直接给 GitHub 链接）
与 6-21 morning-read S-Agent 形成"S-Agent 范式的反例"主题页 notes/multimodal/agentic-mllm-perception-failure-2026.md——把 VSTAT（本轮）+ 6-19 UXBench + 6-18 Expense-of-Seeing 串成 "评测批判 + 范式反例" 主题
与 6-17 Seeker / 6-19 GateMem 联动：视觉记忆外置可能是 VSTAT 失败的解药
提醒：让 jay 跟踪VSTAT 的 GitHub release（与 S-Agent 的 S-300K 模式不同——VSTAT 是 1,500 QA + 834 clips，发布成本更低）

2. 横向对比（本轮只有 1 篇论文，无 Substack）

维度	VSTAT
类型	arXiv 论文（v1 2026-06-03 单作）
核心贡献	(1) 834 clips + 1,500 QA benchmark；(2) 双轴 taxonomy；(3) 24 个 MLLM 全军覆没；(4) 三大 control experiment 定位 perception bottleneck；(5) agentic 不解决问题
类别	多模态评测 · 视频 MLLM · 状态追踪
时间	2026-06-03 v1
复现风险	低-中（合成部分完全可复现；YouTube 部分有版权风险；标注协议需补查）
工程价值	高（机器人 / 长流程 agent 的评测 gate）
学术新颖性	中-高（评测类新意有限，但根因诊断思路新颖；与 UXBench、Expense-of-Seeing 一起形成评测批判主题）
与本周 flyP 既有方向关系	接 6-19 UXBench / 6-18 Expense-of-Seeing / 6-17 Seeker；反向打脸 6-21 morning-read S-Agent 的 "MLLM as planner + tools" 范式

3. 给 Stephen 同步任务的主题页建议

notes/multimodal/agentic-mllm-perception-failure-2026.md：可建，合并本轮 VSTAT + 6-19 UXBench + 6-18 Expense-of-Seeing + 6-17 multimodal-positional-evidence + 6-21 morning-read S-Agent 反向引用——形成"agentic MLLM 范式 ≠ perception 救星"主题页
notes/evaluation/2026-benchmark-critique-roundup.md：可建，合并 6-18 Expense-of-Seeing + 6-19 UXBench + 本轮 VSTAT + 6-19 mmlongembed（待精读）——形成"2026 评测批判趋势"主题页
notes/multimodal/video-mllm-state-tracking-2026.md：可建，合并本轮 VSTAT + 6-17 Seeker + 6-19 GateMem + 6-12 LongVideoAgent + 6-20 NA-VQA（CVPR Workshop）——形成"视频 MLLM 状态追踪与记忆外置"主题页

4. 标签汇总

#afternoon-read #multimodal-benchmark #video-mllm #visual-state-tracking #negative-finding #perception-bottleneck #agent-fail #nyu #kaist #saining-xie #gemini-3-1-pro #qwen3-vl #llava-onevision-2 #cambrian #molmo2 #evaluation-methodology #blender-synthesis #reproduction-medium #v1-2026 #cvpr-2026-watchlist #neurips-2026-watchlist #cvpr-2027-watchlist

5. 建议写入路径

本精读草稿：/shared/research-kb/inbox/flyp/2026-06-21-afternoon-read-VSTAT-visual-state-tracking.md（即本文件）
同步建议（由 Stephen 协调 sync 任务，非本轮操作）：
research-kb/published/notes/multimodal/2026-06-21-VSTAT-visual-state-tracking.md
research-kb/published/notes/evaluation/2026-06-21-VSTAT-evaluation-methodology.md
主题页（合并）：research-kb/published/notes/multimodal/agentic-mllm-perception-failure-2026.md、research-kb/published/notes/evaluation/2026-benchmark-critique-roundup.md、research-kb/published/notes/multimodal/video-mllm-state-tracking-2026.md

6. 待人工确认的问题

VSTAT 项目页 / GitHub / HF 链接——abstract 提到 website 但 PDF 未直给
YouTube 304 clips 的版权 / 下载 / rehost 状态——影响可复现性
MRA-with-MCQ metric 的精确定义（Table 2 footnote 提到）——是 VSTAT 自定义还是借用 Video-MME？
Gemini 3.1 Pro "low" / "high" 的具体 API 参数（thinking budget？temperature？）
Agent 实验的 baseline（MLLM video agent + Claude Opus 4.7 + OpenAI coding agent）——具体是哪几个 agent？是 wang2025active 的 Active 或 LongVideoAgent？
是否进 CVPR 2026 / NeurIPS 2026 / ICCV 2027？作者后续版本动向？
Blender 9 个环境的详细配置（USD 资产、相机路径、物体运动脚本）——是否开源？
thinking 模式模型（如 Qwen3VL-8B-Thinking）反而比 Instruct 版本更差——是否与 6-19 提到的"thinking budget 越大幻觉越强"信号一致？

7. 本轮不写入 `review/` 或 `published/` 的原因

按 2026-06-09 共享知识库写入规则约束：本任务只产出 GitHub-ready 草稿，不执行 git commit / git push / gh pr；最终合并由 Stephen 协调的同步任务串行处理。本文件路径符合 /shared/research-kb/inbox/flyp/YYYY-MM-DD-topic.md 命名约定。

8. 关键判断（供后续精读 / 主题页引用）

8.1 核心结论（短摘）

SOTA 多模态视频 MLLM 在 visual state tracking 上接近随机：Gemini-3.1 Pro 44.4% vs Human 90.5% vs Frequency baseline 37.8%
bottleneck 在 perception，不在 reasoning：文字转写条件下 SOTA 几乎完美
frame subsampling 不是主因：时间拉伸后改进 marginal
agentic 范式不能救场：video agents + coding agents 都失败

8.2 哲学意义

VSTAT + Expense-of-Seeing + UXBench 三者共同指向："MLLM 看似越来越强，但基础视觉能力可能比想象的脆弱"
与 morning-read S-Agent 的"MLLM as planner + tools" 范式对照：tool 解决的是 deciding when to call 和 integrating evidence，但 perceiving the right thing in the first place 是更底层的问题
推论：未来真正突破可能来自 (1) 视觉 backbone 重新设计（perception-first）、(2) 视觉 memory 外置（Store-Retrieve-Reason loop）、(3) 多模态 RL with perception reward——而非单纯堆参数 / 堆 agent 步骤

8.3 对 morning-read S-Agent 的具体修正建议

S-Agent 用 4 类 spatial tools 解决 counting / measurement / orientation / relative position
但 VSTAT 显示：如果底层 visual perception 错（entity association 失败、event recognition 错），工具调用只是"在错数据上做对的算术"
S-Agent 应该在 VSTAT 上做 ablation：基础 VLM vs VLM + S-Agent 在 VSTAT 上的差距，量化"工具调用 vs 感知错误"的占比
这是 S-Agent 论文 1-2 个 ablation 实验可以补的——可作为 S-Agent author feedback（CVPR/NeurIPS rebuttal 阶段）

8.4 与今日同时段其他实例分工

tom 6-21 radar：4 篇候选（SAC / S-Agent / ToolPrivBench / Qiskit RAG）——本轮不重叠
jay 6-21 late-evening supplement：聚焦 pgvector / Vector DB / KubeCon——本轮不重叠
flyP 6-21 morning-read：S-Agent spatial tool use——本轮反向打脸 S-Agent
stephen 6-21 evening check：协调稿，9 天前已把 VSTAT 列为"精读"项但未补完——本轮补完

8.5 Substack 用量

本轮 0/1（不消耗 morning 配额）
任务总计 0/3（morning / afternoon / evening 各 1 条，本轮已 0/1）
后续 evening-read 仍可补 1 条 Substack（建议关注：perception failure 主题的 newsletter 评论）