论文阅读笔记:“Keeping Your Eye on the Ball”

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers(NIPS 2021)
main idea: pooling along motion trajectories would provide a more natural inductive bias for video data, allowing the network to be invariant to camera motion.

首先预处理视频,生成ST个tokens,使用了cuboid嵌入来聚合不相交地时空块。之后在特征中加入了时间位置和空间位置编码,最后将可学习的分类token加入到整个序列中。
video self-attention
joint space-time attention

- 计算量是平方级别的
divided space-time attention(time transformer)

- 优点是计算量下降到
或 - 模型分析时间和空间是独立的(不能同时对时空信息进行推理)
- 处理大量冗余信息,而且不灵活、不能充分利用有效tokens
trajectory attention

- better able to characterize the temporal information contained in videos
- aggregates information along implicitly determined motion paths
- aims to share information along the motion path of the ball

- 应用在空间维度,且每个帧是独立的
- 隐式地寻找每个时刻轨迹的位置(通过比较query, key)
trajectory aggregation


- 应用在时间维度
- 复杂度:
Approximating attention
基于原型的注意力近似:
- 使用尽可能少的原型的同时重建尽可能准确的注意力操作
- 作者从queries, keys中选择最正交的R个向量作为我们的原型


因为最大复杂度为$RDN
实验


- Post title:论文阅读笔记:“Keeping Your Eye on the Ball”
- Post author:sixwalter
- Create time:2023-03-13 00:00:00
- Post link:https://coelien.github.io/2023/03/13/paper-reading/paper_reading_062/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
Comments