论文阅读笔记:“Keeping Your Eye on the Ball”
sixwalter Lv6

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers(NIPS 2021)

main idea: pooling along motion trajectories would provide a more natural inductive bias for video data, allowing the network to be invariant to camera motion.

image-20230320124427201

首先预处理视频,生成ST个tokens,使用了cuboid嵌入来聚合不相交地时空块。之后在特征中加入了时间位置和空间位置编码,最后将可学习的分类token加入到整个序列中。

video self-attention

joint space-time attention

image-20230320093506840
  • 计算量是平方级别的

divided space-time attention(time transformer)

image-20230320094305732
  • 优点是计算量下降到
  • 模型分析时间和空间是独立的(不能同时对时空信息进行推理)
  • 处理大量冗余信息,而且不灵活、不能充分利用有效tokens

trajectory attention

image-20230320092501493
  • better able to characterize the temporal information contained in videos
  • aggregates information along implicitly determined motion paths
  • aims to share information along the motion path of the ball
image-20230320095422970
  • 应用在空间维度,且每个帧是独立的
  • 隐式地寻找每个时刻轨迹的位置(通过比较query, key)

trajectory aggregation

image-20230320100927726 image-20230320101056901
  • 应用在时间维度
  • 复杂度:

Approximating attention

基于原型的注意力近似:

  • 使用尽可能少的原型的同时重建尽可能准确的注意力操作
  • 作者从queries, keys中选择最正交的R个向量作为我们的原型
image-20230320111921268 image-20230320112333825

因为最大复杂度为$RDNO(N^2)降到了$$O(N)$

实验

image-20230320124357224 image-20230320125626322
  • Post title:论文阅读笔记:“Keeping Your Eye on the Ball”
  • Post author:sixwalter
  • Create time:2023-03-13 00:00:00
  • Post link:https://coelien.github.io/2023/03/13/paper-reading/paper_reading_062/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments