论文阅读笔记:“TokenLearner:What Can 8 Learned Tokens Do for Images and Videos?(NIPS 2021)”
sixwalter Lv6

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?(NIPS 2021)

网络结构

Framework for Video

image-20230306100347875

TokenLearner自适应地学习标记向量的集合,MHSA对时空关联进行建模,最后TokenFuser将它们结合起来,并重建为原始的输入tensor大小。

TokenLearner

image-20230306093942601

TokenFuser

  • token-wise linear layer: 在各个token间融合时空信息

  • token特征的反向映射(映射回原先的tensor shape)

image-20230306094504359

Embedding into ViT architecture

image-20230306094153553

实验

kinetics400上的SOTA对比

image-20230306101409167

charades数据集上的模型优化

将TokenLearner嵌入X3D中,将3D卷积换成了一对2D卷积和1D卷积。1D卷积采用是TokenLearner进行替换。并把MHSA替换未来vector transformer。

image-20230306103703827 image-20230306104545348
  • Post title:论文阅读笔记:“TokenLearner:What Can 8 Learned Tokens Do for Images and Videos?(NIPS 2021)”
  • Post author:sixwalter
  • Create time:2023-03-06 00:00:00
  • Post link:https://coelien.github.io/2023/03/06/paper-reading/paper_reading_060/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments