Tokens-to-Token ViT: Training Vision Transformers from ...
arxiv.org › abs › 2101Jan 28, 2021 · To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN ...
Tokens-to-Token ViT: Training Vision Transformers From ...
https://openaccess.thecvf.com/content/ICCV2021/papers/Yuan_Toke…Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet Li Yuan1*, Yunpeng Chen 2, Tao Wang1,3, Weihao Yu1, Yujun Shi1, Zihang Jiang1, Francis E.H. Tay1, Jiashi Feng1, Shuicheng Yan1 1 National University of Singapore 2 YITU Technology 3 Institute of Data Science, National University of Singapore yuanli@u.nus.edu, yunpeng.chen@yitu-inc.com, …