A Robust Volumetric Transformer for Accurate 3D Tumor SegmentationInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2021

Himashi Peiris

Munawar Hayat

Zhaolin Chen

Gary Egan

Mehrtash Harandi

ViT MedIm

207

182

26 Nov 2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2021

Zicheng Liu

337

302

25 Nov 2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Valerii Likhosherstov

192

25 Nov 2021

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Yu Qiao

208

24 Nov 2021

PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer

Jingang Shi

344

241

23 Nov 2021

Efficient Video Transformers with Spatial-Temporal Token Selection

Zuxuan Wu

199

23 Nov 2021

Ice hockey player identification via transformers and weakly supervised learning

156

22 Nov 2021

Florence: A New Foundation Model for Computer Vision

Lu Yuan

...

Jianwei Yang

391

1,049

22 Nov 2021

Exploring Segment-level Semantics for Online Phase Recognition from Surgical VideosIEEE Transactions on Medical Imaging (IEEE TMI), 2021

Xinpeng Ding

Xiaomeng Li

345

22 Nov 2021

Swin Transformer V2: Scaling Up Capacity and Resolution

...

553

2,413

18 Nov 2021

Evaluating Transformers for Lightweight Action Recognition

226

18 Nov 2021

Benchmarking and scaling of deep learning models for land cover image classification

Ioannis Papoutsis

Nikolaos Ioannis Bountos

Angelos Zavras

Dimitrios Michail

Christos Tryfonopoulos

454

18 Nov 2021

Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image ReconstructionComputer Vision and Pattern Recognition (CVPR), 2021

Jing Lin

Radu Timofte

Luc Van Gool

167

344

15 Nov 2021

Relational Self-Attention: What's Missing in Attention for Video UnderstandingNeural Information Processing Systems (NeurIPS), 2021

173

02 Nov 2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action RecognitionBritish Machine Vision Conference (BMVC), 2021

Dima Damen

297

01 Nov 2021

Blending Anti-Aliasing into Vision TransformerNeural Information Processing Systems (NeurIPS), 2021

213

28 Oct 2021

History Aware Multimodal Transformer for Vision-and-Language Navigation

299

309

25 Oct 2021

The Efficiency MisnomerInternational Conference on Learning Representations (ICLR), 2021

278

112

25 Oct 2021

SCENIC: A JAX Library for Computer Vision Research and Beyond

206

18 Oct 2021

Object-Region Video Transformers

382

13 Oct 2021

StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement LearningEuropean Conference on Computer Vision (ECCV), 2021

406

12 Oct 2021

TAda! Temporally-Adaptive Convolutions for Video UnderstandingInternational Conference on Learning Representations (ICLR), 2021

415

12 Oct 2021

Multi-Modal Pre-Training for Automated Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021

220

12 Oct 2021

Video Is Graph: Structured Graph Module for Video Action Recognition

Rongjie Li

Xiaojun Wu

Tianyang Xu

368

12 Oct 2021

EfficientPhys: Enabling Simple, Fast and Accurate Camera-Based Vitals MeasurementIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021

291

135

09 Oct 2021

Exploring the Limits of Large Scale Pre-training

211

133

05 Oct 2021

PETA: Photo Albums Event Recognition using Transformers AttentionInternational Conference on Pattern Recognition (ICPR), 2021

132

26 Sep 2021

Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Zhe Wang

309

117

24 Sep 2021

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Sharan Narang

994

137

22 Sep 2021

$Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels$

Audio-Visual Speech Recognition is Worth 32

\times

\times

183

20 Sep 2021