Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2104.00743
Cited By
v1
v2 (latest)
Towards General Purpose Vision Systems
Computer Vision and Pattern Recognition (CVPR), 2021
1 April 2021
Tanmay Gupta
Amita Kamath
Aniruddha Kembhavi
Derek Hoiem
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Towards General Purpose Vision Systems"
47 / 47 papers shown
Title
SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Guoxin Zang
Xue Li
Donglin Di
Lanshun Nie
Dechen Zhan
Yang Song
Lei Fan
VLM
244
1
0
10 Jul 2025
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Ying Yang
Jie Zhang
Xiao Lv
Di Lin
Tao Xiang
Qing Guo
AAML
VLM
107
1
0
30 May 2025
IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth
Md Touhidul Islam
Imran Kabir
Md. Alimoor Reza
Syed Masum Billah
128
0
0
28 May 2025
Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment
Kazuhiko Kawamoto
Atsuhiro Endo
Hiroshi Kera
238
0
0
17 May 2025
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task
Ahmad Khalil
Mahmoud Khalil
A. Ngom
VLM
204
1
0
20 Apr 2025
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding
IEEE International Conference on Robotics and Automation (ICRA), 2025
Imran Kabir
Md. Alimoor Reza
Syed Masum Billah
ReLM
VLM
LRM
215
3
0
16 Mar 2025
Benchmarking Large and Small MLLMs
Xuelu Feng
Yunsheng Li
DongDong Chen
Mei Gao
Mengchen Liu
Junsong Yuan
Chunming Qiao
79
2
0
04 Jan 2025
Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary Learning Framework for Abnormality Detection and Report Generation
IEEE Transactions on Medical Imaging (IEEE TMI), 2024
Jinghan Sun
Dong-mei Wei
Zhe Xu
Donghuan Lu
Hong Liu
Hong Wang
Sotirios A. Tsaftaris
Jingyu Sun
Yefeng Zheng
Liansheng Wang
MedIm
285
0
0
18 Dec 2024
Locality Alignment Improves Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
489
11
0
14 Oct 2024
Structured Spatial Reasoning with Open Vocabulary Object Detectors
Negar Nejatishahidin
Madhukar Reddy Vongala
Jana Kosecka
172
3
0
09 Oct 2024
Affordance-based Robot Manipulation with Flow Matching
Fan Zhang
Michael Gienger
530
40
0
02 Sep 2024
Universal Novelty Detection Through Adaptive Contrastive Learning
Computer Vision and Pattern Recognition (CVPR), 2024
Hossein Mirzaei
Mojtaba Nafez
Mohammad Jafari
Mohammad Bagher Soltani
Mohammad Azizmalayeri
Jafar Habibi
Mohammad Sabokrou
M. Rohban
190
11
0
20 Aug 2024
VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Xueqing Wu
Zongyu Lin
Songyan Zhao
Te-Lin Wu
Pan Lu
Nanyun Peng
Kai-Wei Chang
LRM
238
3
0
19 Jun 2024
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Navid Rajabi
Jana Kosecka
137
3
0
29 Apr 2024
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
219
254
0
28 Dec 2023
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
213
60
0
11 Dec 2023
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
Yizhou Wang
Ruiyi Zhang
Haoliang Wang
Uttaran Bhattacharya
Yun Fu
Gang Wu
MLLM
215
18
0
04 Dec 2023
DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets
Yash Jain
Harkirat Singh Behl
Z. Kira
Vibhav Vineet
135
24
0
08 Nov 2023
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Georgios Pantazopoulos
Malvina Nikandrou
Amit Parekh
Bhathiya Hemanthage
Arash Eshghi
Ioannis Konstas
Verena Rieser
Oliver Lemon
Alessandro Suglia
LM&Ro
156
10
0
07 Nov 2023
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
IEEE International Conference on Robotics and Automation (ICRA), 2023
P. Sermanet
Tianli Ding
Jeffrey Zhao
Fei Xia
Debidatta Dwibedi
...
Pannag R Sanketi
Karol Hausman
Izhak Shafran
Brian Ichter
Yuan Cao
LM&Ro
207
94
0
01 Nov 2023
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Wei-Ge Chen
Irina Spiridonova
Jianwei Yang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
154
45
0
01 Nov 2023
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
Daniela Ben-David
Tzuf Paz-Argaman
Reut Tsarfaty
MoE
131
0
0
25 Oct 2023
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
Zhizheng Zhang
Wenxuan Xie
Xiaoyi Zhang
Yan Lu
180
15
0
07 Oct 2023
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Computer Vision and Pattern Recognition (CVPR), 2023
Zigang Geng
Binxin Yang
Tiankai Hang
Chen Li
Shuyang Gu
...
Jianmin Bao
Zheng Zhang
Han Hu
DongDong Chen
Baining Guo
DiffM
VLM
225
154
0
07 Sep 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
235
17
0
18 Aug 2023
DesCo: Learning Object Recognition with Rich Language Descriptions
Neural Information Processing Systems (NeurIPS), 2023
Liunian Harold Li
Zi-Yi Dou
Nanyun Peng
Kai-Wei Chang
ObjD
VLM
149
26
0
24 Jun 2023
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
309
921
0
08 Jun 2023
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks
Nature Network Boston (NNB), 2023
Kai Zhang
Jun Yu
Eashan Adhikarla
Rong Zhou
Zhilin Yan
...
Hang Zhang
Yong Chen
Shijie Zhao
Hongfang Liu
Lichao Sun
LM&MA
MedIm
196
11
0
26 May 2023
Type-to-Track: Retrieve Any Object via Prompt-based Tracking
Neural Information Processing Systems (NeurIPS), 2023
Pha Nguyen
Kha Gia Quach
Kris Kitani
Khoa Luu
235
31
0
22 May 2023
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Computer Vision and Pattern Recognition (CVPR), 2023
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
Jing Liu
227
7
0
09 Apr 2023
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
A. Maharana
Amita Kamath
Christopher Clark
Joey Tianyi Zhou
Aniruddha Kembhavi
191
3
0
28 Mar 2023
ViM: Vision Middleware for Unified Downstream Transferring
IEEE International Conference on Computer Vision (ICCV), 2023
Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
181
2
0
13 Mar 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
183
8
0
16 Feb 2023
Generalized Decoding for Pixel, Image, and Language
Computer Vision and Pattern Recognition (CVPR), 2022
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLM
MLLM
ObjD
252
322
0
21 Dec 2022
Universal Object Detection with Large Vision Model
International Journal of Computer Vision (IJCV), 2022
Feng-Huei Lin
Wenze Hu
Yaowei Wang
Yonghong Tian
Guangming Lu
Fanglin Chen
Yong-mei Xu
Xiaoyu Wang
VLM
ObjD
247
8
0
19 Dec 2022
SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding
IEEE International Conference on Computer Vision (ICCV), 2022
Favyen Bastani
Piper Wolters
Ritwik Gupta
Joe Ferdinando
Aniruddha Kembhavi
273
163
0
28 Nov 2022
A Survey of Computer Vision Technologies In Urban and Controlled-environment Agriculture
ACM Computing Surveys (ACM CSUR), 2022
Jiayun Luo
Boyang Albert Li
Cyril Leung
278
21
0
20 Oct 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
International Conference on Learning Representations (ICLR), 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
333
467
0
17 Jun 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLM
MLLM
169
43
0
23 May 2022
GRIT: General Robust Image Task Benchmark
Tanmay Gupta
Ryan Marten
Aniruddha Kembhavi
Derek Hoiem
VLM
OOD
ObjD
137
35
0
28 Apr 2022
FindIt: Generalized Localization with Natural Language Queries
European Conference on Computer Vision (ECCV), 2022
Weicheng Kuo
Fred Bertsch
Wei Li
A. Piergiovanni
M. Saffar
A. Angelova
ObjD
174
18
0
31 Mar 2022
UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training
Daniel Khashabi
Yeganeh Kordi
Hannaneh Hajishirzi
217
73
0
23 Feb 2022
ASC me to Do Anything: Multi-task Training for Embodied AI
Jiasen Lu
Jordi Salvador
Roozbeh Mottaghi
Aniruddha Kembhavi
138
3
0
14 Feb 2022
Webly Supervised Concept Expansion for General Purpose Vision Models
European Conference on Computer Vision (ECCV), 2022
Amita Kamath
Christopher Clark
Tanmay Gupta
Eric Kolve
Derek Hoiem
Aniruddha Kembhavi
VLM
240
65
0
04 Feb 2022
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
293
131
0
23 Nov 2021
Hey AI, Can You Solve Complex Tasks by Talking to Agents?
Tushar Khot
Kyle Richardson
Daniel Khashabi
Ashish Sabharwal
RALM
LRM
160
15
0
16 Oct 2021
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Swaroop Mishra
Daniel Khashabi
Chitta Baral
Hannaneh Hajishirzi
LRM
402
837
0
18 Apr 2021
1