ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.00743
  4. Cited By
Towards General Purpose Vision Systems
v1v2 (latest)

Towards General Purpose Vision Systems

Computer Vision and Pattern Recognition (CVPR), 2021
1 April 2021
Tanmay Gupta
Amita Kamath
Aniruddha Kembhavi
Derek Hoiem
ArXiv (abs)PDFHTML

Papers citing "Towards General Purpose Vision Systems"

47 / 47 papers shown
Title
SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Guoxin Zang
Xue Li
Donglin Di
Lanshun Nie
Dechen Zhan
Yang Song
Lei Fan
VLM
244
1
0
10 Jul 2025
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Ying Yang
Jie Zhang
Xiao Lv
Di Lin
Tao Xiang
Qing Guo
AAMLVLM
107
1
0
30 May 2025
IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth
IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth
Md Touhidul Islam
Imran Kabir
Md. Alimoor Reza
Syed Masum Billah
128
0
0
28 May 2025
Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment
Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment
Kazuhiko Kawamoto
Atsuhiro Endo
Hiroshi Kera
238
0
0
17 May 2025
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task
Ahmad Khalil
Mahmoud Khalil
A. Ngom
VLM
204
1
0
20 Apr 2025
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene UnderstandingIEEE International Conference on Robotics and Automation (ICRA), 2025
Imran Kabir
Md. Alimoor Reza
Syed Masum Billah
ReLMVLMLRM
215
3
0
16 Mar 2025
Benchmarking Large and Small MLLMs
Benchmarking Large and Small MLLMs
Xuelu Feng
Yunsheng Li
DongDong Chen
Mei Gao
Mengchen Liu
Junsong Yuan
Chunming Qiao
79
2
0
04 Jan 2025
Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary
  Learning Framework for Abnormality Detection and Report Generation
Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary Learning Framework for Abnormality Detection and Report GenerationIEEE Transactions on Medical Imaging (IEEE TMI), 2024
Jinghan Sun
Dong-mei Wei
Zhe Xu
Donghuan Lu
Hong Liu
Hong Wang
Sotirios A. Tsaftaris
Jingyu Sun
Yefeng Zheng
Liansheng Wang
MedIm
285
0
0
18 Dec 2024
Locality Alignment Improves Vision-Language Models
Locality Alignment Improves Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
489
11
0
14 Oct 2024
Structured Spatial Reasoning with Open Vocabulary Object Detectors
Structured Spatial Reasoning with Open Vocabulary Object Detectors
Negar Nejatishahidin
Madhukar Reddy Vongala
Jana Kosecka
172
3
0
09 Oct 2024
Affordance-based Robot Manipulation with Flow Matching
Affordance-based Robot Manipulation with Flow Matching
Fan Zhang
Michael Gienger
530
40
0
02 Sep 2024
Universal Novelty Detection Through Adaptive Contrastive Learning
Universal Novelty Detection Through Adaptive Contrastive LearningComputer Vision and Pattern Recognition (CVPR), 2024
Hossein Mirzaei
Mojtaba Nafez
Mohammad Jafari
Mohammad Bagher Soltani
Mohammad Azizmalayeri
Jafar Habibi
Mohammad Sabokrou
M. Rohban
190
11
0
20 Aug 2024
VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
VDebugger: Harnessing Execution Feedback for Debugging Visual ProgramsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Xueqing Wu
Zongyu Lin
Songyan Zhao
Te-Lin Wu
Pan Lu
Nanyun Peng
Kai-Wei Chang
LRM
238
3
0
19 Jun 2024
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Navid Rajabi
Jana Kosecka
137
3
0
29 Apr 2024
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLMMLLM
219
254
0
28 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLMMLLM
213
60
0
11 Dec 2023
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
Yizhou Wang
Ruiyi Zhang
Haoliang Wang
Uttaran Bhattacharya
Yun Fu
Gang Wu
MLLM
215
18
0
04 Dec 2023
DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of
  mixture-of-datasets
DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets
Yash Jain
Harkirat Singh Behl
Z. Kira
Vibhav Vineet
135
24
0
08 Nov 2023
Multitask Multimodal Prompted Training for Interactive Embodied Task
  Completion
Multitask Multimodal Prompted Training for Interactive Embodied Task CompletionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Georgios Pantazopoulos
Malvina Nikandrou
Amit Parekh
Bhathiya Hemanthage
Arash Eshghi
Ioannis Konstas
Verena Rieser
Oliver Lemon
Alessandro Suglia
LM&Ro
156
10
0
07 Nov 2023
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
RoboVQA: Multimodal Long-Horizon Reasoning for RoboticsIEEE International Conference on Robotics and Automation (ICRA), 2023
P. Sermanet
Tianli Ding
Jeffrey Zhao
Fei Xia
Debidatta Dwibedi
...
Pannag R Sanketi
Karol Hausman
Izhak Shafran
Brian Ichter
Yuan Cao
LM&Ro
207
94
0
01 Nov 2023
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
  Generation and Editing
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Wei-Ge Chen
Irina Spiridonova
Jianwei Yang
Jianfeng Gao
Chun-yue Li
MLLMVLM
154
45
0
01 Nov 2023
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
Daniela Ben-David
Tzuf Paz-Argaman
Reut Tsarfaty
MoE
131
0
0
25 Oct 2023
Reinforced UI Instruction Grounding: Towards a Generic UI Task
  Automation API
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
Zhizheng Zhang
Wenxuan Xie
Xiaoyi Zhang
Yan Lu
180
15
0
07 Oct 2023
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
InstructDiffusion: A Generalist Modeling Interface for Vision TasksComputer Vision and Pattern Recognition (CVPR), 2023
Zigang Geng
Binxin Yang
Tiankai Hang
Chen Li
Shuyang Gu
...
Jianmin Bao
Zheng Zhang
Han Hu
DongDong Chen
Baining Guo
DiffMVLM
225
154
0
07 Sep 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
  Models
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
235
17
0
18 Aug 2023
DesCo: Learning Object Recognition with Rich Language Descriptions
DesCo: Learning Object Recognition with Rich Language DescriptionsNeural Information Processing Systems (NeurIPS), 2023
Liunian Harold Li
Zi-Yi Dou
Nanyun Peng
Kai-Wei Chang
ObjDVLM
149
26
0
24 Jun 2023
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and
  Language Models
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
309
921
0
08 Jun 2023
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained
  Transformer for Vision, Language, and Multimodal Tasks
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal TasksNature Network Boston (NNB), 2023
Kai Zhang
Jun Yu
Eashan Adhikarla
Rong Zhou
Zhilin Yan
...
Hang Zhang
Yong Chen
Shijie Zhao
Hongfang Liu
Lichao Sun
LM&MAMedIm
196
11
0
26 May 2023
Type-to-Track: Retrieve Any Object via Prompt-based Tracking
Type-to-Track: Retrieve Any Object via Prompt-based TrackingNeural Information Processing Systems (NeurIPS), 2023
Pha Nguyen
Kha Gia Quach
Kris Kitani
Khoa Luu
235
31
0
22 May 2023
Token Boosting for Robust Self-Supervised Visual Transformer
  Pre-training
Token Boosting for Robust Self-Supervised Visual Transformer Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2023
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
Jing Liu
227
7
0
09 Apr 2023
Exposing and Addressing Cross-Task Inconsistency in Unified
  Vision-Language Models
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
A. Maharana
Amita Kamath
Christopher Clark
Joey Tianyi Zhou
Aniruddha Kembhavi
191
3
0
28 Mar 2023
ViM: Vision Middleware for Unified Downstream Transferring
ViM: Vision Middleware for Unified Downstream TransferringIEEE International Conference on Computer Vision (ICCV), 2023
Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
181
2
0
13 Mar 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
183
8
0
16 Feb 2023
Generalized Decoding for Pixel, Image, and Language
Generalized Decoding for Pixel, Image, and LanguageComputer Vision and Pattern Recognition (CVPR), 2022
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLMMLLMObjD
252
322
0
21 Dec 2022
Universal Object Detection with Large Vision Model
Universal Object Detection with Large Vision ModelInternational Journal of Computer Vision (IJCV), 2022
Feng-Huei Lin
Wenze Hu
Yaowei Wang
Yonghong Tian
Guangming Lu
Fanglin Chen
Yong-mei Xu
Xiaoyu Wang
VLMObjD
247
8
0
19 Dec 2022
SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image
  Understanding
SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image UnderstandingIEEE International Conference on Computer Vision (ICCV), 2022
Favyen Bastani
Piper Wolters
Ritwik Gupta
Joe Ferdinando
Aniruddha Kembhavi
273
163
0
28 Nov 2022
A Survey of Computer Vision Technologies In Urban and
  Controlled-environment Agriculture
A Survey of Computer Vision Technologies In Urban and Controlled-environment AgricultureACM Computing Surveys (ACM CSUR), 2022
Jiayun Luo
Boyang Albert Li
Cyril Leung
278
21
0
20 Oct 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksInternational Conference on Learning Representations (ICLR), 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjDVLMMLLM
333
467
0
17 Jun 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for
  Vision-language Models
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLMMLLM
169
43
0
23 May 2022
GRIT: General Robust Image Task Benchmark
GRIT: General Robust Image Task Benchmark
Tanmay Gupta
Ryan Marten
Aniruddha Kembhavi
Derek Hoiem
VLMOODObjD
137
35
0
28 Apr 2022
FindIt: Generalized Localization with Natural Language Queries
FindIt: Generalized Localization with Natural Language QueriesEuropean Conference on Computer Vision (ECCV), 2022
Weicheng Kuo
Fred Bertsch
Wei Li
A. Piergiovanni
M. Saffar
A. Angelova
ObjD
174
18
0
31 Mar 2022
UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training
UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training
Daniel Khashabi
Yeganeh Kordi
Hannaneh Hajishirzi
217
73
0
23 Feb 2022
ASC me to Do Anything: Multi-task Training for Embodied AI
ASC me to Do Anything: Multi-task Training for Embodied AI
Jiasen Lu
Jordi Salvador
Roozbeh Mottaghi
Aniruddha Kembhavi
138
3
0
14 Feb 2022
Webly Supervised Concept Expansion for General Purpose Vision Models
Webly Supervised Concept Expansion for General Purpose Vision ModelsEuropean Conference on Computer Vision (ECCV), 2022
Amita Kamath
Christopher Clark
Tanmay Gupta
Eric Kolve
Derek Hoiem
Aniruddha Kembhavi
VLM
240
65
0
04 Feb 2022
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
293
131
0
23 Nov 2021
Hey AI, Can You Solve Complex Tasks by Talking to Agents?
Hey AI, Can You Solve Complex Tasks by Talking to Agents?
Tushar Khot
Kyle Richardson
Daniel Khashabi
Ashish Sabharwal
RALMLRM
160
15
0
16 Oct 2021
Cross-Task Generalization via Natural Language Crowdsourcing
  Instructions
Cross-Task Generalization via Natural Language Crowdsourcing InstructionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Swaroop Mishra
Daniel Khashabi
Chitta Baral
Hannaneh Hajishirzi
LRM
402
837
0
18 Apr 2021
1