v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,322 papers shown

Title
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image RetrievalPattern Recognition (Pattern Recogn.), 2024 Zijia Zhao Longteng Guo Tongtian Yue Erdong Hu Shuai Shao Zehuan Yuan Hua Huang Qingbin Liu 145 4 0 24 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance Zhangwei Gao Zhe Chen Erfei Cui Yiming Ren Weiyun Wang ... Lewei Lu Tong Lu Yu Qiao Jifeng Dai Wenhai Wang VLM 367 84 0 21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models Yufei Zhan Hongyin Zhao Yousong Zhu Fan Yang Ming Tang Jinqiao Wang MLLM 267 3 0 21 Oct 2024
Test-time Adaptation for Cross-modal Retrieval with Query Shift Haobin Li Peng Hu Qianjun Zhang Xi Peng Xiting Liu Mouxing Yang TTA 260 8 0 21 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial SamplesNeural Information Processing Systems (NeurIPS), 2024 Baiqi Li Zhiqiu Lin Wenxuan Peng Jean de Dieu Nyandwi Daniel Jiang Zixian Ma Simran Khanuja Ranjay Krishna Graham Neubig Deva Ramanan AAML CoGe VLM 596 59 0 18 Oct 2024
Can MLLMs Understand the Deep Implication Behind Chinese Images?Annual Meeting of the Association for Computational Linguistics (ACL), 2024 Chenhao Zhang Xi Feng Yuelin Bai Xinrun Du Jinchang Hou ... Min Yang Wenhao Huang Chenghua Lin Ge Zhang Shiwen Ni ELM VLM 136 9 0 17 Oct 2024
LocateBench: Evaluating the Locating Ability of Vision Language Models Ting-Rui Chiang Joshua Robinson Xinyan Velocity Yu Dani Yogatama VLM ELM 212 0 0 17 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-TrainingACM Multimedia (ACM MM), 2022 Zhiyuan Ma Jianjun Li Guohui Li Kaiyan Huang VLM 345 9 0 16 Oct 2024
CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning Qingqing Cao Mahyar Najibi Sachin Mehta CLIP DiffM 221 1 0 15 Oct 2024
Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models Fan Yang Yihao Huang Kaidi Wang Ling Shi G. Pu Yang Liu Jian Shu AAML VLM 213 2 0 15 Oct 2024
TULIP: Token-length Upgraded CLIPInternational Conference on Learning Representations (ICLR), 2024 Ivona Najdenkoska Mohammad Mahdi Derakhshani Yuki M. Asano Nanne van Noord Marcel Worring Cees G. M. Snoek VLM 377 13 0 13 Oct 2024
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning TasksConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Sungkyung Kim Adam Lee Junyoung Park Andrew Chung Jusang Oh Jay-Yoon Lee 96 9 0 12 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring ModelingNeural Information Processing Systems (NeurIPS), 2024 Linhui Xiao Xiaoshan Yang Fang Peng Yaowei Wang Changsheng Xu ObjD 406 20 0 10 Oct 2024
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate Qidong Huang Xiaoyi Dong Pan Zhang Yuhang Zang Yuhang Cao Jiaqi Wang Dahua Lin Weiming Zhang Nenghai Yu 157 20 0 09 Oct 2024
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet Haoran Zhang Hangyu Guo Shuyue Guo Meng Cao Wenhao Huang Jiaheng Liu Ge Zhang VLM MLLM LRM 212 4 0 09 Oct 2024
Compositional Entailment Learning for Hyperbolic Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024 Avik Pal Max van Spengler Guido Maria DÁmely di Melendugno Alessandro Flaborea Fabio Galasso Pascal Mettes CoGe 329 31 0 09 Oct 2024
VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models Harshit Tolga Tasdizen CoGe VLM 146 1 0 06 Oct 2024
CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News DetectionAsian Conference on Computer Vision (ACCV), 2024 Devank Jayateja Kalla Soma Biswas 146 5 0 06 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024 Wenhao Chai Enxin Song Y. Du Chenlin Meng Vashisht Madhavan Omer Bar-Tal Jeng-Neng Hwang Saining Xie Christopher D. Manning 3DV 605 89 0 04 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel ... Haoxuan You Zirui Wang Afshin Dehghan Peter Grasch Yinfei Yang VLM MLLM 287 64 1 30 Sep 2024
HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty DecodingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Fan Yuan Chi Qin Xiaogang Xu Piji Li VLM MLLM 145 9 0 30 Sep 2024
Harnessing Frozen Unimodal Encoders for Flexible Multimodal AlignmentComputer Vision and Pattern Recognition (CVPR), 2024 Mayug Maniparambil Raiymbek Akshulakov Y. A. D. Djilali Sanath Narayan Ankit Singh Noel E. O'Connor VLM MLLM 132 2 0 28 Sep 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal FusionNeural Information Processing Systems (NeurIPS), 2024 Ming Dai Lingfeng Yang Yihao Xu Zhenhua Feng Wankou Yang ObjD 414 37 0 26 Sep 2024
MIO: A Foundation Model on Multimodal Tokens Zekun Wang King Zhu Chunpu Xu Wangchunshu Zhou Jiaheng Liu ... Yuanxing Zhang Ge Zhang Ke Xu Jie Fu Wenhao Huang MLLM AuLLM 414 20 0 26 Sep 2024
A-VL: Adaptive Attention for Large Vision-Language ModelsAAAI Conference on Artificial Intelligence (AAAI), 2024 Junyang Zhang Mu Yuan Ruiguang Zhong Puhan Luo Huiyou Zhan Ningkang Zhang Chengchen Hu Xiangyang Li VLM 329 4 0 23 Sep 2024
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model Li Zhou Xu Yuan Zenghui Sun Zikun Zhou Jingsong Lan VLM MLLM 836 7 0 20 Sep 2024
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models V. Bhat Prashanth Krishnamurthy Ramesh Karri Farshad Khorrami 429 9 0 16 Sep 2024
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training Yiyi Tao Zhuoyue Wang Hang Zhang Lun Wang VLM 295 26 0 15 Sep 2024
Automatic Scene Generation: State-of-the-Art Techniques, Models, Datasets, Challenges, and Future ProspectsIEEE Access (IEEE Access), 2024 Awal Ahmed Fime Saifuddin Mahmud Arpita Das Md. Sunzidul Islam Hong-Hoon Kim VGen 3DV 223 2 0 14 Sep 2024
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary DetectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024 Haoxuan Wang Qu He Jinlong Peng Hao Yang Mingmin Chi Yabiao Wang Mamba 229 7 0 13 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models Ali Abdollah Amirmohammad Izadi Armin Saghafian Reza Vahidimajd Mohammad Mozafari Amirreza Mirzaei Mohammadmahdi Samiei M. Baghshah CoGe VLM 186 1 0 12 Sep 2024
An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open DetectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024 Pengfei Qi Yifei Zhang Wenqiang Li Youwen Hu Kunlong Bai ObjD 183 0 0 10 Sep 2024
Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and RegressionIEEE transactions on multimedia (IEEE TMM), 2024 Jingcheng Ke Dele Wang Jun-Cheng Chen I-Hong Jhuo Chia-Wen Lin Yen-Yu Lin 218 1 0 05 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning Manu Gaur Darshan Singh Makarand Tapaswi 883 2 0 04 Sep 2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data Spencer Whitehead Jacob Phillips Sean Hendryx 135 0 0 30 Aug 2024
See or Guess: Counterfactually Regularized Image CaptioningACM Multimedia (MM), 2024 Qian Cao Xu Chen Ruihua Song Xiting Wang Xinting Huang Yuchen Ren CML 174 3 0 29 Aug 2024
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual GroundingACM Multimedia (MM), 2024 Minghang Zheng Jiahua Zhang Qingchao Chen Yuxin Peng Yang Liu ObjD 262 5 0 29 Aug 2024
Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models K. Nakata Daisuke Miyashita Youyang Ng Yasuto Hoshi J. Deguchi 125 0 0 29 Aug 2024
Pixels to Prose: Understanding the art of Image Captioning Hrishikesh Singh Aarti Sharma Millie Pant 3DV VLM 190 2 0 28 Aug 2024
Evaluating Attribute Comprehension in Large Vision-Language ModelsChinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2024 Haiwen Zhang Zixi Yang Yuanzhi Liu Xinran Wang Zheqi He Kongming Liang Zhanyu Ma ELM 170 0 0 25 Aug 2024
Tangram: A Challenging Benchmark for Geometric Element Recognizing Jiamin Tang Chao Zhang Xudong Zhu Mengchi Liu LRM 51 1 0 25 Aug 2024
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal CapabilitiesAAAI Conference on Artificial Intelligence (AAAI), 2024 Bin Wang Chunyu Xie Dawei Leng Yuhui Yin MLLM 425 6 0 23 Aug 2024
Towards Deconfounded Image-Text Matching with Causal InferenceACM Multimedia (ACM MM), 2023 Wenhui Li Xinqi Su Dan Song Lanjun Wang Kun Zhang An-An Liu BDL CML 196 15 0 22 Aug 2024
RT-OVAD: Real-Time Open-Vocabulary Aerial Object Detection via Image-Text Collaboration Guoting Wei Xia Yuan Yu Liu Zhenhao Shang Kelu Yao Peng Wang Kelu Yao Chunxia Zhao Haokui Zhang Rong Xiao ObjD VLM 563 0 0 22 Aug 2024
A Survey on Integrated Sensing, Communication, and ComputationIEEE Communications Surveys and Tutorials (COMST), 2024 Dingzhu Wen Yong Zhou Xiaoyang Li Yuanming Shi Kaibin Huang Khaled B. Letaief 216 109 0 15 Aug 2024
Can Large Language Models Understand Symbolic Graphics Programs?International Conference on Learning Representations (ICLR), 2024 Zeju Qiu Weiyang Liu Haiwen Feng Zhen Liu Tim Z. Xiao Katherine M. Collins J. Tenenbaum Adrian Weller Michael J. Black Bernhard Schölkopf 563 27 0 15 Aug 2024
How Well Can Vision Language Models See Image Details? Chenhui Gou Abdulwahab Felemban Faizan Farooq Khan Deyao Zhu Jianfei Cai Hamid Rezatofighi Mohamed Elhoseiny VLM MLLM 212 12 0 07 Aug 2024
Fairness and Bias Mitigation in Computer Vision: A Survey Sepehr Dehdashtian Ruozhen He Yi Li Guha Balakrishnan Nuno Vasconcelos Vicente Ordonez Vishnu Boddeti 337 11 0 05 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual GroundingEuropean Conference on Computer Vision (ECCV), 2024 Wei Chen Mahdieh Hatamian Yu Wu 202 16 0 02 Aug 2024
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts Xi Lin Akshat Shrivastava Liang Luo Srinivasan Iyer Mike Lewis Gargi Gosh Luke Zettlemoyer Armen Aghajanyan MoE 231 50 0 31 Jul 2024