ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1602.07332
  4. Cited By
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense
  Image Annotations

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

23 February 2016
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
Joshua Kravitz
Stephanie Chen
Yannis Kalantidis
Li-Jia Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
ArXivPDFHTML

Papers citing "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations"

50 / 864 papers shown
Title
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Wenyi Xiao
Ziwei Huang
Leilei Gan
Wanggui He
Haoyuan Li
Zhelun Yu
Hao Jiang
Fei Wu
Linchao Zhu
MLLM
39
22
0
22 Apr 2024
ECOR: Explainable CLIP for Object Recognition
ECOR: Explainable CLIP for Object Recognition
Ali Rasekh
Sepehr Kazemi Ranjbar
Milad Heidari
Wolfgang Nejdl
VLM
46
4
0
19 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
N. Nguyen
CoGe
39
3
0
16 Apr 2024
Understanding Multimodal Deep Neural Networks: A Concept Selection View
Understanding Multimodal Deep Neural Networks: A Concept Selection View
Chenming Shang
Hengyuan Zhang
Hao Wen
Yujiu Yang
43
5
0
13 Apr 2024
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Tianyu Zhu
M. Jung
Jesse Clark
85
1
0
12 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
44
1
0
01 Apr 2024
Deep Instruction Tuning for Segment Anything Model
Deep Instruction Tuning for Segment Anything Model
Xiaorui Huang
Gen Luo
Chaoyang Zhu
Bo Tong
Yiyi Zhou
Xiaoshuai Sun
Rongrong Ji
VLM
49
1
0
31 Mar 2024
Constructing Multilingual Visual-Text Datasets Revealing Visual
  Multilingual Ability of Vision Language Models
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
Jesse Atuhurra
Iqra Ali
Tatsuya Hiraoka
Hidetaka Kamigaito
Tomoya Iwakura
Taro Watanabe
40
1
0
29 Mar 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Peng Gao
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Hongsheng Li
VLM
63
33
0
29 Mar 2024
Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art
Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art
Neeloy Chakraborty
Melkior Ornik
Katherine Driggs-Campbell
LRM
57
9
0
25 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLM
MLLM
37
102
0
18 Mar 2024
ST-LDM: A Universal Framework for Text-Grounded Object Generation in
  Real Images
ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images
Xiangtian Xue
Jiasong Wu
Youyong Kong
L. Senhadji
Huazhong Shu
DiffM
43
1
0
15 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLL
MoMe
31
10
0
13 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
  Objects in 3D Scenes
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
48
10
0
12 Mar 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
121
497
0
07 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
33
14
0
06 Mar 2024
How to Understand "Support"? An Implicit-enhanced Causal Inference
  Approach for Weakly-supervised Phrase Grounding
How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding
Jiamin Luo
Jianing Zhao
Jingjing Wang
Guodong Zhou
43
0
0
29 Feb 2024
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
Xiujie Song
Mengyue Wu
Ke Zhu
Chunhao Zhang
Yanyi Chen
LRM
ELM
36
3
0
28 Feb 2024
Acquiring Linguistic Knowledge from Multimodal Input
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
29
2
0
27 Feb 2024
CODIS: Benchmarking Context-Dependent Visual Comprehension for
  Multimodal Large Language Models
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
Fuwen Luo
Chi Chen
Zihao Wan
Zhaolu Kang
Qidong Yan
...
Xiaoyue Mi
Peng Li
Ning Ma
Maosong Sun
Yang Janet Liu
35
5
0
21 Feb 2024
MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared
  Semantic Spaces
MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces
Tianyu Zheng
Ge Zhang
Xingwei Qu
Ming Kuang
Stephen W. Huang
Zhaofeng He
OffRL
45
1
0
20 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
77
4
0
08 Feb 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Chris Liu
Renrui Zhang
Longtian Qiu
Siyuan Huang
Weifeng Lin
...
Hao Shao
Pan Lu
Hongsheng Li
Yu Qiao
Peng Gao
MLLM
130
107
0
08 Feb 2024
Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models
  for Scene Graphs
Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
Rameshwar Mishra
A. V. Subramanyam
DiffM
27
2
0
25 Jan 2024
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
  Video Search Scenarios
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios
Xiangshuo Qiao
Xianxin Li
Xiaozhe Qu
Jie M. Zhang
Yang Liu
Yu Luo
Cihang Jin
Jin Ma
VLM
27
0
0
19 Jan 2024
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
Xiaotian Han
Yiqi Wang
Bohan Zhai
Quanzeng You
Hongxia Yang
VLM
MLLM
30
2
0
17 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
GroundingGPT:Language Enhanced Multi-modal Grounding Model
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
28
37
0
11 Jan 2024
Low-Resource Vision Challenges for Foundation Models
Low-Resource Vision Challenges for Foundation Models
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
27
5
0
09 Jan 2024
BloomVQA: Assessing Hierarchical Multi-modal Comprehension
BloomVQA: Assessing Hierarchical Multi-modal Comprehension
Yunye Gong
Robik Shrestha
Jared Claypoole
Michael Cogswell
Arijit Ray
Christopher Kanan
Ajay Divakaran
28
0
0
20 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
45
29
0
19 Dec 2023
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Mingsheng Li
Xin Chen
C. Zhang
Sijin Chen
Hongyuan Zhu
Fukun Yin
Gang Yu
Tao Chen
24
24
0
17 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language
  Pre-training
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
14
4
0
14 Dec 2023
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Zeyi Sun
Ye Fang
Tong Wu
Pan Zhang
Yuhang Zang
Shu Kong
Yuanjun Xiong
Dahua Lin
Jiaqi Wang
VLM
CLIP
39
83
0
06 Dec 2023
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
Ivan Rodin
Antonino Furnari
Kyle Min
Subarna Tripathi
G. Farinella
EgoV
27
12
0
06 Dec 2023
Building Category Graphs Representation with Spatial and Temporal
  Attention for Visual Navigation
Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation
Xiaobo Hu
Youfang Lin
Hehe Fan
Shuo Wang
Zhihao Wu
Kai Lv
31
3
0
06 Dec 2023
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation
  in Video Understanding
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
Trong-Thuan Nguyen
Pha Nguyen
Khoa Luu
22
12
0
05 Dec 2023
Dolphins: Multimodal Language Model for Driving
Dolphins: Multimodal Language Model for Driving
Yingzi Ma
Yulong Cao
Jiachen Sun
Marco Pavone
Chaowei Xiao
MLLM
30
50
0
01 Dec 2023
No Representation Rules Them All in Category Discovery
No Representation Rules Them All in Category Discovery
S. Vaze
Andrea Vedaldi
Andrew Zisserman
OOD
34
31
0
28 Nov 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li
Chengyao Wang
Jiaya Jia
VLM
MLLM
38
259
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
56
399
0
28 Nov 2023
Power Hungry Processing: Watts Driving the Cost of AI Deployment?
Power Hungry Processing: Watts Driving the Cost of AI Deployment?
Sasha Luccioni
Yacine Jernite
Emma Strubell
42
162
0
28 Nov 2023
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware
  Direct Preference Optimization
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Zhiyuan Zhao
Bin Wang
Linke Ouyang
Xiao-wen Dong
Jiaqi Wang
Conghui He
MLLM
VLM
32
106
0
28 Nov 2023
The curse of language biases in remote sensing VQA: the role of spatial
  attributes, language diversity, and the need for clear evaluation
The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation
Christel Chappuis
Eliot Walt
Vincent Mendez
Sylvain Lobry
B. L. Saux
D. Tuia
23
3
0
28 Nov 2023
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
  Video-based Large Language Models
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Munan Ning
Bin Zhu
Yujia Xie
Bin Lin
Jiaxi Cui
Lu Yuan
Dongdong Chen
Li-ming Yuan
ELM
MLLM
27
58
0
27 Nov 2023
Enhancing Scene Graph Generation with Hierarchical Relationships and
  Commonsense Knowledge
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge
Bowen Jiang
Zhijun Zhuang
Shreyas S. Shivakumar
Camillo J. Taylor
26
6
0
21 Nov 2023
Learning Scene Context Without Images
Learning Scene Context Without Images
Amirreza Rouhi
David Han
VLM
25
0
0
18 Nov 2023
To See is to Believe: Prompting GPT-4V for Better Visual Instruction
  Tuning
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Junke Wang
Lingchen Meng
Zejia Weng
Bo He
Zuxuan Wu
Yu-Gang Jiang
MLLM
VLM
27
94
0
13 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision
  Tasks
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
36
143
0
10 Nov 2023
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Jinjin Xu
Liwu Xu
Yuzhe Yang
Xiang Li
Fanyi Wang
Yanchun Xie
Yi-Jie Huang
Yaqian Li
MoE
MLLM
VLM
27
12
0
09 Nov 2023
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Yifan Du
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Jinpeng Wang
Chuyuan Wang
Mingchen Cai
Ruihua Song
Ji-Rong Wen
VLM
MLLM
LRM
64
22
0
02 Nov 2023
Previous
123456...161718
Next