ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1511.02283
  4. Cited By
Generation and Comprehension of Unambiguous Object Descriptions
v1v2v3 (latest)

Generation and Comprehension of Unambiguous Object Descriptions

7 November 2015
Junhua Mao
Jonathan Huang
Alexander Toshev
Oana-Maria Camburu
Alan Yuille
Kevin Patrick Murphy
    ObjD
ArXiv (abs)PDFHTMLGithub (164★)

Papers citing "Generation and Comprehension of Unambiguous Object Descriptions"

50 / 914 papers shown
Title
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Zichen Wen
Yifeng Gao
Weijia Li
Conghui He
Linfeng Zhang
LRM
296
31
0
17 Feb 2025
D-Attn: Decomposed Attention for Large Vision-and-Language Models
D-Attn: Decomposed Attention for Large Vision-and-Language Models
Chia-Wen Kuo
Sijie Zhu
Fan Chen
Xiaohui Shen
Longyin Wen
VLM
465
1
0
04 Feb 2025
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Shenghao Fu
Q. Yang
Qijie Mo
Junkai Yan
Xihan Wei
Jingke Meng
Xiaohua Xie
Wei-Shi Zheng
MLLMObjDVLM
386
29
0
31 Jan 2025
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Fu Rong
Meng Lan
Qian Zhang
Guang Dai
VOSVGen
514
3
0
23 Jan 2025
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
MLLMVLMLRM
269
27
0
21 Jan 2025
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position EncodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Ziyang Chen
Mingxiao Li
Zhongfu Chen
Nan Du
Xiaolong Li
Yuexian Zou
283
3
0
19 Jan 2025
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
The Devil is in Temporal Token: High Quality Video Reasoning SegmentationComputer Vision and Pattern Recognition (CVPR), 2025
Sitong Gong
Yunzhi Zhuge
Lu Zhang
Zhiyong Yang
Pingping Zhang
Huchuan Lu
212
17
0
15 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Jiayi Zhang
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
414
32
0
06 Jan 2025
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression ComprehensionAAAI Conference on Artificial Intelligence (AAAI), 2025
Yaxian Wang
Henghui Ding
Shuting He
Xudong Jiang
Bifan Wei
Jun Liu
ObjD
209
7
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLMVLMLRM
715
116
0
03 Jan 2025
ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers
ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers
Chao Fan
Qipei Mei
Xiaonan Wang
Xinming Li
138
4
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingNeural Information Processing Systems (NeurIPS), 2024
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
419
70
0
31 Dec 2024
Towards Visual Grounding: A Survey
Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
807
26
0
28 Dec 2024
To Predict or Not To Predict? Proportionally Masked Autoencoders for
  Tabular Data Imputation
To Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation
Jungkyu Kim
Kibok Lee
Taeyoung Park
308
3
0
26 Dec 2024
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
Yipeng Zhang
Yi Liu
Zonghao Guo
Yidan Zhang
Xuesong Yang
...
Xingtai Lv
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
Maosong Sun
MLLMVLM
295
3
0
18 Dec 2024
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Kun Wu
Chengkai Hou
Jiaming Liu
Zhengping Che
Xiaozhu Ju
...
Zhenyu Wang
Pengju An
Siyuan Qian
Shanghang Zhang
Jian Tang
LM&Ro
465
79
0
18 Dec 2024
M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
M3^33-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationComputer Vision and Pattern Recognition (CVPR), 2024
Zixuan Chen
Jiaxin Li
Liming Tan
Yejie Guo
Junxuan Liang
Cewu Lu
Yongqian Li
VOS
321
0
0
18 Dec 2024
UniReal: Universal Image Generation and Editing via Learning Real-world
  Dynamics
UniReal: Universal Image Generation and Editing via Learning Real-world DynamicsComputer Vision and Pattern Recognition (CVPR), 2024
Xi Chen
Zhifei Zhang
Chentao Song
Yuqian Zhou
Seunggeun Kim
...
Nanxuan Zhao
Yilin Wang
Hui Ding
Zhe Lin
Hengshuang Zhao
VGenDiffM
396
66
0
10 Dec 2024
Visual Lexicon: Rich Image Features in Language Space
Visual Lexicon: Rich Image Features in Language SpaceComputer Vision and Pattern Recognition (CVPR), 2024
Xudong Wang
Xingyi Zhou
Alireza Fathi
Trevor Darrell
Cordelia Schmid
VLM
167
6
0
09 Dec 2024
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for
  Accelerating Large VLMs
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMsComputer Vision and Pattern Recognition (CVPR), 2024
Wangbo Zhao
Yizeng Han
Jiasheng Tang
Hao Sun
Yibing Song
Kaidi Wang
Zinan Lin
Yang You
413
22
0
04 Dec 2024
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Zilin Du
Haoxin Li
Jianfei Yu
Boyang Li
1.1K
1
0
01 Dec 2024
ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
Kunyang Han
Yibo Hu
Mengxue Qu
Hailin Shi
Yao Zhao
Y. X. Wei
MLLMVLM3DV
536
1
0
29 Nov 2024
Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features
Chancharik Mitra
Brandon Huang
Tianning Chai
Zhiqiu Lin
Assaf Arbelle
Rogerio Feris
Leonid Karlinsky
Trevor Darrell
Deva Ramanan
Roei Herzig
VLM
857
4
0
28 Nov 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLMLRM
471
22
0
27 Nov 2024
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual
  Knowledge
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual KnowledgeComputer Vision and Pattern Recognition (CVPR), 2024
Yaqi Zhao
Yuanyang Yin
Lin Li
Mingan Lin
Victor Shea-Jay Huang
Siwei Chen
Xin Wu
Baoqun Yin
Guosheng Dong
Wentao Zhang
220
3
0
25 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment
  in Multi-Modal Models
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Hao Sun
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
148
8
0
14 Nov 2024
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference
  Understanding
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding
Hao Guo
Wei Fan
Baichun Wei
Jianfei Zhu
Jin Tian
Chunzhi Yi
Feng Jiang
225
0
0
13 Nov 2024
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image
  Segmentation
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image SegmentationEuropean Conference on Computer Vision (ECCV), 2024
Seongsu Ha
Chaeyun Kim
Donghwa Kim
Junho Lee
Sangho Lee
Joonseok Lee
226
6
0
03 Nov 2024
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive
  Position Correction for Visual Grounding
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual GroundingIEEE transactions on multimedia (IEEE TMM), 2024
Minghong Xie
Ming Wang
Huafeng Li
Yafei Zhang
Dapeng Tao
Z. Yu
ObjD
142
5
0
31 Oct 2024
Referring Human Pose and Mask Estimation in the Wild
Referring Human Pose and Mask Estimation in the WildNeural Information Processing Systems (NeurIPS), 2024
Bo Miao
Mingtao Feng
Zijie Wu
Mohammed Bennamoun
Yongsheng Gao
Lin Wang
192
6
0
27 Oct 2024
Improving Multi-modal Large Language Model through Boosting Vision
  Capabilities
Improving Multi-modal Large Language Model through Boosting Vision Capabilities
Yanpeng Sun
Han Zhang
Qiang Chen
Xinyu Zhang
Nong Sang
Gang Zhang
Jingdong Wang
Zechao Li
152
10
0
17 Oct 2024
LocateBench: Evaluating the Locating Ability of Vision Language Models
LocateBench: Evaluating the Locating Ability of Vision Language Models
Ting-Rui Chiang
Joshua Robinson
Xinyan Velocity Yu
Dani Yogatama
VLMELM
208
0
0
17 Oct 2024
Context-Infused Visual Grounding for Art
Context-Infused Visual Grounding for Art
Selina Khan
Nanne van Noord
ObjD
170
2
0
16 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
368
12
0
16 Oct 2024
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained
  Vision-Language Understanding
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
Yue Cao
Yangzhou Liu
Zhe Chen
Guangchen Shi
Wenhai Wang
Danhuai Zhao
Tong Lu
215
15
0
15 Oct 2024
Investigating Human-Computer Interaction and Visual Comprehension in
  Text Generation Process of Natural Language Generation Models
Investigating Human-Computer Interaction and Visual Comprehension in Text Generation Process of Natural Language Generation Models
Yunchao Wang
Zihang Fu
Chaoqing Xu
Guodao Sun
Ronghua Liang
93
0
0
11 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with
  Mask Referring Modeling
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring ModelingNeural Information Processing Systems (NeurIPS), 2024
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
394
20
0
10 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLMMLLM
321
64
0
10 Oct 2024
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric
  Understanding
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding
Keliang Li
Zaifei Yang
Jiahe Zhao
Hongze Shen
Ruibing Hou
Hong Chang
Shiguang Shan
Xilin Chen
VLM
212
4
0
09 Oct 2024
ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt
ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt
Fanhu Zeng
Fei Zhu
Haiyang Guo
Xu-Yao Zhang
Cheng-Lin Liu
VLMCLL
234
15
0
08 Oct 2024
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI AgentsInternational Conference on Learning Representations (ICLR), 2024
Boyu Gou
Ruohan Wang
Boyuan Zheng
Yanan Xie
Cheng Chang
Yiheng Shu
Huan Sun
Eric Fosler-Lussier
LM&RoLLMAG
480
224
0
07 Oct 2024
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking
  Based on LLM
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM
Xuchen Li
Shiyu Hu
Xiaokun Feng
Dailing Zhang
Meiqi Wu
Jing Zhang
Kaiqi Huang
214
13
0
03 Oct 2024
Boosting Weakly-Supervised Referring Image Segmentation via Progressive
  Comprehension
Boosting Weakly-Supervised Referring Image Segmentation via Progressive ComprehensionNeural Information Processing Systems (NeurIPS), 2024
Zaiquan Yang
Yuhao Liu
Jiaying Lin
Gerhard Hancke
Rynson W. H. Lau
264
8
0
02 Oct 2024
World to Code: Multi-modal Data Generation via Self-Instructed
  Compositional Captioning and Filtering
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and FilteringConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jiacong Wang
Bohong Wu
Haiyong Jiang
Xun Zhou
Xin Xiao
Haoyuan Guo
Jun Xiao
VLMVGen
269
14
0
30 Sep 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
  Videos
One Token to Seg Them All: Language Instructed Reasoning Segmentation in VideosNeural Information Processing Systems (NeurIPS), 2024
Zechen Bai
Tong He
Haiyang Mei
Pichao Wang
Ziteng Gao
Joya Chen
Lei Liu
Zheng Zhang
Mike Zheng Shou
VLMVOSMLLM
215
69
0
29 Sep 2024
A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping
A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot GraspingIEEE International Conference on Robotics and Automation (ICRA), 2024
Houjian Yu
Mingen Li
Alireza Rezazadeh
Yang Yang
Changhyun Choi
444
6
0
28 Sep 2024
You Only Speak Once to See
You Only Speak Once to SeeIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Wenhao Yang
Jianguo Wei
Wenhuan Lu
Lei Li
VOS
154
4
0
27 Sep 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled
  Multi-modal Fusion
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal FusionNeural Information Processing Systems (NeurIPS), 2024
Ming Dai
Lingfeng Yang
Yihao Xu
Zhenhua Feng
Wankou Yang
ObjD
370
36
0
26 Sep 2024
PTQ4RIS: Post-Training Quantization for Referring Image Segmentation
PTQ4RIS: Post-Training Quantization for Referring Image SegmentationIEEE International Conference on Robotics and Automation (ICRA), 2024
Xiaoyan Jiang
Hang Yang
Kaiying Zhu
Xihe Qiu
Shibo Zhao
Sifan Zhou
MQ
135
2
0
25 Sep 2024
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Ting Liu
Zunnan Xu
Yue Hu
Liangtao Shi
Zhiqiang Wang
Quanjun Yin
450
6
0
20 Sep 2024
Previous
12345...171819
Next