ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal
  Models
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsEuropean Conference on Computer Vision (ECCV), 2023
Rizhao Cai
Zirui Song
Dayan Guan
Zhenhao Chen
Xing Luo
Chenyu Yi
Alex C. Kot
MLLMVLM
319
44
0
05 Dec 2023
Object Recognition as Next Token Prediction
Object Recognition as Next Token PredictionComputer Vision and Pattern Recognition (CVPR), 2023
Kaiyu Yue
Borchun Chen
Jonas Geiping
Hengduo Li
Tom Goldstein
Ser-Nam Lim
507
12
0
04 Dec 2023
A Challenging Multimodal Video Summary: Simultaneously Extracting and
  Generating Keyframe-Caption Pairs from Video
A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from VideoConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Keito Kudo
Haruki Nagasawa
Jun Suzuki
Nobuyuki Shimizu
249
5
0
04 Dec 2023
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Andrés Villa
Juan Carlos León Alcázar
Alvaro Soto
Bernard Ghanem
MLLMVLM
292
19
0
03 Dec 2023
Abstract Syntax Tree for Programming Language Understanding and
  Representation: How Far Are We?
Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?
Weisong Sun
Chunrong Fang
Yun Miao
Yudu You
Mengzhe Yuan
...
Quanjun Zhang
An Guo
Xiang Chen
Yang Liu
Zhenyu Chen
289
16
0
01 Dec 2023
InstructSeq: Unifying Vision Tasks with Instruction-conditioned
  Multi-modal Sequence Generation
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Rongyao Fang
Shilin Yan
Zhaoyang Huang
Jingqiu Zhou
Hao Tian
Jifeng Dai
Jiaming Song
MLLM
213
16
0
30 Nov 2023
TLDR: Text Based Last-layer Retraining for Debiasing Image Classifiers
TLDR: Text Based Last-layer Retraining for Debiasing Image ClassifiersIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Juhyeon Park
Seokhyeon Jeong
Taesup Moon
273
2
0
30 Nov 2023
Understanding and Improving In-Context Learning on Vision-language
  Models
Understanding and Improving In-Context Learning on Vision-language ModelsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Shuo Chen
Zhen Han
Bailan He
Mark Buckley
Juil Sock
Volker Tresp
Jindong Gu
203
2
0
29 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of
  Video-Language Models
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language ModelsEuropean Conference on Computer Vision (ECCV), 2023
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
227
49
0
29 Nov 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
  Training
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced TrainingComputer Vision and Pattern Recognition (CVPR), 2023
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIPVLM
692
84
0
28 Nov 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsEuropean Conference on Computer Vision (ECCV), 2023
Yanwei Li
Chengyao Wang
Jiaya Jia
VLMMLLM
333
480
0
28 Nov 2023
Large Language Models Meet Computer Vision: A Brief Survey
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
LM&MA
150
5
0
28 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot
  Classifiers
IG Captioner: Information Gain Captioners are Strong Zero-shot ClassifiersEuropean Conference on Computer Vision (ECCV), 2023
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Yaoyao Liu
Jiahui Yu
VLM
163
3
0
27 Nov 2023
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
  Video-based Large Language Models
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Munan Ning
Bin Zhu
Yujia Xie
Bin Lin
Jiaxi Cui
Lu Yuan
Dongdong Chen
Li-ming Yuan
ELMMLLM
213
91
0
27 Nov 2023
Fully Authentic Visual Question Answering Dataset from Online
  Communities
Fully Authentic Visual Question Answering Dataset from Online CommunitiesEuropean Conference on Computer Vision (ECCV), 2023
Chongyan Chen
Xiyang Dai
Noel Codella
Yunsheng Li
Lu Yuan
Danna Gurari
373
9
0
27 Nov 2023
Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li
Zhenyu Liu
Wei Wang
Xiaochun Cao
Yuxin Ding
Xiaochun Cao
Min Zhang
181
6
0
27 Nov 2023
Large Language Models as Automated Aligners for benchmarking
  Vision-Language Models
Large Language Models as Automated Aligners for benchmarking Vision-Language Models
Yuanfeng Ji
Chongjian Ge
Weikai Kong
Enze Xie
Zhengying Liu
Zhengguo Li
Ping Luo
MLLMELM
209
10
0
24 Nov 2023
Griffon: Spelling out All Object Locations at Any Granularity with Large
  Language Models
Griffon: Spelling out All Object Locations at Any Granularity with Large Language ModelsEuropean Conference on Computer Vision (ECCV), 2023
Yufei Zhan
Yousong Zhu
Zhiyang Chen
Fan Yang
E. Goles
Jinqiao Wang
ObjD
242
30
0
24 Nov 2023
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
ShareGPT4V: Improving Large Multi-Modal Models with Better CaptionsEuropean Conference on Computer Vision (ECCV), 2023
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Conghui He
Yuan Liu
Feng Zhao
Dahua Lin
MLLMVLM
380
936
0
21 Nov 2023
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with
  Spatial Relation Matching
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation MatchingEuropean Conference on Computer Vision (ECCV), 2023
Meng Chu
Zhedong Zheng
Wei Ji
Tingyu Wang
Tat-Seng Chua
276
25
0
21 Nov 2023
LION : Empowering Multimodal Large Language Model with Dual-Level Visual
  Knowledge
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Gongwei Chen
Leyang Shen
Rui Shao
Xiang Deng
Liqiang Nie
VLMMLLM
302
83
0
20 Nov 2023
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph
  Generation via Visual-Concept Alignment and Retention
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Zuyao Chen
Jinlin Wu
Zhen Lei
Zhaoxiang Zhang
Changwen Chen
302
29
0
18 Nov 2023
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
Shelly Sheynin
Adam Polyak
Uriel Singer
Yuval Kirstain
Amit Zohar
Oron Ashual
Devi Parikh
Yaniv Taigman
220
238
0
16 Nov 2023
Towards Open-Ended Visual Recognition with Large Language Model
Towards Open-Ended Visual Recognition with Large Language Model
Qihang Yu
Xiaohui Shen
Liang-Chieh Chen
VLM
246
8
0
14 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models
  with Image and Video Understanding
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
512
353
0
14 Nov 2023
Detecting and Correcting Hate Speech in Multimodal Memes with Large
  Visual Language Model
Detecting and Correcting Hate Speech in Multimodal Memes with Large Visual Language Model
Minh-Hao Van
Xintao Wu
VLMMLLM
209
15
0
12 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision
  Tasks
Florence-2: Advancing a Unified Representation for a Variety of Vision TasksComputer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
398
393
0
10 Nov 2023
Training CLIP models on Data from Scientific Papers
Training CLIP models on Data from Scientific Papers
Calvin Metzger
VLMCLIP
122
3
0
08 Nov 2023
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Zhen Yang
Yingxue Zhang
Fandong Meng
Jie Zhou
VLMMLLM
216
4
0
08 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
OtterHD: A High-Resolution Multi-modality Model
Yue Liu
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLMMLLM
190
77
0
07 Nov 2023
MetaReVision: Meta-Learning with Retrieval for Visually Grounded
  Compositional Concept Acquisition
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept AcquisitionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Guangyue Xu
Parisa Kordjamshidi
Joyce Chai
162
2
0
02 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
De-Diffusion Makes Text a Strong Cross-Modal InterfaceComputer Vision and Pattern Recognition (CVPR), 2023
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLMDiffM
274
17
0
01 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering
  (VQA) Approaches, Challenges, and Opportunities
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
402
72
0
01 Nov 2023
CapsFusion: Rethinking Image-Text Data at Scale
CapsFusion: Rethinking Image-Text Data at ScaleComputer Vision and Pattern Recognition (CVPR), 2023
Qiying Yu
Quan-Sen Sun
Xiaosong Zhang
Yufeng Cui
Fan Zhang
Yue Cao
Xinlong Wang
Jingjing Liu
VLM
370
88
0
31 Oct 2023
Language Guided Visual Question Answering: Elevate Your Multimodal
  Language Model Using Knowledge-Enriched Prompts
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched PromptsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Deepanway Ghosal
Navonil Majumder
Roy Ka-wei Lee
Amélie Reymond
Soujanya Poria
156
16
0
31 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIPVLMVGen
350
3
0
30 Oct 2023
Impressions: Understanding Visual Semiotics and Aesthetic Impact
Impressions: Understanding Visual Semiotics and Aesthetic ImpactConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Julia Kruk
Caleb Ziems
Diyi Yang
157
3
0
27 Oct 2023
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
  Object Detection
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object DetectionNeural Information Processing Systems (NeurIPS), 2023
Chuofan Ma
Yi Jiang
Xin Wen
Zehuan Yuan
Xiaojuan Qi
ObjDVLM
260
70
0
25 Oct 2023
Knowledge Editing for Large Language Models: A Survey
Knowledge Editing for Large Language Models: A SurveyACM Computing Surveys (ACM Comput. Surv.), 2023
Song Wang
Yaochen Zhu
Haochen Liu
Zaiyi Zheng
Chen Chen
Wenlin Yao
KELM
455
202
0
24 Oct 2023
Leveraging Image-Text Similarity and Caption Modification for the
  DataComp Challenge: Filtering Track and BYOD Track
Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track
Shuhei Yokoo
Peifei Zhu
Yuchi Ishikawa
Mikihiro Tanaka
Masayoshi Kondo
Hirokatsu Kataoka
87
1
0
23 Oct 2023
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Chunlei Wang
Wenquan Feng
Xiangtai Li
Guangliang Cheng
Shuchang Lyu
Binghao Liu
Lijiang Chen
Qi Zhao
ObjDVLM
269
14
0
22 Oct 2023
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce
Baohao Liao
Michael Kozielski
Sanjika Hewavitharana
Jiangbo Yuan
Shahram Khadivi
Tomer Lancewicki
SSL
132
0
0
22 Oct 2023
On the Transferability of Visually Grounded PCFGs
On the Transferability of Visually Grounded PCFGsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yanpeng Zhao
Ivan Titov
144
1
0
21 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
145
2
0
20 Oct 2023
PrivImage: Differentially Private Synthetic Image Generation using
  Diffusion Models with Semantic-Aware Pretraining
PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining
Kecen Li
Chen Gong
Zhixiang Li
Yuzhong Zhao
Xinwen Hou
Tianhao Wang
358
21
0
19 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
  Interactions
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
192
2
0
18 Oct 2023
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph
  Generation
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph GenerationComputer Vision and Pattern Recognition (CVPR), 2023
Kibum Kim
Kanghoon Yoon
Jaeyeong Jeon
Yeonjun In
Jinyoung Moon
Donghyun Kim
Chanyoung Park
537
30
0
16 Oct 2023
Bounding and Filling: A Fast and Flexible Framework for Image Captioning
Bounding and Filling: A Fast and Flexible Framework for Image Captioning
Zheng Ma
Changxin Wang
Bo Huang
Zi-Yue Zhu
Jianbing Zhang
187
3
0
15 Oct 2023
Leveraging Image Augmentation for Object Manipulation: Towards
  Interpretable Controllability in Object-Centric Learning
Leveraging Image Augmentation for Object Manipulation: Towards Interpretable Controllability in Object-Centric Learning
Jinwoo Kim
Janghyuk Choi
Jaehyun Kang
Changyeon Lee
Ho-Jin Choi
Seon Joo Kim
OCL
401
1
0
13 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
  Models
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLMVLM
411
70
0
13 Oct 2023
Previous
123...101112...293031
Next
Page 11 of 31
Pageof 31