ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown
Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text ModelIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2024
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
325
22
0
10 Feb 2024
GPTs Are Multilingual Annotators for Sequence Generation Tasks
GPTs Are Multilingual Annotators for Sequence Generation Tasks
Juhwan Choi
Eunju Lee
Kyohoon Jin
Youngbin Kim
156
14
0
08 Feb 2024
Question Aware Vision Transformer for Multimodal Reasoning
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham
Oren Nuriel
Shai Mazor
Ron Litman
299
36
0
08 Feb 2024
Get What You Want, Not What You Don't: Image Content Suppression for
  Text-to-Image Diffusion Models
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
Senmao Li
Joost van de Weijer
Taihang Hu
Fahad Shahbaz Khan
Qibin Hou
Yaxing Wang
Jian Yang
DiffM
237
43
0
08 Feb 2024
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
  Method for Multimodal Contrastive Learning
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Kevin Jamieson
S. Du
263
7
0
03 Feb 2024
Can MLLMs Perform Text-to-Image In-Context Learning?
Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng
Wonjun Kang
Yicong Chen
Hyung Il Koo
Kangwook Lee
MLLM
263
14
0
02 Feb 2024
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual
  Storytelling
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling
Eileen Wang
S. Han
Josiah Poon
281
5
0
01 Feb 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for
  Automated Audio Captioning
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIPVLM
218
43
0
31 Jan 2024
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor
  Image Comprehension in Remote Sensing Domain
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
Wei Zhang
Miaoxin Cai
Tong Zhang
Zhuang Yin
Xuerui Mao
434
218
0
30 Jan 2024
Towards Unified Interactive Visual Grounding in The Wild
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
299
5
0
30 Jan 2024
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
  Efficient Pretraining
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Qingpei Guo
Furong Xu
Hanxiao Zhang
Wang Ren
Ziping Ma
Lin Ju
Jian Wang
Jingdong Chen
Ming Yang
VLMMLLM
220
6
0
29 Jan 2024
Muffin or Chihuahua? Challenging Multimodal Large Language Models with
  Multipanel VQA
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQAAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yue Fan
Jing Gu
KAI-QING Zhou
Qianqi Yan
Shan Jiang
Ching-Chen Kuo
Xinze Guan
Xin Eric Wang
293
11
0
29 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRLLRM
512
335
0
24 Jan 2024
Common-Sense Bias Modeling for Classification Tasks
Common-Sense Bias Modeling for Classification TasksAAAI Conference on Artificial Intelligence (AAAI), 2024
Miao Zhang
Zee fryer
Ben Colman
Ali Shahriyari
Gaurav Bharaj
486
0
0
24 Jan 2024
Enhancing Object Detection Performance for Small Objects through
  Synthetic Data Generation and Proportional Class-Balancing Technique: A
  Comparative Study in Industrial Scenarios
Enhancing Object Detection Performance for Small Objects through Synthetic Data Generation and Proportional Class-Balancing Technique: A Comparative Study in Industrial Scenarios
Jibinraj Antony
Vinit Hegiste
Ali Nazeri
Hooman Tavakoli
Snehal Walunj
Christiane Plociennik
Martin Ruskowski
197
3
0
23 Jan 2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
  Capabilities
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesComputer Vision and Pattern Recognition (CVPR), 2024
Boyuan Chen
Zhuo Xu
Sean Kirmani
Brian Ichter
Danny Driess
Pete Florence
Dorsa Sadigh
Leonidas Guibas
Fei Xia
LRMReLM
329
548
0
22 Jan 2024
Text-to-Image Cross-Modal Generation: A Systematic Review
Text-to-Image Cross-Modal Generation: A Systematic Review
Maciej Żelaszczyk
Jacek Mańdziuk
320
6
0
21 Jan 2024
LLMRA: Multi-modal Large Language Model based Restoration Assistant
LLMRA: Multi-modal Large Language Model based Restoration Assistant
Xiaoyu Jin
Yuan Shi
Bin Xia
Wenming Yang
201
8
0
21 Jan 2024
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
  Video Search Scenarios
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios
Xiangshuo Qiao
Xianxin Li
Xiaozhe Qu
Jie M. Zhang
Yang Liu
Yu Luo
Cihang Jin
Jin Ma
VLM
358
0
0
19 Jan 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
Supervised Fine-tuning in turn Improves Visual Foundation Models
Xiaohu Jiang
Yixiao Ge
Yuying Ge
Dachuan Shi
Chun Yuan
Ying Shan
VLMCLIP
256
14
0
18 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via
  Multi-modal Feature Synchronizer
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Jiaming Song
Yu Qiao
Jifeng Dai
AuLLM
240
70
0
18 Jan 2024
Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with
  Positive Forward Transfer
Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer
Junhao Zheng
Qianli Ma
Zhen Liu
Binquan Wu
Hu Feng
CLL
352
26
0
17 Jan 2024
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
COCO is "ALL'' You Need for Visual Instruction Fine-tuningIEEE International Conference on Multimedia and Expo (ICME), 2024
Xiaotian Han
Yiqi Wang
Bohan Zhai
Quanzeng You
Hongxia Yang
VLMMLLM
210
3
0
17 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
193
2
0
09 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
277
6
0
06 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in
  Multimodal Large Language Models
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
314
13
0
06 Jan 2024
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency
Yuyang Yin
Dejia Xu
Zinan Lin
Yao-Min Zhao
Yunchao Wei
3DGS
325
112
0
28 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLMMLLM
282
274
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
198
30
0
27 Dec 2023
Cloud-Device Collaborative Learning for Multimodal Large Language Models
Cloud-Device Collaborative Learning for Multimodal Large Language Models
Guanqun Wang
Jiaming Liu
Chenxuan Li
Junpeng Ma
Yuan Zhang
...
Kevin Zhang
Maurice Chong
Ray Zhang
Yijiang Liu
Shanghang Zhang
215
18
0
26 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
273
26
0
25 Dec 2023
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
Xinyan Chen
Jiaxin Ge
Tianjun Zhang
Jiaming Liu
Shanghang Zhang
VLMEGVM
465
2
0
23 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
641
2,210
0
21 Dec 2023
Generative Multimodal Models are In-Context Learners
Generative Multimodal Models are In-Context Learners
Quan-Sen Sun
Yufeng Cui
Xiaosong Zhang
Fan Zhang
Qiying Yu
...
Yueze Wang
Yongming Rao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLMLRM
374
422
0
20 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in
  Language-Image Pretraining
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
255
10
0
19 Dec 2023
CLIM: Contrastive Language-Image Mosaic for Region Representation
CLIM: Contrastive Language-Image Mosaic for Region Representation
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Wentao Liu
Chen Change Loy
ObjDVLM
199
24
0
18 Dec 2023
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Mingsheng Li
Xin Chen
C. Zhang
Sijin Chen
Erik Cambria
Fukun Yin
Gang Yu
Tao Chen
307
36
0
17 Dec 2023
Simple Image-level Classification Improves Open-vocabulary Object
  Detection
Simple Image-level Classification Improves Open-vocabulary Object DetectionAAAI Conference on Artificial Intelligence (AAAI), 2023
Ru Fang
Guansong Pang
Xiaolong Bai
ObjDVLM
292
21
0
16 Dec 2023
Tell Me What You See: Text-Guided Real-World Image Denoising
Tell Me What You See: Text-Guided Real-World Image DenoisingIEEE Open Journal of Signal Processing (IEEE Open J. Signal Process.), 2023
E. Yosef
Raja Giryes
DiffM
499
3
0
15 Dec 2023
VL-GPT: A Generative Pre-trained Transformer for Vision and Language
  Understanding and Generation
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan
ViTVLM
191
46
0
14 Dec 2023
Pixel Aligned Language Models
Pixel Aligned Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLMVLM
291
17
0
14 Dec 2023
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual
  Knowledge Transfer
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge TransferAAAI Conference on Artificial Intelligence (AAAI), 2023
Yabing Wang
Fan Wang
Jianfeng Dong
Hao Luo
VLM
206
20
0
14 Dec 2023
Depicting Beyond Scores: Advancing Image Quality Assessment through
  Multi-modal Language Models
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language ModelsEuropean Conference on Computer Vision (ECCV), 2023
Zhiyuan You
Zheyuan Li
Jinjin Gu
Zhenfei Yin
Tianfan Xue
Chao Dong
EGVM
400
91
0
14 Dec 2023
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
  Captioning
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image CaptioningAAAI Conference on Artificial Intelligence (AAAI), 2023
Zhiyue Liu
Jinyuan Liu
Fanrong Ma
CLIPVLM
254
20
0
14 Dec 2023
Genixer: Empowering Multimodal Large Language Models as a Powerful Data
  Generator
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
Henry Hengyuan Zhao
Pan Zhou
Mike Zheng Shou
MLLMSyDa
461
12
0
11 Dec 2023
Stellar: Systematic Evaluation of Human-Centric Personalized
  Text-to-Image Methods
Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods
Panos Achlioptas
Alexandros Benetatos
Iordanis Fostiropoulos
Dimitris Skourtis
276
10
0
11 Dec 2023
GlitchBench: Can large multimodal models detect video game glitches?
GlitchBench: Can large multimodal models detect video game glitches?Computer Vision and Pattern Recognition (CVPR), 2023
Mohammad Reza Taesiri
Tianjun Feng
Anh Totti Nguyen
Cor-Paul Bezemer
MLLMVLMLRM
329
18
0
08 Dec 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and
  Comprehension via Semantic-aware Visual Objects
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLMVLM
229
22
0
08 Dec 2023
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific
  Narratives
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
Zuyao Chen
Jinlin Wu
Zhen Lei
Zhaoxiang Zhang
Changwen Chen
296
6
0
07 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
227
49
0
07 Dec 2023
Previous
123...91011...293031
Next
Page 10 of 31
Pageof 31