ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Tsung-Yi Lin
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXivPDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,387 papers shown
Title
Training-free Boost for Open-Vocabulary Object Detection with Confidence
  Aggregation
Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation
Yanhao Zheng
Kai Liu
ObjD
26
1
0
12 Apr 2024
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and
  Training Strategies
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Zichao Li
Cihang Xie
E. D. Cubuk
CLIP
34
8
0
12 Apr 2024
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Simon Schrodi
David T. Hoffmann
Max Argus
Volker Fischer
Thomas Brox
VLM
58
0
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
47
25
0
10 Apr 2024
Training-Free Open-Vocabulary Segmentation with Offline
  Diffusion-Augmented Prototype Generation
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
Luca Barsellotti
Roberto Amoroso
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
VLM
DiffM
42
13
0
09 Apr 2024
Improving Interpretable Embeddings for Ad-hoc Video Search with
  Generative Captions and Multi-word Concept Bank
Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank
Jiaxin Wu
Chong-Wah Ngo
W. Chan
VGen
30
1
0
09 Apr 2024
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
Matteo Farina
Massimiliano Mancini
Elia Cunegatti
Gaowen Liu
Giovanni Iacca
Elisa Ricci
VLM
42
2
0
08 Apr 2024
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
Yuxi Ren
Jie Wu
Yanzuo Lu
Huafeng Kuang
Xin Xia
...
Shiyin Wang
Xuefeng Xiao
Yitong Wang
Min Zheng
Lean Fu
37
5
0
07 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
  Matching
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
30
20
0
04 Apr 2024
Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps
  via LLM
Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM
Zhe Liu
Chunyang Chen
Junjie Wang
Mengzhuo Chen
Boyu Wu
Yuekai Huang
Jun Hu
Qing Wang
29
10
0
03 Apr 2024
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic
  Assisted Pseudo-labeling
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling
Xu Wang
Yifan Li
Qiudan Zhang
Wen-Bin Wu
Mark Junjie Li
Jianmin Jinag
51
1
0
03 Apr 2024
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Jienneg Chen
Qihang Yu
Xiaohui Shen
Alan L. Yuille
Liang-Chieh Chen
3DV
VLM
36
24
0
02 Apr 2024
mChartQA: A universal benchmark for multimodal Chart Question Answer
  based on Vision-Language Alignment and Reasoning
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
Jingxuan Wei
Nan Xu
Guiyong Chang
Yin Luo
Bihui Yu
Ruifeng Guo
44
2
0
02 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
47
1
0
01 Apr 2024
LLaMA-Excitor: General Instruction Tuning via Indirect Feature
  Interaction
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
44
6
0
01 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative
  Vision-Language Reasoning
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLM
LRM
VLM
28
2
0
01 Apr 2024
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with
  Vision-Language Models
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
Rongjie Li
Songyang Zhang
Dahua Lin
Kai-xiang Chen
Xuming He
VLM
42
14
0
01 Apr 2024
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large
  Language Model
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
Lirui Zhao
Yue Yang
Kaipeng Zhang
Wenqi Shao
Yuxin Zhang
Yu Qiao
Ping Luo
Rongrong Ji
LM&Ro
LLMAG
VLM
29
3
0
31 Mar 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLM
LRM
39
34
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim M. Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
51
6
0
28 Mar 2024
ACES: Evaluating Automated Audio Captioning Models on the Semantics of
  Sounds
ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds
Gijs Wijngaard
Elia Formisano
Bruno L. Giordano
M. Dumontier
11
2
0
27 Mar 2024
Toward Interactive Regional Understanding in Vision-Large Language
  Models
Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee
Sanghyuk Chun
Sangdoo Yun
VLM
28
1
0
27 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith Ngai
37
4
0
21 Mar 2024
Open-Vocabulary Attention Maps with Token Optimization for Semantic
  Segmentation in Diffusion Models
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
Pablo Marcos-Manchón
Roberto Alcover-Couso
Juan C. Sanmiguel
Jose M. Martínez
VLM
49
18
0
21 Mar 2024
What if...?: Thinking Counterfactual Keywords Helps to Mitigate
  Hallucination in Large Multi-modal Models
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
Junho Kim
Yeonju Kim
Yonghyun Ro
LRM
MLLM
35
4
0
20 Mar 2024
As Firm As Their Foundations: Can open-sourced foundation models be used
  to create adversarial examples for downstream tasks?
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?
Anjun Hu
Jindong Gu
Francesco Pinto
Konstantinos Kamnitsas
Philip H. S. Torr
AAML
SILM
32
5
0
19 Mar 2024
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
Guohao Sun
Can Qin
Jiamian Wang
Zeyuan Chen
Ran Xu
Zhiqiang Tao
MLLM
VLM
LRM
32
9
0
17 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for
  Remote Sensing Image-Text Retrival
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
30
2
0
16 Mar 2024
Refining Knowledge Transfer on Audio-Image Temporal Agreement for
  Audio-Text Cross Retrieval
Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
Shunsuke Tsubaki
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Keisuke Imoto
21
1
0
16 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Generative Region-Language Pretraining for Open-Ended Object Detection
Chuang Lin
Yi-Xin Jiang
Lizhen Qu
Zehuan Yuan
Jianfei Cai
ObjD
VLM
53
13
0
15 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
37
186
0
14 Mar 2024
Eta Inversion: Designing an Optimal Eta Function for Diffusion-based
  Real Image Editing
Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
Wonjun Kang
Kevin Galim
Hyung Il Koo
DiffM
31
5
0
14 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language
  Interface
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Hongsheng Li
Bernt Schiele
Liwei Wang
VLM
37
10
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling
  and Visual-Language Co-Referring
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
36
12
0
14 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLL
MoMe
31
10
0
13 Mar 2024
An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language
  Pre-train Model
An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model
Yuxin Tian
Mouxing Yang
Yunfan Li
Dayiheng Liu
Xingzhang Ren
Xiaocui Peng
Jiancheng Lv
VLM
37
0
0
13 Mar 2024
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Lei Zhu
Fangyun Wei
Yanye Lu
MLLM
VLM
46
17
0
12 Mar 2024
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and
  Image Embeddings
Synth2^22: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
D. Kumaran
Anastasija Ilić
Jovana Mitrović
Charles Blundell
Andrea Banino
VLM
46
9
0
12 Mar 2024
Transformer based Multitask Learning for Image Captioning and Object
  Detection
Transformer based Multitask Learning for Image Captioning and Object Detection
Debolena Basak
P. K. Srijith
M. Desarkar
24
1
0
10 Mar 2024
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
Ibrahim M. Alabdulmohsin
Xiao Wang
Andreas Steiner
Priya Goyal
Alexander DÁmour
Xiao-Qi Zhai
34
16
0
07 Mar 2024
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection
  from Remote Sensing Imagery
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Tong Zhang
Guoqiang Lei
Zhuang Yin
Xuerui Mao
27
6
0
06 Mar 2024
Neural Image Compression with Text-guided Encoding for both Pixel-level
  and Perceptual Fidelity
Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity
Hagyeong Lee
Minkyu Kim
Jun-Hyuk Kim
Seungeon Kim
Dokwan Oh
Jaeho Lee
DiffM
32
6
0
05 Mar 2024
When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on
  its Contour-following Ability
When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on its Contour-following Ability
Wenjie Xuan
Yufei Xu
Shanshan Zhao
Chaoyue Wang
Juhua Liu
Bo Du
Dacheng Tao
26
2
0
01 Mar 2024
DistriFusion: Distributed Parallel Inference for High-Resolution
  Diffusion Models
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
Muyang Li
Tianle Cai
Jiaxin Cao
Qinsheng Zhang
Han Cai
Junjie Bai
Yangqing Jia
Ming-Yu Liu
Kai Li
Song Han
DiffM
29
41
0
29 Feb 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the
  Open World
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
45
47
0
29 Feb 2024
SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images
  via Vision-Language Model
SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model
Bin Cao
Jianhao Yuan
Yexin Liu
Jian Li
Shuyang Sun
Jing Liu
Bo-Lu Zhao
DiffM
35
7
0
28 Feb 2024
Vision Language Model-based Caption Evaluation Method Leveraging Visual
  Context Extraction
Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction
Koki Maeda
Shuhei Kurita
Taiki Miyanishi
Naoaki Okazaki
38
2
0
28 Feb 2024
Acquiring Linguistic Knowledge from Multimodal Input
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
29
2
0
27 Feb 2024
MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning
  for Multimodal Video Captioning
MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning
Huiyu Xiong
Lanxiao Wang
Heqian Qiu
Taijin Zhao
Benliu Qiu
Hongliang Li
CLL
32
1
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation
  Learning
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
40
3
0
27 Feb 2024
Previous
123...567...262728
Next