ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXivPDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,387 papers shown
Title
ArcSin: Adaptive ranged cosine Similarity injected noise for
  Language-Driven Visual Tasks
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu
Xiaomin Yu
Gongyu Zhang
Christos Bergeles
Prokar Dasgupta
Alejandro Granados
Sebastien Ourselin
42
2
0
27 Feb 2024
Towards Open-ended Visual Quality Comparison
Towards Open-ended Visual Quality Comparison
Haoning Wu
Hanwei Zhu
Zicheng Zhang
Erli Zhang
Chaofeng Chen
...
Qiong Yan
Xiaohong Liu
Guangtao Zhai
Shiqi Wang
Weisi Lin
AAML
59
49
0
26 Feb 2024
CLoVe: Encoding Compositional Language in Contrastive Vision-Language
  Models
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro
Amir Ziai
Avneesh Saluja
Zhuoning Yuan
Rada Mihalcea
MLLM
CoGe
VLM
34
5
0
22 Feb 2024
Vision-Language Navigation with Embodied Intelligence: A Survey
Peng Gao
Peng Wang
Feng Gao
Fei-Yue Wang
Ruyue Yuan
LM&Ro
37
2
0
22 Feb 2024
VL-Trojan: Multimodal Instruction Backdoor Attacks against
  Autoregressive Visual Language Models
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Jiawei Liang
Siyuan Liang
Man Luo
Aishan Liu
Dongchen Han
Ee-Chien Chang
Xiaochun Cao
42
37
0
21 Feb 2024
CODIS: Benchmarking Context-Dependent Visual Comprehension for
  Multimodal Large Language Models
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
Fuwen Luo
Chi Chen
Zihao Wan
Zhaolu Kang
Qidong Yan
...
Xiaoyue Mi
Peng Li
Ning Ma
Maosong Sun
Yang Liu
40
5
0
21 Feb 2024
A Multimodal In-Context Tuning Approach for E-Commerce Product
  Description Generation
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation
Yunxin Li
Baotian Hu
Wenhan Luo
Lin Ma
Yuxin Ding
Min-Ling Zhang
53
1
0
21 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic
  Compositional Reasoning via Counterfactual Examples
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
43
18
0
20 Feb 2024
ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
Li Mi
Syrielle Montariol
J. Castillo-Navarro
Xianjie Dai
Antoine Bosselut
D. Tuia
30
4
0
20 Feb 2024
Language-guided Image Reflection Separation
Language-guided Image Reflection Separation
Haofeng Zhong
Yuchen Hong
Shuchen Weng
Jinxiu Liang
Boxin Shi
26
7
0
19 Feb 2024
Interpretable Embedding for Ad-hoc Video Search
Interpretable Embedding for Ad-hoc Video Search
Jiaxin Wu
Chong-Wah Ngo
16
29
0
19 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
30
2
0
18 Feb 2024
Assessing News Thumbnail Representativeness: Counterfactual text can
  enhance the cross-modal matching ability
Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
Yejun Yoon
Seunghyun Yoon
Kunwoo Park
21
0
0
17 Feb 2024
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
  Vision-language Adapter
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Junfei Xiao
Zheng Xu
Alan L. Yuille
Shen Yan
Boyu Wang
33
3
0
16 Feb 2024
Recovering the Pre-Fine-Tuning Weights of Generative Models
Recovering the Pre-Fine-Tuning Weights of Generative Models
Eliahu Horwitz
Jonathan Kahana
Yedid Hoshen
50
9
0
15 Feb 2024
Social Reward: Evaluating and Enhancing Generative AI through
  Million-User Feedback from an Online Creative Community
Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community
Arman Isajanyan
Artur Shatveryan
David Kocharyan
Zhangyang Wang
Humphrey Shi
EGVM
68
5
0
15 Feb 2024
DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA: Weight-Decomposed Low-Rank Adaptation
Shih-yang Liu
Chien-Yi Wang
Hongxu Yin
Pavlo Molchanov
Yu-Chiang Frank Wang
Kwang-Ting Cheng
Min-Hung Chen
27
340
0
14 Feb 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for
  Medical LVLM
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu
Tian-Xin Li
Quanfeng Lu
Wenqi Shao
Junjun He
Yu Qiao
Ping Luo
ELM
LM&MA
32
51
0
14 Feb 2024
Visually Dehallucinative Instruction Generation
Visually Dehallucinative Instruction Generation
Sungguk Cha
Jusung Lee
Younghyun Lee
Cheoljong Yang
MLLM
22
5
0
13 Feb 2024
A Benchmark for Multi-modal Foundation Models on Low-level Vision: from
  Single Images to Pairs
A Benchmark for Multi-modal Foundation Models on Low-level Vision: from Single Images to Pairs
Zicheng Zhang
Haoning Wu
Erli Zhang
Guangtao Zhai
Weisi Lin
VLM
24
8
0
11 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
43
11
0
10 Feb 2024
GPTs Are Multilingual Annotators for Sequence Generation Tasks
GPTs Are Multilingual Annotators for Sequence Generation Tasks
Juhwan Choi
Eunju Lee
Kyohoon Jin
Youngbin Kim
25
10
0
08 Feb 2024
Question Aware Vision Transformer for Multimodal Reasoning
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham
Oren Nuriel
Shai Mazor
Ron Litman
42
20
0
08 Feb 2024
Get What You Want, Not What You Don't: Image Content Suppression for
  Text-to-Image Diffusion Models
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
Senmao Li
J. Weijer
Taihang Hu
Fahad Shahbaz Khan
Qibin Hou
Yaxing Wang
Jian Yang
DiffM
45
27
0
08 Feb 2024
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
  Method for Multimodal Contrastive Learning
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Kevin G. Jamieson
S. Du
28
5
0
03 Feb 2024
Can MLLMs Perform Text-to-Image In-Context Learning?
Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng
Wonjun Kang
Yicong Chen
Hyung Il Koo
Kangwook Lee
MLLM
33
9
0
02 Feb 2024
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual
  Storytelling
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling
Eileen Wang
S. Han
Josiah Poon
11
5
0
01 Feb 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for
  Automated Audio Captioning
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIP
VLM
23
21
0
31 Jan 2024
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor
  Image Comprehension in Remote Sensing Domain
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
Wei Zhang
Miaoxin Cai
Tong Zhang
Zhuang Yin
Xuerui Mao
24
88
0
30 Jan 2024
Towards Unified Interactive Visual Grounding in The Wild
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
30
5
0
30 Jan 2024
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
  Efficient Pretraining
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Qingpei Guo
Furong Xu
Hanxiao Zhang
Wang Ren
Ziping Ma
Lin Ju
Jian Wang
Jingdong Chen
Ming Yang
VLM
MLLM
27
2
0
29 Jan 2024
Muffin or Chihuahua? Challenging Multimodal Large Language Models with
  Multipanel VQA
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
Yue Fan
Jing Gu
KAI-QING Zhou
Qianqi Yan
Shan Jiang
Ching-Chen Kuo
Xinze Guan
Xin Eric Wang
29
7
0
29 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
52
179
0
24 Jan 2024
Common-Sense Bias Modeling for Classification Tasks
Common-Sense Bias Modeling for Classification Tasks
Miao Zhang
Zee fryer
Ben Colman
Ali Shahriyari
Gaurav Bharaj
30
0
0
24 Jan 2024
Enhancing Object Detection Performance for Small Objects through
  Synthetic Data Generation and Proportional Class-Balancing Technique: A
  Comparative Study in Industrial Scenarios
Enhancing Object Detection Performance for Small Objects through Synthetic Data Generation and Proportional Class-Balancing Technique: A Comparative Study in Industrial Scenarios
Jibinraj Antony
Vinit Hegiste
Ali Nazeri
Hooman Tavakoli
Snehal Walunj
Christiane Plociennik
Martin Ruskowski
31
2
0
23 Jan 2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
  Capabilities
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen
Zhuo Xu
Sean Kirmani
Brian Ichter
Danny Driess
Pete Florence
Dorsa Sadigh
Leonidas J. Guibas
Fei Xia
LRM
ReLM
49
206
0
22 Jan 2024
Text-to-Image Cross-Modal Generation: A Systematic Review
Text-to-Image Cross-Modal Generation: A Systematic Review
Maciej Żelaszczyk
Jacek Mañdziuk
35
3
0
21 Jan 2024
LLMRA: Multi-modal Large Language Model based Restoration Assistant
LLMRA: Multi-modal Large Language Model based Restoration Assistant
Xiaoyu Jin
Yuan Shi
Bin Xia
Wenming Yang
36
4
0
21 Jan 2024
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
  Video Search Scenarios
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios
Xiangshuo Qiao
Xianxin Li
Xiaozhe Qu
Jie M. Zhang
Yang Liu
Yu Luo
Cihang Jin
Jin Ma
VLM
33
0
0
19 Jan 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
Supervised Fine-tuning in turn Improves Visual Foundation Models
Xiaohu Jiang
Yixiao Ge
Yuying Ge
Dachuan Shi
Chun Yuan
Ying Shan
VLM
CLIP
46
8
0
18 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via
  Multi-modal Feature Synchronizer
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Hongsheng Li
Yu Qiao
Jifeng Dai
AuLLM
85
42
0
18 Jan 2024
Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with
  Positive Forward Transfer
Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer
Junhao Zheng
Qianli Ma
Zhen Liu
Binquan Wu
Hu Feng
CLL
26
14
0
17 Jan 2024
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
Xiaotian Han
Yiqi Wang
Bohan Zhai
Quanzeng You
Hongxia Yang
VLM
MLLM
33
2
0
17 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
16
2
0
09 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
32
4
0
06 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in
  Multimodal Large Language Models
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
43
8
0
06 Jan 2024
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency
Yuyang Yin
Dejia Xu
Zhangyang Wang
Yao-Min Zhao
Yunchao Wei
3DGS
47
72
0
28 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
37
144
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
41
22
0
27 Dec 2023
Cloud-Device Collaborative Learning for Multimodal Large Language Models
Cloud-Device Collaborative Learning for Multimodal Large Language Models
Guanqun Wang
Jiaming Liu
Chenxuan Li
Junpeng Ma
Yuan Zhang
...
Kevin Zhang
Maurice Chong
Ray Zhang
Yijiang Liu
Shanghang Zhang
41
7
0
26 Dec 2023
Previous
123...678...262728
Next