ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.12793
  4. Cited By
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

21 November 2023
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Conghui He
Jiaqi Wang
Feng Zhao
Dahua Lin
    MLLM
    VLM
ArXivPDFHTML

Papers citing "ShareGPT4V: Improving Large Multi-Modal Models with Better Captions"

50 / 467 papers shown
Title
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities
  through Tree-Based Image Exploration
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
Kangjia Zhao
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Mingwei Zhu
Jianwei Yin
95
4
0
25 Nov 2024
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual
  Token Compression
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu
Chi Xie
Shuang Liang
Bo Zheng
Sheng Guo
75
8
0
21 Nov 2024
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large
  Language Models in Autonomous Driving
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo
Ruijun Zhang
Yiqun Duan
Yuhang He
Chenming Zhang
Shuai Liu
Long Chen
LRM
77
11
0
20 Nov 2024
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment
Zhendong Liu
Yuanbi Nie
Yingshui Tan
Xiangyu Yue
Qiushi Cui
Chongjun Wang
Xiaoyong Zhu
Bo Zheng
Bo Zheng
68
0
0
18 Nov 2024
Mitigating Hallucination in Multimodal Large Language Model via
  Hallucination-targeted Direct Preference Optimization
Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization
Yuhan Fu
Ruobing Xie
X. Sun
Zhanhui Kang
Xirong Li
MLLM
33
3
0
15 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment
  in Multi-Modal Models
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Z. Li
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
29
1
0
14 Nov 2024
Multimodal Instruction Tuning with Hybrid State Space Models
Multimodal Instruction Tuning with Hybrid State Space Models
Jianing Zhou
Han Li
Shuai Zhang
Ning Xie
Ruijie Wang
Xiaohan Nie
Sheng Liu
Lingyun Wang
38
0
0
13 Nov 2024
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
Moran Yanuka
Assaf Ben-Kish
Yonatan Bitton
Idan Szpektor
Raja Giryes
VLM
47
2
0
13 Nov 2024
HumanVLM: Foundation for Human-Scene Vision-Language Model
HumanVLM: Foundation for Human-Scene Vision-Language Model
Dawei Dai
Xu Long
Li Yutang
Zhang YuanHui
Shuyin Xia
VLM
MLLM
37
1
0
05 Nov 2024
SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
Jian Chen
R. Zhang
Yufan Zhou
Tong Yu
Franck Dernoncourt
J. Gu
Ryan Rossi
Changyou Chen
Tong Sun
34
0
0
02 Nov 2024
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via
  Existing MLLM Structures
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures
Tianxiang Wu
Minxin Nie
Ziqiang Cao
MLLM
40
0
0
30 Oct 2024
Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
Arash Marioriyad
Parham Rezaei
M. Baghshah
M. Rohban
CoGe
133
0
0
30 Oct 2024
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li
Jianghong Ma
Xiaofeng Zhang
Yuhang Li
Jianyang Shi
33
0
0
26 Oct 2024
Rethinking Visual Dependency in Long-Context Reasoning for Large
  Vision-Language Models
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
Yucheng Zhou
Zhi Rao
Jun Wan
Jianbing Shen
LRM
23
17
0
25 Oct 2024
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
Haocheng Xi
Han Cai
Ligeng Zhu
Y. Lu
Kurt Keutzer
Jianfei Chen
Song Han
MQ
63
9
0
25 Oct 2024
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Shuhao Gu
Jialing Zhang
Siyuan Zhou
Kevin Yu
Zhaohu Xing
...
Yufeng Cui
Xinlong Wang
Yaoqi Liu
Fangxiang Feng
Guang Liu
SyDa
VLM
MLLM
30
17
0
24 Oct 2024
WAFFLE: Multi-Modal Model for Automated Front-End Development
WAFFLE: Multi-Modal Model for Automated Front-End Development
Shanchao Liang
Nan Jiang
Shangshu Qian
Lin Tan
17
0
0
24 Oct 2024
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Shilin Lu
Zihan Zhou
Jiayou Lu
Yuanzhi Zhu
A. Kong
WIGM
82
10
0
24 Oct 2024
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric
  Reasoning in Large Multimodal Models
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
Linger Deng
Yuliang Liu
Bohan Li
Dongliang Luo
Liang Wu
...
Ziyang Zhang
Gang Zhang
Errui Ding
Yingying Zhu
Xiang Bai
ReLM
3DV
LRM
26
10
0
23 Oct 2024
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
Shota Onohara
Atsuyuki Miyai
Yuki Imajuku
Kazuki Egashira
Jeonghun Baek
Xiang Yue
Graham Neubig
Kiyoharu Aizawa
OSLM
103
1
0
22 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Conghui He
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
45
26
0
22 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5%
  Parameters and 90% Performance
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
64
24
0
21 Oct 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM
  Pretraining
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang
Yuqi Huo
Zijia Zhao
Haoyu Lu
Shu Wu
B. Wang
Qiang Liu
Weipeng Chen
Liang Wang
VLM
25
1
0
21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large
  Multimodal Models
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
43
1
0
21 Oct 2024
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
...
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
Jiankang Deng
MLLM
VLM
40
1
0
18 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
69
21
0
18 Oct 2024
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
  and Generation
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu
Xiaokang Chen
Z. F. Wu
Yiyang Ma
Xingchao Liu
...
Wen Liu
Zhenda Xie
Xingkai Yu
Chong Ruan
Ping Luo
AI4TS
57
74
0
17 Oct 2024
Harnessing Webpage UIs for Text-Rich Visual Understanding
Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu
Tianyue Ou
Yifan Song
Yuxiao Qu
Wai Lam
Chenyan Xiong
Wenhu Chen
Graham Neubig
Xiang Yue
74
5
0
17 Oct 2024
Improving Multi-modal Large Language Model through Boosting Vision
  Capabilities
Improving Multi-modal Large Language Model through Boosting Vision Capabilities
Yanpeng Sun
H. Zhang
Qiang Chen
Xinyu Zhang
Nong Sang
Gang Zhang
Jingdong Wang
Zechao Li
29
5
0
17 Oct 2024
A Survey on Data Synthesis and Augmentation for Large Language Models
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang
Jiahui Zhu
Minjie Ren
Z. Liu
Shiwei Li
...
Chenkai Zhang
Xiaoyu Wu
Qiqi Zhan
Qingjie Liu
Yunhong Wang
SyDa
38
15
0
16 Oct 2024
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
  Embodied AI
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Sijie Cheng
Kechen Fang
Yangyang Yu
Sicheng Zhou
B. Li
Ye Tian
Tingguang Li
Lei Han
Yang Janet Liu
39
8
0
15 Oct 2024
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
25
5
0
15 Oct 2024
VidCompress: Memory-Enhanced Temporal Compression for Video
  Understanding in Large Language Models
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
21
2
0
15 Oct 2024
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Luping Liu
Chao Du
Tianyu Pang
Zehan Wang
Chongxuan Li
Dong Xu
VLM
51
5
0
15 Oct 2024
Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection
Adapt-∞\infty∞: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection
A. Maharana
Jaehong Yoon
Tianlong Chen
Mohit Bansal
31
0
0
14 Oct 2024
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
Peng Xia
Siwei Han
Shi Qiu
Yiyang Zhou
Zhaoyang Wang
...
Chenhang Cui
Mingyu Ding
Linjie Li
Lijuan Wang
Huaxiu Yao
52
10
0
14 Oct 2024
SynFER: Towards Boosting Facial Expression Recognition with Synthetic
  Data
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data
Xilin He
Cheng Luo
Xiaole Xian
Bing Li
Siyang Song
Muhammad Haris Khan
Weicheng Xie
L. Shen
Zongyuan Ge
36
4
0
13 Oct 2024
TULIP: Token-length Upgraded CLIP
TULIP: Token-length Upgraded CLIP
Ivona Najdenkoska
Mohammad Mahdi Derakhshani
Yuki M. Asano
N. V. Noord
Marcel Worring
Cees G. M. Snoek
VLM
46
3
0
13 Oct 2024
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language
  Models Alignment
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
Lei Li
Zhihui Xie
Mukai Li
Shunian Chen
Peiyi Wang
L. Chen
Yazheng Yang
Benyou Wang
Lingpeng Kong
Q. Liu
VLM
ALM
34
17
0
12 Oct 2024
Unraveling and Mitigating Safety Alignment Degradation of
  Vision-Language Models
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu
Chao Shang
Ling Liu
Nikolaos Pappas
Jie Ma
Neha Anna John
Srikanth Doss Kadarundalagi Raghuram Doss
Lluís Marquez
Miguel Ballesteros
Yassine Benajiba
34
4
0
11 Oct 2024
ElasticTok: Adaptive Tokenization for Image and Video
ElasticTok: Adaptive Tokenization for Image and Video
Wilson Yan
Matei A. Zaharia
Volodymyr Mnih
Pieter Abbeel
Aleksandra Faust
Hao Liu
VGen
43
6
0
10 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLM
MLLM
62
25
0
10 Oct 2024
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
  Modality Integration Rate
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Qidong Huang
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Jiaqi Wang
Dahua Lin
Weiming Zhang
Nenghai Yu
49
5
0
09 Oct 2024
From Pixels to Tokens: Revisiting Object Hallucinations in Large
  Vision-Language Models
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
Yuying Shang
Xinyi Zeng
Yutao Zhu
Xiao Yang
Zhengwei Fang
Jingyuan Zhang
Jiawei Chen
Zinan Liu
Yu Tian
VLM
MLLM
100
1
0
09 Oct 2024
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric
  Understanding
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding
Keliang Li
Zaifei Yang
Jiahe Zhao
Hongze Shen
Ruibing Hou
Hong Chang
Shiguang Shan
Xilin Chen
VLM
26
0
0
09 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoV
MLLM
77
12
0
09 Oct 2024
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to
  See
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
Phu Pham
Phu Pham
Kun Wan
Yu-Jhe Li
Zeliang Zhang
Daniel Miranda
Ajinkya Kale
Ajinkya Kale
Chenliang Xu
29
5
0
08 Oct 2024
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical
  Alignment
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Yifei Xing
Xiangyuan Lan
Ruiping Wang
D. Jiang
Wenjun Huang
Qingfang Zheng
Yaowei Wang
Mamba
33
0
0
08 Oct 2024
R-Bench: Are your Large Multimodal Model Robust to Real-world
  Corruptions?
R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?
Chunyi Li
J. Zhang
Zicheng Zhang
H. Wu
Yuan Tian
...
Guo Lu
Xiaohong Liu
Xiongkuo Min
Weisi Lin
Guangtao Zhai
AAML
39
3
0
07 Oct 2024
LoTLIP: Improving Language-Image Pre-training for Long Text
  Understanding
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
Wei Wu
Kecheng Zheng
Shuailei Ma
Fan Lu
Yuxin Guo
Yifei Zhang
Wei Chen
Qingpei Guo
Yujun Shen
Zheng-Jun Zha
VLM
30
9
0
07 Oct 2024
Previous
12345...8910
Next