Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.06242
Cited By
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
10 November 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks"
50 / 105 papers shown
Title
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Zekun Qi
Wenyao Zhang
Yufei Ding
Runpei Dong
Xinqiang Yu
...
Xin Jin
Kaisheng Ma
Zhizheng Zhang
He Wang
Li Yi
LM&Ro
131
3
0
18 Feb 2025
Understanding Classifier-Free Guidance: High-Dimensional Theory and Non-Linear Generalizations
Krunoslav Lehman Pavasovic
Jakob Verbeek
Giulio Biroli
Marc Mézard
48
0
0
11 Feb 2025
DynamicEarth: How Far are We from Open-Vocabulary Change Detection?
Kaiyu Li
Xiangyong Cao
Yupeng Deng
Chao Pang
Zepeng Xin
Deyu Meng
Zhi Wang
ObjD
69
1
0
22 Jan 2025
SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
Varun Biyyala
Bharat Chanderprakash Kathuria
Jialu Li
Youshan Zhang
50
0
0
13 Jan 2025
Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice
Junliang Li
Kai Ye
Haolan Kang
Mingxuan Liang
Yuhang Wu
Zhenhua Liu
Huiping Zhuang
Rui Huang
Yongquan Chen
59
0
0
14 Dec 2024
CATALOG: A Camera Trap Language-guided Contrastive Learning Model
Julian D. Santamaria
Claudia Isaza
Jhony H. Giraldo
71
0
0
14 Dec 2024
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering
Sai Bhargav Rongali
M. Cui
Ankit Jha
Neha Bhargava
Saurabh Prasad
Biplab Banerjee
69
0
0
12 Dec 2024
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Jiuhai Chen
Jianwei Yang
Haiping Wu
Dianqi Li
Jianfeng Gao
Tianyi Zhou
Bin Xiao
VLM
58
4
0
05 Dec 2024
Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting
Guangben Lu
Yuzhen Du
Zhimin Sun
Ran Yi
Yifan Qi
Yizhe Tang
Tianyi Wang
Lizhuang Ma
Fangyuan Zou
DiffM
72
1
0
05 Dec 2024
Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation
Zilyu Ye
Zhiyang Chen
Tiancheng Li
Zemin Huang
Weijian Luo
Guo-jun Qi
DiffM
72
4
0
02 Dec 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLM
LRM
97
6
0
27 Nov 2024
Interpreting Object-level Foundation Models via Visual Precision Search
Ruoyu Chen
Siyuan Liang
Jingzhi Li
Shiming Liu
Maosen Li
Zheng Huang
Hua Zhang
Xiaochun Cao
FAtt
82
3
0
25 Nov 2024
GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers
Éloi Zablocki
Valentin Gerard
Amaia Cardiel
Eric Gaussier
Matthieu Cord
Eduardo Valle
66
0
0
23 Nov 2024
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
Zhiwei Jia
Yuesong Nan
Huixi Zhao
Gengdai Liu
EGVM
84
0
0
22 Nov 2024
MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models
Jianhong Tu
Zhuohao Ni
Nicholas Crispino
Zihao Yu
Michael Bendersky
...
Ruoxi Jia
Xin Liu
Lingjuan Lyu
Dawn Song
Chenguang Wang
VLM
MLLM
49
0
0
15 Nov 2024
Boosting Latent Diffusion with Perceptual Objectives
Tariq Berrada
Pietro Astolfi
Jakob Verbeek
Melissa Hall
Marton Havasi
M. Drozdzal
Yohann Benchetrit
Adriana Romero Soriano
Karteek Alahari
38
0
0
06 Nov 2024
Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction
Muhammad Tayyab Khan
Lequn Chen
Ye Han Ng
Wenhe Feng
Nicholas Yew Jin Tan
Seung Ki Moon
19
2
0
06 Nov 2024
Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms
Jordan Meyer
Nick Padgett
Cullen Miller
Laura Exline
29
4
0
30 Oct 2024
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Shuhao Gu
Jialing Zhang
Siyuan Zhou
Kevin Yu
Zhaohu Xing
...
Yufeng Cui
Xinlong Wang
Yaoqi Liu
Fangxiang Feng
Guang Liu
SyDa
VLM
MLLM
30
17
0
24 Oct 2024
Lightweight Neural App Control
Filippos Christianos
Georgios Papoudakis
Thomas Coste
Jianye Hao
Jun Wang
Kun Shao
LM&Ro
44
4
0
23 Oct 2024
Frontiers in Intelligent Colonoscopy
Ge-Peng Ji
Jingyi Liu
Peng-Tao Xu
Nick Barnes
F. Khan
Salman Khan
Deng-Ping Fan
41
4
0
22 Oct 2024
Few-shot target-driven instance detection based on open-vocabulary object detection models
Ben Crulis
Barthélémy Serres
Cyril de Runz
Gilles Venturini
VLM
ObjD
19
0
0
21 Oct 2024
CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning
Qingqing Cao
Mahyar Najibi
Sachin Mehta
CLIP
DiffM
22
1
0
15 Oct 2024
TinyClick: Single-Turn Agent for Empowering GUI Automation
Pawel Pawlowski
Krystian Zawistowski
Wojciech Lapacz
Marcin Skorupa
Adam Wiacek
Sebastien Postansque
Jakub Hoscilowicz
MLLM
LLMAG
LRM
35
6
0
09 Oct 2024
Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models
Salma Abdel Magid
Weiwei Pan
Simon Warchol
Grace Guo
Junsik Kim
Mahia Rahman
Hanspeter Pfister
84
0
0
06 Oct 2024
Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models
Qi Wu
Zipeng Fu
Xuxin Cheng
Xiaolong Wang
Chelsea Finn
LM&Ro
26
8
0
30 Sep 2024
ComiCap: A VLMs pipeline for dense captioning of Comic Panels
Emanuele Vivoli
Niccoló Biondi
Marco Bertini
Dimosthenis Karatzas
33
3
0
24 Sep 2024
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
Amaia Cardiel
Éloi Zablocki
Oriane Siméoni
Elias Ramzi
Matthieu Cord
VLM
23
0
0
18 Sep 2024
Synthetic data augmentation for robotic mobility aids to support blind and low vision people
Hochul Hwang
Krisha Adhikari
Satya Shodhaka
Donghyun Kim
21
0
0
17 Sep 2024
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models
Bingchen Liu
Ehsan Akhgari
Alexander Visheratin
Aleks Kamko
Linmiao Xu
Shivam Shrirao
Joao Souza
Suhail Doshi
Daiqing Li
Daiqing Li
DiffM
MLLM
16
46
0
16 Sep 2024
AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
Huawei Ji
Cheng Deng
Bo Xue
Zhouyang Jin
Jiaxin Ding
Xiaoying Gan
Luoyi Fu
Xinbing Wang
Chenghu Zhou
17
0
0
16 Sep 2024
SOOD-ImageNet: a Large-Scale Dataset for Semantic Out-Of-Distribution Image Classification and Semantic Segmentation
Alberto Bacchin
Davide Allegro
Stefano Ghidoni
Emanuele Menegatti
26
1
0
02 Sep 2024
Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models
Nobuo Namura
Yuma Ichikawa
AI4TS
13
1
0
27 Aug 2024
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
34
60
0
22 Aug 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
23
0
0
23 Jul 2024
An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval
Mahesh Kandhare
Thibault Gisselbrecht
35
4
0
22 Jul 2024
VideoGameBunny: Towards vision assistants for video games
Mohammad Reza Taesiri
C. Bezemer
VLM
MLLM
33
2
0
21 Jul 2024
Words2Contact: Identifying Support Contacts from Verbal Instructions Using Foundation Models
Dionis Totsila
Quentin Rouxel
Jean-Baptiste Mouret
S. Ivaldi
36
1
0
19 Jul 2024
Precision at Scale: Domain-Specific Datasets On-Demand
Jesús M. Rodríguez-de-Vera
Imanol G. Estepa
Ignacio Sarasúa
Bhalaji Nagarajan
P. Radeva
21
2
0
03 Jul 2024
First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
Enming Zhang
Ruobing Yao
Huanyong Liu
Junhui Yu
Jiale Wang
ELM
LRM
37
0
0
14 Jun 2024
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Xiyao Wang
Jiuhai Chen
Zhaoyang Wang
Yuhang Zhou
Yiyang Zhou
...
Tianyi Zhou
Tom Goldstein
Parminder Bhatia
Furong Huang
Cao Xiao
55
33
0
24 May 2024
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin
Sam Ade Jacobs
A. A. Awan
J. Aneja
Ahmed Hassan Awadallah
...
Li Lyna Zhang
Yi Zhang
Yue Zhang
Yunan Zhang
Xiren Zhou
LRM
ALM
50
995
0
22 Apr 2024
LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning
Junchi Wang
Lei Ke
MLLM
LRM
VLM
36
18
0
12 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim M. Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
35
5
0
28 Mar 2024
Towards Graph Foundation Models for Personalization
Andreas Damianou
Francesco Fabbri
Paul Gigioli
Marco De Nadai
Alice Wang
Enrico Palumbo
M. Lalmas
AI4CE
16
8
0
12 Mar 2024
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data
Chenhui Zhang
Sherrie Wang
19
17
0
31 Jan 2024
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia
Dongchen Han
Yizeng Han
Xuran Pan
Shiji Song
Gao Huang
VLM
23
40
0
15 Dec 2023
General Object Foundation Model for Images and Videos at Scale
Junfeng Wu
Yi-Xin Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOS
VLM
25
38
0
14 Dec 2023
When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions
Weiming Zhuang
Chen Chen
Lingjuan Lyu
C. L. P. Chen
Yaochu Jin
Lingjuan Lyu
AIFin
AI4CE
83
84
0
27 Jun 2023
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering
Chaoning Zhang
Fachrina Dewi Puspitasari
Sheng Zheng
Chenghao Li
Yu Qiao
...
Caiyan Qin
François Rameau
Lik-Hang Lee
Sung-Ho Bae
Choong Seon Hong
VLM
76
61
0
12 May 2023
Previous
1
2
3
Next