Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.06242
Cited By
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
10 November 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks"
50 / 105 papers shown
Title
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
Benjamin Raphael Ernhofer
Daniil Prokhorov
Jannica Langner
Dominik Bollmann
25
0
0
09 May 2025
Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA
Karthik Reddy Kanjula
Surya Guthikonda
Nahid Alam
Shayekh Bin Islam
16
0
0
09 May 2025
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Teng Hu
Zhentao Yu
Zhengguang Zhou
Sen Liang
Yuan Zhou
Qin Lin
Qinglin Lu
DiffM
VGen
50
0
0
07 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
X. Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
57
0
0
05 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
34
0
0
04 May 2025
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Ruiqi Wang
Hao Zhang
VLM
52
0
0
03 May 2025
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
Gabriel Sarch
Balasaravanan Thoravi Kumaravel
Sahithya Ravi
Vibhav Vineet
A. D. Wilson
42
0
0
02 May 2025
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
Haifeng Huang
Xinyi Chen
Y. Chen
H. Li
Xiaoshen Han
Z. Wang
Tai Wang
Jiangmiao Pang
Zhou Zhao
LM&Ro
75
0
0
30 Apr 2025
An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images
Modesto Castrillón-Santana
Oliverio J. Santana
David Freire-Obregón
Daniel Hernández-Sosa
J. Lorenzo-Navarro
48
0
0
30 Apr 2025
Neural network task specialization via domain constraining
Roman Malashin
Daniil Ilyukhin
49
0
0
28 Apr 2025
Anyprefer: An Agentic Framework for Preference Data Synthesis
Yiyang Zhou
Z. Wang
Tianle Wang
Shangyu Xing
Peng Xia
...
Chetan Bansal
Weitong Zhang
Ying Wei
Mohit Bansal
Huaxiu Yao
54
0
0
27 Apr 2025
Step1X-Edit: A Practical Framework for General Image Editing
S. Liu
Yucheng Han
Peng Xing
Fukun Yin
Rui Wang
...
Yibo Zhu
Binxing Jiao
X. Zhang
Gang Yu
Daxin Jiang
DiffM
93
2
0
24 Apr 2025
Physically Consistent Humanoid Loco-Manipulation using Latent Diffusion Models
Ilyass Taouil
Haizhou Zhao
Angela Dai
Majid Khadiv
DiffM
46
0
0
23 Apr 2025
UFO2: The Desktop AgentOS
Chaoyun Zhang
He Huang
Chiming Ni
J. Mu
Si Qin
...
Minghua Ma
Jian-Guang Lou
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
LLMAG
34
0
0
20 Apr 2025
Logits DeConfusion with CLIP for Few-Shot Learning
Shuo Li
F. Liu
Zehua Hao
X. Wang
Lingling Li
X. Liu
Puhua Chen
Wenping Ma
VLM
47
0
0
16 Apr 2025
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
Kaiyu Li
Zepeng Xin
Li Pang
Chao Pang
Yupeng Deng
Jing Yao
Guisong Xia
Deyu Meng
Zhi Wang
Xiangyong Cao
VLM
LRM
37
0
0
13 Apr 2025
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena
Tommaso Apicella
Stefano Rosa
Pietro Morerio
Alessio Del Bue
Lorenzo Natale
32
0
0
11 Apr 2025
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Junli Liu
Qizhi Chen
Z. Wang
Yiwen Tang
Yiting Zhang
Chi Yan
Dong Wang
X. Li
Bin Zhao
CoGe
47
0
0
10 Apr 2025
Ternarization of Vision Language Models for use on edge devices
Ben Crulis
Cyril de Runz
Barthélémy Serres
Gilles Venturini
VLM
50
0
0
07 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
41
0
0
07 Apr 2025
VISTA-OCR: Towards generative and interactive end to end OCR models
Laziz Hamdi
Amine Tamasna
Pascal Boisson
Thierry Paquet
33
0
0
04 Apr 2025
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Divya Velayudhan
A. Ahmed
Mohamad Alansari
Neha Gour
Abderaouf Behouch
...
Muzammal Naseer
Juergen Gall
Mohammed Bennamoun
Ernesto Damiani
N. Werghi
37
0
0
03 Apr 2025
A
T
^\text{T}
T
A: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting
Yizhe Tang
Zhimin Sun
Yuzhen Du
Ran Yi
Guangben Lu
T. Hu
Luying Li
Lizhuang Ma
Fangyuan Zou
DiffM
35
0
0
02 Apr 2025
TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting
Liangbin Xie
D. Pakhomov
Zhonghao Wang
Zongze Wu
Ziyan Chen
...
Haitian Zheng
Zhifei Zhang
Zhe Lin
Jiantao Zhou
Chao Dong
35
0
0
01 Apr 2025
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
Abdelrahman Elskhawy
Mengze Li
Nassir Navab
Benjamin Busam
VLM
46
0
0
01 Apr 2025
IntrinsiX: High-Quality PBR Generation using Image Priors
Peter Kocsis
Lukas Höllein
Matthias Nießner
33
0
0
01 Apr 2025
EAP4EMSIG -- Enhancing Event-Driven Microscopy for Microfluidic Single-Cell Analysis
Nils Friederich
Angelo Jovin Yamachui Sitcheu
Annika Nassal
Erenus Yildiz
Matthias Pesch
...
D. Kohlheyer
Hanno Scharr
Johannes Seiffarth
K. Nöh
Ralf Mikut
29
0
0
30 Mar 2025
From Panels to Prose: Generating Literary Narratives from Comics
Ragav Sachdeva
Andrew Zisserman
39
0
0
30 Mar 2025
Efficient Adaptation For Remote Sensing Visual Grounding
Hasan Moughnieh
Mohamad Chalhoub
Hasan Nasrallah
Cristiano Nattero
Paolo Campanella
Giovanni Nico
A. Ghandour
42
0
0
29 Mar 2025
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
Shitian Zhao
Qilong Wu
Xinyue Li
Bo Zhang
Ming-xing Li
...
H. Li
Yu Qiao
Peng Gao
Bin Fu
Zhen Li
EGVM
43
0
0
27 Mar 2025
Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation
Niccolo Avogaro
Thomas Frick
Mattia Rigotti
A. Bartezzaghi
Filip Janicki
C. Malossi
Konrad Schindler
Roy Assaf
MLLM
VLM
58
0
0
25 Mar 2025
M3: 3D-Spatial MultiModal Memory
Xueyan Zou
Yuchen Song
Ri-Zhao Qiu
Xuanbin Peng
Jianglong Ye
Sifei Liu
Xiaolong Wang
3DGS
54
0
0
20 Mar 2025
A Recipe for Generating 3D Worlds From a Single Image
Katja Schwarz
Denys Rozumnyi
Samuel Rota Buló
Lorenzo Porzi
Peter Kontschieder
VGen
74
1
0
20 Mar 2025
VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making
Mohamed Salim Aissi
Clemence Grislain
Mohamed Chetouani
Olivier Sigaud
Laure Soulier
Nicolas Thome
LRM
39
0
0
19 Mar 2025
MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling
Damian Boborzi
Phillip Mueller
Jonas Emrich
Dominik Schmid
Sebastian Mueller
Lars Mikelsons
DiffM
65
0
0
18 Mar 2025
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
R. Hu
Lianghui Zhu
Yuxuan Zhang
Tianheng Cheng
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
ObjD
56
0
0
13 Mar 2025
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models
Julian Spravil
Sebastian Houben
Sven Behnke
VLM
68
0
0
12 Mar 2025
Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
Shiu-hong Kao
Yu-Wing Tai
Chi-Keung Tang
LRM
MLLM
49
0
0
10 Mar 2025
Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias
Mingxiao Li
Tingyu Qu
Tinne Tuytelaars
Marie-Francine Moens
EGVM
36
0
0
09 Mar 2025
Task-Agnostic Attacks Against Vision Foundation Models
Brian Pulfer
Yury Belousov
Vitaliy Kinakh
Teddy Furon
S. Voloshynovskiy
AAML
68
0
0
05 Mar 2025
Enhancing Collective Intelligence in Large Language Models Through Emotional Integration
Likith Kadiyala
Ramteja Sajja
Y. Sermet
Ibrahim Demir
48
0
0
05 Mar 2025
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
Jun Yu Li
Che Liu
Wenjia Bai
Rossella Arcucci
Cosmin I. Bercea
Julia A. Schnabel
39
0
0
05 Mar 2025
LangGas: Introducing Language in Selective Zero-Shot Background Subtraction for Semi-Transparent Gas Leak Detection with a New Dataset
Wenqi Guo
Yiyang Du
Shan Du
67
1
0
04 Mar 2025
Teaching Metric Distance to Autoregressive Multimodal Foundational Models
Jiwan Chung
Saejin Kim
Yongrae Jo
J. Park
Dongjun Min
Youngjae Yu
69
0
0
04 Mar 2025
Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation
Jiantao Lin
Xin Yang
Meixi Chen
Yingjie Xu
D. Yan
Leyi Wu
Xinli Xu
Lie Xu
Shunsi Zhang
Ying-Cong Chen
52
1
0
03 Mar 2025
ClipGrader: Leveraging Vision-Language Models for Robust Label Quality Assessment in Object Detection
Hong Lu
Yali Bian
Rahul C. Shah
ObjD
VLM
76
0
0
03 Mar 2025
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models
Nastaran Darabi
Devashri Naik
Sina Tayebati
Dinithi Jayasuriya
Ranganath Krishnan
A. R. Trivedi
AAML
39
0
0
24 Feb 2025
Chitrarth: Bridging Vision and Language for a Billion People
Shaharukh Khan
Ayush Tarun
Abhinav Ravi
Ali Faraz
Akshat Patidar
Praveen Kumar Pokala
Anagha Bhangare
Raja Kolla
Chandra Khatri
Shubham Agarwal
VLM
110
1
0
21 Feb 2025
Contrastive Localized Language-Image Pre-Training
Hong-You Chen
Zhengfeng Lai
H. Zhang
X. Wang
Marcin Eichner
Keen You
Meng Cao
Bowen Zhang
Y. Yang
Zhe Gan
CLIP
VLM
62
7
0
20 Feb 2025
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Zekun Qi
Wenyao Zhang
Yufei Ding
Runpei Dong
Xinqiang Yu
...
Xin Jin
Kaisheng Ma
Zhizheng Zhang
He Wang
Li Yi
LM&Ro
131
3
0
18 Feb 2025
1
2
3
Next