ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.17453
  4. Cited By
$VILA^2$: VILA Augmented VILA

VILA2VILA^2VILA2: VILA Augmented VILA

24 July 2024
Yunhao Fang
Ligeng Zhu
Yao Lu
Yan Wang
Pavlo Molchanov
Jang Hyun Cho
Marco Pavone
Song Han
Hongxu Yin
    VLM
ArXivPDFHTML

Papers citing "$VILA^2$: VILA Augmented VILA"

9 / 9 papers shown
Title
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang
Yinheng Li
Charles Ding
Justin Lin
Paul Pu Liang
Dan Zhao
Rogerio Bonatti
K. Koishida
31
5
0
24 Oct 2024
Language-Image Models with 3D Understanding
Language-Image Models with 3D Understanding
Jang Hyun Cho
B. Ivanovic
Yulong Cao
Edward Schmerling
Yue Wang
...
Boyi Li
Yurong You
Philipp Krahenbuhl
Yan Wang
Marco Pavone
LRM
40
15
0
06 May 2024
When Do We Not Need Larger Vision Models?
When Do We Not Need Larger Vision Models?
Baifeng Shi
Ziyang Wu
Maolin Mao
Xin Wang
Trevor Darrell
VLM
LRM
44
23
0
19 Mar 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI Xiao Bi
:
Xiao Bi
Deli Chen
Guanting Chen
...
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Qihao Zhu
Yuheng Zou
LRM
ALM
139
298
0
05 Jan 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
156
895
0
21 Dec 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
Learn to Explain: Multimodal Reasoning via Thought Chains for Science
  Question Answering
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
A. Kalyan
ELM
ReLM
LRM
207
1,089
0
20 Sep 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
382
4,010
0
28 Jan 2022
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual
  Machine Learning
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
184
307
0
02 Mar 2021
1