ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.15668
21
3

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

24 May 2024
Abdelrahman Abdelhamed
Mahmoud Afifi
Alec Go
    MLLM
    VLM
ArXivPDFHTML
Abstract

Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Using multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward set of prompts across all datasets. We evaluated our method on several datasets and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average, for ten benchmarks, our method achieved an accuracy gain of 6.2 percentage points, with an increase of 6.8 percentage points on the ImageNet dataset, compared to prior methods re-evaluated with the same setup. Our findings highlight the potential of multimodal LLMs to enhance computer vision tasks such as zero-shot image classification, offering a significant improvement over traditional methods.

View on arXiv
@article{abdelhamed2025_2405.15668,
  title={ What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models },
  author={ Abdelrahman Abdelhamed and Mahmoud Afifi and Alec Go },
  journal={arXiv preprint arXiv:2405.15668},
  year={ 2025 }
}
Comments on this paper