Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 |
OpenVIS: Open-vocabulary Video Instance SegmentationAAAI Conference on Artificial Intelligence (AAAI), 2023 |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of ThoughtNeural Information Processing Systems (NeurIPS), 2023 |
VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric TasksNeural Information Processing Systems (NeurIPS), 2023 |
Otter: A Multi-Modal Model with In-Context Instruction TuningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023 |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language ModelsInternational Conference on Learning Representations (ICLR), 2023 |