The artificial intelligence revolution: Discover open source alternatives to GPT-4 Vision in LLaVA 1.5!

The artificial intelligence revolution: Discover open source alternatives to GPT-4 Vision in LLaVA 1.5!

LLaVA 1.5: An open source alternative to GPT-4 Vision

The field of generative artificial intelligence is booming with the emergence of large multimodal models (LMM) such as OpenAI’s GPT-4 Vision. These models revolutionize our interaction with AI systems by integrating both text and images.

However, the closed and commercial nature of some of these technologies may hinder their universal adoption. It is in this context that the open source community comes into play, propelling the LLaVA 1.5 model as a promising alternative to GPT-4 Vision.

The mechanics of LMM

LMMs operate using a multi-layer architecture. They combine a pre-trained model to encode visual elements, a large language model (LLM) to understand and respond to user instructions, and a multimodal connector to connect vision with language.

Their training takes place in two stages: an initial phase of alignment between vision and language, followed by fine adjustment to respond to visual requests. This process, although efficient, often requires significant computing power and a rich and precise database.

The advantages of LLaVA 1.5

LLaVA 1.5 relies on the CLIP model for visual encoding and Vicuna for language. The original model, LLaVA, used the text versions of ChatGPT and GPT-4 for visual adjustment, generating 158,000 training examples.

LLaVA 1.5 goes further by connecting the language model and visual encoder using a multi-layer perceptron (MLP), which enriches its training database with visual Q&A. This update, including approximately 600,000 examples, allowed LLaVA 1.5 to outperform other open source LMMs on 11 of 12 multimodal benchmarks.

The future of open source LMMs

The online demo of LLaVA 1.5, accessible to everyone, shows promising results even with a limited budget. However, one restriction remains: the use of data generated by ChatGPT limits its use to non-commercial purposes.

Despite this limitation, LLaVA 1.5 opens a door to the future of open source LMMs. Its cost-effectiveness, ability to generate training data in a scalable manner, and efficiency in adjusting visual instructions make it a foreshadow of future innovations.

LLaVA 1.5 is just the first step in a melody that will resonate with the progress of the open source community. By anticipating more efficient and accessible models, we can envision a future where generative AI technology will be accessible to everyone, thus revealing the unlimited potential of artificial intelligence.