Skip to main content

Hugging Face introduces Idefics2, an 8B open-source visual language model

AI-generated image of interconnected nodes and pathways
AI-generated image of interconnected nodes and pathways

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


Hugging Face first released its Idefics visual language model in 2023, powered using technology initially developed by DeepMind. Today, Idefics is receiving an upgrade with a smaller parameter size, an open license, and improved Optical Character Recognition (OCR) capabilities. Idefics2 is available now on Hugging Face.

Short for Image-aware Decorder Enhanced à la Flamingo with Interleaved Cross-attentionS, Idefics is a general multimodal model that can respond to text and image prompts. While its predecessor has a parameter size of 80 billion, Idefics2 is a tenth of the size at 8 billion, comparable to DeepSeek-VL and LLaVA-NeXT-Mistral-7B.

Among its core capabilities, Idefics2 promises better image manipulation in the native resolution of up to 980 x 980 pixels and native aspect ratios. Images will no longer need to be resized to accommodate a fixed-size square ratio, which is traditionally done in computer vision.

OCR abilities have been enhanced through data integration generated from transcribing text in an image or document. Hugging Face’s team has also improved Idefics’ ability to answer questions on charts, figures and documents.


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


Finally, the model’s architecture has been simplified, shifting away from the gated cross-attentions of Idefics1. “The images are fed to the vision encoder followed by a learned Perceiver pooling and a [Multilayer Perceptron] modality projection,” Hugging Face states in a blog post. “That pooled sequence is then concatenated with the text embeddings to obtain an (interleaved) sequence of image(s) and text(s).”

How Hugging Face's Idefics2 compares to other multimodal models. Image credit: Hugging Face
How Hugging Face’s Idefics2 compares to other multimodal models. Credit: Hugging Face

Hugging Face uses a mixture of openly available datasets, specifically Mistral-7B-v0.1 and siglip-so400m-patch14-384, to train Idefics2. In addition, web documents, image caption pairs, OCR data, rendered text and image-to-code data were leveraged.

Its release is part of many multimodal models introduced as the AI boom continues, including Reka’s new Core model, xAI’s Grok-1.5V and Google’s Imagen 2.