Hugging Face introduces Idefics2

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

Hugging Face first released its Idefics visual language model in 2023, powered using technology initially developed by DeepMind. Today, Idefics is receiving an upgrade with a smaller parameter size, an open license, and improved Optical Character Recognition (OCR) capabilities. Idefics2 is available now on Hugging Face.

Short for Image-aware Decorder Enhanced à la Flamingo with Interleaved Cross-attentionS, Idefics is a general multimodal model that can respond to text and image prompts. While its predecessor has a parameter size of 80 billion, Idefics2 is a tenth of the size at 8 billion, comparable to DeepSeek-VL and LLaVA-NeXT-Mistral-7B.

Among its core capabilities, Idefics2 promises better image manipulation in the native resolution of up to 980 x 980 pixels and native aspect ratios. Images will no longer need to be resized to accommodate a fixed-size square ratio, which is traditionally done in computer vision.

OCR abilities have been enhanced through data integration generated from transcribing text in an image or document. Hugging Face’s team has also improved Idefics’ ability to answer questions on charts, figures and documents.

AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

Turning energy into a strategic advantage
Architecting efficient inference for real throughput gains
Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO

Finally, the model’s architecture has been simplified, shifting away from the gated cross-attentions of Idefics1. “The images are fed to the vision encoder followed by a learned Perceiver pooling and a [Multilayer Perceptron] modality projection,” Hugging Face states in a blog post. “That pooled sequence is then concatenated with the text embeddings to obtain an (interleaved) sequence of image(s) and text(s).”

How Hugging Face's Idefics2 compares to other multimodal models. Image credit: Hugging Face — *How Hugging Face’s Idefics2 compares to other multimodal models. Credit: Hugging Face*

Hugging Face uses a mixture of openly available datasets, specifically Mistral-7B-v0.1 and siglip-so400m-patch14-384, to train Idefics2. In addition, web documents, image caption pairs, OCR data, rendered text and image-to-code data were leveraged.

Its release is part of many multimodal models introduced as the AI boom continues, including Reka’s new Core model, xAI’s Grok-1.5V and Google’s Imagen 2.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

The AI insights you need to lead