Skip to main content

Google’s new technique gives LLMs infinite context

Examples of open source LLMs
Image Credit: StableDiffusion, via VentureBeat

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


A new paper by researchers at Google claims to give large language models (LLMs) the ability to work with text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their “context window” while keeping memory and compute requirements constant.

Context window is the number of tokens a model can work on at any time. For example, if your conversation with ChatGPT extends beyond the context window, its performance will drop sharply and it will discard the tokens included at the beginning of the conversation. 

Organizations are customizing LLMs for their applications by inserting bespoke documents and knowledge into their prompts. Therefore, increasing context length has become one of the major efforts in improving models and gaining an advantage over competitors. 

Experiments reported by the Google research team indicate that models using Infini-attention can maintain their quality over one million tokens without requiring additional memory. Theoretically, the same trend can be continued to larger lengths.


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


Infini-attention

The Transformer, the deep learning architecture used in LLMs, has a “quadratic complexity” in memory footprint and computation time. This means that for example, if you extend the input size from 1,000 to 2,000 tokens, the memory and computation time required to process the input would not just double—it would quadruple.

This quadratic relationship is due to the self-attention mechanism in transformers, which compares each element in the input sequence with every other element. 

In the past couple of years, researchers have developed different techniques to reduce the costs of extending the context length of LLMs. The paper describes Infini-attention as a “long-term compressive memory and local causal attention for efficiently modeling both long and short-range contextual dependencies.”

Infini-attention architecture

This means that Infini-attention keeps the classic attention mechanism in the transformer block and adds a “compressive memory” module to address extended inputs. Once the input grows beyond its context length, the model stores the old attention states in the compressive memory component, which maintains a constant number of memory parameters for computational efficiency. To compute the final output, Infini-attention aggregates the compressive memory and the local attention contexts.

“Such a subtle but critical modification to the Transformer attention layer enables a natural extension of existing LLMs to infinitely long contexts via continual pre-training and finetuning,” the researchers write.

Infini-attention in action

The researchers tested Infini-attention Transformers on benchmarks that test LLMs on very long input sequences. In long-context language modeling, Infini-attention outperformed other long-context transformer models by maintaining lower perplexity scores (a measure of the model’s coherence) while requiring 114x less memory.

In the “passkey retrieval” test, Infini-attention was able to return a random number inserted into a long text of up to one million tokens. It outperformed other long-context techniques in summarizing tasks for texts of up to 500,000 tokens.

According to the paper, the tests were carried out on LLMs with 1 billion and 8 billion parameters. Google did not release the models or code, so other researchers have not been able to verify the results. However, the reported results bear similarities with the performance reported on Gemini, which has a context length of millions of tokens.

Applications of long context LLMs

Long-context LLMs have become an important area of research and competition between frontier AI labs. Anthropic’s Claude 3 supports up to 200,000 tokens while OpenAI’s GPT-4 has a context window of 128,000 tokens. 

One of the important benefits of LLMs with infinite context is the creation of custom applications. Currently, customizing LLMs for specific applications requires techniques such as fine-tuning or retrieval-augmented generation (RAG). While those techniques are very useful, they require challenging engineering efforts.

An LLM with infinite context could, theoretically, enable you to insert all of your documents into the prompt and let the model pick the most relevant parts for each query. It could also enable you to customize the model by providing it with a long list of examples to improve its performance on specific tasks without the need to fine-tune it.

However, this does not mean that infinite context will replace other techniques. It will lower the barrier of entry for applications, enabling developers and organizations to quickly create working prototypes for their applications without the need for immense engineering efforts. Eventually, organizations will need to optimize their LLM pipelines to scale to lower costs and improve speed and accuracy.