Skip to main content

How LLMs are ushering in a new era of robotics

White collar workers in a retro office type on computers at desks beneath humanoid robots hanging from the ceiling attached to various cables.
Credit: VentureBeat made with OpenAI DALL-E 3

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


Recent months have seen a growing number of projects that use large language models (LLMs) to create robotics applications that seemed impossible before. Thanks to the power of LLMs and multi-modal models, researchers are creating robots that can process natural language commands and accomplish tasks that require complex reasoning.

The growing interest in the intersection of LLMs and robotics has also restored activity in the robotics startup community, with several companies securing hefty rounds of funding and releasing impressive demos. 

As the impressive advances in LLMs are spilling over into the real world, we might be seeing what could be a new era in robotics.

Language models for perception and reasoning

The classic way to create robotics systems requires complicated engineering efforts to create planning and reasoning modules. Creating useful interfaces to interact with the robots was also difficult as people can utter the same instruction in many different ways.


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


With the advent of LLMs and vision language models (VLMs), roboticists were able to enhance existing robotics systems in unprecedented ways. The first step in this direction was SayCan, a project by Google Research. SayCan used the semantic knowledge encoded in an LLM to help a robot reason about a task and determine which sequences of actions could help accomplish it. 

“SayCan was one of the more impactful papers on robotics,” AI and robotics research scientist Chris Paxton told VentureBeat. “One nice thing about SayCan from a system point of view is that it’s very modular. It lets you bring different pieces together to build a system that can do cool demos, and it was immediately very compelling.” 

After SayCan, other researchers started exploring the use of language and vision models in robotics, and the community has been able to make progress in different directions. Some projects use general-purpose LLMs and VLMs to enable robotics applications while others try to customize existing models for robotics tasks.

“We’ve seen that using large language models and large vision models has made things like perception and reasoning much more accessible,” Paxton said. “It’s made a lot of robotic tasks seem more doable than they were before.”

Stringing together existing capabilities

One of the great limitations of classic robotics systems is controlling them. Robotics teams can train robots for individual skills such as opening doors and drawers or picking and manipulating objects. However, training the robots to combine these skills to accomplish complicated tasks is difficult. This is why they are usually very rigid and require explicit instructions to perform complex tasks.

VLMs and LLMs enable robots to map loosely defined instructions to a specific sequence of tasks that are within the range of the robot’s skills. And interestingly, many frontier models can accomplish such tasks without the need to be trained.

“I can take those different skills and, with these large language models, I can just string them together and reason about how I should use them,” Paxton said. “With the new visual language models like GPT-4V, we can see how these systems can come together and be useful in a wide range of applications.”

One example is GenEM, a technique developed by the University of Toronto, Google DeepMind and Hoku Labs. GenEM uses the vast social context available in large language models to create expressive behaviors for robots. GenEM uses GPT-4 to reason about the environment and determine, based on the robot’s affordances, what kind of behavior it should engage in. For example, the LLM determines that it is polite to nod to people to acknowledge their presence. It then translates this to specific actions that the robot supports, such as moving its head up and down. It does this using the vast knowledge contained in its training data as well as its in-context learning abilities, which enable it to map the actions to API calls for the robot.

Another project is OK-Robot, a system created by Meta and New York University, which combines VLMs with movement-planning and object-manipulation modules to perform pick-and-drop operations in environments that the robot has never seen before.

Some robotics startups are finding renewed success with the growing capabilities of language models. For example, Figure, a California-based robotics startup, recently raised $675 million to build humanoid robots powered by vision and language models. The company’s robots use OpenAI models to analyze instructions and plan their actions. 

It is important to note, however, that while LLMs and VLMs solve important problems, these robotics teams must still create the systems for primitive skills such as grasping and moving objects, avoiding obstacles, and navigating the environment.

“There’s a lot of other work that goes on at the level below that those models aren’t handling,” Paxton said. “And that’s the kind of stuff that is hard to do. And in a lot of ways, it’s because the data doesn’t exist. That’s what all these companies are building.”

Specialized foundation models

Another approach to using LLMs and VLMs is the development of specialized foundation models for robotics. These models usually build on the vast knowledge contained in pre-trained models and customize their architectures for robotic actions.

One of the most important projects in this respect was Google’s RT-2, a vision-language action (VLA) model that takes perception data and language instructions as input and directly outputs action commands to the robot. 

Google DeepMind recently created RT-X-2, a more advanced version of RT-2 that adapts to different kinds of robot morphologies and can perform tasks that were not included in its training data. And RT-Sketch, a collaboration between DeepMind and Stanford University, translates rough sketches into robot action plans.

“These are a different approach where the model is now one huge policy that can do everything,” Paxton said. “That’s another exciting direction that is based on end-to-end learning where you take a camera feed and the robot figures out most of what it needs to do.”

Foundation models for robotics have also found their way into the commercial space. In March, Covariant announced RFM-1, an 8-billion parameter transformer model trained on text, images, videos, robot actions and a range of numerical sensor readings. Covariant aims to create a foundation model that can solve many tasks for different types of robots. 

And Project GR00T, announced at Nvidia GTC, is a general-purpose foundation model that enables humanoid robots to take text, speech, videos or even live demonstrations as input and process it to take specific general actions. 

Language models still have a lot of untapped potential and will continue to help robotics researchers make progress on fundamental issues. And as LLMs continue to progress, we can expect their results to usher in innovations in robotics.