Two Articles on the Great Challenge Ahead: Speech

Talking with Robots: Words, Language & Actions

“There’s a genius to human beings — from understanding idioms to manipulating our physical environments — where it seems like we just “get it.” The same can’t be said for robots.” —Warehouse Automation

Computer Speech
& Language

“Spoken language interaction with robots: Recommendations for future research.” Computer Speech & Language, Volume 71, January 2022,
Computer Speech & Language

Reasons why spoken language interaction with robots will greatly benefit human society include:

• Among the various ways to exchange information with robots, spoken language has the potential to often be the fastest and most efficient. Speed is critical for robots capable of interacting with people in real time. Especially in operations where time is of the essence, slow performance is equivalent to failure. Speed is required not only during the action, but also in the human–robot communication, both prior to and during execution.

• Spoken language interaction will enable new dimensions of human–robot cooperative action, such as the realtime coordination of physical actions by human and robot.

• Spoken language interaction is socially potent (Bainbridge et al., 2011), and will enable robots to engage in more motivating, satisfying, and reassuring interactions, for example, when tutoring children, caring for the sick, and supporting people in dangerous environments.

• As robots become more capable, people will expect speech to be the primary way to interact with robots.

• Robots that you can talk with may be simply better liked, a critical consideration for consumer robotics.

• Robots can be better communicators than disembodied voices Deng et al. (2019); being co-present, a robot’s gestures and actions can reinforce or clarify a message, help manage turn-taking more efficiently, convey nuances of stance or intent, and so on.

• Building speech-capable robots is an intellectual grand challenge that will drive advances across the speech and language sciences and beyond.

Not every robot needs speech, but speech serves functions that are essential in many scenarios. Meeting these needs is, however, beyond the current state of the art.

1.2. Why don’t we have it yet?

At first glance, speech for robots seems like it should be a simple matter of plugging in some off-the-shelf modules and getting a talking robot (Moore, 2015). But it’s not that easy. This article will discuss the reasons at length, but here we give an initial overview of the relevant properties of robots and spoken communication.

What is a robot, in essence? While in some ways a robot is like any other AI system that needs to converse with humans, there are also fundamental differences. Notably, in general:

1. A robot is situated; it exists at a specific point in space, and interacts with the environment, affecting it and being affected.

2. A robot provides affordances; its physical embodiment affects how people perceive its actions, speech, and capabilities, and affects how they choose to interact with it.

3. A robot has very limited abilities, in both perception and action; it is never able to fully control or fully understand the situation.

4. A robot exists at a specific moment in time, but a time where everything may be in a state of change — the environment, the robot’s current plans and ongoing actions, what it’s hearing, what it’s saying, and so on.

That Special Something Called Speech

Even the simplest human tasks are unbelievably complex.
The way we perceive and interact with the world requires a lifetime of accumulated experience and context. For example, if a person tells you, “I am running out of time,” you don’t immediately worry they are jogging on a street where the space-time continuum ceases to exist. You understand that they’re probably coming up against a deadline. And if they hurriedly walk toward a closed door, you don’t brace for a collision, because you trust this person can open the door, whether by turning a knob or pulling a handle.

A robot doesn’t innately have that understanding. And that’s the inherent challenge of programming helpful robots that can interact with humans. We know it as “Moravec’s paradox” — the idea that in robotics, it’s the easiest things that are the most difficult to program a robot to do. This is because we’ve had all of human evolution to master our basic motor skills, but relatively speaking, humans have only just learned algebra.

In other words, there’s a genius to human beings — from understanding idioms to manipulating our physical environments — where it seems like we just “get it.” The same can’t be said for robots.

Today, robots by and large exist in industrial environments, and are painstakingly coded for narrow tasks. This makes it impossible for them to adapt to the unpredictability of the real world. That’s why Google Research and Everyday Robots are working together to combine the best of language models with robot learning.

Called PaLM-SayCan, this joint research uses PaLM — or Pathways Language Model — in a robot learning model running on an Everyday Robots helper robot. This effort is the first implementation that uses a large-scale language model to plan for a real robot. It not only makes it possible for people to communicate with helper robots via text or speech, but also improves the robot’s overall performance and ability to execute more complex and abstract tasks by tapping into the world knowledge encoded in the language model.

Using language to improve robots
PaLM-SayCan enables the robot to understand the way we communicate, facilitating more natural interaction. Language is a reflection of the human mind’s ability to assemble tasks, put them in context and even reason through problems. Language models also contain enormous amounts of information about the world, and it turns out that can be pretty helpful to the robot. PaLM can help the robotic system process more complex, open-ended prompts and respond to them in ways that are reasonable and sensible.

PaLM-SayCan shows that a robot’s performance can be improved simply by enhancing the underlying language model. When the system was integrated with PaLM, compared to a less powerful baseline model, we saw a 14% improvement in the planning success rate, or the ability to map a viable approach to a task. We also saw a 13% improvement on the execution success rate, or ability to successfully carry out a task. This is half the number of planning mistakes made by the baseline method. The biggest improvement, at 26%, is in planning long horizon tasks, or those in which eight or more steps are involved. Here’s an example: “I left out a soda, an apple and water. Can you throw them away and then bring me a sponge to wipe the table?” Pretty demanding, if you ask me.

Making sense of the world through language

With PaLM, we’re seeing new capabilities emerge in the language domain such as reasoning via chain of thought prompting. This allows us to see and improve how the model interprets the task. For example, if you show the model a handful of examples with the thought process behind how to respond to a query, it learns to reason through those prompts. This is similar to how we learn by showing our work on our algebra homework.

So if you ask PaLM-SayCan, “Bring me a snack and something to wash it down with,” it uses chain of thought prompting to recognize that a bag of chips may be a good snack, and that “wash it down” means bring a drink. Then PaLM-SayCan can respond with a series of steps to accomplish this. While we’re early in our research, this is promising for a future where robots can handle complex requests.

Grounding language through experience

Complexity exists in both language and the environments around us. That’s why grounding artificial intelligence in the real world is a critical part of what we do in Google Research. A language model may suggest something that appears reasonable and helpful, but may not be safe or realistic in a given setting. Robots, on the other hand, have been trained to know what is possible given the environment. By fusing language and robotic knowledge, we’re able to improve the overall performance of a robotic system.

Here’s how this works in PaLM-SayCan: PaLM suggests possible approaches to the task based on language understanding, and the robot models do the same based on the feasible skill set. The combined system then cross-references the two to help identify more helpful and achievable approaches for the robot.

Google Is Teaching Robots to Think for Themselves
Excerpted from Warehouse Automation