Virtual assistants are going a long way to making our life easier. They find information for us, play songs for us and also make calls apart from many other daily activities. But the Facebook Artificial Intelligence Research (FAIR) centre is far from being satisfied. Researchers Douwe Keila, Jason Weston, Harm de Vries, Kurt Shuster, Dhruv Batra and Devi Parikh, want to take AI and virtual assistants to the next level. They want to use natural language to make travel more attractive.
Full comprehension of human language has been a difficult aim to achieve for some time now. The research team at FAIR are teaching the artificial intelligence systems to understand language by getting them to ‘guide’ virtual tourists around New York City. They have developed a new research task, called Talk The Walk, which explores this embodied AI approach while introducing a degree of realism not previously found in this area.
New AI Task
In this new AI task, two agents have to talk to each other and accomplish a goal. The goal is to navigate to a specific location in the city or any tourist location. But the setting is not game-like, which can be seen in many other tasks. Here, the goal is to navigate and roam through 360-degree images of actual city streets, in this case, New York. The researchers at FAIR have created a guide agent which sees a map of the streets and the neighbourhood. Using a novel attention mechanism called MASC (Masked Attention for Spatial Convolution), the researchers helped the guide bot focus on the right place on the map.
This work really goes on a long way to improve the research community’s understanding of perception and communication. Those aspects can lead to grounded language learning and provide a stress test for language as a method of interaction. The FAIR team have also released the baselines and data set for the task of Talk The Walk. This will help other researchers by providing a framework to evaluate embodied AI models especially related to dialogue.
The Unlikely Pairing Of Tourism And AI
360-degree images were used by FAIR researchers to train the systems and demonstrate grounded language. The researchers took images from five New York city neighbourhoods — Hell’s Kitchen, East Village, Financial District, Upper East Side, and Williamsburg. The areas have grid-based layouts and four-cornered intersection and serve as the first-person perception for the tourist agent.
In an era when simulation rules, FAIR’s bet on working with realism is like a breath of fresh air in AI. Researchers at FAIR have been also successful to create natural language conversations between two agents. In fact, researchers preferred real human talk rather than worded messages such as one sees on Google Maps. The interaction between AI agents promises to be closer to the language we use in our day-to-day life. The participants were given the same guide agents and tourist agents with the same navigation goals and constraints.
The guiding agent has only access to only a two-dimensional overhead map with landmarks and noticeable places such as restaurants and hotels. Both agents are not able to see what the others see. Therefore, reaching a place requires great communication between the two agents. Each experiment has the target achieved when guide agent thinks that the tourist agent has reached the required destination. If the prediction is accurate the episode, (using some RL terminology) is marked as successful otherwise marked as incorrect. There is no limit on the number of communication messages.
Building Communication Between Agents
Communication between artificially intelligent agents has been always a prime target. The research team at FAIR looked to focus their research on natural language but also came up with communication protocols for robots which are different from human ones. There were two scenarios as researchers Weston and Kiela put it :
“In the first setting, agents communicated via continuous vectors, meaning they transferred raw data to one another. Those continuous vectors included, for example, representations of what the tourists were observing and doing, to help the map-based guides localise their counterparts.”
The second output, “The second emergent communication setting took a different approach, using what the researchers refer to as synthetic language. In this setting, communication was far more simplified than natural language, using a very limited set of discrete symbols to convey information. By giving the bots the option of communicating in the simplest form possible, the interactions are fast and precise and give us a good idea of how well we could perform with natural language.”
Building Environment based AI
The researchers try to make it very clear that this task is not a competition. They stated:
“Talk the Walk isn’t a competition between natural language and synthetic interactions but rather an attempt to offer clarity and quantifiable results related to the ultimate goal of creating machines that can effectively “talk” to humans and to one another.”
They also added that grounding AI tasks in reality is also very important. Going forward environment-based AI will be really important. Again attention mechanisms are used extensively to translate embeddings of places to tourist state transitions (directions to take left or right). This also helps the guide agent to know where the tourist agent is currently.
The researchers stress that grounding the AI tasks in reality is important. But building such systems can be hard. For example, some tasks like reading letters in signage were not taken up by the researchers. This gives us a good understanding of building embodied AI can be really difficult because it consists of perceiving a given environment, navigating through it, and communicating about it.