Does Artificial Intelligence Understand Human Language?
Artificial Intelligence has already become part of our everyday lives, and it’s becoming increasingly smart with ever-widening applications. Nowadays, even human life can depend on it. So have machines really learned to understand us or they are deftly imitating comprehension? Continue reading about how scientists are trying to answer this far from idle question.
AI Studies
IBM Watson is a supercomputer that defeated the best players on the American TV quiz show – Jeopardy. 2010 commercial dedicated to it claimed that Watson understood natural language in all its complexity and ambiguity. Despite this, the computer failed to revolutionize medicine with artificial intelligence. It has become apparent: if, at first glance, it seems that a computer understands language, it does not mean it really does.
Understanding everyday language is one of the main goals of all the AI being developed nowadays. Scientists initially tried to manually program machines to understand news, literature, and other texts. As the Watson example showed, it was a dead end since it’s impossible to write down all the rules needed to understand the text. So a different approach was developed, and now, instead of programming specific rules, scientists allow computers to learn on their own by processing a vast number of texts and guessing words. In this way, a language model is created. With neural networks like OpenAI’s GPT-3, such models can generate surprisingly convincing texts, demonstrate advanced language thinking and even write poetry.
But has GPT-3, trained on thousands of websites, books, and encyclopedias, outperformed Watson? Does the model understand the language it generates? Unfortunately, engineers cannot yet come to an unambiguous conclusion. Philosophers have tried before to answer this question, but in the last ten years, AI left the walls of laboratories and entered the real world, where a lack of understanding can have catastrophic consequences. For instance, IBM Watson was found to offer numerous incorrect and unsafe treatments, and the Google translator made significant errors in translating medical recommendations for non-English-speaking patients.
How do you know if a machine understands you? In 1950, Alan Turing, one of the fathers of modern cybernetics, tried to answer this question and created his famous game known today as the Turing Test. During this test, a person interacts with a computer and another person: by answering questions, they must determine who they are talking to. According to Turing, if a person cannot distinguish the computer from a person, we can conclude that the computer is capable of thinking and understanding.
Unfortunately, Turing underestimated the computer’s ability to deceive. Even primitive chatbots, like the Elise program created by Josef Weizenbaum in 1966, have repeatedly convinced people that they are dealing with an animate being.
In a 2012 article, three computer scientists, Hector Levesque, Ernest Davis, and Leora Morgenstern, proposed a more objective test called the “Winograd schema.” AI researchers later recognized this test as the best, though not perfect, way to evaluate a computer’s ability to understand. “Winograd schema,” named after researcher Terry Grape, consists of pairs of question-accompanied sentences that differ by just one word.
Here are a few examples:
Sentence 1: I poured water from a bottle into a cup until it was full.
Question: What was full, the bottle or the cup?
Sentence 2: I poured water from the bottle into the cup until it was empty.
Question: What became empty, the bottle or the cup?
These pairs of sentences have changed one word that affects who or what they are talking about. It takes common sense to answer these questions. Winograd schema was developed to assess precisely this kind of comprehension and was meant to correct the shortcomings of the Turing test.
The authors developed several hundred schemes that a computer would not be able to answer test questions using google search. In 2016, there was a test based on Winograd schema, and the program succeeded, giving only 58% of correct answers – a result not too different from random guessing.
However, the AI’s ability to solve Winograd schema has improved dramatically due to the advent of large neural network language models. For example, in a 2020 study by the OpenAI lab, the GPT-3 model answered nearly 90% of the questions correctly. And the SuperGLU AI model performed even more impressively, giving 97% accurate answers. That’s pretty close to human understanding, but is it true?
Not really. Despite the scientists’ best efforts, the Winograd schema is not google-proof. Like many language comprehension tests, it contains loopholes that allow neural networks to give correct answers without understanding their meaning. For example, consider the following sentences: “the sports car overtook the mail truck because it was going faster” and “the sports car overtook the mail truck because it was going slower.”
A language model trained on a large corpus of English-language texts is aware of the relationship between the words “sports car” and “fast” and “mail truck” and “slow” and, therefore, can give the correct answer based on such a relationship. Winograd schema contains many similar loopholes.
In 2019, Researchers at the Allen Institute for AI created an improved version of the Winograd schema, called “WinoGrande,” which no longer consists of hundreds but forty-four thousand sentences. The researchers then used a simple algorithm to eliminate sentences that contained statistical loopholes. As expected, the remaining sentences proved to be much harder for computers.
Yet another surprise awaited the scientists. More than two years after “WinoGrande” was created, the language patterns have gotten even more extensive. And the bigger they get, the better the results AI shows. Today, programs trained on terabytes of text and thousands of examples from “WinoGrande” give about 90% correct answers, while humans provide about 94%.
Conclusion
It’s still difficult to determine, based on test results, whether the AI understands the text it processes. Thanks to loopholes, neural networks often achieve high results on Winograd schema and other popular ways of assessing general language comprehension.
The crux of the problem is that understanding language is impossible without understanding the world, and a computer that deals only with language has zero world understanding. One needs to know what a sports car is and what a mail truck is, that cars can overtake each other and that cars are objects in the material world driven by people who have their purposes. All of this is self-evident for us, but not for computers. This cannot be converted into text to train a model then. Some cognitive scientists argue that humans use innate basic knowledge about space, time, and other core characteristics of the world as they learn and understand language.
If we want computers to master human language, we should first program some basic knowledge that humans possess from birth into them. Accordingly, in assessing computers’ understanding, we should first evaluate our knowledge of these principles. Teaching computers “childlike” knowledge looks like a huge backward step in light of the achievements of AI systems such as Watson and GPT-3. But if our goal is for AI to achieve genuine understanding, that might be the only way.
For premium readers