From the beginning of the digital age, we’ve looked to our computers for answers. Nowhere is this so evident as in the computer science discipline known as question answering, or QA. Overlapping the fields of natural language processing and information retrieval, QA initially utilized handcrafted knowledge bases to answer questions. Today, however, these systems increasingly use machine learning and pre-trained language models like OpenAI’s GPT-3 to achieve their results.
One of the newest and most innovative of these QA models has recently been developed at the Allen Institute for AI (AI2) in Seattle. Macaw, which loosely stands for “Multi-angle c(q)uestion answering,” was developed as an open-source project and is available to the community via GitHub.
If you’d like to see how Macaw works, AI2 is making their interactive demo available to the public starting today. You can use the demo to explore Macaw’s answers and compare them to those given by the GPT-3 language model on a benchmark set of questions.
Macaw is built on top of Google’s pre-trained open-source T5 language model, which is less than a tenth the size of the well-known GPT-3 language model. Yet, despite its considerably smaller size, Macaw outperformed GPT-3 by more than 10% on Challenge300, a suite of 300 questions designed to push various limits of question-answering systems. In a performance comparison with three other QA systems, Macaw scored 75%, compared with 65% for both GPT-3 and AI2’s Jurassic-1 and 57% for Google’s T5-CBQA. (T5-Closed Book QA)
“What’s so interesting to me is Macaw produces quite remarkable answers, to the extent it can even surprise someone like me who’s worked in AI for years,” said Peter Clark, project lead and senior research manager at AI2. Clark has worked in artificial intelligence for more than three decades.
Of the existing pretrained QA systems, none have previously been able to perform as well as GPT-3’s few-shot model. A few-shot model generates answers based on a limited number of samples.
But that was before Macaw. The relative performances between Macaw and GPT-3 may seem counterintuitive given GPT-3 is based on 175 billion parameters, while Macaw’s T5 model uses only 11 billion. These parameters are the weights and biases in the model’s neural network. This can be thought of as a general indication of the scale and overall complexity for pretrained language models and in recent years, increased scale has been accompanied by improved capabilities. But Macaw’s approach to QA makes a huge difference.
Many early QA systems relied on querying a structured database for their answers: input a question and the system would output a corresponding answer. But more recently, QA systems have been based on pre-trained language models which have the potential for much greater versatility. In Macaw’s case, its multi-angle approach allows it to use different combinations of inputs and outputs to achieve surprisingly impressive results.
“Instead of just giving it one permutation,” Clark explains, “we’re giving it all of these different permutations and that has two advantages. One is, in principle, it should improve its performance in all of these individual tasks. And secondly, it allows a bit more flexibility in using the system.”
Macaw achieves this by using a combination of “slots” as its inputs and outputs. These slots are the Context, Question, Multiple-choice options, Answer and Explanation. By using different “angles” or combinations of these slots as the input, a different, often more accurate output can be generated. (see figure below)
For example, you might input a question along with its context in order to get an answer. Or you might give Macaw a question, an answer and the context and the system would return a set of multiple-choice options as its output. Macaw can even generate explanations to accompany its answers, though the study’s researchers consider these to be of lower quality than the other kinds of results the model generates.
“We’ve used it for generating explanations for questions and answers,” Clark explains. “So, we can say, we have an answer to this question. Can you explain it for us? And Macaw was able to do that as well.”
Macaw’s output is further improved by recursively assembling its inputs and outputs in different combinations, so they can be fed back into the system, often improving the accuracy of the final output. The result is a much stronger “zero-shot” performance. Zero-shot in this context refers to generating answers to questions for which Macaw has no prior labeled examples. This amounts to a kind of inference, a variation of the kind of reasoning people perform, reaching conclusions based on evidence. While it’s no surprise the system isn’t as good as we are at this, it’s still impressive.
Though Macaw reaches its answers very differently from how we do, it’s a little analogous to our own reasoning. Several pieces of information are often more helpful than a single item or data point, even though they may not all be directly relevant. Different contexts may also alter the conclusions we reach. At a certain level, the same can be said for Macaw.
One of the ongoing challenges in artificial intelligence is to give it general commonsense about the world, much as people have. To this end, AI2 has its Mosaic project, a team led by Yejin Choi that focuses on developing machine commonsense reasoning.
But Macaw also demonstrates a considerable degree of commonsense as a result of its being trained on millions of real-world questions and answers. Combined with its ability to perform zero-shot reasoning, it’s feasible that Macaw and other commonsense systems could one day support each other, contributing to and reinforcing each other’s capabilities.
Clark acknowledges this. “There is a huge overlap and our two teams do work very closely together,” he said. Details about Macaw’s approach and methods can be found in the study paper, “General-Purpose Question-Answering with Macaw” by Oyvind Tafjord and Peter Clark, both of AI2.