An Analysis of the AQuA Algebra Word Problem Dataset


The AQuA Dataset

DeepMind has recently released AQuA, a dataset of multiple-choice algebra word problems to test the state of the art of deep learning techniques today.

Why Algebra Word Problems?

While deep learning has had great success in tasks such as image recognition and machine translation, it has had less success in domains that require reasoning. Algebra word problems are an example of a task that deep learning hasn't mastered yet.

DeepMind's Results on AQuA

In their paper Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems, DeepMind's model answers 36.4% of the questions in the test set correctly. Since each question is multiple-choice with 5 choices, a random model would be expected to get 20% correct.

Question Variety in AQuA Dataset

I ran an analysis of the AQuA dataset to try to understand the significance of DeepMind's result of 36.4% accuracy.

First I manually looked over the questions to get a sense of the difficulty and variety of questions.

There is a large variety of questions. Here is a sample of two questions:

Question 1: Pascal has 96 miles remaining to complete his cycling trip . If he reduced his current speed by 4 miles per hour , the remainder of the trip would take him 16 hours longer than it would if he increased his speed by 50 % . What is his current speed Z ?

Question 2: What is the greatest possible ( straight line ) distance , between any two points on a hemisphere of radius 6 ?

Since there are large number of question types, it didn't seem promising to attempt to match a fixed number of question templates to the questions, as was done in Kushman 2014.

Baseline Analysis

I ran some basic analyses to see how well simple techniques would do on AQuA.

Find the number values and then guess answer using heuristics

As a simple baseline, I wanted to see how accurate a simple guessing strategy would be.

The strategy is as follows: find the number tokens in the question. Then do basic +, -, *, / combos and see if these guesses match any of the answer options. The full details are in the Github code, but here are some basic results on the training data.

Total number of questions: 97467

Questions with at least one percentage value: 13215

Questions with exactly one generated guess: 16246
Number of questions with correct guess: 5415

Questions with at least one generated guess: 37367
Number of questions with correct guess: 19932

The upshot: AQuA is a dataset where a simple guessing strategy will not suffice. It is time to dig in and start making progress.

Github Link

Next Steps

Use existing deep learning techniques to categorize the problems and enable more accurate answering strategies.