*1/7/2018*

DeepMind has recently released AQuA, a dataset of multiple-choice algebra word problems to test the state of the art of deep learning techniques today.

While deep learning has had great success in tasks such as image recognition and machine translation, it has had less success in domains that require reasoning. Algebra word problems are an example of a task that deep learning hasn't mastered yet.

In their paper Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems, DeepMind's model answers 36.4% of the questions in the test set correctly. Since each question is multiple-choice with 5 choices, a random model would be expected to get 20% correct.

I ran an analysis of the AQuA dataset to try to understand the significance of DeepMind's result of 36.4% accuracy.

First I manually looked over the questions to get a sense of the difficulty and variety of questions.

There is a large variety of questions. Here is a sample of two questions:

**Question 1:** Pascal has 96 miles remaining to complete his cycling trip . If he reduced his current speed by 4 miles per hour , the remainder of the trip would take him 16 hours longer than it would if he increased his speed by 50 % . What is his current speed Z ?

**Question 2:** What is the greatest possible ( straight line ) distance , between any two points on a hemisphere of radius 6 ?

Since there are large number of question types, it didn't seem promising to attempt to match a fixed number of question templates to the questions, as was done in Kushman 2014.

I ran some basic analyses to see how well simple techniques would do on AQuA.

As a simple baseline, I wanted to see how accurate a simple guessing strategy would be.

The strategy is as follows: find the number tokens in the question. Then do basic +, -, *, / combos and see if these guesses match any of the answer options. The full details are in the Github code, but here are some basic results on the training data.

Total number of questions: 97467 Questions with at least one percentage value: 13215 Questions with exactly one generated guess: 16246 Number of questions with correct guess: 5415 Questions with at least one generated guess: 37367 Number of questions with correct guess: 19932

The upshot: AQuA is a dataset where a simple guessing strategy will not suffice. It is time to dig in and start making progress.

https://github.com/ridddlr/AQuA-analysis

Use existing deep learning techniques to categorize the problems and enable more accurate answering strategies.