Provenance for Deep Learning Models

12/7/2017

(After getting an overview of deep learning from the videos in CS231N and CS224N, my goal was to get my hands dirty with existing deep learning models. Here I give an overview of some exploratory experiments I ran to see how "data provenance" ideas can be applied to deep learning models, specifically focusing on models that perform Question Answering.)

Summary

The idea explored here: Can data provenance (or a similar concept) help explain why a deep learning model produces a particular output for a given input?

Since deep learning models are thought to be black-box functions, I was interested in seeing whether provenance could give insights on how deep learning models work.

Motivating Scenario

Suppose a deep learning model has been trained to perform Question Answering. The model then reads a paragraph from Wikipedia about Super Bowl 50, and is asked the following question:

Which NFL team represented the AFC at Super Bowl 50?

The model answers "National Football League" but the correct answer is "Denver Broncos". Can we use provenance to debug the deep learning model?

"Provenance" of Question Answering Systems

A Question Answering system can be viewed as a transformation (or a workflow of transformations) that takes as input the sequence of words in the question and outputs an answer:

Input elements: words in the question
Output: answer output by system

In data-oriented workflows, the provenance of an output element is the subset of input elements that contributed to the output element.

Here we explore whether a similar concept can be defined for deep learning models. The concept we propose is "importance."

"Important" words

Important words are those words in the question that have a greater influence on the output answer. For example, in the question:

(Q:) Which NFL team represented the AFC at Super Bowl 50?

we may guess that "NFL" is the important word that caused the model to incorrectly answer "National Football League".

Definition of importance

Let Q be the original question that the model answered with answer A. Let Q' be an edited version of question Q that omits word W. The model answers Q' with answer A'. If A is correct and A' is incorrect (or vice versa), then word W is important to question Q.

Intuitively, if omitting a word in the question caused the model's answer to change from correct to incorrect (or vice versa), the word is deemed "important."

Importance Example

We can ask the model the question Q above omitting the word "NFL" to form Q':

(Q':) Which team represented the AFC at Super Bowl 50?

If the model correctly answers Q' with "Denver Broncos", then we deem the word "NFL" important to the question Q.

Experiments

Setup

The deep learning model used in the experiments was trained on the SQuAD dataset. (SQuAD is described in the Appendix in more detail.)

This model is a simplified version of the model presented in Bidirectional Attention Flow for Machine Comprehension (ICLR 2017), with similar performance to the original implementation.

The code used for the model is from https://github.com/yolandawww/QASystem.

Results

I present examples of questions along with the words in the question deemed "important" in bold and underlined.

(Q1:) Which NFL team represented the AFC at Super Bowl 50? (Q1 is originally answered incorrectly: "National Football League"; correct answer is "Denver Broncos")
(Q2:) In what country is Normandy located? (Q2 is originally answered correctly: "France"; no words deemed important)
(Q3:) In what year was Nikola Tesla born? (Q3 is originally answered correctly: "1856")
(Q4:) In the Rankine cycle, what does water turn into when heated? (Q4 is originally answered correctly: "steam")
(Q5:) In what year did the Amazon experience a drought that may have been more extreme than in 2005? (Q5 is originally answered correctly: "2010")

Comments

Note that these questions have relatively few words deemed "important."

Also consider that these questions from SQuAD are answered in the context of a particular paragraph from Wikipedia. Thus, if the subject (e.g., Tesla in Q3) of the paragraph is already known, then the Question Answering system can answer correctly even if the word "Tesla" is omitted.

Potential Applications

Data Augmentation: By omitting certain words in the training data, we may find new questions that are well-posed and yet the system cannot answer correctly. By adding these new questions to our training set, we may improve the accuracy of our trained model.

Exploration: By finding words in the question that influence the model's answer, we can gain an intuition on how the model currently works, suggesting possible ways to improve the model.

Next Steps

For now, I will shift my focus away from SQuAD and onto word problems. There is a recently released dataset from DeepMind on algebra word problems. Since this dataset is newer than SQuAD, explorations of this dataset may be more fruitful.

Appendix

About the SQuAD Dataset

SQuAD (Stanford Question Answering Dataset) is a dataset that tests a machine's reading comprehension.

The questions and answers in the dataset are based on context paragraphs from Wikipedia.

After reading a paragraph from Wikipedia, the machine is asked a question about the paragraph.

Example:

Context paragraph (from Wikipedia article Super Bowl 50)

Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24-10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question: Which NFL team represented the AFC at Super Bowl 50?

Answer: Denver Broncos

For the SQuAD dataset, the correct answer is always a subspan of the context paragraph.

Deep Learning Models for SQuAD

Deep learning models today perform worse than a human, but substantially better than the logistic regression baseline.

Performance Comparison (F1 Scores are listed below)

Human Performance: 91.2
Current Leader r-net (as of 12/7/17): 87.8
Model used in this writeup (Dev set performance): 71.9
Baseline (Logistic Regression): 51.0