Compare Among Popular Machine Reading Comprehension Datasets

This is a quick guide for people who newly joined the Machine Reading Comprehension army. Here I’ll give you some advice about which dataset to start with.


Teaching machines to read is a non-negligible part of ‘True AI’, people are making progress since the renaissance of deep learning, however, we’re not even close, the state-of-the-art models still hard to beat a human kid. Our brain system is so sophisticated, that most people who claim their models to be inspired by human brain actually don’t exactly know how human brain works.

Then people came up with the idea, let’s start with training machines to do reading comprehension questions, like a children, and use the accuracies of question answers to indirectly represent how machines read and comprehend, which is smart cause we need some metrics to evaluate. Since that, a lot of  MRC dataset came out. However of course, none of the exist dataset can declare that a data can do reading comprehension as well as human even if it gets 100% accuracy, because all datasets have bias, and doing question answering is just a small part of human reading comprehension and reasoning.

I saw on Weibo (the Chinese Twitter) someone said that Reading Comprehension is simple, it’s all about word pair comparison. Let me use some examples to show you some of the hard stuff I think word comparison (or even s-o-t-a MRC systems) will have a hard time dealing with.

“The Elder Wand,” he said, and he drew a straight vertical line on the parchment. “The Resurrection Stone,” he said, and he added a circle on top of the line. “The Cloak of Invisibility,” he finished, enclosing both line and circle in a triangle, to make the symbol that so intrigued Hermione. “Together,” he said, “the Deathly Hallows.”
—Xenophilius Lovegood, Harry Potter and the deathly hallows


When I read this, I was thinking, is the Resurrection Stone actually the philosopher’s stone which appeared in HP1? And is the Cloak of Invisibility the one Harry’s father left to him? In this case, firstly, the span between HP1 and HP7 is really long (at least it’s long for me), can machines have such long term memory? Secondly, what do you think the similarity between Resurrection and philosopher’s is?

Here’s another example, when I first came to Montreal, Quebec last year, I saw this on the highway:


Every single word is like a out-of-vocabulary for me, but no-surprisingly, I soon understand what those words mean and rarely make mistakes on the road. Can machine do this?

Machine Comprehension Test (MCTest)

official website, paper, nice results 1   2

This is a dataset from MSR, which contains 660 stories, each story has 4 human asked questions (Natural Language Question), and for each question, there’re 4 candidate answers. This is pretty much like reading comprehension questions for pupils.  Most of the stories are short and sentences are fairly short as well, and the size of vocabulary is small. The special thing for this dataset is its size, you can hardly use any Deep Learning method on it by encountering overfit really fast, most people use feature engineering or some word matching based method to deal with it. I think it would be interesting to try some transfer learning approach on it.

Children’s Book Test (CBT)

official website, paper, nice results 1   2   3

A dataset from FAIR, which contains stories from children’s books. Each story in this dataset is a 20 consecutive sentences from children’s books, and remove a word from the consecutive 21st sentence, as the question, or query. There’re 4 splits of this dataset which are classified by the distinct types of word removed in queries: Named Entities, Common Nouns, Verbs, Prepositions.  This type of ‘fill in the blank’ query is called Cloze type question. For each question, there’re 10 candidate answers which taken from the story, and all have same POS with the correct answer word.

The CBT dataset is pre-processed, everything is tokenized, you don’t need to worry about splitting punctuations from the end of last words of sentences, you can use it directly to test your algorithm, which is super sweet.

The size of CBT is much larger than MCTest, each split contains about 100K of stories, which is good for Deep Learning model to train, however, some part of it doesn’t seem to be that supreme:

  1. Use 20 sentences as context, and the 21st sentence as query, that is under the assumption that information about the query is in the previous 20 sentences. But this is not always true in real world, like the Harry Potter example.
  2. The answer words (candidate answers) are taken from story, but for some kind of word, there’re only a few words of that type occur in the story; or, some word have much bigger probabilities of occurrence than other words, then the candidates are not that balanced. For example, if you just choose “said” from 10 candidate answers in Verb split, you’ll get ~25% of accuracy.
  3. Although the queries are also from the book, they are in Natural Language, but they’re not NLQ, which technically not even question answering cause there’s no question at all, just fill in the blanks.

CNN/Daily Mail

official website, dataset download, paper, nice results 1   2   3   4

CNN/Daily Mail QA dataset is released by Google DeepMind, which the largest (AFAIK) QA dataset. CNN dataset contains over 90K of of CNN news, and averagely has 4 queries per story, which gives 380K of story-question pairs; Daily Mail has about 200K new stories, and also, each story has 4 queries, which totally gives 880K story-question pairs.

All of the stories are from real news, so the vocabulary is very large (CNN 120K, DM 210K). Questions are also Cloze type questions, but the difference between these questions and CBT questions is, these questions are actually highlights of news, so the information of question is guaranteed to be included by the given story. Another interesting attribute of this dataset is, the entities in the dataset are randomly replaced by some anonymous tokens like “@entity23”, which 1. makes it easier by changing multi-word entities to a single token; 2. force the model that train on it to learn something beyond the entity itself, and prevent the model from building up any knowledge, and trying to answer questions from global knowledge of that entity.

The average entity amount for each story is about 26, and the maximum amount is over 500, so it’s pretty challenging to choose the correct one from so many entities.

The Stanford Question Answering Dataset (SQuAD)

official website, paper

This dataset is recently released by Stanford University, which contains about 100K of question-answer pairs from 536 articles, the story for each question is a paragraph from these articles. Questions in SQuAD dataset are generated by crowdworkers so they’re NLQ.

The superior attributes of SQuAD data are:

  1. There’re different responses answered by different crowdworkers for each question, so it’s possible to rank different answers, instead of using True/False, for example, the overlap of multiple answers means more crowdworders agree with these words. This is particular useful when the true answer for specific questions are not unique, or some answers are partially true.
  2. The answers don’t necessarily to be one word, they can also be multiple words, or a sentence, that makes it more like real world situation.

But on the other side, it has some cons:

  1. Apparently the generality of this dataset is way less than, say, CNN dataset, because everything are from 536 articles.
  2. Each training story only contains one paragraph, mostly only four or five sentences (CNN has ~1K words per story), which is too short.

LAnguage Modeling Broadened to Account for Disclourse Aspects (LAMBADA)

paper (dataset not released yet)

This dataset is made by University of Trento and University of Amsterdam, it contains a bunch of really hard reading comprehension questions. In the paper, the authors said “We show that LAMBADA exemplifies a wide range of linguistic phenomena, and that none of several state-of-the-art language models reaches accuracy above 1% on this novel benchmark.” By seeing the examples in the paper, it turns out the answer doesn’t need to be in the vocabulary of input story, which means the model need to generate answer, instead of picking the word which has highest probability in the story. Following is one example:

  • Context: He shook his head, took a step back and held his hands up as he tried to smile without losing a cigarette. “Yes you can,” Julia said in a reassuring voice. “I’ve already focused on my friend. You just have to click the shutter, on top, here.”
  • Target sentence: He nodded sheepishly, through his cigarette away and took the ____.
  • Target word: camera

The word ‘camera‘ didn’t appear in the given context, so the model should generate the word from seeing ‘click the shutter‘, ‘focused on‘ and so on. This is crazy for the current models, because one need to use extra large vocabulary as target vocabulary, and choose one from the extra large vocabulary, as people do in Machine Translation.

One more exciting thing about this dataset is, some models give 0% accuracy baselines, this means one need to come up with completely new models for answering this type of questions, which might push people to have better ideas and models.

The Movie QA Dataset

paper (dataset not released yet)

This dataset is also from FAIR, which is in the domain of movies. It is designed for comparing the abilities of doing QA using Knowledge Bases (KBs), Information Extraction (IE), and the Wikipedia documents themselves, so for each question (seems to be NLQ), it has corresponding KBs, IE, and Wiki document.


That’s pretty much all I want to share today, hopefully it’s helpful for MRC newbies.

This entry was posted in Deep Learning, Machine Learning, Machine Reading Comprehension, NLP and tagged , , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Anshul Chauhan
    Posted December 8, 2016 at 5:13 pm | Permalink

    Which one do you propose is the best paper to start implementing for comprehension reading? Or better has already a implementation available and generalizes well?

    • Eric
      Posted December 10, 2016 at 9:30 pm | Permalink

      Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text understanding with the attention sum reader network.
      This one is pretty simple and straightforward, and I think they’ve already open sourced their code.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>