Microsoft has released a dataset of exactly 100,000 questions and answers, intended for researchers and educators. Microsoft is calling this dataset “MS MARCO” – that’s cleverly short for Microsoft Machine Reading Comprehension.
The dataset, of course, serves a great purpose for researchers who require a good amount of data to use in their machine learning experiments.
MS MARCO’s questions are sampled from real anonymized queries via Bing; the answers, however, are written by humans if they could summarize them. The answers are sourced from context passages using the most advanced version of Bing.
The Human Element
The fact that humans write the answers in this dataset is perhaps what makes this dataset the most valuable – it’s not easy to get human-sourced answers for 100,000 queries.
Microsoft – and everybody else – is attempting to bring advanced machine learning to their AI assistant; asking Cortana “Who is the President of the United States?” might give you the correct result – Barrack Obama – but the AI didn’t do any work for it.
Simple facts such as that are much easier to find answers for; if you ask Cortana – or any other AI assistant – something more advanced like “Why did the Roman Empire fall?”, it will lead you to a search result.
The goal is to make these AI assistants intelligent enough to be able to go through the web – like a human would – and figure out a summarized response to the query.
The dataset does exactly that – for 100,000 questions, there are 100,000 summarized answers written by humans. This dataset will help train an AI learn and read natural language like humans would.
Microsoft has made the dataset available for free to researchers and educators – it is not meant for commercial use.