Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
Ask Google Assistant or Cortana something like “What’s 4 +4?” today and you’re likely to hear “8.” Ask a more difficult question, like “What did the ancient Greeks eat?” and chances are, instead of answering the question directly, you will likely get pointed toward a website for you to sift through to find an answer to your question.
Microsoft Machine Reading Comprehension (MS MARCO), a dataset of 100,000 questions and answers made available to researchers for the first time today, was made to change that.
By open-sourcing a dataset with answers written by humans, Microsoft hopes MS MARCO can make breakthroughs in artificial intelligence research, and begin to help AI read and understand language like humans would.
That way, instead of having to read through a website to find the answer to your question, you can ask a search engine or virtual assistant, and they will skim documents and websites like humans do, then provide a complex or nuanced answer.
AI Scaling Hits Its Limits
Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:
- Turning energy into a strategic advantage
- Architecting efficient inference for real throughput gains
- Unlocking competitive ROI with sustainable AI systems
Secure your spot to stay ahead: https://bit.ly/4mwGngO
The 100,000 questions and answers were made based on questions asked by real people to the Bing search engine or Cortana virtual assistant. Answers provided by MS MARCO were drawn from more than 200,000 documents or websites and summarized by a human.
“The team chose the anonymized questions based on the queries they thought would be more interesting to researchers. In addition, the answers were written by humans, based on real web pages, and verified for accuracy,” said a Microsoft blog post announcing the release of MS MARCO.
Many datasets used to train natural language processing today have notable shortcomings, the eight-person team that compiled MS MARCO argued in a paper published last month on open research publication arxiv.org.
Most datasets used to train natural language processing (NLP) today do not use questions posed by real people, and they tend to draw upon resources like Wikipedia instead of the less polished but more realistic questions from real people.
MS MARCO is available to businesses and researchers, but datasets available to download for free are for non-commercial use.