Harvard Is Releasing A Free AI Training Dataset Backed By OpenAI And Microsoft
In a groundbreaking move that promises to significantly impact the field of artificial intelligence (AI), Harvard University has announced the release of a massive dataset comprising nearly one million public domain books. This initiative, known as the Institutional Data Initiative (IDI), is financially backed by tech giants Microsoft and OpenAI. The dataset is designed to be used for training AI models, providing a rich resource for researchers and developers worldwide.
Background and Motivation
The Institutional Data Initiative is a strategic effort by Harvard to leverage its vast repository of public domain works for the advancement of AI. The initiative aims to "level the playing field" for AI researchers and startups by providing access to high-quality, openly available training data. The dataset, which includes works scanned by Google Books, is approximately five times larger than the Books3 dataset used to train Meta's Llama, highlighting its substantial scale and potential impact.