Dec 6, 2024 3 min read AI Stories

Your Bluesky Posts May Now Be Fed To AI

The recent controversy surrounding the release and subsequent removal of a dataset comprising one million Bluesky posts has sparked significant debate within the tech community and beyond. This event has highlighted critical issues concerning data privacy, consent, and the ethical use of publicly available information for artificial intelligence (AI) research.

Background

Bluesky, a decentralized social media platform, has been at the center of a data privacy controversy after a dataset of one million public posts was scraped and uploaded to Hugging Face, a popular platform for AI research. The dataset, created by Daniel van Strien, a machine learning librarian at Hugging Face, was intended for use in developing language models and natural language processing tools.

The dataset included users' decentralized identifiers (DIDs) and was equipped with a search function to find content from specific users. This action raised significant concerns about data transparency and user consent, as Bluesky users had not agreed to have their posts used for such purposes.