A machine learning librarian at Hugging Face just released a dataset composed of one million Bluesky posts, complete with when they were posted and who posted them, intended for machine learning research.
…
The data isn’t anonymous. In the dataset, each post is listed alongside the users’ decentralized identifier, or DID; van Strien also made a search tool for finding users based on their DID and published it on Hugging Face. A quick skim through the first few hundred of the million posts shows people doing normal types of Bluesky posting—arguing about politics, talking about concerts, saying stuff like “The cat is gay” and “When’s the last time yall had Boston baked beans?”—but the dataset has also swept up a lot of adult content, too.
It’s also noteworthy that it’s a “snapshot” of time on Bluesky, meaning it could, and probably does, include since-deleted posts.
This dataset could be used for “training and testing language models on social media content, analyzing social media posting patterns, studying conversation structures and reply networks, research on social media content moderation, [and] natural language processing tasks using social media data,” the project page says. “Out of scope use” includes “building automated posting systems for Bluesky, creating fake or impersonated content, extracting personal information about users, [and] any purpose that violates Bluesky’s Terms of Service.”
The dataset is already popular: as of writing, it’s one of the top trending Hugging Face projects.