×
Study reveals 12.8B-image AI dataset contains millions of personal documents
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

A new study reveals that DataComp CommonPool, one of the largest open-source AI training datasets with 12.8 billion samples, contains millions of images with personally identifiable information including passports, credit cards, birth certificates, and identifiable faces. The findings highlight a fundamental privacy crisis in AI development, as researchers estimate hundreds of millions of personal documents may be embedded in datasets used to train popular image generation models like Stable Diffusion and Midjourney.

What you should know: Researchers audited just 0.1% of CommonPool’s data and found thousands of validated identity documents and over 800 job application materials linked to real people.

  • The study, published on arXiv, discovered credit cards, driver’s licenses, passports, birth certificates, résumés, and cover letters that were confirmed through LinkedIn searches.
  • Many résumés contained sensitive information including disability status, background check results, birth dates of dependents, race, and contact information for references.
  • Since CommonPool has been downloaded more than 2 million times, these privacy risks have likely been replicated across numerous downstream AI models.

The big picture: Web-scraped AI training data inherently contains private information that people never intended for machine learning use, challenging the industry’s assumption that publicly available content is fair game.

  • CommonPool draws from the same Common Crawl data source as LAION-5B, which trained major models including Stable Diffusion and Midjourney, suggesting similar privacy violations exist across multiple datasets.
  • The data was scraped between 2014 and 2022, meaning many images predate the existence of large AI models—making meaningful consent impossible.

Why privacy filters failed: Despite attempts to protect privacy, CommonPool’s automated face-blurring algorithm missed over 800 validated faces in the small sample, suggesting 102 million faces went undetected across the entire dataset.

  • The curators didn’t apply filters for known personally identifiable information strings like emails or social security numbers.
  • Even when faces are blurred, the feature is optional and can be removed, while image captions and metadata often contain additional personal details like names and locations.

What they’re saying: Experts emphasize that current web-scraping practices are fundamentally flawed and extractive.

  • “Anything you put online can [be] and probably has been scraped,” said William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University.
  • “It really illuminates the original sin of AI systems built off public data—it’s extractive, misleading, and dangerous to people who have been using the internet with one framework of risk,” explained Ben Winters, director of AI and privacy at the Consumer Federation of America.
  • “If you web-scrape, you’re going to have private data in there. Even if you filter, you’re still going to have private data in there, just because of the scale of this,” Agnew added.

Legal limitations: Current privacy laws offer insufficient protection against this type of data harvesting, particularly for research contexts.

  • Privacy regulations like GDPR and California’s CCPA don’t necessarily apply to academic researchers who created CommonPool and often have carve-outs for “publicly available” information.
  • Even when people successfully request data removal, the law remains unclear about whether already-trained models must be retrained or deleted.

The children’s privacy concern: Researchers found numerous examples of children’s personal information, including birth certificates, passports, and health status, often shared in contexts suggesting limited, specific purposes rather than broad AI training use.

A major AI training data set contains millions of examples of personal data

Recent News