Researchers from Texas A&M University, the University of Texas, and Purdue University have demonstrated that training large language models on “junk data” from social media can significantly degrade their performance, causing effects similar to human “brain rot.” The study analyzed 100 million tweets to show how models trained on popular but superficial content performed worse on reasoning and memory benchmarks, raising concerns about the quality of data used to train future AI systems.
The big picture: As AI models increasingly rely on internet data for training, the quality of that content directly impacts model performance, with researchers warning that “heavily relying on Internet data leads LLM pre-training to the trap of content contamination.”
How they defined “junk data”: The researchers used two distinct approaches to identify low-quality training content from HuggingFace’s tweet corpus.
- High-engagement, short tweets were classified as junk based on the theory that “more popular but shorter tweets will be considered to be junk data” that maximizes engagement in trivial ways.
- A second classification used GPT-4o to identify tweets focused on “superficial topics (like conspiracy theories, exaggerated claims, unsupported assertions or superficial lifestyle content)” or those with “attention-drawing style (such as sensationalized headlines using clickbait language or excessive trigger words).”
- Graduate student evaluators confirmed the AI classifications with 76% accuracy.
Key findings: Models trained with higher ratios of junk data showed measurably worse performance across multiple benchmarks.
- Reasoning capability (ARC AI2 Reasoning Challenge) and long-context memory (RULER) showed statistically significant declines with more junk data.
- Results were mixed on other measures, with some models showing improved performance on ethical norms and certain personality traits when trained on 50/50 junk-to-control ratios.
- The effects varied across different model architectures, including Llama 8B.
Why this matters: The findings arrive at a critical time when AI-generated content is flooding the internet, potentially creating a feedback loop where future models are trained on increasingly degraded data.
- The researchers warn this could lead to “model collapse” as AI systems are trained on content produced by other AI systems.
- They call for “a re-examination of current data collection from the Internet and continual pre-training practices.”
What the researchers recommend: The study emphasizes that “careful curation and quality control will be essential to prevent cumulative harms” in future AI development.
- Current practices of scraping internet data indiscriminately may need fundamental changes.
- The research suggests AI companies should prioritize data quality over quantity when building training datasets.
Researchers show that training on “junk data” can lead to LLM “brain rot”