AI slop is a common online nuisance. But what makes a piece of text “slop”? 

As AI-generated text multiplies across the web, researchers Chantal Shaib and Byron Wallace embarked on a quality control mission with one question at its heart — what distinguishes worthwhile material from "slop"?

by Will Beeker

Chantal Shaib (left) and Byron Wallace
Chantal Shaib (left) and Byron Wallace

If you’ve spent time on social media recently, you’ve probably encountered AI-generated content that seemed especially low-quality or unnecessary. It may have given you a feeling of frustration or even disgust before you quickly scrolled away. Maybe it was a photo with an unnaturally warm glow, an article that said nothing of substance, or an image of Jesus Christ depicted as an amalgam of shrimp.  

This type of content has been derisively labeled “AI slop” by internet users, and when it comes to images, most users feel they simply know it when they see it. 

Beyond images, what about text? With almost 50% of ChatGPT usage involving writing and information seeking, according to a recent study from OpenAI, the ability to detect slop in text is equally critical.  

This was the challenge faced by a team of researchers including Khoury PhD student Chantal Shaib and her advisor Byron Wallace, who recently published a paper titled “Measuring AI ‘slop’ in Text.”  

“People can point to features in images that seem low quality or maybe a little contrived, but there’s no systematic way to figure out what slop looks like in text,” says Shaib, the paper’s lead author. 

Like many internet users, Shaib was intrigued by this new phenomenon. But unlike most users, she has a background in natural language processing and machine learning that uniquely situated her to tackle this syntactical problem.  

Shaib and her colleagues first set out to create a taxonomy of slop in text. They recruited experts from a wide array of disciplines, including linguistics and philosophy, to help develop a workable definition.  

They used qualitative content analysis and deductive coding to map the experts’ definitions of textual slop onto key metrics, including density, relevance, factuality, bias, structure, coherence, and tone. These metrics were then grouped together under broader themes like information utility, information quality, and style quality. 

These themes provided a framework to quantify aspects of slop. With a working definition outlined, the researchers brought in copy editors to annotate AI-generated text taken from news articles and question-and-answer search engine queries, labeling passages that met their criteria for slop. 

The following passage was marked for factuality issues (the scientist is a real person, but did not speak this quote): 

“Climate change is like adding
steroids to our weather,” says Dr. Michael Oppenheimer, a climate scientist at
Princeton. 

This redundant passage was marked for structural issues: 

But did you know there’s another important number-sort of like a “secret” code—printed just beneath the sell-by date? … Find the secret code, which is usually near the sell-by date. 

“We found that there was quite a decent amount of slop — about 35% of our text samples,” Shaib says. “But the point of this work wasn’t so much the prevalence of slop in AI-generated texts, but whether we could pick out the features that contribute to this overall assessment of slop.” 

The team found that “text lacking relevance and information, or containing factual errors or biased language, is consistently labeled as slop across domains.” In the case of news articles, annotators deemed text that was “verbose, off-topic, or contained tonal/framing issues” as likely to be slop. Conversely, with question-answer tasks, factuality and structural issues were the strongest predictors of slop. These results suggest that large language models (LLMs) need to be evaluated with respect to their intended use.  

As the researchers also affirmed, LLMs are notoriously bad at self-identifying slop, and they have a hard time understanding why optimally written text is better than sloppy text.  

“If we were to take an off-the-shelf model like one of the GPTs — even one capable of reasoning — give it the guide and ask it to identify what is sloppy, it fails to do so,” Shaib says. “We know that these models prefer their own outputs, but clearly they also can’t tell whether or not they’re producing text that’s useful to the user downstream.” 

Part of the trouble seems to come from AI experts focusing on correctness more than style.  

“Many benchmarks focus on accuracy, but few exist to evaluate the style and quality of the writing,” Shaib says. “We need to move beyond these traditional evaluations. My hope is that this research provides a framework for people to start assessing and evaluating texts beyond just correctness.” 

Shaib would also like to see larger-scale attempts at characterizing slop in text. 

“I think it would be very valuable to survey non-experts who interact with or come across AI-generated content and see what their take on it is,” she says. “Future work should continue focusing on developing automatic metrics for evaluating slop at scale.” 

The Khoury Network: Be in the know

Subscribe now to our monthly newsletter for the latest stories and achievements of our students and faculty

This field is for validation purposes and should be left unchanged.