TLDR;
AI models seem obsessed with em-dashes because they've likely been trained on a lot of old books especially ones from the 1800s and early 1900s where em-dashes were everywhere. Earlier versions of AI didn't use them as much, but newer ones do, possibly because their training data now includes digitized classics full of long sentences and dramatic punctuation. Other ideas, like regional English habits or human feedback bias, don't match the data. In short, today's AI may write like it's been reading Moby-Dick.
Introduction
Why does AI writing sound like it's addicted to em-dashes those long dashes that break up sentences? Once you start noticing it, you can't unsee it. Every time a chatbot writes something, there's an em-dash waiting to jump out at you. And what's strange is, no one really knows why.
Research
Here's some theories. Some say it's because English writing already uses a lot of them, so the model just copies what it sees. But that doesn't quite fit if humans used them that often, we wouldn't notice the difference. Others think the model uses them because they give it room to pause or switch ideas without committing too hard to a sentence ending. That sort of makes sense, but commas and semicolons could do the same job, so it doesn't feel like the full story either.
Then there's a theory that's a little more interesting. During the training process, human testers in certain countries give feedback to help the AI write better. Some people wondered if the English used in those regions naturally favors em-dashes, so the models “learned” that style. But when people actually looked into it using data from the International Corpus of Nigerian English (ICE-NIG) it turned out that those dialects use fewer em-dashes than standard American or British English. So that idea doesn't hold much weight.
The most convincing idea I've seen is about the kind of books the AI was trained on. Older books especially from the 1800s and early 1900s use a lot more em-dashes. A study on punctuation frequency in English text showed that the em-dash peaked around 1860 and then gradually declined. So if modern AI models are trained on a mix of old literature and scanned print books, it makes sense they'd pick up that habit.
Some researchers and writers like Maria Sukhareva have pointed out that many AI companies started digitizing classic books between 2022 and 2024, as they needed cleaner, higher-quality data for training. During that time, court documents even revealed that Anthropic began scanning books in early 2024. OpenAI likely did the same. So maybe all these models just spent a bit too long “reading” Victorian and Edwardian prose.
It's a bit poetic, really. These models might be picking up quirks from long-dead writers, learning their rhythm and punctuation without realizing it. The same way we absorb habits from the people we talk to, the AI may have soaked up the dashes, pauses, and pacing of older literature but blended it with modern phrasing.
Some people even joked that platforms like Medium might have played a part, since they automatically convert two hyphens (“--”) into an em-dash. The CEO of Medium once said they might be partly responsible, but that feels more like a funny coincidence than the true cause.
What fascinates me most is how something as tiny as a punctuation mark can reveal so much about what's underneath the surface of these systems. It's like hearing an accent you can't quite place. The AI sounds modern, but every em-dash carries a faint echo of the past a whisper from the pages it was trained on.
References
For the curious, here are some related reads: