
How good are LLMs at thinking about language?
Originally posted on The Horizons Tracker.
AI platforms such as ChatGPT are often described as advanced prediction engines. Trained on vast collections of text—from news and novels to film scripts and online forums—they generate words by guessing what comes next. Their fluent answers can feel human, but genuine sentience remains out of reach.
New research1 from the University of California, Berkeley, shows that these systems are moving closer to a distinctly human skill: thinking about language itself. The study suggests that some models can now analyse sentences much like a trained linguist, challenging the notion that only humans can reflect on how language works.
The ability to talk about and manipulate language—known as metalinguistics—has long been seen as a hallmark of human cognition. “Our new findings suggest that the most advanced large language models are beginning to bridge that gap,” the researchers explain. “Not only can they use language, they can reflect on how it is organised.”
Put to the test
The team tested several models, including different versions of OpenAI’s ChatGPT and Meta’s Llama 3.1, by feeding them 120 complex sentences. They asked the systems to analyse each sentence, judge whether it displayed certain linguistic features, and draw a syntactic tree showing its structure.
Consider the sentence “Eliza wanted her cast out.” Did Eliza want someone expelled, or did she want a plaster cast removed? Most models failed to notice the ambiguity. But OpenAI’s o1 model, designed for more advanced reasoning, both detected the double meaning and produced an accurate diagram.
The researchers then explored a more challenging feature: recursion, the ability to embed phrases within phrases. Noam Chomsky described this as a defining property of human language—“the dog that chased the cat that climbed the tree barked loudly.” When prompted, o1 not only identified recursive structures but extended them, turning “Unidentified flying objects may have conflicting characteristics” into “Unidentified recently sighted flying objects may have conflicting characteristics.”
This performance stood out. It suggests that, at least in some respects, the latest models are doing more than mimicking surface patterns. They are showing signs of understanding how language fits together.
Such tests offer a useful benchmark for future research. They help separate genuine progress in AI from hype about its abilities. “Everyone knows what it’s like to talk about language,” the researchers conclude. “This gives us a clear way to measure how well these systems are really doing.”
Article source: How Good Are LLMs At Thinking About Language?
Header image source: Created by Bruce Boyes with Microsoft Designer Image Creator.
Reference:
- Begus, G., Dabkowski, M., & Rhodes, R. (2025). Large Linguistic Models: Investigating LLMs’ Metalinguistic Abilities. IEEE Transactions on Artificial Intelligence, 6(12), 3453-3467. ↩




