By Dr Kerrilee Lockyer
The hype around artificial intelligence (AI) has kept me more enthralled than the technology itself – which is saying something, because the technology is pretty remarkable. In an instant, I can ask any one of the popular chatbots to write a haiku about magpies in the style of US President Donald Trump’s Truth Social posts. I know you’re interested, so here it is:
Loud Magpies—very aggressive! Always attacking our GREAT people for no reason. Not nice! They swoop from the trees, total losers. They want your food and your shiny things. SAD! We need TOUGH birds back in our parks!
#MagpieScam #MakeParksSafeAgain
This showcases the capability of AI to process and mimic human language, even in communicatively complex contexts such as sarcasm, humour or metaphor (Andersson and McIntyre 2025). Like I said, pretty remarkable.
But I am more interested in the hype. It is difficult to discern what is the truth about this technology because the public discourse around it swings from one extreme to the next. Alongside those worrying about job losses and the future of work, there are many sceptics who scoff. The claims that editing can be replaced by AI need to be examined carefully, and this is what I tried to show in the presentation I did on the 16 February.
For you reading this, I invite you to do some experiments to reveal what AI can’t do.
But first, let’s look at what AI is, starting with axing the name itself. It is much more accurate to use the term “large language model (LLM)” because that is essentially what it is. These models are great buckets of huge datasets: all of the words written on the internet plus anything else that these companies can get their hands on to scan and upload into their models.
Then, these models have to break language up into what are called “tokens”. Words get broken up into smaller segments, which do not always correspond to whole words; they are determined by statistical frequency. Common words are typically assigned a single token (e.g. “apple”). More complex words are subdivided into multiple tokens based on common prefixes or suffixes (e.g. “tokenization” becomes “token” and “ization”). Even more complexity comes when dealing with abnormal or irregular spellings so “write” becomes “wr + i + te” because there is also “wrote” or “wr + o + te”.
Each unique token is assigned a specific integer identification number (ID). These token IDs are converted into vectors, which are high-dimensional coordinates. This mathematical space allows the system to calculate the relationship between different tokens based on their proximity. In other words, embeddings try to capture the semantic meaning of words, sentences and even entire documents.
The generation of text is a probabilistic operation. The model does not formulate a complete response in advance; instead, it executes a repetitive prediction cycle:
- The system analyses the input prompt and all previously generated tokens.
- A probability distribution is calculated across the entire vocabulary (often exceeding 100,000 tokens).
- The model selects the next token based on these calculated probabilities.
- This cycle continues until the system generates a specific “end of sequence (EOS)” token, which terminates the output.
Okay, so what all that means is the way LLMs use language is the basis of their remarkable ability to produce text, but it is also the glass ceiling they cannot reach beyond. For if these models can only understand bits of language at a time in terms of statistical frequency, there are some very real limits on their abilities.
First and foremost, LLMs do not actually “know” anything. They simply calculate what word or word fragment (the “token”) is most likely to come next, based on scouring countless datasets. The larger the datasets, the more likely hallucinations will occur (Gomes et al. 2022). And, therefore, LLMs cannot:
- account for multiple contexts (linguistic context, historical context, social context, institutional context to name a few)
- be self-reflective on things like bias, truth and concepts
- understand words beyond the tokens that they have been programmed with (and that’s how you can get strange spelling errors like “wrIte”).
In other words, they cannot capture the human conceptual embodiment of meaning (Dreyfus 1992; Rump 2025). Don’t take my word for it: you can play around with this yourself.
If you open a chat model of any of the popular LLMs, you can try today’s Wordle. Some models are getting better at answering the Wordle, but when you dig deeper you find it is outsourcing the process to other software like Python. And when they don’t, most chatbots simply cannot do a Wordle because they cannot break a word down to the character level.
You can also try asking a question like: which movies directed by people born before 1955 won Oscars after 1995? Pretty straightforward, right? Yet you will find most LLMs will produce a table with glaringly obvious mistakes. When I last asked a chatbot to do this, it gave me Christopher Nolan (born in 1970) who won an Oscar for Oppenheimer in 2024. It was very confident in its response too. This is due to the complex knowledge base of the question (Gomes et al. 2022). To answer this question, you need to be able to take 2 different sets of data and cross-reference them.
A more worrying side of LLMs are their tendencies to reproduce biases. LLMs are built on datasets written by people about the world, so they are inherently shaped by the ideologies and views of those people. LLMs pick up on patterns and reproduce them in the texts they generate. For example, the inequalities.ai project shows how ChatGPT ranks different US states based on qualities like laziness, trustworthiness and honesty. It consistently ranks lower socioeconomic states from the south unfavourably (Kerche et al. 2026).
LLMs are probabilistic systems rather than programmed, and most “safeguards” can be overcome by changing the prompts (Kerche et al. 2026). This is the “black box” nature of these models; no one really knows how they work nor what they will produce (Kruszelnicki 2026).
The upshot is that editing requires full visual and contextual awareness. Editors are better equipped to do this because the human brain is designed so you can see the whole sentence, paragraph or word and immediately spot inconsistencies. This wonderful brain can also do parallel constraint checking, which means you can apply multiple rules at once like grammar, style, spelling, meaning and tone.
The human brain also has intuition and memory, which allows you to know what looks right without checking a reference. It has the ability to do flexible reasoning so you can understand subtle instructions or implied rules (like “A is not in the middle”) and adjust the text accordingly.
Most importantly, an editor is a social and moral being, and so you operate professionally and ethically. You know what is right and wrong against the values of the context in which the writing is being produced.
And that is what AI cannot do.
⸻
REFERENCES
Andersson M and McIntyre D (2025) “Can ChatGPT identify impoliteness? A study in the pragmatic awareness of a large language model”, Journal of Pragmatics, 239:16–36.
Dreyfus HL (1992) What computers still can’t do: a critique of artificial reason, MIT Press.
Friend S and Goffin K (2025) “Chatbot-fictionalism and empathetic AI: should we worry about AI when AI worries about us?”, Philosophical Psychology, 1–24, https://doi.org/10.1080/09515089.2025.2525320.
Gomes J, de Mello RC, Ströele V and de Souza JF (2022), “A hereditary attentive template-based approach for complex knowledge base question answering systems”, Expert Systems with Applications, 205, Article 117725, https://doi.org/10.1016/j.eswa.2022.117725.
Hume D (1957) The natural history of religion, Stanford University Press.
Kerche FW, Zook M and Graham M (2026) “The Silicon gaze: A typology of biases and inequality in LLMs through the lens of place”, Platforms & Society, 3, https://doi.org/10.1177/29768624251408919.
Kruszelnicki K (7 February 2026) “The Great AI Safety Debrief Part One with Dr Petr Lebedev (406)” [podcast], Shirtloads of science, www.drkarl.com.
Messer, U (2025) “How do people react to political bias in generative artificial intelligence (AI)?”, Computers in Human Behavior: Artificial Humans, 3, Article 100108, https://doi.org/10.1016/j.chbah.2024.100108.
Placani A (2024) “Anthropomorphism in AI: hype and fallacy”, AI Ethics, 4, 691–698, https://doi.org/10.1007/s43681-024-00419-4.
Rump J (2025) “AI, judgment, and nonconceptual content: a critique of Dreyfus in light of neuro-symbolic AI”, Phänomenologische Forschungen.
