SEPTEMBER 2025
LLMs in Psycholinguistics: What Artificial Intelligence Can Tell Us About Our Brain
Can large language models help explain how language is formed, stored, and functions within the system of thought?

Artem Novozhilov, PhD candidate at the University of Nova Gorica, Slovenia, explains why LLMs have become a convenient mirror for psycholinguistics — and what conclusions about the workings of the brain have already been drawn — in a new interview for Garden research.
Angelina Zaitseva
In 2022, Sri Lanka’s government collapsed within just a few months. Fiscal irresponsibility and the lingering effects
Welcome to Garden Research. We bring together epistemic methods, cognitive inquiry, and experimental approaches to carry research beyond academic boundaries. We build for those who want to develop research literacy, learn how to work with knowledge, navigate complexity, and build durable systems of thinking.
September, 11
GR: Artem, could you explain in simple terms: what does psycholinguistics do?
ARTEM: It's a discipline at the intersection of psychology, linguistics, and neuroscience. We study how language "lives" inside the mind: how speech is processed and perceived, and how the structure of language is acquired across all three levels of cognition — the computational level (what does this system do and why?), the algorithmic-representational level (how exactly are the system's tasks carried out?), and the implementational level (how does this happen physically?) within consciousness.

Like any science, linguistics is divided into fundamental and applied branches. The former investigates what languages can look like in general and which cognitive constraints (such as working memory capacity) shape grammar, versus which features of language are dictated by the very logic of information processes.

The applied sector branches into three major tracks. Corpus linguistics builds and annotates massive text databases, determining which tags and algorithms are needed so that other researchers can "feed" models with clean data. Clinical linguistics studies how neurodivergent brains process speech and what to do when the system fails — for instance, following a stroke, or in cases of dyslexia and autism. Together, all these directions not only help develop new forms of therapy but also shed light on how language is structured. And computational linguistics is responsible for everything related to machine translation, speech recognition, and summarization.
Psycho- and neurolinguistics occupy a unique position between these poles: they connect theoretical models from fundamental linguistics with the experimental and technological methods of applied linguistics. In essence, they serve as a bridge between theories about how language is structured and empirical ways of understanding how it actually lives in the brain.
GR: So psycholinguistics is not just about language and text?
ARTEM: Exactly. Neuro- and psycholinguistics study what happens in the brain when the speech system operates normally — and when it breaks down. After a stroke, for example, aphasia can occur — difficulties with comprehending or producing words — and researchers design tests to select the right rehabilitation approach.

In autism spectrum disorder, the understanding of pragmatics is affected: the influence of context, social situation, relationships between interlocutors, and the time and place of communication on the choice of linguistic means.

In people with dyslexia, the brain makes errors at the stage of letter recognition — this is measured, among other methods, through eye-tracking: the registration of eye movements. A camera records hundreds of micro-pauses of gaze in people with dyslexia where a typical reader makes four or five. With this data in hand, linguists investigate whether certain typefaces can help speed up reading. In this way, by studying speech processes, psycholinguistics translates knowledge about the brain into practical tools that restore people's ability to communicate freely.
GR: You recently attended several conferences. Tell us about what's happening in psycholinguistics right now.
ARTEM: Yes, quite a few! — Human Sentence Processing, Formal Approaches to Slavic Linguistics, and the Annual Meeting of the Cognitive Science Society in the U.S.; Architectures and Mechanisms for Language Processing in Prague; and even SyntaxFest 2025 in Ljubljana.

The headline development is that large language models (LLMs) have entered the game — the same GPTs, only under the hood for researchers. LLMs accelerate analytical and computational tasks: we automate corpus annotation, compute complex syntactic parameters, and automate the creation of experimental tests. But it is precisely in psycholinguistics that LLMs are becoming an exceptionally convenient "mirror" for science: since a model has no innate sense of language, comparing its behavior with human behavior allows us to see where the laws of information transfer are at work and where the particularities of the brain come into play.

Krieger, B., Brouwer, H., Aurnhammer, C., & Crocker, M. W. (2025). On the limits of LLM surprisal as a functional explanation of the N400 and P600. Brain Research, 1865, 149841. https://doi.org/10.1016/j.brainres.2025.149841

Huber, E., Sauppe, S., Isasi-Isasmendi, A., Bornkessel-Schlesewsky, I., Merlo, P., & Bickel, B. (2024). Surprisal From Language Models Can Predict ERPs in Processing Predicate-Argument Structures Only if Enriched by an Agent Preference Principle. Neurobiology of Language, 5(1), 167–200. https://doi.org/10.1162/nol_a_00121
GR: Impressive — is any of this confirmed experimentally?
A: Yes, absolutely. For instance, researchers take short texts and present them to people word by word, while simultaneously "feeding" the same material to a large language model. For the model, surprisal is immediately calculated — a number that indicates how unexpected a given word is for it. Common words yield low surprisal; rare or anomalous ones yield high surprisal. In humans, unexpectedness is captured by EEG sensors, as well as by increased pauses, reading times, and gaze fixation on the word. If a sentence is "broken," two characteristic peaks light up on the graph of the brain's electrical activity: N400 and P600. N400 appears approximately 0.4 seconds later when a word violates grammar ("He spread-plural jam on the bread"); P600 emerges another two-tenths of a second later when a word is grammatically correct but semantically absurd ("I spread socks on the bread").

The results are intriguing. Model surprisal aligns well with N400: when the algorithm and the brain encounter a grammatical disruption, both "stumble" simultaneously. An analogue of the P600 response is also present in the model — just not as a discrete signal, but as the cost of prediction reassembly (the difference between surprisal on the final and penultimate word, or KL-divergence). This allows for a more precise linking of the computational and neurophysiological levels of language processing description.

The second line of research involves using large language models as a "test range" for probing the limits of human cognition. Here, scientists deliberately place models and humans under identical conditions and look for divergences. For example, a model can be artificially constrained to "hold" no more than two phrases simultaneously; when this limitation is introduced, its predictions of reading times and error rates begin to resemble human data — confirming the hypothesis that humans consider only a small number of predictions in parallel. In a similar fashion, researchers test the so-called frequency indistinguishability threshold: humans barely notice the difference between very rare words, whereas a model distinguishes them effortlessly. When a simplified "threshold" is imposed on the model, its numerical results once again approximate reader behavior.

Hahn, M., Futrell, R., Levy, R., & Gibson, E. (2022). A resource-rational model of human processing of recursive linguistic structure. Proceedings of the National Academy of Sciences, 119(43), e2122602119. https://doi.org/10.1073/pnas.2122602119

Andrea de Varda, A., & Marelli, M. (2024). Locally Biased Transformers Better Align with Human Reading Times. In T. Kuribayashi, G. Rambelli, E. Takmaz, P. Wicke, & Y. Oseki (Eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 30–36). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.cmcl-1.3

Kuribayashi, T., Oseki, Y., Brassard, A., & Inui, K. (2022). Context limitations make neural language models more human-like. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. EMNLP 2022. https://doi.org/10.18653/v1/2022.emnlp-main.712
The third line of research is the search for universal "default language rules," or biases. Linguists have long observed that the same prohibitions recur across all the world's languages: for example, a component of a complex question cannot be moved too far from the verb (the so-called "syntactic islands").
We also study word order: for instance, languages in which the basic word order is object–verb–subject (OVS) are virtually nonexistent — only eleven such languages are known worldwide, and those with OSV order number just four.

It was previously thought that these were innate properties of the human brain. Now, scientists run large language models and discover that without any built-in settings, models rarely employ these constructions. This lends support to the idea that these constraints may arise from the very logic of information transfer — short, easily predictable phrases are transmitted more reliably and therefore become entrenched in language.
To test this hypothesis, models of varying size and architecture are compared. Smaller bidirectional networks (which read a sentence simultaneously from left to right and right to left) sometimes predict reader behavior better than enormous unidirectional GPT models — demonstrating that scale alone is not decisive; architecture and how the model perceives context matter as well. Researchers also account for technical details, such as the way a model segments words into "token chunks": this affects its ability to capture rare forms and can introduce its own artificial "biases."

Ultimately, by comparing the behavior of different LLMs with real-world languages, scientists are able to separate constraints dictated by human perception from those imposed by pure informational economy — and in doing so, refine the theory of why language is structured the way it is.

Abramski, K., Improta, R., Rossetti, G., & Stella, M. (2025). The "llm world of words" english free association norms generated by large language models. Scientific Data, 12(1), 1–9.

Abramski, K., Lavorati, C., Rossetti, G., & Stella, M. (2024). Llm-generated word association norms. In HHAI 2024: Hybrid Human AI Systems for the Social Good (pp. 3–12). IOS Press.
GR: Can we say that LLMs primarily help us understand how our brain works?
ARTEM: Yes, large language models have genuinely become a new window into the workings of the human brain. When we compare their statistical predictions with EEG bursts, eye movements, or reading speed, we obtain a (relatively) simple, numerically measurable "model" of the very same processes that previously had to be studied through lengthy behavioral tests.

At the same time, models possess nothing "innate": they learn from text alone. This is precisely why their convergences with brain data reveal which constraints are dictated by the pure informatics of language, while their divergences reveal where the particularities of our memory, attention, or embodied experience come into play. As a result, a language model becomes a universal test range where one can not only safely "probe" the boundaries of human cognition, but also test hypotheses about languages that are possible and impossible for humans — and, step by step, make them more comprehensible both to ourselves and to the machines we create.
GR: And finally — tell us about your latest work.
ARTEM: Recently, I've published two new papers, with two more awaiting publication. We are studying how native speakers of Russian and Serbo-Croatian assess the acceptability (naturalness or correctness) of sentences.

To do this, we use specialized metrics that help describe and predict such judgments. For instance, Mean Dependency Distance indicates how far apart words that should be linked to one another are positioned within a sentence. The greater the distance, the harder it is for the brain to process the phrase — and the less natural it sounds. Another metric, Projectivity, checks whether dependency links between words cross one another.

If they do, the structure is considered more complex, and sentences are typically rated lower. This knowledge helps us build computationally precise, cognitively grounded models of acceptability for free word order languages, and also allows us to disentangle hard grammatical constraints from processing constraints (cognitive load).
WHAT's next
Made on
Tilda