when words can’t tell the whole story

12 min readMay 29, 2024

developing causal world models through embodied cognition

written in late fall 2023, sharing now in may 2024

Although it’s been a while since I’ve written about AI and technology, I recently graduated from my CS degree and wanted to share an essay I wrote in a Cognitive Science class during my final fall at Waterloo that reflects my (very much personal, not reflective of any past employer’s) viewpoint so far on today’s LLM technologies and whether or not they approximate human intelligence.

This is an especially timely and hotly contested debate given the ease with which the language output of tools like ChatGPT have become embedded into everyday life. Lately, it “appears” like AI agents can answer any question we throw at them, which may lead us as a society to develop a certain automation bias towards trusting AI-generated content, the more we see plausible examples of ‘coherence’ in their long form text output. But researchers like Bender et al. (2021) have long argued that these machines are mere Stochastic Parrots — highly capable of regurgitating patterns from the world they were trained on, but limited by an inability to actually possess the world views of a human mind. My thoughts here explore, from a purely cognitive science lens, what type of agent might one day overcome the limitations of today’s LLMs.

The power of written language, and the degree to which it both conveys and contains knowledge, has always fascinated me as a storyteller and now an engineer. But are words enough to capture our world?

if only ChatGPT could solve all of life’s problems…

In November 2022, OpenAI released ChatGPT, a public chat interface for one of its black-box large language models (LLMs) called GPT-3.5. Within a year, the tool had gained public notoriety for its apparent adeptness at writing coherent text. But should the output of LLMs like ChatGPT be naively interpreted as “meaningful”? These concerns trace back to philosophical debates about what constitutes intelligence from the early days of cognitive science. In 1950, Alan Turing presented the Turing test to evaluate machines on a symbolic, functionalist definition of intelligence. Later in the 20th century, philosopher John Searle opposed Turing’s language-based test through his Chinese Room Experiment (1980), instead tying intelligence to an underlying causal relationship with the world. To build such a world model, concept empiricist Jesse Prinz argues that human concepts would have to be formed perceptually through exposure to external stimuli, a viewpoint which modern-day researchers label “embodied cognition” (Smith & Gasser, 2005). Meta’s Chief AI Scientist Yann LeCun (2022) agrees, positing that today’s LLMs could never achieve a human level of cognition due to their amodal and probabilistic nature.

Given this viewpoint, I believe that if a machine were to be trained in an action-oriented manner that allowed it to acquire a multimodal, sensory perception of the world, it would develop the causal world model that Searle argues is required for human intelligence, thereby surpassing the limitations of Turing’s initial test.

theories on the mind

it’s what’s inside that counts… right?

Before exploring how to build such a machine, it’s important to establish the relevant theories of what constitutes intelligence. These originated during the mid to late 20th century in the field of cognitive science, which stood at the intersection of computer science, philosophy, and psychology.

Alan Turing’s 1950 paper Computing Machinery and Intelligence outlines one such test to answer the question, ‘can machines think?’ In the Turing test, a machine would be considered intelligent if it is capable of fooling a human interrogator into believing it is human through language-based conversation. Turing’s machines are restricted to ‘digital computers,’ which he describes as computational systems built to execute operations provided to them in aninternal ‘book of rules.’ He subsequently implies that this skill of symbol manipulation is what enables digital computers “to carry out any operation that a human computer would” (p. 436), waving away concerns for “trying to make a ‘thinking machine’ more human by dressing it up in … artificial flesh” (p. 434). By bestowing intelligence upon a machine that is capable of producing human-like output, without necessarily engaging with the world in any human way, Turing argues for a functionalist definition of intelligence that favours output over internal processes. In doing so, he insinuates that concepts are inherently symbolic, rather than being associated with the real world.

The natural question that follows from Turing’s test is whether the ability to string together text to form coherent language output is truly sufficient to describe a human level of intelligence. In Minds, Brains, and Programs, John Searle (1980) presents the Chinese Room Thought Experiment to counter Turing’s assertion that a digital computer could replace a human one, using elements of the assertion itself. He replaces Turing’s digital computer with an English-speaking human who is being fed a ‘rule book’ of instructions for translating English words into Chinese symbols, placing that person in a closed room observed by a third-party interrogator. Although the English-speaker may see enough examples to produce coherent-seeming Chinese output from the interrogator’s perspective, thereby passing Turing’s test, Searle argues that this person does not ‘understand’ Chinese as a native speaker does. This take reflects Searle’s causal definition of intelligence, which presupposes an intentionality behind the deployment of language such that it is knowingly directed at objects in the world. He argues that while the English-speaker has become proficient at stringing together Chinese symbols, they have no idea what those words semantically mean. Thus, Searle implies that no formally symbolic model of mental states could possess the intentionality required to demonstrate understanding. However, he stops short of claiming that it is impossible to design a machine that could implement these causal processes, conceding that if such a machine did exist, it could exhibit what he defines as intelligence. Turing and Searle’s opposing viewpoints on what constitutes ‘understanding’ thus provide the foundation from which to evaluate machine intelligence.

today’s LLM technologies

did you chatGPT that?

Returning to the development of modern-day LLMs, I’ll first start by explaining how they are currently trained, to better understand the limitations of the output they produce. Bender et. al (2021) explain that a language model (LM) is a type of system that has been trained, in a self-supervised way, to predict the likelihood of a word occurring in a text document given the words that surround it. An LLM is a specific type of large-scale LM which, having been trained on long windows of text from a massive, web-scraped corpus, is able to output long sequences of text that mirror the distribution of the dataset the model was trained on. A key limitation of these models is called LLM ‘hallucination,’ when the model produces an output that is syntactically coherent (based on the distribution it learned in training), but in reality, semantically incorrect. Bender et al. argue that these errors occur because “the training data for LLMs is only form; they do not have access to meaning” (p. 615) — mirroring Searle’s criticisms of Turing-style digital computers. The fact that LLMs make these types of mistakes indicates that their training process is flawed, as all they have learned is to syntactically manipulate formal symbols, without the semantic understanding of what they mean. Although today’s LLMs may learn the data distribution of the Internet well enough to conceivably pass Turing’s test, their ultimate inability to produce factual output demonstrates the test’s inherent limitations. This limitation suggests a need to train machines in a way that they achieve the causal understanding of the world that Searle posits is required for intelligence.

a multi-sensory understanding of the world

because everybody knows a picture is worth a thousand words

(Author’s Note: a little after finishing this paper, I came across a new TED talk from Stanford’s Human Centred AI Co-Director and “Godmother of AI” Dr. Fei-Fei Li which argues that ‘spatially intelligent’ AI will develop a better understanding of the world than today’s LLMs do. Her points about the need for agents to “DO” in multi-modal environments to develop a world model, ie. to go beyond seeing, but also doing in that environment, are what inspired me to share my own views, which run in a similar vein).

So, how might one encode such a causal understanding into a machine? Jesse Prinz (2004) defines a proxytype theory of mental content, in which concepts “are mechanisms that allow us to enter into perceptually mediated, intentionality-conferring, causal relations with categories in the world” (p. 164). Prinz argues that our mental concepts are modal-specific, which means they are tied in memory to the sense they were acquired through actions in our spatial environments, whether that was through sight, touch, or one of our other senses. This contrasts against the rationalist’s hope that we as humans can represent concepts amodally in the mind through abstract, symbolic representations, which is the type of information that has been taught to LLMs so far, based on their current training tasks.

Findings from developmental studies of concept acquisition in childhood provide support for Prinz’ theory. In a 2005 study from Indiana University, Linda Smith and Michael Gasser present their hypothesis of ‘embodied cognition,’ finding that “intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity” (p. 13). A key element of their theory is a feature called reentry, which is “the explicit interrelating of multiple simultaneous representations [of a given concept] across modalities” (p. 14). They give the example that “observers of infants have long noted that they spend literally hours watching their own actions” (p. 14), linking their visual perception and tactile actions together as they form a conceptual understanding of what is happening in their world. Reentry explains why, when a person visually encounters an apple, that experience invokes the other sensory features the person has come to associate with the APPLE concept, through real-world interactions through their various senses (its smell, its taste, the beautiful sheen of a fresh Red Delicious). This provides clear evidence for Prinz’ theory of concepts, leading to the conclusion that a primary limitation of LLMs is that they are not currently designed to acquire the perceptual understanding that humans leverage in order to learn, characterize and deal with concepts.

Another challenge of today’s LLMs is that their current language-only training data is insufficient to provide an accurate representation of mental contents. Although language is one mode of characterizing a concept, a mere understanding of the concept’s name and its language distribution is not enough to characterize it. In his discussion of how perception is linked to understanding, Aristotle defines the ‘common sensibles,’ a set of higher-order perceptual capacities that unite the five senses in a manner required to understand complex relationships about the world. In doing so, Aristotle goes further than Prinz to argue that when a complex concept is learned through a multi-sensory understanding, it is more objectively known to be true than if it was learned through a single sense. Thus, multiple modalities of data that extend beyond text are necessary to build a world model that the machine, and its users, could objectively trust.

For this reason, learning from multimodal data could mitigate the issue of LLM hallucination. The research by Smith and Gasser (2005) supports Aristotle’s intuition that multisensory concept acquisition is critical for creating an accurate mental model of complex concepts, such as TRANSPARENCY. In a test for whether infants could identify the existence of a transparent box containing a toy, participants who had been able to play with the box beforehand were more likely to find its opening, than to attempt to reach for the toy directly (the latter indicating that they were misled by their visual sense into believing the box did not exist.) In this scenario, the researchers argued that the “haptic cues from touching the transparent surfaces educated vision, and vision educated reaching and touch, enabling infants to find the openings” (p. 25). Similarly, it could be argued that LLM hallucination occurs because the machine has neither perceptual input data, nor enough sources through which it could fully understand complex concepts. Without learning from a perceptual interrelationship between modalities, both humans and machines are easily led astray.

building agents that see, plan, and act

so monkey sees, but monkey also does

In his 2022 position paper, A Path Towards Autonomous Machine Intelligence, Meta’s Chief AI Scientist Yann LeCun describes a machine that builds a world model by learning perception-planning-action cycles. It does so by being placed in scenarios where it engages in world experiences by capturing different types of perceptual data and represents those experiences as a collection of perception-action relationships in its memory, in the same way that human would token a proxytypical concept. In future tasks, the machine would propose an action, use the perceptual evidence in its memory to gauge the outcome of its current action, and then alter the action it decides to take accordingly. Given the viewpoint that intelligence comes from perceptual interactions with the world, I feel that this presents one example of a feasible approach to learning from sensory data in a non-symbolic manner. Here, the agent is not being trained to make new predictions by learning the probability of its training data, but rather to take new actions based on the tokened, sense-based concepts it stores in its memory, given it’s “life experiences.” Although the details of how this training architecture would work are still open engineering problems (which reach outside the cognitive science theories I explore here), this learning model presents a high-level philosophical paradigm shift in how to train an intelligent machine that would move from being probability-based to action-oriented.

an llm can’t think because it only guesses

although they can visualize our world, they can’t capture how we act in it

Clearly, there is evidence to demonstrate that concept acquisition in early development takes place through learning experiences with multi-sensory stimuli. It does not make sense to train models on a Turing-style symbolic representation of the world, given that humans do not learn about it in that way. It follows that if a model were built instead to actively perceive its external world through sensory input, it could come to possess Prinz’ proxytypical concepts. Moreover, if this input were multimodal, the model would be able to accurately represent complex real-world concepts. This implies that a model with such an embodied cognition would act in a way that is based on true perception, and thereby understanding, of its surroundings. The last question that remains is whether this mechanism would truly achieve Searle’s definition of intelligence (remember, this required understanding to include some form of intentionality which means the system would extend beyond pure symbols).

Searle himself brings up one counterargument in Minds, Brains Programs, which he called ‘The Robot Reply.’ It acquiesces to a need for perception by placing the symbolic digital computer inside a robot that has perceptual capacities enabling it to ‘see’ and ‘act,’ similar to the system I describe above. Searle argues that this ‘sensory’ input must still be converted into symbols by the computer, based on syntactic rules, and thereby dismisses the entire Robot Reply as an example of intelligence by asserting that it’s simply a larger version of a formal computational system. He has a point: but what Searle is missing is a form of agentic learning that extends beyond simple rule-following. If I were merely proposing that an embodied cognition system would train existing LLM architectures on multimodal data types, Searle’s objections would remain applicable; LeCun agrees that extending the same training technique to multimodal LLMs would result in the same hallucinations we observe in text-based LLMs, because no matter the mode, our current architectures holds the fundamental limitation of being purely symbolic.

However, given a new action-oriented model I believe would be neccessary for capturing perceptual experiences (which cannot be achieved by the probabilistic training style of today’s LLMs), Searle’s argument no longer holds, because I am envisioning a model that learns from input data, rather than performing a prescribed operation on it. This may look like what LeCun proposes, or more likely will require robotic learning within our multimodal environments, to plan and act within it in a way that goes beyond probabilistic guesses. Perception and interactions with the world are what will differentiate a more intelligent form of AI from the tools we use today. Ultimately, this demonstrates that just because today’s Turing-style AI technologies wouldn’t pass Searle’s bar for semantic intelligence, it would be incorrect to claim that won’t be a computer out there that will one day be able to do so.

Thanks for reading this exploration of my thoughts (as of late 2023, to be clear) on machine intelligence, given my initial studies into theories on cognition and psychology, applied to AI. Since writing this essay, a lot of discussion has come up about the necessity for “intelligent” models to incorporate robotics into their learning process, which is something I would incorporate into this article if writing it now in May 2024. So much to learn! As a recent grad, my thoughts in this space are very much burgeoning and contemplative: I’m highly curious about other theories, stances, and approaches to the topic, so please share. You can reach out to me on LinkedIn.

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922.

LeCun, Y. (2022, June 27). A path towards autonomous machine intelligence. OpenReview. Retrieved December 10, 2023 from https://openreview.net/forum?id=BZ5a1r-kVsf.

Prinz, J. J. (2004). Furnishing the mind: Concepts and their perceptual basis. MIT Press.

Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–457.

Smith, L., & Gasser, M. (2005). The development of embodied cognition: six lessons from babies.Artificial life, 11(1–2), 13–29. https://doi.org/10.1162/1064546053278973.

Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433–460.