Parrots
2023
Model input essentially guides inference through a extremely dense dimensional data manifold learned from the training data. Prompting techniques such as Chain-of-Thought (CoT) and Tree-of-Thoughts (ToT), reduce task complexity by creating several input passes. With each additional token provided either during inference in CoT, ToT, or by human users, the model is incrementally guided through the high-dimensional space of learned representations - the transformer architecture learns data in many arbitrary levels of abstraction. Inside a multi-layer neural network (especially one built on self-attention like transformers), each successive layer ends up representing input data at a progressively more abstract level of meaning or function. These layers do not map cleanly onto classical linguistic categories (such as “words” at the bottom layer, “phrases” in the middle, and “semantics” at the top). Rather, the internal transformations discovered by the model are learned automatically based on the training objective (e.g., next-token prediction) and can end up capturing a variety of features-often mixing word meaning, syntactic structures, and even task-specific clues in a way that can seem surprising from a human perspective. These methods narrow in on the "correct" manifold by increasing the conditional probability of arriving at the "correct" answer as more context is added. By correct answer, we really mean "expected" or "acceptable" from the perspective a human reader, nothing more.
A transformer is composed of multiple layers, each of which receives inputs (token embeddings or the previous layer’s outputs) and processes them through multi-head attention and feed-forward blocks. The output of each layer thus encodes the input in a slightly different “view,” becoming more context-informed and more specialized. This incremental transformation means that, early on, the model may learn to identify basic patterns or relationships (like common word co-occurrences). Deeper in the network, those representations become increasingly context-dependent - what a given word or phrase represents can shift based on surrounding text. In multi-head attention, each head learns statistical correlation on different parts or aspects of the input sequence. One head may learn to capture subject–verb agreement, another might latch onto semantic similarities, and yet another might look for discourse-level signals. By combining all these parallel attentions, each token at each layer gains a rich, multi-faceted representation that can capture many dimensions at once: syntax, semantics, and longer-range dependencies.
The aforementioned critiques of the evaluation metrics used in assessing LLMs invites a deeper exploration of general intelligence - specifically how it can be reliably measured and observed in AI through rigorous and realistic tests that extend beyond linguistic capabilities to include broader cognitive functions. If we must define general intelligence (GI), one can use the "g factor", which refers to the ability to reason, plan, solve problems, think abstractly, and learn quickly across a wide range of domains. GI then involves higher-order cognitive processes that go beyond specific skills or knowledge domains.
A critical issue that arises in analyzing the reasoning capabilities of large and opaque models like the GPT series is training-test set cross-contamination, which becomes increasingly problematic for the most advanced models. The massive training datasets used, comprising extensive portions of the internet, are often untraceable and completely anonymous to researchers outside the initial developer groups, to some extent even to the developers themselves, making replication studies impossible. The exact amount and identity of the data used to train models like GPT-3.5 or GPT-4 has not been publicly disclosed, posing a risk of rendering current benchmarking efforts meaningless due to cross-contamination.
Researchers have attempted to counter the contamination problem using N-Gram Overlap as a metric for detection, by eliminating or withholding results for tests where answers were present in the training data. However, this method has been criticized. Blodgett et al. point out, for example, that such heuristic approaches to mitigating biases in NLP systems can be problematic and may not fully address the underlying challenges due to a lack of normative reasoning. The method is then limited in that it fails to consider the context in which N-Grams, which are a sequence of 'N' items (words, characters, phonemes, etc.) from a given text or speech corpus, appear and may discount synonymous or analogous text worded differently. Additionally, the decision to use a 200-character window around detected N-Grams during the training of GPT-3.5 is arbitrary and may not accurately reflect the influence of surrounding text on model learning.
We believe that the utility of LLMs in academic fields may be underscored by their potential to aid research, for example, through automation, as a dynamic search engine, as a knowledge aggregator, or by simply enabling engaging discussions. But many problems ensue, namely that of verifying the correctness of generated content. One way of tackling this issue is by generating ever-more specialized datasets that target specific areas of interest within an academic field. And we suppose that exploring benchmarking of these systems is a valid compromise in evaluating LLMs. This is partly because any holistic evaluation such as a simulator would need to functionally encapsulate all known scientific data for that specific topic, or at least be capable of hierarchically modelling from first principles upward - arguably more complex problem than building an LLM. A paradox emerges: to fact-check the output of LLMs, we would have to build yet another system, the purpose of which would be the same as that of developing LLMs, leading us down the rabbit hole we were trying to escape from in the first place.
If the reader finds the present argument overall contentious, then a question might help us converge - apart from temperature settings, which allows the consideration of less likely tokens, and from the bias, weights, and all other parameters that result in learned correlations and patterns during training, through what other mechanism would reasoning emerge? Is it through these parameters or through some "ghost in the shell"? And if we agree that reasoning is emergent via the parameters, in the sense that it mirrors learned correlations, how would valid and sound novel correlations come about, if not by luck? In order to claim that LLMs are capable of reasoning beyond the trained data, an experiment that answers these questions must be produced, showcasing some underlying mechanism by which the LLM (or any ANN) does reason, besides reproducing learned reasoning chains, producing spurious correlations, or lucky guesses.
An easy enough experiment that highlights our main thesis in this paper is to try and attempt to convince an LLM to solve an unsolved scientific problem, such as a Millennium problem, say P vs NP. This is the question of whether every problem for which a solution can be verified quickly (in polynomial time, denoted as P) can also be solved quickly. If we begin by asking GPT to list known strides and advances in the theory and then work from those, we will see the model quickly hit a ceiling. We argue here that that's because the manifold space that concerns information related to the P vs NP problem "ran out" and the model is left extrapolating from the limited data it has been trained on, without the ability to generate truly novel insights or reason beyond its probabilistic patterns. This limitation highlights the fundamental difference between data-driven models and human creative and logical reasoning, which can, even if slowly and constrained by experience, explore uncharted territories regardless of existing data.
Artificial Dualism
2023
We contend in this section that interpreting LLMs outputs as an expression of reasoning instead of as the output of an arbitrary probability function is a mistake. In doing so we are attempting to address the bulk of truth statements or propositions that attribute higher-order cognitive functions to LLMs, or any existing algorithm for that manner. We refer to this topic as the Artificial Dualism Problem (ADP). The support for ADP is often in perceived unknowns in the context of Artificial Neural Networks (ANNs) - interestingly analogous to the God of the Gaps argument (GoG) - whereby a believer will ask the 'why' question until an unknown is discovered and readily place God there as the primal cause of that causal chain, e.g. the Big Bang. This is a space where arguments about properties such as sentience, consciousness, reasoning, and even self-awareness are often made about ANNs, such as GPT models, or, more famously, Google's LaMDA model, as reported in a letter by Google's engineer Blake Lemoine. When a user of an algorithm cannot believe that the cause of the output is sufficiently explained by the underlying architecture or technology they will readily attribute emergent intelligence, reasoning, or consciousness to the algorithm in order to explain the phenomena. The difference between ADP and the GoG is that in ADP we can empirically disprove this claim, while in GoG the claim is not falsifiable and therefore not something that a scientist or anyone in search of truth should be too concerned with.
We argue then, based on our ChildPlay results and the results of many others, that ADP occurs largely due to overextending and misinterpreting statements concerning unknowns as "emergent" properties of ANNs, while ignoring, understating, or misunderstanding the necessary conditions for higher-level cognitive functioning in human beings as we currently understand them.
Black Boxes R NOT Pitch Black
2023
AI models are often explained as input-output streams with a black box in the middle. They are further described mathematically by conditional probability formalisms. Whether this formalism is literal or not does not seem to us a matter of philosophical inquiry. While the fact that human behaviour has often been put into probabilistic terms as well makes the matter more difficult, it does not change the facts, namely that we can fully describe and understand empirically LLMs or, more generally, ANNs, regardless of the complexity involved. Work of this sort is generally referred to as mechanistic interpretability research. We believe this to be important, since the assumptions leading to ADP can be shown to be incorrect simply by referring to a model's architecture or to the ANN algorithm in general.
Traditional Dualism posits a separation between the mind and the body, often invoking an immaterial soul or consciousness that cannot be fully understood or reduced to physical processes. Unlike the mind-body problem, where our understanding of consciousness is limited by the inaccessibility of the brain's full inner workings - since dissecting it effectively destroys its function - ADP can be disproved because we have complete access to the algorithm and architecture of ANNs. The mechanistic transparency of these models allows us to empirically demonstrate that the perceived cognitive capacities are merely the result of sophisticated pattern recognition, devoid of any true sentience or consciousness. Thus, while traditional Dualism remains a philosophical challenge, ADP results from a misconception that can be directly refuted by looking at the very design and function of the systems in question.
In order to disprove ADP, at least weakly, one can simply compare ANNs, or in our case to be more specific, LLMs, with the relevant brain regions and processes that allow for decision-making, reasoning, and consciousness in it - the only other object of study we have. The absence of algorithmic equivalents to critical neurobiological structures and processes, such as the prefrontal cortex's role in executive functions, the thalamus's integration of sensory and cognitive signals, and the hippocampus's involvement in memory consolidation, together with the hyperconnectedness involved in the brain, should severely and fundamentally limit what claims regarding cognitive capacities one can make. Unlike the human brain, which seems to rely on recurrent and feedback loops, which have been linked to consciousness, self-awareness, and abstract reasoning, LLMs operate on feedforward architectures devoid of such looped structures and overall complexity. In fact, we know LLMs perform token prediction through conditional probability without engaging in any higher-order cognitive functions characteristic of sentient beings, regardless of their underlying capacity to encode higher-order probability dependencies (token to token, parts of sentences to other parts, paragraphs to paragraphs, etc). This structural limitation, we argue, precludes any emergence of higher-order thinking (including reasoning) within these models, reducing their training and subsequent outputs to sophisticated pattern recognition rather than genuine cognitive processing. They are, technically and theoretically, piggybacking off of the human training data which effectively is the outcome of many functioning brains that are, as far as neuroscience can tell, actually thinking in the real sense of the word.
TTT - The Turing Trap
2023
We now introduce the concept of the "Turing Trap" to describe the fallacy of anthropomorphizing cognitive capacities in ANNs simply because they can pass the Turing Test or generate human-like responses. The Turing Test, proposed by Alan Turing, suggests that if a machine can engage in conversation indistinguishably from a human, it can be considered intelligent. However, this test has been thoroughly criticised for not accounting for the underlying mechanisms that produce such responses both in machines and in humans, as detailed above and elsewhere.
The Turing Trap then, we argue, occurs when observers mistake the surface-level performance of ANNs for genuine cognitive capacities. This mistake seems to us to be driven by anthropomorphism, the human tendency to attribute human-like qualities to non-human entities. We think that one of the primary factors is the high level of fluency and apparent coherence in the text generated by LLMs which can create a powerful illusion of depth and intentionality. This illusion is strengthened by the model’s ability to recall and reference information in ways that seem consistent with that of people who have memory and are capable of reasoning. This effect is exacerbated by the reinforcement learning training through human feedback, where human-like answers are further reinforced after general training by human technicians, which could be seen as overfitting on the task if that is what is being evaluated. In fact, the training data used in most of the existing LLMs is so massive that no individual person knows what is in it - so it is not surprising that one would be surprised to find that the neural network is capable of producing a certain piece of information or another.
The cognitive biases behind the Turing Trap are further amplified by the design of LLMs themselves, which are optimized to produce outputs that are not just accurate but also contextually appropriate and engaging through reinforcement learning with human feedback. It follows that the more closely an AI's behavior mimics human-like interaction, the stronger the anthropomorphic response would be, leading to a higher chance of mistakenly believing that these models possess higher cognitive functions, when in fact, they are simply copying text, a capacity often referred to as "parroting".
On Language
2022
Professor Noam Chomsky's shadow is long. Among many things that he has said, one has particular importance for the development of AI. Namely his assumption that humans are distinct from other animals by the use of language. This has given a special character to human language in the field of Linguistics where most see Chomsky as their godfather. Regardless, this particular argument-Chomsky or no Chomsky-has in turn affected AI research deeply. The assumption that human language has some magical property is and has always been guiding NLP funds, diverging all meaningful efforts to a failed endeavor (or so I will argue): to build a general language model by solving human language explicitly, that is, by mapping its syntactical, grammatical and probability rules.
The assumption is an old one and has been receding hand in hand with religious beliefs. First, the difference between animals and humans was said to be that of the ephemeral soul, something that even modern philosophers dare attribute reason to. This argument has been receding ever since it was made. Initially, with the forbidden dissections of human bodies, Davinci and others could see that humans deep down (literally) didn't look much different than the common pig and that there was no trace of a "soul". Then, with the dissection of the brain, the argument went from bad to worst. In the 17th century, Descartes decided for no good reason that the pineal gland must be the root of the soul and that the reason that the soul could not be found is that the pineal gland was a communication device with the divine-a portal if you will. This didn't land much more credence in the decaying argument.
Now in modernity, Cognitive Computational Neuroscience and its subfields have nearly taken the old centenary argument out of its misery. They have shown that neurons are probably the root of reason and thinking - for if you break enough of these you can make a person go from an adult to a child. And this by just removing a bit of the frontal lobe. If you break a bit more in the Broca's area then the person will produce nonsense speech-break a bit more and you end up with no speech! Start breaking neurons in the hypothalamus and you end up with someone stuck in a 5–10 second memory loop. Break a bit more and you'll certainly end up with a large-sized vegetable. All in all, every single cognitive capacity, such as self-awareness, reasoning, and speech, can be attributed to a part or parts of the brain since if you remove these you simply lose those capacities.
How is NLP trying to solve the problem of human language, and by assumption, Artificial General Intelligence? The current gold standard NLP models like GPT-3 and more recently DALL-E are impressive but so was the 1739s mechanical duck: the Canard Digerateur, built by Jacques de Vaucanson in an attempt to build artificial life. Neither of them is quite the real thing. In fact, despite the shinny complex arrays or cogs inside, they are both hollow, and frighteningly so. But just like the digestive duck got the attention of kings and queens, so does GPT-3. Chomsky's argument for distinguishing animals from humans is based on the capacity for language. The issue is that this is demonstrably false.
It has been a long-held scientific belief (i.e. justified by field evidence that was replicated and reproduced) that language differs across animals by only a measure of quality and content. Bees have their dances which can communicate the distance and the sugar content of a food source as well as the potential location for housing a better colony. They can even elicit voting for the moving of the colony, or dance in alarm in the presence of predators-this is without mentioning all the potential arrays of communication that might be happening via the use of pheromones.
Most insects-indeed, most animals-share these capacities for communication: they have different dances or sets of movements that mean particular things (visual communication) as well as a variety of pheromones, also used to detail particular meanings or contexts (e.g. trail, sex, and alarm pheromones). Insects and larger animals also communicate with sound: dolphins and whales have a rich language of clicks and whistles, and bats have echolocation which is mimicking a phonological loop (when you memorize a string of numbers you may repeat them in your mind-bats have an ongoing narrative of maps and symbols based on the bounces of their clicks), songbirds have names from birth that are particular sounds reserved to them by their parents and then by their "peers". Crows will name you a particular sound if they see you more than once and cats will slowly close their eyes in the presence of other cats if they mean no harm. The truth is, whether we like it or not, all animals that have a brain are capable of communicating information with each other. The information then, together with a brain, is the key to understanding language.
The belief that other animals are simply automata whose actions are merely the outcome of a set of if-else statements or simple logical rules is wrong-this has been illustrated by experiment. Real-world environments are dynamic and chaotic. They require dynamic controllers capable of dealing with probability distributions in many different dimensions, capable of memory, context as well as self-awareness, and many other cognitive faculties. Although evolutionary pressures tend to minimize resource expenditure, the designs it outputs are far from following Occam's razor. This is to say that animals are not made simple by evolution, they might even be made complex: how else can the presence of a complex object like the brain be explained? Even insects must be capable of acting in complex environments where unpredictable states are the norm. Current Artificial Intelligence models and algorithms are not comparable with an ant or any other insect to any reasonable degree: an embodied agent acting in the real world-much less a horse, a monkey, or even you.
I take then the human language to be emergent. It may or may not emerge from only the human brain. If a child does not learn a language before a certain development of the brain has occurred, evidence indicates that they won't ever be able to learn a natural language. Chomsky is correct in that languages have similarities-but I think these similarities are not just because of the brain, but rather because of their function which is to communicate about the real world (be it the inner mental world or the environment surrounding the speaking agent). Both human, as well as animal languages, carry information-Shannon in his fundamental paper on information theory (A Mathematical Theory of Communication By C. E. SHANNON, 1948) built the mathematical framework for analyzing information. Before him, Wittgenstein laid down the logical foundations of language and meaning (Tractatus Logico-Philosophicus, L WITTGENSTEIN, 1921). The culmination of these two treaties, I argue, is then the solution to language. Not human language, for human language is simply a subset of all languages, but any language or the superset of languages. This is equivalent to solving the cause of causes as opposed to solving any language explicitly as recent research has all but nearly exhausted itself in attempting.
A framework must be defined then, where the general capacity of the brain to manipulate symbols that map to other symbols (also known as an ontology or dynamic general-purpose semantic network) is central, where symbols are arbitrary. This is simply an alternative algorithmic path to AGI which is fundamentally based on the importance of symbols and their mapping. The processing of which requires Shannon's theory and the representation of which requires Wittgenstein's theory. And still, these will only serve as a substitute to the rampant transformers used in NLP, and not very well, at least not in terms of results. Even then and regardless of the substitution, there is still a hole that needs feeling regarding the loopiness present in the frontal cortex that yields self-awareness (or so we think) as well as the importance of heuristics such as the ones enabled by sentiments in the human brain (check Damasio, et al). In the end, the overall complexity of the brain and its cognitive functioning is still far from being approximated.
LaMDA and other programs (e.g. DALL-E2, GPT-3) are not sentient for this reason. Sentience, consciousness, or self-awareness are some of the terms that have been bent over and backward to try and fit them into our current algorithms. The reality is that non of the brain parts from which these characteristics arise are present in these algorithms-explicitly, implicitly, or functionally. So why and how would these language models ever have any perception whatsoever of anything if the perceiving mechanisms are not there, to begin with? The belief that they are sentient regardless is an outcome of the confusion involved in the field of Linguistics and in the field of AI by association.
The belief is that language is the root of consciousness because we're humans and we're special and language is the last standing magical symbol of this special character. I am afraid this is wrong. The same argument made here regarding language is directly applicable to matters of consciousness. The difference across animal consciousness is not a binary one, but rather one of quality and content. And the reason is the same: the brain is the key to understanding cognitive phenomena-not a singular part, but the complex object in itself. All things emerge from it, some more specifically rooted to some cortex, others not so much.
I leave you with a reference to Feynman's Cargo cult science as an analogy to the problem: it is the case that people have been trying to imitate language instead of understanding what language is.
"In the South Seas, there is a Cargo Cult of people. During the war, they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they've arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas-he's the controller-and they wait for the airplanes to land. They're doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn't work. No airplanes land. So I call these things Cargo Cult Science because they follow all the apparent precepts and forms of scientific investigation, but they're missing something essential because the planes don't land."