I recommend you first read this article I wrote on word2vec, if you haven’t already. If you have, go back and reread it anyway because it’s been a year and a half and I don’t trust your memory. Here’s the two sentence recap: words can be mapped to 50-dimensional vectors which represent their location in some high dimensional “semantic space”. This isn’t just a quirk of computers, these numerical representations actually map to some inherent structure in language, and vector math applied to these words gives meaningful results, the prototypical example being that V(‘king’) — V(‘man’) + V(‘woman’) = V(‘queen’).
If you’re clever, you might have already noticed the flaw in this model. If not, a quick trip to the dictionary should lay it out for you. How many definitions can you find for the word ‘king’? Merriam Webster lays out 16 for us, several being names, the rest being things like playing cards, chess pieces, or a kind of salmon. Are we to understand that the name for this fish is related to ‘man’, ‘woman’ and ‘queen’ in this mathematical formula? Of course not — every word carries multiple possible definitions, sometimes with no connection to each other, and every word can be influenced by the context it is placed in to shade its meaning a thousand subtle different ways. M-W’s definition list is not even comprehensive. So how are we to understand how we understand language, if every word draws its meaning from its neighbors?
This is the central problem in natural language processing, and the current approach is commonly summed up in the words of John Rupert Firth in 1957: “you shall know a word by the company it keeps.” The word2vec model of words as isolated semantic atoms is a good start for understanding language, but not the conclusion. We don’t use words one at a time — they almost always come in bundles.
A naive approach to this problem would be to feed our data into a neural model — neurons can influence each other, meaning the representation of one word may interact with the representations of its neighbors. Problem: text documents can be thousands of words long, and we don’t have room for a neuron dedicated to each word. Solution: a recurrent neural network — one which feeds into itself, considering each new word in the context of the words which came before it. Sounds great, except not quite — this model still has no way of separating words once they’re added to its context. One more level of complexity gets us even further.
Mathematically, “attention” is a set of three matrices, for keys, values and queries respectively. Conceptually, attention looks at some word within a sentence, and asks: “for each word that came before this, how relevant is it to this specific word, and in what way?” In the sentence: “I will never, no matter how many times you ask, be interested in signing up for your mailing list,” the word “interested” is heavily influenced by the word “never” several places before it. The intervening words, on the other hand, are less relevant. Attention allows the machine to encode this, breaking the direct connection between proximity and relevance.
This is the base level — the result of this attention mechanism is handed off to a recurrent neural network like above, and the result of that goes into another attention mechanism. This repeats six times (in the original paper), or 12 (in training word embeddings), or 24 (in the very largest models). Then the end result is “decoded” by funneling it through the same process in reverse, only this decoder stack has been trained to decode it into some other language — say, Russian. Or English.
Starting from the French sentence at the bottom, the meaning propagates up the encoder stack until it reaches some heavily abstracted numeric representation — a semantic matrix, the end result of every language’s journey up the stack. The first layer maps connections between words, the second connections between those connections, and so on, at least in theory. In practice it’s heavily dependent on the learning paradigm, the available data and the compute budget. The decoder stack traces those layers back down into solid language again, outputting a sequence of words which best approximate the original in a different language set.
This in-between phase, when the words have all been transmuted into abstract meaning, is a very attractive one. Here, maybe, we can find our “semantic wilderness” — the place where the signifier lies, the thing behind the language. At the very least this is closer to it than the simple word vectors of last post. The transformer is cut in half — the decoder section thrown away. Words are fed into the encoder, their meaning filtered up to the top, and then we ask the model questions. It answers not by looking at the text, but by considering the abstract representation of the text’s meaning in its mind. In other words, it does not simply recite answers from memory, but processes meaning. Trained on billions of documents, these attention-based language models understand how natural language is supposed to look. GPT models are built on this transformer architecture, that’s why they can generate text so well, with such a keen eye for context. These models remember not only what to say but how to phrase it.
So is this it? The peak of natural language processing? Not even close. I don’t wish to spoil anything, but there are drawbacks and limitations to the attention model and they’re already being addressed by researchers. New paradigms are in the works. Especially lazier paradigms, where the researchers do less work — a computer scientist’s favorite thing is to automate away his own job.
This might be as close as we can get to seeing the semantic wilderness, however. Future models will likely rely on things like reinforcement learning, allowing the machine to define its own algorithm as well as tuning it. As illegible as attention models are, they still at least imply some consistent lines of causality — we can trace the influence of a word embedding through the model, and estimate what each layer accomplishes. As these models get closer to a true analog of human cognition, they also get harder to understand on the base mechanical model. Being able to replicate the human mind on silicon may not actually help us comprehend it.
And if we are getting closer to human cognition, where is the boundary line? If you believe that a gnat is close enough to consciousness for its life to be worth preserving, then surely a GPT-3 instance meets the same threshhold. Computer models don’t live or die in the same way as biological creatures, seeing as their entire existence can be erased, restored, duplicated or put in indefinite stasis all with a handful of memory operations. But there does arise an ethical question, when we begin to solve problems by building a slave race of digital creatures to do grunt work for us. Even ignoring the moral hazard of progeniting semi-conscious subhumans engineered to serve, the ability to so casually use cognitive power threatens the social value of humans. Not just in displacing people, but in implying they are less useful, less valuable, lower priority, than a server bank. Even without being explicitly typed, this mentality can seep into the zeitgeist. This problem can’t be addressed from within computer science, nor should it be — this is a flaw in modern ethics, where humans are valued because of our potential for intelligence. This is the best replacement secular ethics could find for the sacred eternal soul, but it was never a particularly good one. Now it is becoming untenable — without a better ethics, or a new religion, we will soon create models we are trained to believe are superior to us. We will give them the world.