I knew nothing about Transformers before yesterday; I am a lover for cognitive science because I overthink too much.
Words inherently lack precision. Often, I might have what feels like a brilliant, even genius, idea—but articulating it clearly becomes incredibly difficult. (Imagine how challenging it is for me to create a simple pitch deck.) Frequently, hilarious jokes form perfectly in my mind, yet when spoken aloud, a critical layer of meaning vanishes.
Every act of communication invisibly sheds tiny fragments of precision, and cumulatively, these losses reshape our intended meaning into something quite different in another’s understanding. The words we speak are, after all, merely concrete approximations of complex, abstract thoughts that reside within our minds. The truth is that meanings are inherently abstract and intricately multidimensional.
This process of converting abstract meanings into words I term "lossy compression." This compression happens repeatedly and in both directions: first, when translating thoughts into words, and again, when listeners decode these words through their unique cognitive frameworks. Communication, therefore, is fundamentally a two-way exchange of compressed abstractions.
The same lossy compression principle underlies the operation of language models (LMs), specifically Transformer architectures. We begin by converting our thoughts into words to communicate with LMs. These words (tokens) are then transformed by the embedding matrix into complex, ultra-high-dimensional vectors (GPT-3 vectors have dimensions exceeding 1300). This extremely high-dimensional communication foundation is precisely why LMs inherently possess a deeper and more sophisticated understanding than human cognitive processing typically allows.
With such extensive dimensional understanding, Transformers recalculate semantic meanings of words through attention mechanisms, using context and position to predict subsequent words. The depth and dimensionality of linguistic understanding achieved by this method surpass our intuitive comprehension. Objectively, LMs have already reached a level of standardized, unified, and exceptionally complex abstract thinking (prediction).
The final step, unembedding, dramatically compresses these intricately complex semantic abstractions back into comprehensible human language. Given that human understanding of language varies significantly due to individual cognitive interpretations, we barely realize how much semantic precision is lost during the LM’s abstraction, unembedding, and our subsequent interpretation. The extent of information lost in this compression is profound and largely unnoticed.
Words are lossy compression of thoughts
For humans like us, the process of turning abstractions into words is something we're doing every single day (probably the reason why we find linguistics and natural-language processing is always that intriguing). We often describe someone as "smart" as in a way to refer to their intellectual ability, but what "smart" hinted at the same time, is the the ability to turn abstracts into tangibles (or rather sensibles).
If you think deeper about it, this's not an easy thing to do at all. From cognition psychology, pretty much thing we sensed since our birth, is shaping an uncontrollable second-nature biases in everything. The fact that this bias exists in every single one of us, made it impossible to comprehend anything on a "exact same page". The cognitional difference in words themselves already create a huge gap between us, while our usage of word, which is based on our sense of these words (semantic memory) creates other gap of invisible misunderstandings. Even if you happen to find that exact right fit for the abstract in your mind, the fact is you can never know if it's being comprehend in the same way as you created.
At the end of the day, human's way for communications, words, in an normal scenario of communications, been though three general stage of precision loss.
However, this doesn't only happens inter-human communication, as we discussed previously, this sort of "lossy compressions" goes in a same way on human-to-LLM communications (typing prompts to LLMs), you will see that despite words are the only medium for communication, they are defected for communication in their true nature.
How incredible vectors are in expressing meanings
Vectors are how Large-Language Models deal with words, or rather, meanings. Everything you input into will be firstly mapped into token_id
. The which these token_id
will be map words into word embeddings
.
One of the difficulties on creating such machine like Large-Language Models, is how can these machines - essentially computers - understand the meaning of words. We understand words - through out a life-time of constant contextual learning with visual implications, but machines can't, how can they only learn by regression, or optimization of data, finding the lowest derivative of the residual function.
The way that machines approach this problem of meanings, is through out high dimensional vectors. I knew that I probably had already mentioned this a few time, but you'll see how interesting and actual "contextually rich" these tokens are, with few fun instance, where we referenced 3Blue1Brown's DL5 w/ https://arxiv.org/pdf/1301.3781
What Transformers does eventually, is updating word embeddings with unimaginable contextual information (multi-head attention: $context^{2}$ space makes a attention pattern; 96 attention patterns makes a attention head (layer); 96 attention heads makes GPT-3's multi-head attention architecture.)
1. Large Language Models achieve prediction through reasoning with their final vector, but this vector isn't really about words (tokens) anymore—it's an abstraction of meaning itself. The Transformer architecture semantically updates each word's vector through attention mechanisms, layer by layer. By the time we reach the final layer, what emerges isn't a word representation at all, but rather a pure abstraction of meaning.
2. At the unembedding layer, we use the WuW_u Wu matrix to generate logits across the vocabulary, mapping these precisely expressed "meanings" onto the most probable "words." But this is inherently a lossy compression of meaning. First, the probabilistic selection itself degrades semantic precision. Second, when we interpret these words, we do so through our own cognitive frameworks—introducing yet another layer of precision loss.
3. This creates a fascinating paradox: when we communicate with LMs, we're navigating multiple layers of lossy semantic compression and decompression. Internally, the Transformer's vectors express meanings with extraordinary precision. But to enable human communication, LMs must convert words into vectors, then vectors back into words. The models have already developed the capability for pure "meaning-to-meaning" communication—we're just stuck using words as the interface.
-----
This piece remains unfinished - What you seen above is what I started, but never have the time to finish 4 months ago, it seemly impossible for me to finish it as the integrity I would wanted it, so I decide to just post it out - so it might have a chance on bring little of inspirations instead of just sitting in my computer. I wished I finished tracing down these thoughts into words, but this fittingly reflected the theme of this blog - Words are lossy compressor of our thoughts, even though the answer to this question already lies in the chatbots, LLMs, or rather we've found the right question for this answer.