# You Type. It Thinks. But How? — Inside ChatGPT and Claude AI LLM models

> Every time you ask ChatGPT something, a fascinating sequence of events fires off inside a neural network trained on hundreds of billions of words. This article unpacks the entire journey — from the moment you hit send, to the moment a response appears. No PhD required.

* * *

## What Is an LLM?

Let's start with the acronym. **LLM** stands for **Large Language Model**.

Break it down:

*   **Large** — trained on enormous datasets (a significant chunk of the internet, books, Wikipedia, code, and much more — often hundreds of billions of words)
    
*   **Language** — it works exclusively with text and language: reading it, understanding it, generating it on literally any topic
    
*   **Model** — a mathematical system (a neural network, specifically) that has learned statistical patterns from all that data
    

### What problem do LLMs actually solve?

Before LLMs, getting a computer to "understand" text was painfully rigid. You had to program explicit rules: *if the user says X, do Y*. The problem? Human language is gloriously messy. We abbreviate, use sarcasm, switch topics mid-sentence, and rely on context that computers had no way to grasp.

LLMs changed this entirely. Instead of rules, they learn *patterns* from data. By reading enough human text, they develop a statistical intuition for what words mean in context, how sentences flow, and what a "good answer" looks like.

### Popular LLMs you've heard of

| Model | Creator | What it powers |
| --- | --- | --- |
| GPT-4o | OpenAI | ChatGPT |
| Claude (Sonnet, Opus) | Anthropic | Claude.ai |
| Gemini | Google DeepMind | Google Search, Workspace |
| LLaMA 3 | Meta | Open-source apps, research |
| Mistral | Mistral AI | Open-source, enterprise |

### Where do you see LLMs in daily life?

LLMs are everywhere now. You're using them when you:

*   Ask ChatGPT to explain a bug in your code
    
*   Use GitHub Copilot to autocomplete a function
    
*   Get a summarized email draft in Gmail
    
*   Ask Siri or Google Assistant a complex question
    
*   Use Notion AI to rewrite a paragraph
    

They've become the invisible intelligence layer beneath most of the text-based tools you use every day.

* * *

## What Happens When You Send a Message to ChatGPT?

Let's trace the exact journey of your message.

### Step 1: You type a prompt

You write something like: *"Explain how neural networks learn."*

This seems simple enough. But to a computer, this is just a string of characters — it has no inherent meaning yet.

### Step 2: Your text is tokenized

Before your message even reaches the model, it's broken down into **tokens** (more on this shortly). Think of tokens as the building blocks the model uses to read text. "neural networks" might become `["neur", "al", " networks"]` or similar units, depending on the tokenizer.

### Step 3: The model processes the tokens

This is where the magic happens. Your tokenized input is passed through the LLM — a neural network with billions of parameters. It reads the entire context you've provided and computes probabilities for what the best next token would be, then the one after that, and so on.

### Step 4: A response is generated — token by token

The model doesn't write the whole response at once. It generates **one token at a time**, each choice influenced by everything that came before. This is why ChatGPT's output streams to your screen word-by-word — you're watching the model think in real time.

### Why ChatGPT is NOT copy-pasting from the internet

This is one of the biggest misconceptions. ChatGPT doesn't have a search engine built in (unless you've specifically enabled web browsing). It has no live internet connection during generation.

What it has instead is *compressed knowledge* — statistical patterns learned during training. When it explains photosynthesis, it's not pulling a Wikipedia article. It's reconstructing an explanation from patterns it absorbed across thousands of texts it read during training. The knowledge is baked into its weights, not fetched on demand.

* * *

## Why Computers Don't Understand Human Language

Here's the fundamental problem: **computers only understand numbers.**

At the hardware level, everything is a 0 or a 1. Your CPU processes numbers. Your GPU processes numbers. Memory stores numbers. Every image you see, every sound you hear, and every video you watch has been converted into a stream of numbers before your computer could process it.

Text is no different.

The letter "A" is stored as the number 65 (in ASCII). The emoji 😊 is stored as a Unicode code point. But raw character codes don't carry *meaning*. The fact that "bank" appears in both "river bank" and "bank account" isn't obvious from their character codes alone. Context is invisible at the character level.

So how do we bridge the gap between human language and numerical computation?

The answer is a two-step pipeline:

1.  **Tokenization** — break text into manageable chunks
    
2.  **Embeddings** — convert those chunks into rich numerical vectors that capture meaning
    

We'll look at tokenization next, and embeddings are what make the Transformer architecture tick.

* * *

## Tokenization: Breaking Language into Bite-Sized Pieces

### What is a token?

A **token** is a chunk of text that the model processes as a single unit. Tokens are *not* the same as words.

Tokens can be:

*   A full word: `"hello"` → `["hello"]`
    
*   Part of a word: `"unbelievable"` → `["un", "believ", "able"]`
    
*   A punctuation mark: `"!"` → `["!"]`
    
*   A space + word: `" the"` → `[" the"]`
    

As a rough rule of thumb, **1 token ≈ 0.75 words** in English, or about 4 characters. OpenAI's GPT-4 works with roughly 100,000 tokens in its context window.

### Why not just use individual characters?

You could tokenize by character — but then `"intelligence"` becomes 12 separate units. The model would need to learn relationships across 12 steps instead of 1-3. That's computationally expensive and makes it harder to learn meaningful patterns.

### Why not just use whole words?

On the other extreme, using whole words means a massive vocabulary. The English language has 170,000+ words, plus typos, slang, technical terms, code snippets, and every other language in the training data. You'd need a lookup table with millions of entries.

Tokens hit a middle ground — a vocabulary of around 50,000-100,000 subword units that can express virtually any text efficiently.

### A real tokenization example

Let's tokenize: `"ChatGPT is surprisingly good at coding."`

It might break down like this:

```plaintext
["Chat", "G", "PT", " is", " surprisingly", " good", " at", " coding", "."]
```

Notice `"ChatGPT"` gets split (it's not a common word), but `"surprisingly"` stays whole (it's common enough to have its own token in most vocabularies). Each token then gets converted to a number — its **token ID** — which the model can actually process.

### How tokenization affects cost and context

LLM APIs typically charge by token. A 1,000-word article might be ~1,300 tokens. The model's **context window** (how much it can "read" at once) is also measured in tokens. This is why very long documents sometimes need to be chunked — the model has a limit on how many tokens it can process simultaneously.

* * *

## Transformers: The Architecture That Changed Everything

If LLMs are the car, Transformers are the engine.

The **Transformer architecture** was introduced in a landmark 2017 paper titled *"Attention Is All You Need"* by researchers at Google. Before Transformers, language models used recurrent neural networks (RNNs), which processed text sequentially — one word at a time, left to right. This was slow and struggled with long-range dependencies (relating a word at the end of a long sentence to one at the beginning).

Transformers threw out the sequential approach entirely and introduced one core insight: **attention**.

### What is attention?

The key idea is that when processing a word, the model should be able to look at *every other word* in the context simultaneously and decide which ones are relevant. This is called **self-attention**.

Consider the sentence: *"The animal didn't cross the street because it was too tired."*

What does "it" refer to? The animal, not the street. Humans know this instantly from context. Self-attention gives the model a mechanism to figure this out — by assigning higher "attention weights" to "animal" when processing "it."

In mathematical terms, each token computes three vectors:

*   **Query (Q)** — "what am I looking for?"
    
*   **Key (K)** — "what do I contain?"
    
*   **Value (V)** — "what information do I pass along?"
    

The model computes how well each Query matches every Key, normalizes those scores, and uses them to weight the Value vectors. The result: a rich representation of each token that incorporates relevant context from the entire sequence.

### Why did Transformers change AI?

Before 2017, language models were good at specific tasks but failed at general understanding. After Transformers, the improvement curve went nearly vertical. Here's why:

**Parallelism.** Unlike RNNs, Transformers process all tokens at once. This means you can train on much larger datasets much faster using modern GPU clusters.

**Scalability.** As you add more parameters (more layers, larger attention heads), Transformers continue to improve in ways that RNNs did not. This "scaling law" behavior was unexpected — and it's what enabled GPT-3, GPT-4, Claude, and Gemini to exist.

**Long-range context.** Self-attention gives every token a direct connection to every other token, regardless of distance. A key fact mentioned 5,000 words earlier in a document is just as accessible as one mentioned in the previous sentence.

### Why does almost every modern LLM use Transformers?

Because nothing better has emerged at scale. Researchers have tried other architectures — state space models like Mamba, hybrid approaches, recurrent alternatives — and while some show promise on specific benchmarks, Transformers remain the dominant architecture for frontier models.

The combination of attention, parallelism, and proven scalability has made them the de facto foundation for every major LLM: GPT-4, Claude, Gemini, LLaMA, Mistral, and beyond.

* * *

## Temperature: How Models Choose Their Words

One concept worth knowing as a developer or power user: **temperature**.

After the model computes probabilities for the next token, temperature controls how "creative" or "conservative" the sampling is:

*   **Low temperature (0.0–0.3):** The model almost always picks the highest-probability token. Responses are predictable, consistent, and deterministic. Ideal for code generation or factual Q&A.
    
*   **High temperature (0.8–1.5):** The model samples more randomly from the probability distribution, picking tokens that aren't necessarily the most likely. Responses are more varied, surprising, and creative — but sometimes incoherent.
    

Think of it as the difference between a careful technical writer (low temperature) and a creative writing student who's had a bit too much coffee (high temperature).

* * *

## Putting It All Together: The Complete LLM Workflow

Here's the full picture, end to end:

1.  You type a prompt
    
2.  The text is **tokenized** into token IDs
    
3.  Each token ID is converted into an **embedding** (a high-dimensional vector)
    
4.  The embeddings pass through **multiple Transformer layers**, each running self-attention and a feed-forward network
    
5.  The final layer outputs a probability distribution over the entire vocabulary
    
6.  The model **samples** from that distribution (adjusted by temperature) to pick the next token
    
7.  That token is appended to the context, and steps 4–6 repeat until the response is complete
    
8.  Token IDs are **decoded** back into human-readable text
    

The "intelligence" of an LLM emerges from billions of parameters tuned during training to make step 4 produce useful, coherent output — one token at a time.

* * *

## Key Takeaways

*   **LLMs** are large neural networks trained on massive text corpora to predict and generate language
    
*   **ChatGPT** generates responses token by token, using learned patterns — not live internet lookups
    
*   **Computers only understand numbers**, so text must be converted to numerical representations
    
*   **Tokenization** breaks text into subword units that balance vocabulary size and expressiveness
    
*   **Transformers** revolutionized NLP with self-attention, allowing every token to attend to every other token in parallel
    
*   **Temperature** controls the creativity vs. predictability of model outputs
    

The next time you ask ChatGPT a question, you're firing up one of the most complex mathematical structures humanity has ever built — one that speaks your language because it read enough of it to know the patterns by heart.

* * *

*Want to go deeper? Check out the original Transformer paper —* [*"Attention Is All You Need" (Vaswani et al., 2017)*](https://arxiv.org/abs/1706.03762) *— and Andrej Karpathy's "Let's build GPT" video on YouTube.*
