Chapter 2 of 20

How ChatGPT Works — LLMs, Tokens & Transformers

Understand large language models in plain English — what tokens are, how the transformer architecture works, what training and RLHF mean, why ChatGPT hallucinates, and what the temperature setting controls.

Meritshot13 min read
ChatGPTLLMTokensTransformerRLHFHallucination
All ChatGPT Chapters

Why Understanding the Machine Matters

You can drive a car without knowing how an internal combustion engine works. But if you understand that the engine needs petrol, operates within a temperature range, and cannot run on water, you become a far safer and more effective driver.

The same principle applies to ChatGPT. You do not need a PhD in machine learning to use it well, but understanding a handful of concepts — tokens, context windows, temperature, hallucination — will transform how you prompt it, help you diagnose when it goes wrong, and let you set the right expectations for every task.


Large Language Models in Plain English

A large language model is a mathematical function that takes a sequence of text as input and produces a probability distribution over what word (or, more precisely, what token) is most likely to come next.

That sounds abstract. Here is the intuition:

Imagine you have read every book, article, forum post, and website ever published. You have read so much that you have developed an extraordinarily refined sense of which word tends to follow which other word in which context. When someone hands you the sentence "The customer paid using" you might guess the next word is "cash", "card", "UPI", or "Paytm" — with UPI and Paytm being especially likely if the surrounding text has an Indian context.

An LLM is a system that has been trained to do exactly this, at a scale that dwarfs any individual's reading. GPT-4 was trained on roughly a trillion tokens of text. The "large" in large language model refers to two things: the size of the training data and the number of parameters (adjustable numerical weights) in the model.

The critical insight is that the model does not "know" facts in the way a database does. It does not look up "What is the capital of India?" in a table. Instead, it generates the most statistically plausible next tokens given your prompt — and because "New Delhi" follows "The capital of India is" in the vast majority of text it trained on, it produces "New Delhi" reliably.

This distinction — generation versus retrieval — explains both the power and the failure modes of ChatGPT.


What Is a Token?

Models do not process text one character at a time or one word at a time. They work with tokens, which are chunks of text that the model has learned to treat as a single unit.

Tokens are roughly 3-4 characters or 0.75 words on average in English. The exact boundaries depend on frequency: common words like "the", "is", "in" are each a single token. Rare or long words are split into multiple tokens.

Token Examples

Text: "ChatGPT is amazing"
Tokens: ["Chat", "G", "PT", " is", " amaz", "ing"]   ← approximate
Count: 6 tokens

Text: "UPI transfer of ₹500"
Tokens: ["U", "PI", " transfer", " of", " ₹", "500"]  ← approximate
Count: 6 tokens

Text: "Hello"
Tokens: ["Hello"]
Count: 1 token

The rupee symbol (₹) is a good example of a less-common character: it will consume one or more tokens on its own, whereas a common English word like "the" is a single token.

Why Tokens Matter for You

Cost: The OpenAI API charges per token consumed (both input and output). When you are building an application on top of ChatGPT, token count directly affects your bill. For example, if you are summarising support tickets for an e-commerce company like Meesho, each ticket might be 200-300 tokens. Processing a million tickets a month is 200-300 million tokens — a meaningful cost.

Context window: Every model has a maximum context window — the total number of tokens it can hold in memory at once, including your prompt and its response. GPT-4o has a 128,000-token context window, which is roughly 100,000 words or a short novel. If your conversation exceeds this limit, the model starts "forgetting" the earliest parts.

Prompt design: Concise prompts are better not just for cost but because they leave more room in the context window for the model's response and for any documents you paste in.


The Transformer Architecture (Conceptual)

ChatGPT is based on the transformer architecture, introduced by Google researchers in the landmark 2017 paper "Attention Is All You Need." You do not need to understand the mathematics, but grasping two concepts — attention and context — will make you a better user.

Attention: How the Model Relates Words to Each Other

Transformers process an entire sequence of tokens simultaneously (not word-by-word like older models). The attention mechanism allows the model to weigh how relevant each token is to every other token in the sequence when generating the next token.

A concrete example: In the sentence "The bank on the river was steep, but the bank refused my loan," the model must figure out that the first "bank" refers to a riverbank and the second "bank" refers to a financial institution. Attention allows the model to look at surrounding context — "river", "steep", "loan", "refused" — to disambiguate.

This ability to relate distant words to each other is what makes transformers dramatically better than older architectures at understanding nuanced language.

Context Window: Working Memory

Think of the context window as the model's working memory — everything it can "see" in one inference pass. Your full conversation history, system instructions, pasted documents, and the model's own previous responses all count toward this limit.

When the context window fills up, the model does not crash — it simply loses access to the oldest parts of the conversation. You may notice this in very long sessions: the model "forgets" something you mentioned early on. The fix is to start a new conversation and briefly recap the key context.


How Training Works

ChatGPT's capabilities come from two distinct training phases.

Phase 1: Pre-Training

In pre-training, the model is shown enormous quantities of text and trained to predict the next token. The training data includes web pages, books, academic papers, code repositories, and more — scraped over years and cleaned to remove the worst content.

No human tells the model what facts are true or what tone is appropriate during this phase. It simply learns the statistical patterns of language at massive scale. This is computationally expensive: training GPT-4 is estimated to have cost tens of millions of dollars in compute.

After pre-training, the model can generate fluent text, but it does not reliably follow instructions. Ask it to "explain compound interest" and it might continue the phrase with another question rather than answering — because on the internet, questions often appear after other questions.

Phase 2: Fine-Tuning with RLHF

RLHF stands for Reinforcement Learning from Human Feedback. This is the step that transforms a fluent-but-unfocused pre-trained model into the helpful, instruction-following assistant you use in ChatGPT.

The process works in three steps:

Step 1 — Supervised fine-tuning: Human trainers write example prompts and ideal responses. The model is fine-tuned to imitate these examples.

Step 2 — Reward model training: Human raters are shown several model responses to the same prompt and rank them from best to worst. A separate "reward model" is trained to predict these human preferences.

Step 3 — RL optimisation: The main model generates responses, the reward model scores them, and the main model is updated to produce responses that score higher. This loop continues until the model reliably produces responses humans prefer.

RLHF is why ChatGPT is helpful, relatively safe, and good at following complex instructions. It is also part of why it can sound overly agreeable — the human raters who trained the reward model tended to prefer confident, complete-sounding answers, which the model learns to produce even when it is uncertain.


Why ChatGPT Hallucinates

Hallucination — the model stating confident falsehoods — is one of the most discussed (and most misunderstood) behaviours of LLMs.

The root cause is architectural: the model is a next-token predictor. It produces the most statistically plausible continuation of your prompt. If the most plausible-sounding continuation happens to include a fabricated name, date, or citation, the model has no internal "fact-checking" mechanism that catches this before generation.

Examples of hallucination in practice

Prompt: "Cite three peer-reviewed papers on customer churn in Indian telecom."

Possible hallucinated output:
1. Sharma, R. & Gupta, A. (2021). "Churn Prediction in Jio 4G Networks."
   Journal of Telecom Management, 14(2), 45-62.

← This paper may not exist. The author names, journal name, volume, and
  page numbers are all plausible-sounding but potentially fabricated.

Hallucination is more likely when:

  • You ask for very specific facts (exact dates, statistics, citations)
  • The topic is niche and underrepresented in training data
  • The model is "pushed" toward a confident answer by the phrasing of your prompt

Mitigation strategies:

  • Ask for general explanations rather than specific citations, then verify those separately
  • Prompt the model to say "I don't know" or "I'm not sure" when it is uncertain ("If you don't know the answer with confidence, say so rather than guessing")
  • Cross-check any specific claim that will be acted upon

What the Temperature Setting Does

When you access ChatGPT through the API or through some interfaces, you can adjust a parameter called temperature. It controls how creative — or how predictable — the model's responses are.

Technically, temperature scales the probability distribution over possible next tokens before the model samples from it.

TemperatureEffectBest For
0.0Deterministic — always picks the single most likely tokenFactual Q&A, code generation, structured data extraction
0.3 – 0.7Balanced — mostly predictable with occasional variationMost everyday tasks, writing assistance, summaries
0.8 – 1.2Creative — more varied and surprising word choicesBrainstorming, creative writing, generating multiple options
Above 1.5Very unpredictable — can become incoherentRarely useful in practice

A practical analogy: Imagine you are writing a report about Zomato's 2025 revenue. At temperature 0, every time you ran the same prompt you would get virtually the same response — precise, consistent, and factual. At temperature 1.2, each run might produce a differently phrased, sometimes surprisingly creative take — useful if you want variety, risky if you need accuracy.

In the standard ChatGPT interface, temperature is managed automatically. The model uses a relatively low temperature for factual questions and a higher one for creative tasks. But when you build applications on the API, choosing temperature deliberately is an important skill.


The Full Pipeline: Prompt to Response

Putting it all together, here is what happens between you pressing Enter and the response appearing on screen:

1. Your text is tokenised
   "Explain UPI to a first-time smartphone user" → [tokens]

2. Tokens pass through the transformer layers
   Attention heads compute relationships between all tokens

3. The model generates output tokens one at a time
   Each token is sampled from the probability distribution (shaped by temperature)

4. Output tokens are decoded back into text
   [tokens] → "UPI stands for Unified Payments Interface..."

5. The response is streamed to your screen token by token
   (Which is why text appears character-by-character, not all at once)

The entire process for a typical response takes one to five seconds for a cloud-served model like ChatGPT, depending on response length and server load.


Common Pitfalls

Assuming the model "knows" something because it states it confidently. Confidence in tone is a byproduct of how LLMs generate text — they always pick plausible-sounding continuations. A confident statement and an accurate statement are not the same thing.

Expecting perfect consistency. Because the model samples probabilistically, the same prompt may produce slightly different outputs on different runs. If consistency is critical (e.g., for an automated pipeline), use temperature 0 and test thoroughly.

Blaming "outdated knowledge" for all errors. Some errors are not due to the knowledge cutoff — they are hallucinations about topics well within the training window. Distinguishing between a cutoff issue and a hallucination helps you choose the right mitigation.

Ignoring the context window in long conversations. If you notice the model seems to have "forgotten" something important you said 30 messages ago, it has — it has fallen outside the context window. Start a new chat and restate the essential context.

Confusing temperature with quality. Higher temperature does not mean better responses. It means more varied responses. For most analytical or factual tasks, a lower temperature produces more reliable output.


Practice Exercises

  1. Open ChatGPT and start a conversation. Type "What is 17 multiplied by 348?" Then verify the answer with a calculator. Repeat with "What is 17,432 divided by 0.0037 rounded to two decimal places?" Note how accuracy changes for harder arithmetic — this illustrates that the model is a pattern predictor, not a calculator.

  2. Ask ChatGPT: "Cite a specific peer-reviewed study on the impact of UPI on financial inclusion in rural India, including authors, journal, year, and DOI." Then search Google Scholar for the paper it names. This exercise builds the habit of verifying citations.

  3. Ask the same question three times in a row in three separate conversations: "Give me a creative tagline for a chai delivery app." Compare the three answers. This demonstrates the effect of probabilistic sampling — the outputs will differ even for the same prompt.

  4. Paste a 200-word passage from any article into ChatGPT and ask: "Summarise this in exactly three bullet points." Count the tokens roughly (number of words divided by 0.75) and think about how many such passages would fit in a 128,000-token context window.

  5. Ask ChatGPT to explain the transformer attention mechanism to you as if you are a Class 11 student who understands basic linear algebra. Evaluate whether the explanation is accurate and appropriately pitched — this tests your own understanding of what you just learned.


Summary

  • A large language model is a next-token prediction engine trained on massive text corpora; it generates text by sampling statistically plausible continuations, not by retrieving stored facts.
  • Tokens are the atomic units of text that models process — roughly 0.75 words each in English; token count affects API cost and the model's context window.
  • The transformer architecture uses attention mechanisms to relate every token to every other token in the context, enabling nuanced language understanding.
  • Training happens in two phases: large-scale pre-training on raw text, followed by RLHF fine-tuning using human feedback to shape the model into a helpful, instruction-following assistant.
  • Hallucination occurs because the model has no fact-checking layer — it produces plausible-sounding text even when the underlying claim is false; always verify specific facts.
  • Temperature controls output randomness: low temperature for precise and consistent responses, higher temperature for creative and varied ones.