~/nathan

building practical AI systems

session://blog/part-3-llms-and-modern-ml-the-new-fundamentals

$ cat posts/part-3-llms-and-modern-ml-the-new-fundamentals.md

blog/Tech/Feb 12, 2026

### Part 3 of 7

Part 3: LLMs and Modern ML (The New Fundamentals)

I mean, I could explain transformers. I knew about attention mechanisms. I had read the “Attention Is All You Need” paper (okay, skimmed it).

Part 3: LLMs and Modern ML (The New Fundamentals)

$ render article --theme terminal-notes

(3rd Part of the 7-Part series on what it takes to be a strong ML engineer in 2026)

Check out Part 2: Training and Loss Functions (What You’re Actually Optimizing) here.

I thought I understood LLMs.

I mean, I could explain transformers. I knew about attention mechanisms. I had read the “Attention Is All You Need” paper (okay, skimmed it).

Then I tried to actually build something with LLMs in production.

Turns out, there’s a big gap between “I can use an API” and “I understand how this actually works and why it fails.”

Let me share what I’m learning as I try to cross that gap.

Tokenization: Where many problems begin (and nobody notices)

I’m embarrassed to admit how long I ignored tokenization. It seemed like implementation detail — just the thing that converts text to numbers so the model can process it.

Wrong. Tokenization is where a lot of subtle problems originate.

What I’m learning about tokenizers:

They fragment your domain-specific language in unpredictable ways:

  • Medical terms get split into meaningless pieces
  • Code identifiers break in weird places
  • Non-English text gets brutally over-tokenized
  • Numbers can be one token or many, depending on formatting

Real example I tested:

  • “cardiovascular” → might be 1 token or 3 tokens depending on the tokenizer
  • “2023” → 1 token
  • “2,023” → 3 tokens (the comma breaks it)
  • “COVID-19” → 2–4 tokens

Why this matters:

Token counts determine your costs:

  • Paying per token, not per word
  • That innocent comma just increased your costs by 3x for that number
  • Formatting choices affect your monthly bill

Token limits are hard walls:

  • 8K context means 8K tokens, not 8K words
  • Your carefully crafted prompt might get truncated mid-sentence
  • That JSON structure you’re asking for? Might not fit

Tokenization creates biases:

  • English: ~1 token per word
  • Chinese: ~2–3 tokens per character
  • Some African languages: even worse
  • Same semantic content, very different token costs

What I wish I’d known earlier:

  • Always check how your domain-specific text tokenizes
  • Test your prompts at different lengths — they might break at token boundaries
  • Budget for tokens, not characters
  • Different tokenizers behave very differently (GPT vs. Claude vs. Llama)

Attention and Context: The fundamental constraints

The thing about transformer attention that I didn’t fully appreciate: it’s not just a clever mechanism. It’s a fundamental constraint on what these models can do.

What “context window” actually means:

It’s not just “how much text fits”:

  • It’s how much the model can “pay attention to” simultaneously
  • Everything in that window competes for the model’s focus
  • More context doesn’t always mean better performance
  • The model can “forget” things at the start of a long context

Attention is O(n²):

  • Double your context length = 4x the compute
  • This is why long context is expensive
  • This is why there are hard limits on context size
  • This is why clever chunking strategies matter

The “lost in the middle” problem:

  • Models pay more attention to the start and end of context
  • Information in the middle can get lost
  • This breaks naive RAG implementations
  • Putting your most important info in the middle is a recipe for failure

KV Cache: The thing that determines your serving costs:

I didn’t understand KV cache until recently. Now I realize it’s critical for production LLM systems.

What it is:

  • During generation, the model caches key-value pairs from attention
  • Reuses them for each new token instead of recomputing
  • This is what makes generation feasible at all

Why it matters:

  • KV cache size grows with context length
  • It’s often the memory bottleneck in serving
  • Longer contexts = more cache = more memory = fewer concurrent requests
  • This is why context length and batch size trade off

What I’m learning:

  • Monitor KV cache hit rates in production
  • Understand the memory cost of long contexts
  • Design prompts with cache efficiency in mind
  • Consider whether you really need that full context

Fine-tuning vs. LoRA vs. RAG:

Fine-tuning: When you need to change behavior fundamentally

Use when:

  • You need the model to learn domain-specific patterns
  • You have enough quality data (thousands of examples minimum)
  • You can afford the compute and storage
  • You need consistent behavior across all requests

Don’t use when:

  • Your knowledge changes frequently
  • You don’t have enough training data
  • You can’t afford full model copies
  • You just need to inject some facts

LoRA (Low-Rank Adaptation): Efficient adaptation

Use when:

  • You want good amount of fine-tuning benefits but cheaper
  • You need multiple adapted versions of the same base model
  • You want faster training
  • Storage/memory is constrained

Don’t use when:

  • You need to completely change model behavior
  • You’re just adding retrievable facts
  • Your task is already handled well by the base model

RAG (Retrieval-Augmented Generation): Dynamic knowledge

Use when:

  • Your knowledge updates frequently
  • You need to cite sources
  • You want explainability
  • Your knowledge base is too large for context

Don’t use when:

  • You need the model to internalize patterns, not just retrieve facts
  • Your retrieval quality is poor
  • Latency is critical
  • Your knowledge is small enough to fit in context

The answer: Often you combine them

  • Fine-tune for domain behavior
  • Use LoRA for task-specific adaptations
  • Add RAG for dynamic, updateable facts

RAG: The failure points nobody talks about in tutorials:

Chunking destroys semantic coherence:

  • Split at 512 tokens? You just broke an explanation in half
  • Split at paragraphs? Some paragraphs are 2 words, some are 200
  • Split at sentences? Complex documents have sentence dependencies
  • No perfect answer. Every strategy has tradeoffs

Embeddings don’t capture your domain’s notion of relevance:

  • General-purpose embeddings optimize for general similarity
  • Your domain has specific notions of “relevant”
  • “Similar words” ≠ “relevant for this query”
  • You might need domain-specific embedding models

Retrieval finds plausible-sounding but wrong information:

  • High semantic similarity ≠ correct answer
  • Model retrieves confident-sounding wrong information
  • Generates plausible answers from irrelevant context

Reranking helps but isn’t magic:

  • Cross-encoders are slow
  • Faster rerankers are less accurate
  • Reranking still depends on retrieval quality
  • Garbage in, garbage out. Better ranked garbage is still garbage

Grounding doesn’t actually ground:

  • Just because you retrieved relevant docs doesn’t mean the model used them
  • Models can ignore retrieved context and hallucinate anyway
  • Citation systems can cite irrelevant sources confidently
  • Validating that retrieval helped is surprisingly hard

What I’m learning to do:

Test retrieval independently:

  • Is your retrieval finding the right documents?
  • Measure this separately from generation quality
  • Track precision and recall of retrieval
  • Use retrieval-specific metrics (MRR, nDCG)

Validate that retrieval actually helps:

  • Compare with vs. without retrieval
  • Check if the model is actually using retrieved context
  • Measure hallucination rates with and without RAG
  • Don’t assume retrieval is helping. Measure it. If you can’t measure it, you can’t manage it.

Monitor in production:

  • Track which queries fail to retrieve relevant docs
  • Monitor retrieval latency
  • Watch for drift in retrieval quality
  • Detect when your knowledge base becomes stale

Hallucinations: Not bugs, features of the architecture

This is the hardest thing for me to accept: hallucinations aren’t bugs to fix with better prompting.

They’re fundamental to how autoregressive language models work.

Why models hallucinate:

They’re trained to always produce plausible text:

  • Rewarded for fluent, confident output
  • No inherent mechanism for uncertainty
  • Will confidently make things up rather than refuse

They don’t have internal fact databases:

  • Knowledge is distributed across parameters
  • No clean separation of “known facts” vs “uncertain guesses”
  • Can’t check their own knowledge reliably
  • Memorization and confabulation look similar

They are context-limited:

  • Can’t always retrieve relevant training information
  • Might have seen correct info but can’t access it
  • Will fill in gaps with plausible-sounding guesses

What this means for production:

You can’t eliminate hallucinations:

  • Better prompting helps but isn’t sufficient
  • RAG helps but isn’t perfect
  • Fine-tuning can reduce but not eliminate
  • You need system-level solutions, not just prompt engineering

Design for hallucinations:

  • Add verification steps
  • Use retrieval with source citation
  • Implement confidence thresholds
  • Build human review for critical outputs
  • Fail gracefully when uncertain

Measure hallucination rates:

  • Don’t just assume your system is accurate
  • Test with known ground truth
  • Monitor hallucinations in production
  • Track which types of queries cause hallucinations

What I have heard from production engineers

“Test your system at the token limit. That’s where things break.”

“RAG quality is 80% retrieval, 20% generation. Fix retrieval first.”

“Every LLM has different tokenization. Budget separately for each.”

“Hallucinations will happen. Your system needs to handle them gracefully.”

“Context length and serving costs trade off directly. Design accordingly.”

Tomorrow: Part 4 (Production systems — where models can die)

What I’m learning about data pipelines, monitoring, deployment, and why most “model problems” are actually infrastructure problems.

And as always, if you spotted something I misunderstood, please correct me. I’m here to learn, not to be right.

$ ls related/