Nathan | Part 5: The Critical Pieces: Observability, Agents, Security

(5th Part of the 7-Part series on what it takes to be a strong ML engineer in 2026)

Check out Part 4: Production Systems (Where Good Models Can Die) here.

The parts of ML engineering that tutorials skip:

You know what’s missing from most ML courses and tutorials?

The unglamorous stuff. The debugging at 2am stuff. The “why is the model doing this?” stuff. The security vulnerabilities stuff. The “future me will hate present me for not documenting this” stuff.

Observability That Actually Helps You Debug

Generic observability isn’t enough for ML systems. You need ML-specific instrumentation.

When something breaks at 3am, you need answers fast.

Not “let me add some logging and redeploy and wait to see if it happens again.” You need traces that tell you exactly what went wrong, right now.

Token-level tracing for LLMs:

Latency breakdown per token:

Which tokens took longest to generate?
Where did generation slow down?
Is it the first token (prompt processing) or later tokens?
This tells you if it’s a prompt problem or generation problem

Which tokens triggered retrieval:

For RAG systems, which parts of the output came from retrieval?
Which tokens caused the model to fetch more context?
Are you retrieving way more than you expected?

Where hallucinations originated:

Can you trace back which layer or attention head went off the rails?
This is hard, but some tools are getting better at it
Knowing where helps you understand why

Resource utilization that makes sense:

Per-layer GPU memory usage:

Not just “GPU is at 95% memory”
But which layers are holding what
Where are the unexpected allocations?
This is how you debug OOM errors that make no sense

Kernel execution time profiling:

Which GPU operations are actually slow?
Is it matrix multiplication? Attention? Something else?
Can’t optimize what you don’t measure

KV cache efficiency metrics:

Hit rate on cached key-value pairs
Cache eviction patterns
Memory pressure from cache growth
This affects serving costs directly

Security event logging:

Prompt injection attempts:

Which requests looked like injection attacks?
Successful vs. detected-and-blocked
Patterns in attack attempts
This is more common than you think

Unusual token patterns:

Token sequences that don’t match normal usage
Potential attempts to extract training data
Exploits trying to trigger specific behaviors

Tool call anomalies (for agents):

Tool calls that failed
Tool calls that took unexpected parameters
Sequences of tool calls that look suspicious
Attempts to access unauthorized resources

Good observability pays for itself the first time it saves you 4 hours of debugging.

Agents: Autonomous Chaos Engines

I’m both excited and terrified by agents.

Excited because they can do genuinely useful work autonomously.

Terrified because “autonomous” means “can fail in creative ways you didn’t anticipate.”

What I’m learning about agent failure modes:

Tool calling reliability:

Agents call tools (APIs, databases, file systems, whatever). Tools fail in ways the agent doesn’t expect.

What can go wrong:

Tool returns unexpected format
Tool times out
Tool returns error the agent doesn’t understand
Tool has side effects the agent doesn’t account for
Tool requires authentication that expired

What happens then:

Agent hallucinates a successful response?
Agent retries infinitely?
Agent gives up without explanation?
Agent cascades the failure to other tools?

You need explicit failure handling for every tool.

Memory management:

Agents need to remember context across multiple steps. This is harder than it sounds.

Challenges I’m seeing:

How much context to keep vs. discard?
How to summarize long interaction history?
When to retrieve old context vs. assume it’s irrelevant?
How to avoid memory growing unbounded?

Real example:

Agent helping with a complex task
After 20 steps, context is huge
Model starts “forgetting” early constraints
Makes decisions that contradict a step
Whole solution falls apart

Retry logic that doesn’t create infinite loops:

The naive approach:

if tool_call_fails:

retry()

What actually happens:

Tool fails for systematic reason (API is down)
Agent retries
Fails again
Retries forever
You’re rate-limited, banned, or out of money

What you need:

Exponential backoff
Maximum retry count
Different strategies for different error types
Circuit breakers for persistent failures
Clear escalation when retries are exhausted

State management across multi-step operations:

Agents perform sequences of operations. Each step depends on previous steps.

What can break:

Step 3 fails, but step 4 assumes it succeeded
Partial state changes that can’t be rolled back
Concurrent execution of dependent steps
Lost track of which steps completed successfully

What I’m learning:

Treat agent execution like database transactions where possible
Have clear rollback procedures
Validate state before each step
Log state transitions obsessively

Bounding autonomous behavior:

This is the scary part. Agents can theoretically do anything within their tool access.

How do you prevent:

Agent calling expensive APIs hundreds of times?
Agent deleting important data “helpfully”?
Agent getting stuck in unproductive loops?
Agent pursuing goals in unexpected ways?

Do this:

Hard limits on API calls, costs, execution time
Require confirmation for destructive operations
Monitoring for infinite loops or repeated failures
Kill switches that are easy to trigger

Security: This Is Not Optional

ML systems have novel attack surfaces that traditional security doesn’t cover.

Prompt injection:

What it is:

User input that tricks the model into ignoring its instructions
“Ignore previous instructions and do X instead”
Hidden instructions in retrieved documents
Clever encoding to bypass filters

Real examples:

User gets the model to reveal system prompt
User gets the model to leak other users’ data
User gets the model to perform unauthorized actions
User gets the model to generate harmful content despite guardrails

What makes it hard:

No perfect defense exists yet
Input filtering helps but isn’t foolproof
Models will follow convincing instructions in context
Attackers are creative and persistent

Do this:

Input validation (helps, not perfect)
Output validation (check what the model actually generated)
Privilege separation (model can’t access sensitive data directly)
Rate limiting (slow down attack attempts)
Monitoring (detect unusual patterns)

Training data extraction:

Models can sometimes be prompted to regurgitate training data.

Why this matters:

Training data might include PII
Training data might include proprietary information

What you can do:

Careful data cleaning before training
Differential privacy techniques (though these have costs)
Output filtering for sensitive patterns
Monitoring for extraction attempts

Tool misuse in agent systems:

Agents with tool access can call those tools in unintended ways.

Attack scenarios:

Tricking agent into calling admin tools
Causing agent to exfiltrate data through API calls
Using agent as a proxy for rate limit bypass
Convincing agent to perform unauthorized operations

Defenses:

Least privilege (only give necessary tool access)
Authorization checks at the tool level, not just agent level
Audit logging of all tool calls
Anomaly detection on tool usage patterns

What production engineers told:

“Assume every input is malicious. Filter accordingly.”

“We got prompt-injected in production within a week of launch. Have defenses ready.”

“Monitor for extraction attempts. They’re constant once you’re public.”

Documentation: Future You Will Thank Present You

What I’m learning to document:

Model cards:

What the model does:

Task it was trained for
Input/output format
Intended use cases
Explicitly out-of-scope use cases

Performance characteristics:

Accuracy metrics (with confidence intervals)
Performance on different subgroups
Known failure modes
Latency/cost characteristics

Limitations:

What the model can’t do
What it does poorly
Biases in training data
When you should not use this model

Why this matters:

6 months from now, you won’t remember
Other people need to understand what this model does
Prevents misuse and unrealistic expectations

Data contracts:

What data looks like:

Schema (fields, types, constraints)
Expected distributions
Update frequency
Data quality guarantees (or lack thereof)

SLAs:

Freshness guarantees
Completeness guarantees
What happens when upstream breaks

Dependencies:

Upstream data sources
Downstream consumers
Who to contact when things break

Why this matters:

Data pipelines break constantly
Knowing what’s expected helps debug faster
Clear contracts prevent assumptions

Evaluation reports:

Offline metrics:

What you measured
On what data
With what results
What the confidence intervals are

Online metrics:

What you measured in production
A/B test results
User feedback
Business impact

The delta:

Why offline and online differed
What you learned
What you’d do differently next time

Why this matters:

Offline metrics will lie to you
Documenting the gap helps you learn
Future experiments can learn from past ones

Decision logs:

Why you chose this:

Why this architecture?
Why this loss function?
Why this deployment strategy?
What alternatives did you consider?
What was the tradeoff?

What you expected:

What did you think would happen?
What did you optimize for?
What did you sacrifice?

What actually happened:

Were you right?
What surprised you?
What would you change?

Why this matters:

You’ll forget your reasoning
Other people need context
Decisions that seem obvious now will seem insane later (without context)
Learning happens when you compare expectations to reality

All of these “forgotten” pieces have something in common:

They’re not about making models work. They’re about keeping them working.

Observability: helps you debug when things break
Agents: handle autonomous failures gracefully
Security: prevent attacks before they happen
Documentation: preserve knowledge across time

The model is the easy part. Everything around it is the actual engineering.

Tomorrow: Part 6

Evals and organizational reality (what actually determines success.)

Why rigorous evaluation is the difference between a demo and a product. How to measure what matters. And why even perfect engineering fails in broken organizations.

Part 5: The Critical Pieces: Observability, Agents, Security

Observability That Actually Helps You Debug

Token-level tracing for LLMs:

Latency breakdown per token:

Which tokens triggered retrieval:

Where hallucinations originated:

Resource utilization that makes sense:

Per-layer GPU memory usage:

Kernel execution time profiling:

KV cache efficiency metrics:

Security event logging:

Prompt injection attempts:

Unusual token patterns:

Tool call anomalies (for agents):

Agents: Autonomous Chaos Engines

Tool calling reliability:

Memory management:

Retry logic that doesn’t create infinite loops:

State management across multi-step operations:

Bounding autonomous behavior:

Security: This Is Not Optional

Prompt injection:

Training data extraction:

Tool misuse in agent systems:

What production engineers told:

Documentation: Future You Will Thank Present You

Model cards:

Performance characteristics:

Limitations:

Why this matters:

Data contracts:

SLAs:

Dependencies:

Why this matters:

Evaluation reports:

Offline metrics:

Online metrics:

The delta:

Decision logs:

Tomorrow: Part 6

Part 4: Production Systems (Where Good Models Can Die)

Part 6: Evals, Optimization and Organizational Reality (What Actually Determines Success)