~/nathan

building practical AI systems

session://blog/part-7-putting-it-all-together-what-this-actually-means

$ cat posts/part-7-putting-it-all-together-what-this-actually-means.md

blog/Tech/Feb 20, 2026

### Part 7 of 7

Part 7: Putting It All Together (What This Actually Means)

But I’m starting to see the path forward more clearly. Let me share what I’m taking away from all this research and learning.

Part 7: Putting It All Together (What This Actually Means)

$ render article --theme terminal-notes

(Final Post of the 7-Part series on what it takes to be a strong ML engineer in 2026)

Check out Part 6: Evals, Optimization and Organizational Reality (What Actually Determines Success) here.

Here’s what we’ve learned over these past 6 posts:

ML engineering in 2026 is vastly more complex than I thought when I started this journey.

It’s not just about training models. It’s about:

  • Understanding data deeply
  • Measuring rigorously
  • Building reliable systems
  • Deploying safely
  • Operating at scale
  • Working within organizations

And honestly? I’m still at the beginning of this learning curve.

But I’m starting to see the path forward more clearly. Let me share what I’m taking away from all this research and learning.

Stop Collecting Frameworks. Start Building Systems.

This is the biggest mindset shift I’m making.

The old approach (what I was doing):

  • Complete tutorial after tutorial
  • Add frameworks to my resume
  • Build toy projects that work on my laptop
  • Move on to the next shiny thing

The new approach (what I’m trying to do):

  • Pick one real problem
  • Build it end-to-end
  • Deploy it to production (even if it’s just for me)
  • Watch it break
  • Fix it
  • Learn from the entire cycle

What “end-to-end” actually means:

Not just:

  • Train a model
  • Get good offline metrics
  • Call it done

But:

  • Understand the data sources and their failure modes
  • Build pipelines that handle real-world messiness
  • Implement proper evaluation (offline AND online)
  • Deploy with monitoring and observability
  • Handle failures gracefully
  • Optimize for real constraints (cost, latency)
  • Iterate based on production feedback

Why this matters:

You learn more from one production system than from ten courses.

Because production teaches you things tutorials can’t:

  • How data pipelines break in unexpected ways
  • How models degrade silently over time
  • How latency requirements force difficult tradeoffs
  • How users interact with your system in ways you never imagined
  • How organizational constraints shape technical decisions

The questions I’m asking myself now:

Not “Do I know PyTorch?” but “Can I build a complete ML system?”

One that:

  • Ingests messy production data reliably
  • Trains without manual babysitting
  • Evaluates performance honestly
  • Deploys safely with proper monitoring
  • Scales to real usage
  • Degrades gracefully when things go wrong
  • Provides value that exceeds its costs
  • Can be maintained by someone else (or future me)

My Learning Plan Going Forward

Based on everything I’ve researched and learned, here’s what I’m focusing on:

Phase 1: Build one stack deeply

Pick PyTorch (my choice, but JAX is equally valid)

Go deep:

  • Not just .fit() and hope
  • But GPU memory management, profiling, optimization
  • Understanding what’s actually happening during training
  • Being able to debug when things go wrong

Build something real with it:

  • End-to-end pipeline
  • Real data with real problems
  • Production deployment
  • All the unglamorous parts included

Phase 2: Master the foundations

Statistics and evaluation:

  • Take a proper statistics course
  • Practice A/B testing with real data
  • Build evaluation pipelines
  • Learn to measure things that matter

Data engineering:

  • Learn SQL deeply (embarrassingly important)
  • Understand data pipelines
  • Practice feature engineering
  • Get comfortable with messy data

Distributed systems basics:

  • Understand how distributed systems fail
  • Learn about queues, retries, idempotency
  • Practice building fault-tolerant systems
  • Because ML systems are distributed systems

Phase 3: Build production experience

Deploy real systems:

  • Start small (personal projects)
  • Gradually increase complexity
  • Experience the full deployment cycle
  • Learn from production failures

Focus on the boring parts too:

  • Monitoring and observability
  • Deployment automation
  • Pipeline reliability
  • Documentation

Measure everything:

  • Build intuition for what metrics matter
  • Practice connecting technical metrics to business value
  • Learn to detect problems before they compound

Phase 4: Specialize

Only after the foundations are solid:

  • Pick a specialization (LLMs, computer vision, etc.)
  • Go deep on domain-specific challenges
  • Build expertise that’s rare and valuable

But not before. Specialization without foundations is fragile.

The Mistakes I’m Making (And Learning From)

Being honest about what’s not working:

Mistake 1: Starting with optimization too early

  • Tried to optimize before I had something working
  • Wasted time on micro-optimizations that didn’t matter
  • Should have focused on getting it working first
  • Lesson: Premature optimization is real

Mistake 2: Underestimating evaluation complexity

  • Thought “I’ll just check if the answers are good”
  • Reality: need systematic evaluation or I’m flying blind
  • Building proper eval framework retroactively is hard
  • Lesson: Build evaluation from day one

Mistake 3: Ignoring edge cases initially

  • Focused on happy path. Didn’t think about error handling
  • Production immediately hit edge cases I didn’t anticipate
  • Lesson: Test failure modes early

Mistake 4: Not documenting as I go

  • Told myself “I’ll document it later”
  • Later me has no memory of why I made certain decisions
  • Now reverse-engineering my own code
  • Lesson: Document decisions when you make them

Mistake 5: Trying to learn everything at once

  • Started reading about 10 different topics simultaneously
  • Got overwhelmed
  • Made little progress on any of them
  • Lesson: Depth in one area > surface knowledge in many

What Success Actually Looks Like

I’m redefining what “good ML engineer” means to me.

Not:

  • Can list every framework
  • Has certifications in everything
  • Knows all the latest papers
  • Can talk impressively about transformers

But:

  • Can build complete systems that work in production
  • Understands when and why things fail
  • Measures rigorously and honestly
  • Makes appropriate tradeoffs consciously
  • Learns continuously from real systems
  • Communicates effectively with stakeholders
  • Ships value, not just code

The shift:

From “breadth of knowledge” to “depth of capability”

From “what I know” to “what I can build”

From “impressive on paper” to “effective in practice”

What I’m measuring myself on:

  • Have I built something that runs in production?
  • Does it actually solve a real problem?
  • Can I debug it when it breaks?
  • Am I measuring its impact honestly?
  • Am I learning from each iteration?

Not:

  • How many frameworks do I know?
  • How many courses have I completed?
  • How impressive does my resume look?

What I Need From You

If you’ve made it this far through all 7 parts, thank you.

Here’s how you can help:

If you’re ahead of me on this journey:

  • Correct my misconceptions (please!)
  • Share what you wish you’d known earlier
  • Point out what I’m still missing
  • Tell me which mistakes I should avoid

If you’re on this journey with me:

  • Share what you’re building
  • Let’s learn together
  • Compare notes on what’s working
  • Support each other through the hard parts

If you’re behind me on this journey:

  • Ask questions
  • Share what’s confusing you
  • Help me understand what needs better explanation
  • Let me know if this series was useful

Final Thoughts

This 7-part series started as a research to understand what ML engineering actually requires.

It became a roadmap for my own learning journey.

I’m nowhere near the destination. I’m at the beginning.

But at least now I can see the path more clearly.

I know what I need to learn. What I need to build. What I need to measure. What I need to practice.

And I’m excited to walk this path, stumble along the way, and learn from every mistake.

Thank You

To everyone who read along, commented, corrected me, shared insights, and helped me learn:

Thank you.

This series was better because of your input.

And my learning journey is better because I’m not doing it alone.

Let’s keep learning together.

What’s your next step on your ML engineering journey? What are you going to build?

And if this series helped you at all, pass it on to someone else who might benefit.

The best way to learn is together.

Want to follow my journey as I build, fail, learn, and iterate? Hit follow.

Want to call out something I got wrong? Please do. I’d rather be corrected and learn than be wrong quietly.

Here’s to the journey ahead. 🚀

$ ls related/