Systems Thinking for AI Engineers
Software is fragile. Systems are robust.
The AI industry still has not internalized that line, even
as we race to ship agents, copilots, and retrieval pipelines
into production.
Here is a pattern I see repeated: an engineer builds a
prototype. The LLM is impressive. The demo lands. Stakeholders
nod. The team pushes to production. Then the API times out on
a Thursday night. The model hallucinates a legal citation. The
monthly bill arrives at three times the forecast. The system
does not fail gracefully. It just fails.
The problem was never the model. The problem was that nobody
designed the system.
Most AI engineers think in features, not in systems. They
optimize the prompt but ignore the fallback. They benchmark
the model but never test what happens when the model is
unavailable. They celebrate the happy path and never map the
failure modes.
This essay is about a different way of thinking -- one I
learned from 8+ years of shipping software at scale, where
uptime is non-negotiable and "it works on my machine" is not
a deployment strategy.
The Hardware Engineering Lens
The best analogy I have found for AI systems comes from
hardware engineering -- a world where components overheat,
signals degrade, and power supplies fluctuate. Hardware
engineering teaches that every component in a system is
trying to fail. Your job is to design the system so that
when individual parts fail (and they will), the whole thing
keeps working.
That mindset shapes everything about how I approach AI
systems. Three analogies borrowed from hardware that I
apply every day:
Voltage Regulators = Guardrails. A voltage regulator
takes unpredictable input (noisy, fluctuating, sometimes
spiking) and clamps it to a stable output range. Without one,
downstream components fry. LLM guardrails do the same thing.
They take the unpredictable output of a language model and
constrain it to an acceptable range. Both accept variable
input, both produce bounded output, and both dissipate the
excess. A voltage regulator sheds extra energy as heat. A
guardrail sheds hallucinated content as rejected tokens. And
critically, both have a design limit. Push past it, and the
protection fails. Knowing that limit is what separates
engineering from guesswork.
Signal-to-Noise Ratio = Hallucination Rate. In signal
processing, SNR measures useful signal relative to background
noise. Every AI system has its own SNR. The "signal" is
factually grounded, contextually relevant output. The "noise"
is hallucinations, irrelevant tangents, and confabulated
details. Better retrieval improves the signal. Better prompts
filter the noise. But here is the part most people miss: you
can also reduce noise at the source by constraining the
input. In hardware, a bandpass filter eliminates frequencies
outside your range of interest. In AI, you constrain the
context window to only the most relevant documents. Same
principle. Different medium.
Circuit Breakers = Fallback Patterns. A physical circuit
breaker trips when current exceeds a safe threshold. It
sacrifices availability of a single circuit to protect the
building from fire. Software circuit breakers do the same:
when an API's error rate crosses a threshold, the breaker
trips, the system stops calling the failing service, and a
fallback takes over. One unprotected failure can cascade and
take out everything downstream. Every external dependency in
my AI systems gets a circuit breaker. Every single one.
The Five Properties of a Production-Ready System
Through building and breaking enough systems, I have arrived
at five properties that separate fragile software from
reliable infrastructure. These are not theoretical. They are
what I evaluate in every production system I touch.
1. Redundancy. No single point of failure. If your entire
AI feature depends on one API from one provider, you do not
have a system. You have a bet. Redundancy means multiple LLM
providers with automatic failover. It means cached embeddings
for your most common queries so that when the embedding
service goes dark, 60% of traffic is still served. It means
your retrieval layer can fall back from semantic search to
keyword search without the user seeing an error page.
2. Defined Failure States. Every component must have a
known, tested failure mode. Not "it might crash" but "when
this component returns a 503, the system will respond with
X." I document failure states the way datasheets document
operating limits. If you cannot tell me exactly what happens
when your LLM provider returns a 429, your system is not
ready for production.
3. Observability. You cannot fix what you cannot see. This
means logging latency, token usage, and error rates per
request. It means alerts for cost anomalies, not just
error spikes. It means being able to replay a failed request
from your logs to understand exactly where the pipeline
broke. Observability is not a feature you add later. It is
the foundation you build on first.
4. Graceful Degradation. When something breaks, the system
gets worse, not broken. The difference between "search results
are slightly less relevant right now" and "500 Internal Server
Error." What does the feature look like without the AI
component? If the answer is "it doesn't exist," you have a
fragility problem. Every AI feature I build has a non-AI
fallback -- even if it is just a static response or a
redirect to a human.
5. Cost Awareness. The property most engineers ignore, and
the one that kills the most projects. Unit economics are a
system property, not a business concern. If your cost-per-
request doubles at scale, your system has a design flaw. I
track cost the same way I track latency: per request, with
alerts on anomalies, with clear budgets per feature. I have
seen teams build impressive AI features that were quietly
burning $50K+/month because nobody put a cost ceiling on
token consumption. A system without cost awareness is a
system waiting to be shut down by finance.
Applying Systems Thinking to AI
Here is the mental shift: your LLM is not your system. It
is one component within your system.
This sounds obvious. It is not. Most AI engineering today
treats the model as the center of gravity. Everything else --
retrieval, caching, fallbacks, monitoring -- is an
afterthought.
Systems thinking inverts this. The model is a component with
known failure modes, just like a transistor in a circuit. And
just like a transistor, it needs supporting infrastructure to
function reliably.
The failure modes of a typical LLM-powered feature:
- API Timeouts. Your provider has an outage or throttles
your requests. Not an edge case. A certainty on a long
enough timeline.
- Hallucinations. The model generates plausible but
incorrect information. Not a bug. A fundamental property
of how language models work.
- Cost Spikes. A prompt change doubles your average token
consumption. A new user pattern triggers unexpectedly long
outputs. Your monthly bill triples.
- Model Deprecation. Your provider sunsets the model
version you depend on. Your carefully tuned prompts no
longer produce the same results.
Each of these is a known failure mode. None of them should
surprise you. And none of them should take your system down.
The systems thinker designs for these failures upfront. Not
out of pessimism, but because the probability of at least
one occurring in production approaches 100% over time. The
question is not "will it fail?" but "have I designed for the
failure?"
Case Study: Thursday Night, 8 PM, Embedding API Down
This story illustrates the difference between feature thinking
and systems thinking better than any abstraction.
I was running a RAG-powered support system. Real users, real
traffic, real expectations. Thursday evening around 8 PM, our
embedding provider started returning intermittent 503 errors.
Response times climbed from 200ms to 2 seconds, then to full
timeouts.
Here is what did not happen: the system did not go down.
Users did not see error pages. Nobody got paged.
Here is what did happen, in sequence:
90 seconds in: Our observability layer flagged the latency
increase. Alerts fired to the on-call channel. But by the
time I saw the alert, the automated response was already
underway.
Circuit breaker tripped after five consecutive failures.
The system stopped calling the failing service, preventing
backed-up requests from overwhelming everything downstream.
Graceful degradation activated. The retrieval layer fell
back from semantic search to a pre-computed keyword index. Was
it as good? No. Keyword search misses nuance. But users got
relevant-enough results instead of a blank screen.
Redundancy layer kicked in. Cached embeddings for the 500
most frequently asked queries served roughly 60% of incoming
traffic with zero quality loss.
Defined failure state communicated. Users saw a subtle
message: "Results may be less precise right now."
Within 30 minutes, we switched to our secondary embedding
provider -- a relationship negotiated specifically for this
scenario. Full semantic search was restored.
Total downtime: zero. Degraded service window: 30
minutes. Customer complaints: none.
None of this was heroic engineering. It was boring,
methodical systems thinking, applied months before the
incident. Every component had a fallback. Every failure mode
had a response. The system worked because we designed it to
work when things broke.
From Prompt Engineering to Systems Engineering
The industry talks a lot about "prompt engineering." That
framing is limiting. Maybe even harmful.
Prompt engineering frames the LLM as the system. Get the
prompt right, and everything works. But a perfect prompt
means nothing if the context window is stuffed with irrelevant
documents. It means nothing if the API is down. It means
nothing if the response costs $0.10 per query and your margin
is $0.02.
The shift I am advocating for: from prompt engineering to
systems engineering. The most important skill for an AI
engineer is not writing a better prompt. It is designing a
better system around that prompt -- retrieval, caching,
guardrails, observability, fallbacks, and cost controls.
When I evaluate an AI system, I do not start by reading the
prompts. I start by asking: "What happens when the model is
unavailable?" The answer to that question tells me more
about the system's maturity than any benchmark ever could.
This is what 8+ years of shipping software at scale taught
me. Not specific knowledge about any one tool or framework,
but a way of seeing. A discipline that assumes components
will fail and designs the system to absorb those failures.
The Pre-Ship Checklist
These are the questions I ask myself, and my teams, before
any AI feature goes to production. Print this out. Tape it to
your monitor. Argue about it in your next architecture review.
Redundancy
- What happens if your primary LLM provider is down for four hours?
- Do you have a cached or static fallback for your most critical user paths?
- Can you switch providers without redeploying?
Defined Failure States
- Can you name every failure mode of every external dependency?
- Does each failure mode have a documented, tested response?
- Have you actually tested those failure responses, or just theorized about them?
Observability
- Are you logging latency, token count, and error rate per request?
- Do you have alerts for cost anomalies, not just errors?
- Can you replay a failed request from your logs to diagnose the root cause?
Graceful Degradation
- If the AI component fails, does the user still get value from the feature?
- Is your degradation path tested in CI, or does it only exist in a design doc?
Cost Awareness
- What is your cost-per-request at 10x your current traffic?
- Do you have a kill switch if costs spike beyond your budget?
- Have you modeled what happens to your unit economics when the provider raises prices by 20%?
Systems are not built by optimists. They are built by
engineers who respect the ways things break and design
accordingly.
If you want to see these principles in action,
talk to my AI. It runs on the exact architecture described in this post --
redundancy, circuit breakers, graceful degradation, cost
awareness, all of it. Or
explore my work for more on how I approach reliability engineering.
Software is fragile. Systems are robust. Build the system.