+
+
+
+
why i built cruel
- february 2026
-
-
-
-
- 3am. phone buzzes. production alert. something is wrong, but not with our code - our
- code is perfect. passed every test, every lint check, every code review. the
- problem? the ai provider started rate limiting us and our retry logic had a subtle
- bug that nobody ever caught. because in development, the api never fails.
-
-
- users see a blank screen. classic.
-
-
- this happened more than once. different projects, different providers, same pattern.
- everything works beautifully in development. the streams flow smoothly, the tokens
- arrive one by one, the json parses clean, the error boundaries sit there looking
- pretty, completely untested against real failures.
-
-
- then you deploy and the real world introduces itself.
-
-
- rate limits hit during your busiest hour. streams cut mid-sentence on your longest
- responses. structured output comes back with malformed json. context length errors
- surface only on conversations from your most engaged users - the ones you really
- don't want to lose. content filters trigger on inputs nobody on the team ever
- thought to test against.
-
-
-
- and it's not any one provider's fault. this is just the nature of building on top of
- ai apis. every provider - whether it's openai, anthropic, google, mistral, cohere,
- or anyone else - has their own failure modes, their own error formats, their own
- rate limit behaviors. they're all doing incredible work pushing the boundaries of
- what's possible. but distributed systems fail. that's not a bug, it's physics.
-
-
-
- the question isn't whether your ai integration will encounter failures in
- production. the question is whether you've tested what happens when it does.
-
-
-
- the duct tape era
-
-
-
- for the longest time, my approach to this problem was embarrassingly manual. need to
- test a rate limit? hardcode a mock response that returns a 429. need to test a
- stream cut? write a custom readable stream that stops halfway through. need to test
- a timeout? add a setTimeout that never resolves.
-
-
-
- copy and paste between projects. slightly different each time. never quite matching
- the real error format. always incomplete. always the thing i'd "get to later" and
- never actually finish.
-
-
-
- and honestly? most of the time i just skipped it entirely. shipped the code, crossed
- my fingers, and hoped that the error handling i wrote based on reading the docs
- would actually work when a real failure hit.
-
-
- spoiler: reading the docs is not the same as testing against real failures.
-
-
- the retry logic that looks correct in a code review? it doesn't respect the
- retry-after header. the stream error handler that catches the right error type? it
- doesn't clean up the partial response in the ui. the circuit breaker pattern you
- implemented from that blog post? it's never actually been tripped.
-
-
-
- you don't know if your parachute works until you jump. and we were all jumping
- without ever testing the chute.
-
-
-
- the idea
-
-
-
- i work on the ai sdk team at vercel. it's an incredible team - lars, nico, and
- everyone else shipping tools that millions of developers use every day. being part
- of this team means i get to see how ai integrations work across the entire
- ecosystem. every provider, every framework, every edge case.
-
-
-
- and i kept seeing the same pattern: developers build amazing ai features, test them
- against the happy path, ship to production, and then discover their error handling
- has gaps when real-world chaos hits.
-
+ february 2026
-
- the ai sdk already does incredible work abstracting away provider differences.
- unified api, streaming support, structured output, tool calling - all the hard parts
- handled cleanly. but the one thing no sdk can do for you is test your app's
- resilience against failures that only happen in production.
-
-
-
- that's when it clicked. what if there was a library that could simulate every
- failure mode you'd encounter in production? not mocking the entire api - just
- wrapping your existing code with configurable chaos. tell it "fail 10% of the time"
- or "add random latency" or "cut the stream halfway through" and let your error
- handling prove itself.
-
-
-
- not a mock. not a test framework. just chaos. realistic, configurable,
- provider-accurate chaos that works with anything async.
-
-
-
- building it
-
-
-
- the core took a weekend. wrap any async function, inject failures at a configurable
- rate, add random latency between two bounds, occasionally just... never resolve. the
- fundamentals of chaos in about 200 lines of typescript.
-
-
-
- then came the network simulation layer. packet loss, dns failures, disconnects, slow
- connections. then http chaos - status codes, rate limits, server errors. then stream
- manipulation - cuts, pauses, corruption, truncation. each layer building on the core
- but targeting specific failure domains.
-
-
-
- the ai sdk integration is where it got really interesting. i didn't want generic
- failures - i wanted failures that match real provider behavior exactly. when cruel
- simulates a rate limit, it returns the correct status code with a realistic
- retry-after header. when it simulates an overloaded error, the error object has the
- right shape, the right properties, the right behavior that the ai sdk's retry system
- expects.
-
-
-
- this means your error handling code sees exactly what it would see from a real
- provider failure. no surprises in production because you already tested against the
- real thing - or at least something indistinguishable from it.
-
-
-
- then resilience patterns. circuit breaker, retry with backoff, bulkhead isolation,
- timeout wrappers, fallbacks. not because cruel is trying to be a resilience library
- - there are great ones already - but because when you're chaos testing, you want to
- verify these patterns actually work under pressure.
-
-
-
- zero dependencies. i was obsessive about this one. no runtime deps means no supply
- chain risk, no version conflicts, no transitive dependency nightmares. just
- typescript and your code. install it, import it, use it. nothing else comes along
- for the ride.
-
-
-
- the name
-
-
-
- i thought about this for longer than i'd like to admit. tested a bunch of names.
- "chaos-inject" felt corporate. "fault-line" felt geological. "havoc" was taken.
-
-
-
- then i just thought about what chaos testing should feel like. it should be
- uncomfortable. it should break things you thought were solid. it should find the
- bugs you didn't know existed. it should be relentless and thorough and completely
- without sympathy for your assumptions.
-
-
- it should be cruel.
-
-
- that's the whole philosophy in one word. if your tests are gentle, your production
- failures will be brutal. better to find out now - in development, with a stack trace
- and a debugger and a cup of coffee - than at 3am from a production alert while your
- users watch a loading spinner that never stops.
-
-
-
- what's next
-
-
-
- cruel is open source and i'm building it in public. the core is stable, the ai sdk
- integration works, the resilience patterns are solid. but there's so much more to
- do.
-
-
-
- better test matchers. more realistic failure scenarios. deeper integration with
- vitest and jest. maybe a visual dashboard that shows you exactly how your app
- behaves under different chaos profiles. provider-specific failure libraries that
- evolve as the apis evolve.
-
-
-
- if you're building ai apps and you've ever been bitten by a production failure you
- didn't test for, give cruel a try. break things on purpose. find the bugs before
- your users do.
-
-
-
- and if you find a failure mode i haven't thought of yet, open an issue. the cruelest
- ideas come from real production pain.
-
+
+ {intro.map((line) => (
+ {line}
+ ))}
+
- zero mercy.
-
-
-
why i built cruel
-february 2026
-- 3am. phone buzzes. production alert. something is wrong, but not with our code - our - code is perfect. passed every test, every lint check, every code review. the - problem? the ai provider started rate limiting us and our retry logic had a subtle - bug that nobody ever caught. because in development, the api never fails. -
- -users see a blank screen. classic.
- -- this happened more than once. different projects, different providers, same pattern. - everything works beautifully in development. the streams flow smoothly, the tokens - arrive one by one, the json parses clean, the error boundaries sit there looking - pretty, completely untested against real failures. -
- -then you deploy and the real world introduces itself.
- -- rate limits hit during your busiest hour. streams cut mid-sentence on your longest - responses. structured output comes back with malformed json. context length errors - surface only on conversations from your most engaged users - the ones you really - don't want to lose. content filters trigger on inputs nobody on the team ever - thought to test against. -
- -- and it's not any one provider's fault. this is just the nature of building on top of - ai apis. every provider - whether it's openai, anthropic, google, mistral, cohere, - or anyone else - has their own failure modes, their own error formats, their own - rate limit behaviors. they're all doing incredible work pushing the boundaries of - what's possible. but distributed systems fail. that's not a bug, it's physics. -
- -- the question isn't whether your ai integration will encounter failures in - production. the question is whether you've tested what happens when it does. -
- -- the duct tape era -
- -- for the longest time, my approach to this problem was embarrassingly manual. need to - test a rate limit? hardcode a mock response that returns a 429. need to test a - stream cut? write a custom readable stream that stops halfway through. need to test - a timeout? add a setTimeout that never resolves. -
- -- copy and paste between projects. slightly different each time. never quite matching - the real error format. always incomplete. always the thing i'd "get to later" and - never actually finish. -
- -- and honestly? most of the time i just skipped it entirely. shipped the code, crossed - my fingers, and hoped that the error handling i wrote based on reading the docs - would actually work when a real failure hit. -
- -spoiler: reading the docs is not the same as testing against real failures.
- -- the retry logic that looks correct in a code review? it doesn't respect the - retry-after header. the stream error handler that catches the right error type? it - doesn't clean up the partial response in the ui. the circuit breaker pattern you - implemented from that blog post? it's never actually been tripped. -
- -- you don't know if your parachute works until you jump. and we were all jumping - without ever testing the chute. -
- -- the idea -
- -- i work on the ai sdk team at vercel. it's an incredible team - lars, nico, and - everyone else shipping tools that millions of developers use every day. being part - of this team means i get to see how ai integrations work across the entire - ecosystem. every provider, every framework, every edge case. -
- -- and i kept seeing the same pattern: developers build amazing ai features, test them - against the happy path, ship to production, and then discover their error handling - has gaps when real-world chaos hits. -
+- the ai sdk already does incredible work abstracting away provider differences. - unified api, streaming support, structured output, tool calling - all the hard parts - handled cleanly. but the one thing no sdk can do for you is test your app's - resilience against failures that only happen in production. -
- -- that's when it clicked. what if there was a library that could simulate every - failure mode you'd encounter in production? not mocking the entire api - just - wrapping your existing code with configurable chaos. tell it "fail 10% of the time" - or "add random latency" or "cut the stream halfway through" and let your error - handling prove itself. -
- -- not a mock. not a test framework. just chaos. realistic, configurable, - provider-accurate chaos that works with anything async. -
- -- building it -
- -- the core took a weekend. wrap any async function, inject failures at a configurable - rate, add random latency between two bounds, occasionally just... never resolve. the - fundamentals of chaos in about 200 lines of typescript. -
- -- then came the network simulation layer. packet loss, dns failures, disconnects, slow - connections. then http chaos - status codes, rate limits, server errors. then stream - manipulation - cuts, pauses, corruption, truncation. each layer building on the core - but targeting specific failure domains. -
- -- the ai sdk integration is where it got really interesting. i didn't want generic - failures - i wanted failures that match real provider behavior exactly. when cruel - simulates a rate limit, it returns the correct status code with a realistic - retry-after header. when it simulates an overloaded error, the error object has the - right shape, the right properties, the right behavior that the ai sdk's retry system - expects. -
- -- this means your error handling code sees exactly what it would see from a real - provider failure. no surprises in production because you already tested against the - real thing - or at least something indistinguishable from it. -
- -- then resilience patterns. circuit breaker, retry with backoff, bulkhead isolation, - timeout wrappers, fallbacks. not because cruel is trying to be a resilience library - - there are great ones already - but because when you're chaos testing, you want to - verify these patterns actually work under pressure. -
- -- zero dependencies. i was obsessive about this one. no runtime deps means no supply - chain risk, no version conflicts, no transitive dependency nightmares. just - typescript and your code. install it, import it, use it. nothing else comes along - for the ride. -
- -- the name -
- -- i thought about this for longer than i'd like to admit. tested a bunch of names. - "chaos-inject" felt corporate. "fault-line" felt geological. "havoc" was taken. -
- -- then i just thought about what chaos testing should feel like. it should be - uncomfortable. it should break things you thought were solid. it should find the - bugs you didn't know existed. it should be relentless and thorough and completely - without sympathy for your assumptions. -
- -it should be cruel.
- -- that's the whole philosophy in one word. if your tests are gentle, your production - failures will be brutal. better to find out now - in development, with a stack trace - and a debugger and a cup of coffee - than at 3am from a production alert while your - users watch a loading spinner that never stops. -
- -- what's next -
- -- cruel is open source and i'm building it in public. the core is stable, the ai sdk - integration works, the resilience patterns are solid. but there's so much more to - do. -
- -- better test matchers. more realistic failure scenarios. deeper integration with - vitest and jest. maybe a visual dashboard that shows you exactly how your app - behaves under different chaos profiles. provider-specific failure libraries that - evolve as the apis evolve. -
- -- if you're building ai apps and you've ever been bitten by a production failure you - didn't test for, give cruel a try. break things on purpose. find the bugs before - your users do. -
- -- and if you find a failure mode i haven't thought of yet, open an issue. the cruelest - ideas come from real production pain. -
+{line}
+ ))} +zero mercy.
-+ {item.title} +
+{line}
+ ))} +