I wonder if they could've designed better circuit breakers for situations like this. They're very common in electrical engineering, but I don't think they're as common in software design. Something we should try to design and put in, actually for situations like this.
They’re a fairly common design pattern https://en.m.wikipedia.org/wiki/Circuit_breaker_design_patte.... However, they certainly aren’t implemented with the frequency they should be at service level boundaries resulting in these sorts of cascading failures.
Netflix was talking alot about circuit breaks a few years ago, and had the Hystrix project. Looks like Hystrix is discontinued, so I'm not sure if there are good library solutions that are easy to adopt. Overall I don't see it getting talked about that frequently... beyond just exponential backoff inside a retry loop.
One of the big issues mentioned was that one of the circuit breakers they did have (client back off), didn't function properly. So they did have a circuit breaker in the design, but it was broken.