Optimally Imperfect Engineering

Google’s SRE book is a well-grounded, lucid discussion of running modern software services. A highlight is the discussion of error-budgets, which demonstrates why, in order to make good engineering decisions, it’s vital we quantify the level of imperfection we are willing to accept.

Being explicit about what is “good enough” allows our teams to focus on delivering what is really impactful.

This idea is common knowledge. We often hear “perfect is the enemy of shipped”, and we see examples of the idea in Minimal Viable Products. Often however the full implications aren’t actually taken into account, and this causes real problems in the process of making the best software which is possible to build within our real-word constraints.

The problem isn’t that we don’t understand that these decisions are tradeoffs, Instead, we often don’t define how imperfect we’re happy for our software to be, and that means our focus is in the wrong places.

Value Axes – Imperfection in every direction

We all know our software can’t be perfect. Of course we want it to be as good as possible. Ask a stakeholder what % of the time a service should be avaliable and, when presented in isolation, the answer may be “as close to 100% as possible”. Improving availability will take up time we could use elsewhere, is this really how we deliver the most value to our customers?

What does “good” even mean for our software? Any reasonable definition needs to spread it’s evaluation very wide. Our service delivers value by having features our customers get value from, its features being avaliable for the customer to use, the services needs low-enough latency to deliver a good experience, and the list goes on.

I’ve never met a team who can’t imagine work which would improve their services along one of these axes. In fact, most of these axes extend indefinatly towards ultimately unreachable goals e.g. 100% availability or 0ms response times. A “perfect” service would have all the features a user could possibly want, incredible performance, reliability, cost-efficiency etc.. This perfect service can’t be reached, and in practice probably should never be reached. A long time before we even come close to this perfect service, we should be focusing our effort somewhere else which delivery more value for each unit of our time. Importantly, this point of diminishing returns is often sooner than we think, at least initially.

We can imagine the quality axes for our service as a spider graph, with each axis stretching out to infinity and the scale as being comparative value delivered.

Towards Optimal Imperfection

The fact that a perfect service is unattainable means we are, as always, in the messy world of trade-offs. Improving any possible attribute adds real value, but we have to decide which one of these parameters we will push forward, and which we will leave. The opportunity cost (what we could be doing instead of the chosen work), is often higher than we intuitively believe.

Accepting that the perfect service doesn’t exist is good for freeing our mental constraints. No matter how long we work on this service, the wider software stack, or any part of the software, it will always be imperfect. We will have to accept some level of unavailability, features we wish the service had, outlier latency spikes and a million other imperfections. We are not closing in on perfection, and should never imagine we are. In a world where we’re seeking perfection, any work which extends our progress along any of the value axes would makes sense. If instead we look at each and every value axis as a un-ending project of reducing imperfection, it frees us to evaluate when to stop moving along that axis.

When a possible task to add to the backlog, the fact that it would improve our solution measured against on one or our axes of value isn’t a good enough reason to do that work.

To make the example concrete, we may be able to reduce the latency between clicking play and a video starting to play. Will lowering this improve the customer experience, absolutely! But this work would take up precious time we could spend improving the % of video streams which go uninterrupted. Which is more important is a difficult and business-specific question, but a framework for exposing and discussing these competing objectives is necessary to make the right choice. If our click->play latency is good enough, we shouldn’t improve it even if we can, because we’re implicitly choosing not to do other work with potentially more benefit.

We can pick nearly any axis and spend eternity pushing it towards a perfections we’ll never meet, at some point along that journey we should have stopped. Often, the point at which we should have stopped is sooner than we intuitively think, but sometimes the trade-offs aren’t made explicit when making the decision or even when making designing the initial solution.

The questions we should really be asking is intuitive and well-understood, but done poorly in practice “Which axis has the lowest cost to move, per unit of benefit”.

Designing Imperfect Software

Agreeing the ways in which a service can be imperfect is essential to designing the right system initially and deciding where to expend time in the ongoing maintenance and development of any system. We should make these explicit in our strategy “We expect availability being over 99.9% will be a key driver of success” but also what won’t drive success and we can afford to be flexible on “We believe that, if time from click to starting the video playing is <3 seconds, further improvements add minimal user benefit”. A team armed with the this type of knowledge is empowered to build an appropriate solution.

Being clear that an availability goal has more value to our customers than a new sign-in page allows the work to be prioritized correctly. Being explicit about the value delivered by both functional and non-functional requirements, and where we expect to stop seeing return on value as we move along each axis is the core to good engineering decision making.

Have this discussion early, the interplay of values are complex and not easy to capture, btu the discussions can help unearth important trade-offs which the teams may not be thinking about. Revisit these trade-offs to ensure your values are aligned with changing requirements, knowledge of how your software delivers value, and to ensure your stakeholders, customers and teams stay aligned on what ways we accept our software will be imperfect.

All our decisions have opportunity costs, things we could do if we weren’t doing this. Always be on the lookout for times you’re improving some dimension of quality past the point it returns the best return on your investment.