Arc-E-Tect

March 28, 2019

Blame prevention through devops

This post is a follow up of my previous post The question that takes away all blame.
Blameless postmortems, or blameless RCA’s are supposed to be the new-normal in devops organisations, but all too often we see that first the team and sometimes the person to blame is sought, and then we tell them to ‘fix it’.

You might’ve noticed that I wrote devops in all lower case in this post's title. I did that on purpose.

Devops sanse capital 'd'

Even though you typically see DevOps instead of devops. I think that DevOps implies that it’s about Development and Operations engineers working together. In my opinion, devops is about combining the responsibility of the development team and the responsibility of operations team, turning them into the responsibility of a single team. This would make devops a matter responsibility and less of an organisational concern.

The organisational aspect would then be in the form of ‘Product Teams’, responsible for a product with a Product Owner that is accountable for that product.
Something for another day. This article is about blameless post-mortems and root cause analysis driven through devops-tinted glasses.
Blameless post-mortems are something that comes more natural in environments of shared responsibilities. Environments where the same people are responsible for both the quality of the product as well its usage. Environments where devops is considered the combined responsibility of development and operations within a single team.

Silos vs Accountability

I have observed in a number of organisations that one of the main reasons these organisations are considering the move towards devops is based around the concept of shared responsibility. It is the idea that silos prevent this sharing of responsibility. It is a misconception though. Silos don't prevent shared responsibility, although culturally they'll probably inhibit the sharing of responsibilities. What the real problem is, is the lack of accountability in a siloed organisation. Or quite the opposite; Too many persons are accountable for different/conflicting objectives.

In a siloed organisation, each silo is primarily responsible and even accountable for its own output, its immediate contribution to the process of product delivery, but not the full process of delivery itself. Meaning that when the outcome of the process is of the unwanted kind (caused an incident), either one of the silos’ outputs caused the problem (who is to blame?) and when there is no single silo to be blamed, nobody can be held accountable.

This can go as far that a sales team is responsible for selling a product. A signed contract is considered a success. The development team is responsible for changing the product. The release of the change into production is considered a success. The operations team is responsible for 'running' the product and it is considered to be successful when there are no incidents. This would be for SaaS vendors. For more traditional software companies, i.e. those that require implementations of a product at the customer site, the operations team is part of the customer's organisation or at least the operations accountability is typically with the customer. It'll be more complicated, because there will likely be an implementation team that is successful when the product is implemented according to the contract sold.
Success of the product is defined as selling/changing/operating/implementing. With different persons accountable for each of these successes, you see that conflicts are imminent. So when a problem happens anywhere in the delivery, each of the accountable persons will elaborate that they're not to blame, because they are successful. Actually, sales and development were successful, and operations and implementation were given something that prevented them from being successful. Contract was signed based on the availability of missing features at the time the implementation project reached completion. Future releases are feature complete and functionality fully tested. But it's unmanageable, not performant and definitely not secure. And can't be implemented as integrations with other systems not available until project end.

Siloed organisations are structured around tasks, competencies and expertise. By centralising capabilities, they can be shared across products. Siloed organisations are build on shared service centres. Reason behind these structures is cost reduction through utilisation optimisation. I'm not a fan, see: Perish or Survive, or being Efficient vs being Effective.

Output vs Outcome

It’s the difference between output and outcome that often drives ‘blaming’ in a post-mortem.

In siloed organisations, each silo’s focus is on output, its output. The silos are in many cases the result of centralising the responsibility for specific aspects of the delivery process, with a lack of accountability for the full process. Specialists are responsible for doing their 'thing' is efficient as possible.

Often responsibility is mistaken for accountability, so these task-optimised teams, teams of experts, are held accountable for what they deliver, which is not the outcome of the process, but output of their effort. Because they are the experts, they perform their task for different products, i.e. they participate in various delivery processes. And are held accountable for the number of tasks that they completed in total.

The problem therefore is in that the silos are operating truly independent of each other. Each silo services several product delivery processes. Because servicing only one (product delivery) process, would mean that a significant amount of time the silo would be idle. Since the silos are there to optimise resource utilisation, idle time is undesired. Idle time is considered wasted time by many non-Lean'ers. In Lean wasted time is time spend on something that is not immediately needed for the delivery of a product.

Every silo will work hard to meet its numbers. Meet its targets. And when the target is the number of tasks performed instead of the number of products delivered, we're doing a lot and contributing nothing.

Within a context of blameless post-mortems. In a context where blaming should be prevented, we need to make sure that responsibility is shared on the (product delivery) process outcome. Accountability is set to manage that outcome. Meaning that development and operational responsibilities are both defined to contribute to the outcome of the process. Something devops shines at.

Thanks once again for reading my blog. Please don't be reluctant to Tweet about it, put a link on Facebook or recommend this blog to your network on LinkedIn. Heck, send the link of my blog to all your Whatsapp friends and everybody in your contact-list. But if you really want to show your appreciation, drop a comment with your opinion on the topic, your experiences or anything else that is relevant.

Arc-E-Tect

The text very explicitly communicates my own personal views, experiences and practices. Any similarities with the views, experiences and practices of any of my previous or current clients, customers or employers are strictly coincidental. This post is therefore my own, and I am the sole author of it and am the sole copyright holder of it.

March 21, 2019

The question that takes away all blame

Blameless postmortems, or blameless RCA’s are supposed to be the new-normal in devops organisations, but all too often we see that first the team and sometimes the person to blame is sought, and then we tell them to ‘fix it’.

It’s for a large part a remnant of the siloed organisation and the culture that stems from it. And it is a matter of asking the wrong questions. This is, I would say, about 80% of the reason why we are unable to prevent incidents from recurring.

The cool part is in the ‘asking the wrong questions’ thing.

Now, before going into it, let me emphasise that postmortems, RCA’s are about preventing incidents to occur again. You want to know the root cause, because you want to prevent something like the incident to ever happen again. Stop reading if you disagree.

‘What?’ is wrong

All too often, when we are dealing with the aftermath of an incident, we wonder ‘what caused this incident?’. Which is a valid question, but not one that is very valuable. The issue I have with this approach is that when we have an answer to this question, we think we found the root cause of the incident. Which we haven’t.

The answer to the question ‘What caused the incident?’ is typically something technical. There was not enough memory in the server. There was a bandwidth problem. There was a bug in the software that resulted in the incident.

There is something very satisfying in the answer to the question ‘What caused the incident?’, it is in the gratification you get from knowing where the weakness in the product is. It’s in the software, in the hardware configuration, in the network infrastructure. Because when we know where the weakness was, we know who is responsible for the weak component. It’s the development team, automation team, the network team. And when we know who is responsible for that component, we know who to blame and who to tell to go fix it.

In about all cases I’ve been involved in RCA’s, putting the blame on somebody was not about punishing that person, it was about identifying who should fix the problem.

The problem is in ‘responsibility’, because the person being held responsible is not necessarily the person that is accountable for the incident. Often, especially in a siloed organisation there is no one accountable.

Although it is important to understand what went wrong, and what caused the impact, we need to realise that this is not the same as understanding what caused the incident. We’re not at the root-cause just yet. But we want to, because this investigation it painful. Colleagues are to blame, and the responsible persons must be called to justice. They must be told that we can never ever feel that impact again. And so, we make sure that next time the impact will be bearable. We increase memory in the server, increase the bandwidth in our network, fix the bug in our software. All holes are plugged. Ready to go.

If only we had addressed the root cause, it all would be honky dory.

‘Why?’ is right

The question that should be asked is not so much about what caused the incident, it’s about why the incident could occur in the first place.

That’s a tough question to answer. Why was there a bug in the software? Why was there not enough bandwidth? Why was there not enough memory in the server?

And that’s only the first ‘Y’.

A very common, tried-and-tested, effective way of identifying the ‘real’ root cause of an incident is by applying the 5-Y method. In this approach you ask 5 times ‘Why could the previous answer happen?’. Experience has taught us that going 5 levels deep will get you to the root cause of the problem, sometimes less, hardly ever more than five levels deep.

Let’s assume that the incident was due to insufficient bandwidth and let’s start asking ‘Why?’

Why was there not enough bandwidth? Too many customers accessed the newly released API.
Why did too many customers access the new API? Because we announced it prematurely in our global newsletter.
Why was it announced in our global newsletter? Because the marketing manager wasn’t aware that the API was to be released following the ‘soft-launch protocol’.
Why was the marketing manager not aware of the fact that the release was to follow the soft-launch protocol? Because she was not in the meeting in which it was decided to follow the soft-launch protocol.
Why wasn’t she in the meeting in which it was decided that the API was going to follow the soft-launch protocol? Because she was on vacation and didn’t appoint a delegate.

Now we know why the incident could occur. Not what caused it, but why it could be caused. Making sure that the marketing manager or a delegate is attending meetings in which product launch strategies are decided will prevent this incident to occur in the future.

Of course, the above is only an example, but it shows that by asking ‘What?’ the solution would be a costly technical solution and by asking ‘Why?’ the solution is better meeting attendance.

Another important conclusion you might’ve drawn is that asking ‘What?’ only involves technical people. Further leading the path to a solution down the costly technical path. Whereas the ‘Why?’ question requires all parties involved in the delivery of the product (the API) to attend the postmortem. Getting to the bottom of the incident’s cause, requires a multi-disciplinary team. Just like delivering a product requires many disciplines.

It makes no sense to think that creating a success is requiring many disciplines, but when it results in a failure, to prevent it, only requires a single discipline. There is no difference between delivering something that works and something that doesn’t. Not from a product delivery perspective.

Product Owner or Problem Owner?

The question is of course: Who would go through all this trouble and assemble all these people that are involved in delivering a product into the hands of our customers? It’s the one that is held accountable for the incident. More importantly, it’s the person that is held accountable for the fact that the incident doesn’t occur again.

The Product Owner would be my preferred role, ownership of the product implies ownership of the success of the product and all challenges that come with it.

There is a follow-up story that you can find here.

January 2, 2019

Cloud Native Enterprises - Rapid elasticity

Which don't have a lot to do with Cloud Native Apps but everything with truly embracing the paradigm shifts the Cloud has brought IT within the realm of businesses.

Read the Introduction first.

After reading the introduction to these posts you know what a cloud infrastructure is and what cloud native applications are, what about cloud native enterprises. Well these are enterprises that adhere to these same 5 characteristics. These enterprises, or organisations in general, cannot be modelled according to traditional enterprise models because of their specific market, competition, growth-stage, etc. These enterprises need to be, for all accounts, be cloud native in order to grow, succeed and be sustainable. Interestingly, but not surprisingly they require The Cloud and Cloud Native Applications.
In coming posts I will address every essential characteristic of The Cloud as defined by NIST from a perspective of the Enterprise. Unlike most cases, I will post these within the next 7 days and I certainly do hope before coming weekend.

On-demand self-service. When online services and core systems really seamlessly integrate.
Broad network access. When customers, partners and users are distinct groups treated equal.
Resource pooling. When synergy across value chains makes the difference.
Measured services. When business resources are limited.

Rapid elasticity. When business is extremely unpredictable.

Very few organisations can say that year round they experience the same amount of business. Almost every organisation experiences something like a 'season'. There are always highs and lows in an organisation's load. This can be 'Black Friday/Cyber Monday' for retailers, tax-season for the IRS or its equivalent in your country, or for example hotels during the summer holidays.
But not only in sales, there are highs and lows, often very predictable, but in production companies there are peaks in order fulfilment.

In order to deal with these fluctuations in 'business load', organisations need to be elastic. The more elastic an organisation is, the better it will be able to handle the fluctuations. Mind, this is not the same as being agile. Being agile is being able to change direction as needed, in a timely manner. Being elastic is being able to change scale as needed. Also in a timely manner. Arguably, being elastic is more difficult to accomplish than being agile.

Elasticity is a matter of scaling up, and down. The more elastic an organisation can be, the more efficient it can do its business. Efficiency here is a matter of spending just enough. Another difference between elasticity and agility. The former being about efficient in using resources, the latter is about being effective in resource utilisation.

From the Cloud we know that elastic means scaling up, or down of the IT infrastructure in order to manage fluctuating load. Thus not having to worry about under-utilisation of IT infrastructure, thus paying for infrastructure that is not being used. And at the same time, not having to worry when load increases and the stability of the environment isn't compromised due to over-utilisation of that same infrastructure.

In the Cloud Native Enterprise, we project these traits onto the organisation itself. Onto the business.

Traditionally, scaling organisations are hard to accomplish. Extremely hard in fact. For one there is always the challenge of resource availability. Where resources are of course human resources. The most common way of addressing this is through contractors. The hiring and firing processes around contractors are far more flexible than for permanent employees. Thus bring some elasticity to the organisation. But getting the 'bodies in' is only part of the challenge. Getting the 'right bodies in' is another aspect. Adding personnel with the right skills is possibly even harder. First you need to find them, next you need to validate that you found them. The hiring process is cumbersome and the further you need to stretch the organisational elastic band, the harder it becomes. The amount of effort needed is not increasing linear, but almost exponential.
Organisations need to be able to scale up quickly. Which often also means that the HR department needs to be elastic as well. Again, the same challenge apply.
In an economy with an up-beat, there are more companies that are looking for the same scalability and resources become even more of a challenge to find.
I haven't mentioned the strain on the organisation itself due to growing too large. More management layers need to be introduced in order to be able to grow and manage this. And this is where scaling down becomes a problem. Adding more management layers in the hierarchy to handle the size of the workforce, is relatively easy compared to removing these layers again.

The more seasonal a market is, the more of a challenge elasticity becomes for an organisation. Tax-offices will be overloaded with calls to the help-desk as the deadline for submitting your tax-forms nears.

At one point I was involved in a court-case as one of the expert witnesses where my client succumbed at its own success. The client was in a business where 98% (!!!) of its revenue was generated in 5 consecutive days of the year. It couldn't handle the increased business as it couldn't predict what that load would be. Especially since in the months prior to their 'peak' they took over their largest competitor.

Organisations can't scale their workforce to the point where we can talk of elasticity. Scaling up and down is a matter of weeks. Where elasticity is the ability to scale up and down in a matter of days, hours or even minutes and less.
This is where the Cloud comes in. It is not so much the elasticity of Cloud infrastructure that plays a role, but the ability of an organisation to utilise the Cloud such that it can scale its business efficiently. Cloud infrastructure is an important part, where it comes to handle IT load. But the true business elasticity comes from handling at a business level the fluctuating load.
Extreme automation of business processes allows an organisation to reduce its dependence on specific business knowledge of its employees. Everybody can press a button or enter data when the 'system' does the heavy lifting in applying business rules and execute repetitive tasks. Understand that by following this paradigm, the reliance of the organisation on less-educated employees is reduced to a bare minimum, and it can focus its efforts towards attracting higher educated employees. Employees that require less managerial guidance which allows for a more flattened hierarchy. Flat hierarchies are more elastic.
Automated processes can in addition, benefit from the technological elasticity of the Cloud. Thus having a double edged sword.

Elastic organisations are not trivial. Far from it in fact. Main reason being that traditionally, organisations scale with their business by either increasing its workforce through short term contracts, i.e. contractors, which can be hired and fired as needed. Or by attracting permanent employees and assume a positive attitude towards the future. Or laying off personnel in advance when the future is perceived less positive. It is how organisations are used to manage the seasons.

The Cloud Native Enterprise on the other hand is far from traditional. At its core it will embrace technology resources not to replace human resources, but to complement them. It invests in senior, higher educated experts, to reduce the reliance on management. In effect, pay more to less, in order to reduce the need to scale the workforce.

Concluding

And so the Cloud Native Enterprise is the enterprise where IT is part of the business when it comes to delivering business products. Allowing a business to grow and shrink as needed, just ahead of time and with the flexibility of a rubber band.

Translate