Praise for the stable services

If you have worked in web development, especially on the backend, you will have almost certainly heard praise heaped upon the “heroes” that have put in the late night hours saving some services during an on-call emergency.

As someone who has done jobs with an on-call aspect, it can be hard work, it’s not nice and the recognition is certainly welcome. However, when was the last time someone received praise for managing a service that is rock solid and simply sits there quietly doing its job for months?

Maybe it is just me, but I’ve never heard of this happening. Within a security company (where uptime should be important), I’ve seen essential services where the only downtime is when a new version needs to be rolled out. Otherwise, it can more or less be forgotten.

I understand things that are not seen can easily be forgotten but by only praising those who are seen struggling at 3am to fix things, I think that creates the wrong incentives. In fact, I have seen many developers who aren’t that concerned about stability at this security company. I’ve seen management say you’ve got to put in the hours in the “trenches” to get recognised.

This leads to burn out and employees leaving or, as I mentioned early, it incentivises just getting things out and not worrying about it being good. It’s assumed it won’t be, and when you save the system later you’ll get praise. By doing those out-of-hours shifts, you’ll also be more likely to make mistakes during the day or simply do less because the fact is, you cannot be “on” all the time.

Some employees learn to love the praise for the hard work and then simply do not bother doing their normal job well and happily jump into the big emergency whether they’re needed or not.

As software is generally a service to enable real work (i.e. security software enables analysts. Customers pay for analysts not software), that real work can fall apart too by developers spending too much time on emergencies. Take this real-world example from this year. It’s slightly tweaked to protect the innocent.

We've been waiting for this since early 2019. It is needed so we can tell what type of widgets we are looking at in our tools. We cannot even discern the type of Doo-dad correctly. I'm glad we're approaching, maybe investigating the cloud requirements in future iterations.

This isn’t a one-off occurrence, and it shows an unhappy internal “customer” that just gives up. You risk losing those non-engineering employees because the quality is simply not there and everyone’s job is made harder. All because of management creating the wrong incentives.

Some of this might be that employers just assume employees won’t stick around, so squeeze as much as you can out of them and let them burn out and move on. This is one reason many jobs simply don’t offer much in the way of training. It’s assumed they pay for the training and then the employee moves on with the free knowledge.

As you might expect, it’s never assumed the company and culture is at fault. Instead, we blame workers for quiet quitting and just not being productive. There was a recently Washington Post article about this. I don’t think it’s surprising the areas where productivity is taking a plunge is more on the white-collar side, probably more in tech than elsewhere, if I had to guess.

With software in particular, I do think it’s this always on culture, DevOps, infrastructure as code, etc that is at fault. Regardless of the out-of-hours work, it’s adding complexity. Then, moreover, when you’re short on sleep and not incentivised to make stable services, then at some point it all falls apart.

I think if companies would show more praise for the work put towards stable services that require little attention, then developers might change their behavior. They can stick to reasonable hours and get a good sleep. This also means they can put that stable work to the side and focus on the new thing. If the company is lucky, they will also avoid burn out and employees might stick around a bit longer.