Bug Fixing in Production: an Inside Look

Introduction

Building an application is one thing and providing support when it’s up and running is a whole different ball game that has its own set of challenges.

In a perfect world, the code that is rolled out at the end of each sprint would work perfectly in production. There would never be any bugs, and there would never be any issues that force developers to roll back the code that had already been deployed.

Of course, we don’t live in a perfect world. And software development is always fraught with various bugs. Especially when there is a large team working on a project.

The main task of such a team is to release the application without a single bug. However, even with a team of testers, various bugs may appear even at the production stage, so you need to have some kind of fixing plan.

When things come crashing down

Handling errors within the production phase can be tricky, and the production environment isn’t every developer’s cup of tea. Not to mention, oftentimes there is no documentation available, and the developers are called in for troubleshooting support with little to no knowledge about the application.

It’s needless to say, that errors in applications could lead to software downtimes, which lead to loss of revenue and reputation. The Gartner analysts have estimated that the average cost of downtime is $5,600 per minute — that’s well over $300,000 per hour.

Here are some real-life examples of big downtimes:

Skype

Skype experienced its first major downtime in August 2007: for two days straight, users could neither call nor send messages. The reason was a problem with the algorithm working with the network software. At that time, the market capitalization of Skype’s owner, the online auction service eBay, fell by $1 billion on the first day of the outage. Their quotes on the New York Stock Exchange fell by 7% – from $36.2% to $33.64 per share.

Google Cloud

On June 2, 2019, Google Cloud Platform experienced a significant network outage that impacted services hosted in parts of US West, US East, and US Central regions. This outage impacted Google’s applications, including GSuite and YouTube. The outage lasted more than four hours, which became notable given the criticality of Google’s services to business customers. Google issued an official report on the incident several days later.

The complete unavailability of parts of Google’s network turned out to be due to Google’s network control plane inadvertently getting taken offline. Google later revealed that during the outage period, a set of automated policies determined which services were reachable through the unaffected parts of its network.

Facebook

A week after Forbes ranked Facebook’s founder Mark Zuckerberg above Steve Jobs on the list of the richest Americans, the social network suffered a major disruption. By the fall of 2010, the number of registered Facebook users exceeded 500 million, and the fall of the social network was a public shock, becoming the main news of the day.

Facebook’s management detailed the reasons for the incident: an automatic system that looks for incorrect settings considered the changes made to the main network base an error. So, a massive correction of “errors” began, and the servers were overloaded with hundreds of thousands of requests per second.

An insider’s perspective to bug fixing in production

We met with Andrei Vorobjov, the Lead QA Engineer at Bamboo Agile, experienced in testing projects of any complexity and scale, to discuss all the subtleties that go into bug fixing during the production stage.

Describe your experience of dealing with software testing in production.

How does the common process of bug fixing in production vary in various companies?

– The attitude towards bug fixing in production as an element of the software development process strictly depends on the type of product the company develops. In software module development, bugs came at a high cost – not only financially, but first and foremost, reputationally. That was because the client could lose money right after the release, which would lead to the contractor having to provide refunds, discounts, perks, etc. to smooth over the situation.

All those potential costs made companies hire the departments of professional support managers and maintain separate teams focused solely on bug fixing in production.

Such a system works well, and my first company usually handled it in the following way:
1. The user finds a bug in an e-Commerce application (an online store) and informs the store about it, describing the bug in detail.
2. An online store (in this case, the client of a software development team) reports and describes the bug to the support department.
3. The support department informs the QA department about the bug for its reproduction. The latter is needed to understand the bug’s nature and whether it’s global or local.
4. After reproducing the bug and identifying it as global, a QA engineer describes and sends it to the team responsible for bug fixing.
5. After fixing, the bug is taken back to retesting. Then, the fix is released in case of the bug being global (the fix could also go directly to the client that purchased the module if the bug was identified as local).

Everything seems transparent and simple. But are there any hidden pitfalls involved in this process?

– That’s true, there are plenty of downsides. For example, there’s the effect of Chinese whispers due to the constant rewriting of reproduction steps and a long communication process causing some bugs to get lost. But we have to commend that any rule has its exceptions, therefore, the bugs fixed by a single comma wouldn’t have that long of a way to go, and would be quickly delivered to the development team.

How has the testing process changed now, and what are the obstacles to seamless testing that can still be met?

– Currently, while working on a product that doesn’t affect anyone’s direct costs (meaning no significant loss for clients’ businesses), we focus on the bug’s nature for an individual user, since in this case, any bug would be global. In the existing system, we follow the principle of the release of bugs in clusters (once again, excluding the critical ones). This allows not only to focus on the ongoing development without distracting the team (due to the absence of a separate team to work on fixes) but to optimize the work by comprising the releases primarily from the bugs fixed. But even here you can find new downsides, and the lack of time is the most crucial one. In the context of testing, the engineer can lack access to production and an opportunity to properly test the fix after the release.

What conclusion have you reached for yourself after working in such different places, and what advice can you give to others?

– As a tester, I’ve seen the importance of bug criticality on a global level and have found out that not all bugs were as frightening as the client saw them. Each company tries to optimize the process of dealing with bug fixing in production. That’s why whether you have the processes aligned or missing completely, I’d advise remembering the following:

a tester is able to veto the release of an unreliable product – of course, with the explanation of this decision. This allows us to avoid some bugs in production.
you cannot rely upon common sense while making decisions on bug fixing. If the bug is not worth it, it’s probably much better to put it on hold for a while or, on the contrary, to fix it fast by bypassing the bedevilled process mechanisms.

Conclusion

All in all, bug fixing in production is an unavoidable part of software development. While the process itself may be difficult and confusing, there are multiple ways to make the task easier for the development team. Great attention to detail, clear communication, and a well-defined troubleshooting strategy are all amazing assets in the programmers’ eternal battle against bugs.