presentation

No Country for Old QA

Emanuil Slavov

Aug 23, 2017 • 15 min read

In March 2017 I gave a presentation at a QA conference: QA Challenge Accepted 3.0. The title was No Country for Old QA and in it I summarized my experience from the last 15 years working as QA. It also included my thoughts on where our industry is going and what is the future of this profession. The presentation slides are in English, but I spoke in Bulgarian. After the conference was over, I received lots of requests to publish a summary of the talk in English as people wanted to share bits and pieces with their teams. You can find the highlights bellow.

The Current State of QA

Different people have different opinions about what the QA role is and does. But until the last years it was a role almost universally present in every software company. Recently however, a breed of companies started emerging — they do not have dedicated QA role. This trend usually begin with startups. It does not make an economic sense to invest heavily in quality software until a product market fit is found. Github, Stripe and Airbnb are examples of such companies. They all are private companies, and not a small ones at that. As of March 2017, Airbnb valuation was $31B with ~5,000 employees. For comparison, Bulgaria’s GDP is $50B with population ~7,000,000.

Public companies that do not have a dedicated QA role

So the QA role does not exist in startups but when the company grows it has to have a QA right? Here are some examples of big, publicly traded software companies that do not have such role^[1]: Yahoo!, Facebook, Microsoft, Google. Facebook was a startup but they never had a dedicated QA. Even as they grew, they found ways to keep it that way^[2]. Even Microsoft, once proud with 2:1 ratio^[3], are now phasing out the formal QA role.

OK, but in some industries you absolutely need QA, like in gaming? Wrong. Here is one example:

No Man's Sky - everything is this game is generated on the fly (pun intended).

This is a screenshot form a game called No Man’s Sky. The purpose of the game is to travel around the galaxy in a spaceship, collect resources, fight other explorers etc. The unique thing about this game is that whenever you approach a planet for the first time, the game autogenerates it, including everything on in — the geography, flora, fauna and resources. The number of planets is 18 quadrillion. So each player gets to experience a unique, one of a kind journey. How do you test all those planets in your lifetime? How do you test something nondeterministic? The short answer is: you can’t. What the developers of this game did, was to create bots that fly around, land on planets, take some screenshots, a short video and fly away to the next planet. The bots can’t visit all the planets, because our Sun will be long gone before they do. So they pick a small sample to land on. The screenshots and the videos are fed to TV screens in the developer’s room. The developers check the images from time to time for irregularities — e.g. a creature with 7 heads and 18 legs, too large vegetation. In case of a problem they adjust the algorithms appropriately.

The Cost of a Defect

Why some companies do not have a formal QA role while others do. Some of the reasons have to do with the cost of the defects. If the cost is low, there is not much sense to invest in heavy (and expensive) testing upfront.

Free vs Paid Product. If a product is free, and it has bugs, who are you going to complain to? You don’t pay a dime, you don’t have a support contract. If a product is free, it means that you are the product. You don’t pay with money, but you pay with your time, eyeballs, actions, information, social interactions etc. On the other side, if a product is paid, it usually comes with SLA and financial penalties for not meeting the terms. The cost of a defect is high in the later case.

Startup. As mentioned above, the most important task of the startup is to find the right product market fit. Everything else comes after that. The cost of a defect is low as there is a high chance that the startup will run out of funding before delivering something useful. The worst part (for the current QA engineers) is that as startups mature and become public companies, they learn how to operate without dedicated QA. I expect this process to continue.

Monopoly. If your company is a monopoly then the price if a defect may be quite low. When you are the only game in town (Facebook, government, internal IT department of a company), the customers have no other choice but to tolerate you. This is pretty clear with Facebook: your drunk photo could not be uploaded for some reason, and in your rage what are you going to do? Use MySpace instead? On the other side, if you are in highly competitive market, the cost defects may make of break your company’s public image.

Significant Impact. Think about the software that you’re writing. Can a defect cause significant money loss? I worked for a company that is processing electronic money transactions. Each defect was costing us money — literally. The most expensive one that I’ve seen costed us 100,000 EUR, but we made one of our customers very happy. Can a defect sets you back with significant amount of time? Consider Mars Schiaparelli lander crash. A software defect caused the crash, setting back the European agency the time it took to build the lander as well as the time it took to reach Mars. Can a defect cause you not to be compliant? To operate in certain industries a compliance guides need to be followed: HIPPA and PCI are two examples. The result of a defect can cause non-compliance and in some cases you may not be able to operate in such industries or pay heavy penalties to the regulators. Can a defect cause loss of life? In the case of one X-ray machine it did. On the other hand, lets go back to the Facebook picture example: the upload did not succeed on the first try, there is no significant impact.

Deploy Frequency. I used to work at a company that produced software distributed on compact disks. Every defect costed us a lot because even if we issued a patch for it. It was up to the customers to decide when it’s going to be applied. We also had to support ‘rolling’ upgrades (this also included data migrations) - meaning that you can upgrade from version 3.0 directly to 7.0 without going through the in-between versions. All this required heavy pre-release testing and there was no other way. Now, with SaaS and continuous delivery, there is only once place where you code resides, it’s easy to upgrade and apply a fix. After a fix the only thing your customer needs to do is reload their browser. Being able to deploy quickly a fix to all customers means that in most cases the defect cost is low^[4].

What Happened in The Last 15 Years

I’ve worked as QA for the last 15 years and these are the most significant developments that happened during that period.

Salaries got higher. When I started working as a QA, my salary was 160 USD (after taxes). This is not much now but in 2002 it was a pretty good chunk for me, given that the standard of living in Bulgaria was low. Today, the salaries are 10-20 times higher.

Less QAs. 15 yeas ago is was it was so cheap to hire a QA that some companies had more QAs than developers. There was no need to invest in test automation or any time/effort saving activity as you had so many QA drones willing to work for peanuts doing the same repetitive tasks over and over again. Today, as a result of the higher salaries, there are significantly less QAs compared to developers[^9]. And it makes sense — you can have a product without QA, but not without a developer. The focus is on hiring developers, QA may never get hired.

A lot more is expected from a QA. 15 years ago, hiring QAs consisted mostly of checking if they have any computer skills at all, e.g. at least MS Office literacy. Today, to get hired even as Junior QA, you need to have at least one of the following skills: relational DB knowledge, programming experience, networking knowledge or experience with hardware.

No dedicated QA teams. When I started as QA, most of the companies were working with waterfall development methodology. There were big and independent teams - developers, QAs, product owners. Today pretty much no company works like that (except some outsourcers). The teams are now combined. In the majority of the cases the people who are promoted to higher positions are either with development or product owner background. Regular QA engineers do not see a career path on those teams and as a result tend to choose other careers. The constantly shrinking sizes of the QA teams also contributes to the fact that the QAs do not see themselves moving towards management position. There is just not enough employees in the QA team for a rigid hierarchy (junior, regular, senior, lead, manager). In most cases one QA lead is enough.

Moving to Other Positions Then I started working as QA, our team consisted of 12 engineers. Now only 40% of the original team still works as QA (in lead or management positions). 60% of that team have move to other endeavors. The two most common positions they moved to were development^[5] and product owner^[6]. It is fare to say that most of the people working as QA now will not retire working as QA. What’s more — most of the QA engineers consider this role as a stepping stone, a foot in the door, to some other position in the IT industry with more potential for career growth.

Shrinking Cycles

There is another important development that happened in the last 15 years. When working with waterfall, planning, development and testing cycles lasted for 2-4 months each. I used to plan what each member of the QA team would do, day by day, 6 months in advance in Excel Gantt chart. Needless to say this plan was never accurate, but at the time we didn’t know any better. Today, almost every company works with some sort of iterative development process with short release cycles — 1-2 weeks for SaaS, 3-4 weeks in the case of applications that needs on premise installation, or a mobile app. In order to meet those deadlines, companies rely more and more on fast feedback quality related activities.

The ever shrinking time available for manual QA

This time is taken from the manual testing. In order to secure more time for development, manual testing is being squeezed from left by early defect detection activities (performed by the developers): static code analysis, code review, pair programming, automated tests. Since the pressure to release faster to the customers is huge, once a future is ready, manual testing is also squeezed from right^[7]. Usually by activities performed by operations: analyzing customer reported defects, monitoring for errors exceptions, mitigation techniques or even full rollback in case of a catastrophic failure. All those shift actives mean less time for manual testing. Since the required manual/exploratory testing is not much^[8], in some cases those test activities are performed by the product owner or by the developers themselves. One can argue that shrinking the manual testing process leads to higher quality overall and faster cycle time. All of this paints pretty bleak picture for the average QA engineer.

The PDCA Cycle

In classic project management theory there is the notion of the ‘holy trinity’. It consists of high product quality, low manufacturing price and short development time. The theory states that you can have only two of the three at the same time. However, if you want to continue to work as a QA, you need to help your organization achieve all of the three at the same time. What’s more - you need to be flexible and respond to changes. 15 years ago, your best bet to achieve the ‘holy trinity’ was to fill a room with a bunch of QA Engineers and pay them 160 USD a month. Today, your best bet are the ‘shift’ left/right activities. But tomorrow we may require new thinking to achieve the ‘holy trinity’ - possibly using artificial intelligence.

If you’ve studied the classic management theory you may think that I’m full of bullshit. But I want to introduce you to Williams Edward Deming. Credited (also with Joseph Moses Juran) at least to a degree for what we now call The Japanese Economic Miracle after the second world war. A bankrupted country in 1945, with inflation of 100% for three consecutive years and destroyed infrastructure. Yet in 1967 it rose as the second largest economy in the world. This was accomplished in part by Deming insisting that by focusing on quality first, the other two parts of the holy trinity will fall in place.

Deming postulated 14 points for improving any system, and some of them we can directly relate to software engineering:

“Cease dependence on inspection to achieve quality. Eliminate the need for massive inspection by building quality into the product in the first place.”

We’ve already figured out that testing after the fact does not yield great results. We should put more efforts in detecting and preventing defects in planning and development phases.

“Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease costs.”

By focusing on quality, cost reduction and speed (productivity) will naturally follow.

“Break down barriers between departments.”

More than 50 years ago Deming was preaching what we are discovering just now with so called ‘agile’ development methodologies.

“The responsibility of supervisors must be changed from sheer numbers to quality.”

It’s always better to produce less with higher quality. Don’t rate people based on fallible metrics — number of bugs found/fixed, code coverage percentage achieved. Forget about premature optimization. Will the feature solve a customer problem? Is she willing to pay for this solution? What is the most optimal way to produce it?

There is a chart that sits in almost every Japanese factory. A chart used for product development as well as for problem solving. It was not originally created by Deming, but it was improved and popularized by him. It’s called the PDCA cycle - plan, do, check (study), act (adjust). It can also fit nicely with any software development methodology, as we also have the same stages.

Every activity affects quality.

I’ve added three more components to the chart above that also affects quality - People, Product and Process. Listed at every stage are some of the quality related activities that can be performed.

The regular QAs are only responsible from three quality related activities.

Now look at the chart above. If you’re a regular QA, your responsibilities are to participate in planning meeting (if you're lucky) where you can give your opinion, and to perform manual/automated testing during the development phase. There are lots of activities that can influence quality but for various reasons, most of us never participate in them.

We can draw three conclusions from this graphic:

QAs can never be the only ones responsible for the quality of the product.
If you want to improve product quality you need to perform various activities at different stages.
Quality does not equal testing (manual or automated). There are lots of other activities, some even more important (and cheaper) than testing after a feature is completed.

Seven Steps

If you want to continue to work and grow as a QA you need to study and apply different quality related activities at different stages in the software development lifecycle. Get out of your comfort zone and start learning.

Here are seven steps to start your journey:

1. Increase the feedback loops. Figure out from what activities (other than testing), you can get information about the quality of the product. Where are the weakest links? For example, study customer reported defects, monitor for errors and exceptions in the production environment, perform customer quality surveys, how flakey are your automated tests, what is the cause of flakiness.

2. Track and visualize trends. Don't keep the data you collect to yourself, plot it and show it to everyone in your organization. You’ll be able to see if a solution works by monitoring if the stats show improvement or not after you’ve implemented it.

“In God we trust, all others [must] bring data.” W. Edwards Deming.

3. Fix defects immediately. I’m a big advocate for zero bugs policy and no defects backlog. If something is important fix it right away. It impacts your customers, the ones who are paying you to use your product. When you find out about a defect, the knowledge is still fresh in your mind and you can fix in quickly. If you postpone the fix, you lose this advantage, people leave the company and the fix aways takes longer. Make sure to implement mechanisms to prevent this bug from happening again (automated tests, alerting, database checks, purging of old/unused data).

4. Eliminate classes of defects. Sometime when you spot a defect, it’s part of a larger class of defects. Analyze each defect and figure out if you can eliminate the parent class of similar defects. For example, we had a PHP backend that contained SQL injections. We and our customers would find them on regular basis and fix them one by one. We got tired of this whack-a-mole game and we wrote a custom tool that would scan the PHP code and detect all SQL injections. We got rid of that problem reliably once and for all.

5. Create mitigation strategies. Defects will happen, it’s not possible to eliminate all of them forever. The smart thing to do is to expect the defects and build mitigation strategies. As Todd Conklin says — safety is the presence of capacity (to fail). When a defect occurs, the consequences should not be catastrophic. The examples are highly domain specific, but here are some of them: use try/catch/finally blocks to gracefully handle an exception and recover, use database transaction, periodic checks for the integrity of your data, restart a service in case of a failure, retry a failed network connection. Some of these techniques are described in great details in Release It! book by Michael Nygard.

6. Eliminate waste. Waste is everything that the customer is not paying for — it may be a code (unused feature), a documentation that nobody reads, unnecessary activity performed constantly (e.g. test before release that the software runs on Windows XP, or on Internet Explorer 9). Constantly ask why do we need to do an activity, how does this activity help the customer. “We’ve always done it this way” is your enemy.

7. Share knowledge. Whatever you learn, it’s your obligation to quickly share it with your company. The knowledge will spread rapidly and will be of use to everyone. Even if it may not be directly applicable, it may spark other ideas for improvements. If the knowledge is not company confidential, share it with a wider audience. Write a blog post or present it at a conference. This is the way to grown and improve the QA community.