Practical Techniques to Achieve Quality in Large Software Projects
High quality is hard to achieve and very expensive, but it’s worth every penny and must be taken extremely seriously. There is no silver bullet – just lines of defense. The good news is with the proper lines of defense, quality becomes incremental. It only goes up with every release. With enough test coverage and quality tools you can substantially increase the quality of your product and protect it from regressions.
When you embark upon a large software project you need to figure out how to think about quality. And you should start with testing. Not all tests are created equal. Here are the most common types:
- Unit tests
- Invariant tests (asserts, etc)
- Functional tests
- Stress tests
- Performance tests
- Scalability tests
- Customer workload based testing
- Scenario based testing
In this blog post we will discuss test coverage as well as how to manage the long tail of rarely surfaced bugs using force multipliers – a slew of engineering techniques that transform our existing test suite to produce new ways of testing the product. Each of these has been a small “aha” moment and yielded good results.
With testing, one of the hardest things is deciding exactly what to test. Every test you create or run has a cost, and this is where a tester’s intuition shines. When your product has a lot of features it’s easy to poke around, but the real question is how can you do it in a systematic way.
It’s important to track which areas of the product are tested well and which are not, and this is made all the more difficult by the many moving parts of the codebase. Here are a few MemSQL examples:
- Correctness of read queries in the presence of write queries
- Correctness of replication in the presence of alter table
It helps tremendously to test combinations of features. It’s especially hard because different people or even teams could work on those features and the problem may only arise when both features are involved at the same time.
The number of possible combinations explodes very quickly. However, we’ve noticed that in practice the majority of bugs surface in the combination of just two features.
So, one way to tackle the combinatorial explosion is to only consider pairs of features and write tests for every pair. You can also employ n-wise testing and use a different combination every time.
On top of that you should also think about dimensions. A classical test dimension for a SQL database is “all data types” or “all possible positions within a SQL query such as the where clause, select clause, group by, etc.” When testing a specific feature you want to test it with every dimension.
We wrote down all feature pairs and dimensions into a giant spreadsheet. After that we fanned out the test work to the engineering team to color the matrix green.
You can write a lot of tests, but it is continuous integration testing that allows developers to make far-reaching changes without a paralyzing fear of undetected breaks. Such integration testing should be fast and scalable. You need stellar test infrastructure to frictionlessly add, manage, and run these tests. With a well-built execution system, the number of functional tests can grow safely and quickly.
Building good test infrastructure is similar to writing a consumer product for your engineering team. The easier it is to produce tests, the more the engineering team is engaged and more tests are running against your system every day. In the ideal world, when you modify a single line of code you should instantly know what tests you just broke. Practically it’s not always possible because of the computational complexity of test runs, but you should push towards getting instant feedback as much as possible.
A perfect test execution system would run the entire test suite after every push. We used to run every test before commits on each developer’s machine. When that started taking too long, we ran only a subset of tests locally and performed full test runs every night on a dedicated box. When the dedicated box couldn’t handle the load, we built a cloud-based test execution platform called “Psyduck.”
The Psyduck platform can run any subset of our tests against any version of our codebase. Furthermore, if you want to run tests against local code changes, Psyduck accepts patches, and so we enforce running Psyduck before pushing code. For example:
memcompute1:~/memsql/memsqltest nikita(arcpatch-D2497)$ ./psy test --filter=.nikita Compressing patch...done. Sending 5340 bytes to S3.....done. patch id = 901be82c33ee4b07b142d102660e3206Psyduck uses a combination of in-house hardware along with hundreds of Amazon spot instances to run tests as fast as possible in parallel.
Last summer a MemSQL intern built a real-time visualization layer called Liveduck and it became an instant hit. This interface displays all ongoing and recently completed tests along with metadata such as the associated engineer and the pass/fail count.
Psyduck provides us with a quick metric on where we stand now and where we need to be at release. As we approach a release, the team pushes towards what we call “green Psyduck”. This means that all tests are passing.
Another property of an excellent test system is that every test must run the same on a developer’s machine as it does on the platform. Furthermore, each test should be self-contained so that it requires zero setup. This will greatly assist in test suite maintenance and accelerate failure investigation.
Every one of these tricks took MemSQL quality to the next level. Here is the list.
Transforms allow to put a twist on all existing tests. By implementing a few functions in Python you can munge both the incoming query and the expected output. Then, you can run your entire functional test suite via a transform. MemSQL has many transforms. Here are few:
- Replication transform. All the read queries are directed to a replication slave.
- Subquery transform. Every read query is wrapped into subquery
- Backup restore transform. As queries run, the transform performs backups and restore and makes sure the results of the queries stay the same
We have had internal transform competitions, where the goal is to write the transform which finds the most bugs. When you write transforms you feel like you are standing of the shoulders of giants. With a few lines of code, you can stress the system with thousands of new tests.
Random query generators produce SQL statements using a given grammar. The cool part is that it can read the bison grammar of the actual product and use it to generate random queries. There are a bunch of force multipliers applicable for random query generation:
- Produce many different queries that give the same result; for example, identical queries over tables with different indexes and cross verify the results.
- Build intelligence into the generator to gradually increase the complexity of the generated tests. This ensures that the system is robust against simple cases before you proceed into more complex cases. If you start with complex cases you won’t be able to converge to a quality system fast enough.
- Once a bug is identified, the generator automatically reduces the complexity of the query to come up with the smallest possible reproducible test case. Random query generators can produce extremely intimidating queries. If the repro is left “as is” it would take a lot more time for a developer to investigate the root cause of the failure.
Now that you have created a lot of functional tests, you can leverage them to run functional stress. The idea is that you take all these tests and randomly throw them at the system in parallel. When you do that you can’t really verify the test results, but all you care about is crashes and internal consistency checks.
This is a very neat trick. Every time there is a condition that is true 99.9% of the time (like memory allocations), you inject a failure once per unique call stack. So if you run a query “select * from t” the first time the server will throw an error with the first memory allocation. Then when you run this command again, the first memory allocation succeeds because this call stack has already been failed once (and saved in the stackhasher), but the next allocation fails, and so forth. If you can keep re-running a query until it succeeds, you can deterministically test each possible resource failure. We have a transform which does exactly this with every test in our code base, and it has proven invaluable.
Even though you test relentlessly, you still need pathways for your customers to surface issues back to your engineering team for investigation. You want to introduce minidump and full core dump functionality, in addition to logging to correctly diagnose bugs.
One of the first steps towards quality that we did was to have every change go through a code review. To assist with this process a good tool is a must. We use Phabricator because it integrates into our workflow and is fun and easy to use.
Code review helps promote a quality-oriented culture in the office. Engineers will trend towards producing better code and more tests to increase the number of positive comments they receive. Furthermore, having such a system in place helps new engineers learn the ropes faster.
It’s entirely worthwhile to invest into testing infrastructure from Day One. By doing so, you can achieve the following goals:
- Using stellar tools and force multipliers to do more with fewer engineers and really push the long tail of bugs
- Always be aware of the current state of the quality of the product
- Quickly stabilize and ship the product!