database capacity planning

Read data in, write data out. In their purest form, this is what computers accomplish. Building a high performance data processing system requires accounting for how much data must move, to where, and the computational tasks needed. The trick is to establish the size and heft of your data, and focus on its flow. Identifying and correcting bottlenecks in the flow will help you build a low latency system that scales over time.

Characterizing your system

Before taking action, characterize your system using the following 8 factors:

  1. Working set size
    Set of data a system needs to address during normal operation. A complex system will have many distinct working sets, but one or two of them usually dominate.
  2. Average transaction size
    Working set of a single transaction performed by the system.
  3. Request size
    Expected throughput. The combination of throughput and transaction size governs most of the total data flow of the system.
  4. Update rate
    Measure of how often data is added, deleted, and edited.
  5. Consistency
    Time required for an update to spread through the system.
  6. Locality
    Portion of a working set a request needs access to.
  7. Computation
    Amount of math needed to run on the data.
  8. Latency
    Expected time for transactions to return a success or failure.
Download the Capacity Planning Cheat Sheet
8 system factors to define before capacity planning

Identifying bottlenecks

After pinpointing these characteristics, it should be possible to determine the dominant operation responsible for data congestion. Your answer might be obvious, but identifying the true bottleneck will provide a core factor to focus on.

The pizzeria example

Let’s say you own a pizza shop and want to make more money. If there are long lines to order, you can double the number of registers. If the pizzas arrive late, you can work on developing a better rhythm. You might even try raising the oven temperature a bit. But fundamentally, a pizza shop’s bottleneck is the size of its oven. Even if you get everything else right, you won’t be able to move more pizzas per day without expanding your oven’s capacity or buying a second one.

If you can’t clearly see a fundamental bottleneck, change a constraint and see what shifts in response. What happens if you had to reduce the latency requirement by 10x? Halved the number of computers? What tricks could you get away with if you relax the constraint on consistency? It’s common to take the initial constraints as true and unmoving, but they rarely are. Creativity in the questions has more leverage than creativity in the answers.

If you’re looking to build a well-designed computing system, I contributed an in-depth article on Infoq, that provides use cases and real-world examples.