A few years ago, I wrote about the pros and cons of data sharding and data partitioning based on real-world examples. But it recently occurred to me that many may still be unsure when to use database sharding techniques and the expected benefits. The below article addresses three key questions to ask before you begin any data sharding effort:
- Why shard a database?
- When to database shard?
- What are my options to consider instead of sharding?
Question 1: Why shard a database?
The primary appeal of sharding a database is that it can help facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more machines to an existing stack to spread out the load. This allows for more traffic and faster processing times.
A relational database running on a single machine is relatively easy to scale up by simply upgrading its hardware and resources. At some point, however, a database will be limited in terms of storage and computing power when you simply cannot add more hardware. Therefore, having the capability to scale horizontally makes your design much more flexible.
Another frequent reason to utilize a sharded database architecture is to speed up query response times. Sometimes the only execution plan possible on queries may be a full table scan for certain tables, and adding indexes would be ineffective. For an application with a large database, queries can become prohibitively slow. By sharding one table into multiple shards, queries can be written so the full table scan is over a much smaller number of rows resulting in improved query performance.
Sharding can also help to make an application more reliable by mitigating the impact of outages. If your application or website relies on an un-sharded database, an outage has the potential to make the entire application unavailable. With a sharded database, an outage is likely to affect only a single shard, making only some parts of the application or data unavailable as opposed to the entire application.
Question 2: When to database shard?
To shard or not to shard…that is the question. Whether or not to implement a sharded database architecture is almost always a matter of debate. Some view sharding as the inevitable outcome for a database that reaches a certain size. Others, due to the complexity sharding adds, regard it as a last resort that should be avoided unless no other options exist.
Because of this added complexity, sharding is usually only performed when handling very large amounts of data or addressing query performance issues that cannot be addressed with other methods. Here are some common situations where sharding a database may be warranted:
- Query bias results in execution plans that cannot be modified by other methods.
- The amount of data exceeds the storage capacity of a single database node.
- Slow response times and timeouts due to:
- The volume of writes or reads exceeding the database capabilities.
- The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas.
Question 3: What are my options to consider instead of sharding?
The first difficulty that people encounter with sharding is the sheer complexity of properly implementing a sharded database architecture. If not done correctly, improper sharding can lead to lost data or corrupted tables.
Shards can become imbalanced over time as new data is added. The database would likely need to be repaired and re-sharded to allow for a more even data distribution.
It can be very difficult and time consuming to un-shard a database. Rebuilding the original un-sharded architecture would require merging the new partitioned data with the old backups or transforming the partitioned DB back into a single database.
Before implementing the complexity and extra maintenance associated with sharding, you will want to examine other options for optimizing your database:
- Utilize materialized views, query hints, and indexes
- There is no replacement for good SQL skills and writing queries properly.
- Data caching
- Many applications call the same 10% of queries over and over. Caching stores the data that has already been requested in memory, allowing the application to read it from memory when required later as opposed to performing a physical read from disk.
- Replicate to secondary databases
- This is another great strategy to improve read performance. It allows you to write to one database, replicate the database to secondary databases to be used for read only. This lessens the possibility of read locks and other blocks, as well as reducing contention for the same resources.
- Vertical scaling
- Vertical scaling or scaling up, is the process of upgrading the hardware of an existing server by adding more RAM or CPU. “Just throw more hardware at it” is sometimes the right choice.
If your database grows past a certain point, none of these strategies will be enough to improve performance on their own. In such cases, sharding may indeed be the best option for you.
For more information on how Profit Point can help, check out these pages: