Last weekend I met up with some DBA friends for lunch. Try as you might, the conversation always seems to return to talking shop and soon we were discussing the differences between data partitioning and sharding.
There seems to always be a lot of conflicting opinions on this topic, so I felt it was worthy of a quick comparison.
Defining Database Sharding and Partitioning
Partitioning is a general term used to describe the breaking up of your logical data elements into multiple entities typically for the purpose of performance, availability, or maintainability.
Sharding is the equivalent of “horizontal partitioning.” When you shard a database, you create replications of the table schema, then divide what data is stored in each shard based on a shard key. For example, you might shard a material database using Material Id as a shard key – you could store ranges 0-20000 in one shard and 20001-40000 in a different shard. When choosing a shard key, you want to consider data-access patterns, query biased and storage space considerations to ensure you are distributing load and space across shards evenly.
“Vertical partitioning” involves creating tables with fewer columns and using additional tables to store the remaining columns. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized. For example, a customer might only have one credit account number on file, yet I might choose to put the credit account information into a separate table with a Customer Id reference so that I have the flexibility to move that information into a separate database or different security context, etc. Vertical Partitioning can also be used to help reduce IO. Assume we have a table with a lot of data such as a material master. Only about 5% of the columns are normally queried such as material number and description. Assuming the frequent queries will lead to clustered index scans, we want to minimize the number of IO operations, so we want to make the row as small as possible. Infrequently queried columns such as material characteristics can be vertically partitioned. A partition is a physically separate file that comprises a subset of rows of a logical file which occupies the same CPU, memory and storage node as its peer partitions.
A shard is a physical computer node comprised of a CPU, memory, and storage. A shard’s schema (and integrity constraints) may be replicated across as many shards as needed. Shards may contain non-partitioned and partitioned tables.
By using shards and partitions together we now effectively have two keys which we can use to logically chunk out the data. Choosing those keys depends on the query biases of the primary applications reading and writing data.
Real world examples
Social Media Example
A social media site could shard its data by User Key with user data residing on different data nodes. But within those data nodes, they could also partition data based on Create Date where items posted could be broken down by week, month or other appropriate time slice. This sharding and partitioning scheme would make sense for a social media site because of the normal query biases – people normally look at the information that pertains to them. The data for a person can live on the same shard even if some of that data is replicated from other user shards.
Supply Chain Example
The online tracking system of a package delivery company, such as UPS or FedEx, might shard its data by active tracking number with groups of tracking numbers residing on different data nodes. People will typically query tracking numbers that are in transit and are looking for package movement information. So the shard would contain the tracking number and transit tracking information. The billing information such as sender address might all reside in one or two shards since it would be accessed infrequently.
Sharding and partitioning are powerful additions to the dba tool kit which are certainly worth consideration.