Compiling Ideas
Compiling Ideas Podcast
Building a Custom Database System from Scratch
0:00
-57:39

Building a Custom Database System from Scratch

Building a database from scratch is a multi-faceted engineering journey, touching on storage engines, indexing data structures, network protocols, and distributed algorithms. This article distills the key components of a database system — from how data is stored on disk (row-oriented vs. column-oriented layouts) to how queries find that data quickly (indexes like B-trees, LSM trees, geospatial structures, etc.), and onward to the complexities of scaling out (replication strategies, sharding/partitioning schemes, rebalancing data across nodes, routing requests in a cluster, and maintaining consistency via consensus algorithms). The takeaway is a deep appreciation for the trade-offs and design decisions involved at each layer. By understanding these internals, engineers gain insight into why databases behave the way they do and how to tailor a custom system to specific needs — or simply to become a power user of existing systems.

Introduction

Why would anyone build a database from scratch, given the abundance of battle-tested databases available? The reasons range from education (to truly understand database internals) to innovation (to meet specialized requirements not handled by off-the-shelf systems). Imagine needing a high-performance time-series database for a novel hardware device, or an embeddable database with custom on-disk formats — sometimes building your own is the only way to get exactly what you need. In any case, designing a database is an enlightening exercise in computer science and software engineering. It forces you to confront fundamental challenges in data representation, concurrency, fault tolerance, and distributed consistency. This article acts as a guide, as if a seasoned professor were walking you through the major design decisions and components of a database system. We’ll go deep into each critical part, maintaining rigor without glossing over hard parts, to illuminate what it really takes to create a custom database from scratch.

Storage Models: Row-Oriented vs. Columnar

One of the first decisions in building a database is how to lay out data in storage. The two classic models are row-oriented (row store) and column-oriented (column store). In a row-oriented design, each row’s fields are stored contiguously, meaning all the data for a single record sits next to each other on disk or in memory. This is the traditional layout used by relational databases like MySQL and Postgres, and it’s optimized for transactional workloads — fast reading or writing of whole records (e.g. fetching or inserting an entire user record). By contrast, a column-oriented layout groups together values from the same column for all rows. For example, if you have a table with columns (Name, City, Sales), a column store might physically store all the names in one segment, all the cities in another, and so on. This approach is powerful for analytical queries that perform aggregate operations on many rows but only a few columns — since the database can scan just the relevant columns without touching entire row objects.

Why it matters: Row vs. column storage has profound performance implications. Row stores excel at OLTP (Online Transaction Processing) scenarios where you frequently read or write individual records and need all their fields at once (e.g. updating one user’s profile). Appending a new record is as simple as writing a new row to disk, and reading a record brings in all its fields in one IO swoop. However, row stores are less efficient for OLAP (Online Analytical Processing) queries that scan large portions of the dataset but only for a subset of columns — think of summing all sales figures, or computing an average on one field across millions of rows. In those cases, a columnar store shines: it can read the Sales column in a tight, contiguous block and skip all other data, making memory usage and CPU cache utilization much more efficient. Column stores also compress data better (each column often has homogeneous data, ideal for compression algorithms), which further speeds up large scans. The downside is that writing a new record in a pure column store means updating multiple separate locations (one per column), which is slower for single-row operations. As a database builder, you might even choose a hybrid: some modern systems use a row store for recent data (for fast writes) and a column store for older data (for efficient analytics), or support both modes. Understanding your target use case is crucial: if you need fast transactions, lean toward a row-oriented design; if you need fast analytics on big data, a columnar format could be worth the complexity.

Indexing Structures: B-Trees, LSM Trees, Geospatial Indexes, and Skiplists

Efficient data access in a database almost always relies on indexes. An index is an auxiliary data structure that allows the database to quickly locate the records that satisfy a query, rather than scanning every record. The choice of indexing structure will shape your database’s read/write performance characteristics and its capabilities. Let’s explore some common index structures and where they fit in:

  • B-Tree Indexes: The B-Tree (and its variants like B+Tree) is a balanced search tree optimized for block storage (disks or SSDs). B-Trees keep keys in sorted order and ensure that the tree’s height is logarithmic in the number of entries, so lookups, insertions, and deletions can all be done in O(log n) time. Crucially, B-trees are designed to minimize disk I/O: each node can contain many keys (tuning the “branching factor” or node size to match the disk page size), which means a search touches only a few nodes (disk pages) even for millions of records. Most relational databases use B-tree indexes for a wide range of queries, especially those involving key lookups or range scans on sorted data (e.g. “find all users with last name between ‘Johnson’ and ‘Jones’”). B-trees have consistent read/write performance and are a great general-purpose index. If you implement a B-tree in your custom database, you’ll need to handle splitting and merging tree nodes as entries grow or shrink, ensure the tree stays balanced, and manage concurrency (e.g. latches or locks on tree nodes) if you allow concurrent access. It’s non-trivial, but B-trees are a time-tested foundation for database indexing.

  • LSM Trees: The Log-Structured Merge-Tree takes a different approach that favors write-heavy workloads. Systems like Cassandra, RocksDB, and LevelDB use LSM trees internally. The idea is to accumulate writes in an in-memory sorted structure (often a skiplist or binary tree) and periodically flush sequential runs of data to disk (into files known as SSTables), merging those sorted runs in the background. This turns random writes into sequential writes on disk, which is much faster on HDDs and even SSDs. The trade-off is that reads can be slower (because a given key might be present in multiple sorted files and memory, requiring a search through each) unless mitigated by bloom filters or partitioned indexes. LSM trees excel for scenarios with high write throughput or where data arrives in streams. Implementing an LSM-tree means dealing with components like a memtable (the in-memory structure for new writes, often implemented as a skiplist for quick sorted insert), SSTables (append-only sorted files on disk), and a compaction process that merges and re-sorts data files to keep read performance in check. It’s more complex than a B-tree, but very powerful for certain workloads (e.g. time-series inserts, logging, or IoT data).

  • Geospatial and Other Specialized Indexes: Sometimes your data isn’t just one-dimensional (like a numeric key) but multi-dimensional — for example, geographic coordinates (latitude, longitude) for location data, or complex data types like vectors. Geospatial indexes like R-Trees are designed to handle multi-dimensional range queries (e.g. “find all points within this bounding box”) efficiently. R-trees partition space into rectangles and are often used in GIS systems or features like MySQL’s spatial extensions and PostGIS. Another example is an inverted index for text search, which maps words to lists of documents (this underlies search engines and features like MySQL’s FULLTEXT indexes or Elasticsearch’s core). These specialized indexes might not be needed in every custom database, but it’s worth knowing that general-purpose structures (B-trees, hash tables) might not suffice for certain queries. If you plan for full-text search, consider an inverted index; for geospatial, consider an R-tree or geo-hash based index, etc. You might even integrate an existing library for these rather than writing from scratch.

  • Skiplists and Hash Indexes: A skiplist is a probabilistic data structure that maintains multiple layers of linked lists to achieve O(log n) search time, serving as an alternative to balanced trees. Skiplists are simpler to implement than B-trees and have excellent in-memory performance; in fact, Redis uses skiplists for its sorted set implementation. In a custom database, you might use a skiplist for in-memory indexing (like the memtable in an LSM engine) or for smaller datasets entirely in memory. Hash indexes (using hash tables) are another structure — they excel at point queries (exact matches) but don’t support range scans. Many databases use hash indexes for lookup by key, but one must be mindful of hash collisions and the lack of ordering (you can’t efficiently get the “next” key from a hash index). In summary, a well-rounded database often ends up using multiple index types: B-trees or LSM for general data, plus maybe specialized ones (geospatial, text) as optional add-ons. The art of building a DB is picking the right index for the job, or even allowing the user (or query optimizer) to choose index types per table.

Takeaway: Indexes are what make queries fast. If you forego them, your custom database will end up “full-table-scanning” and performing poorly on all but the tiniest data. But every index you add comes with write overhead and storage cost, so part of the design is balancing which indexes to build and maintain. Many modern systems choose LSM trees for write-heavy workloads (trading some read cost) or B-trees for mixed read/write workloads. Geospatial and text search require completely different structures to be efficient. As a database designer, you have to study your expected usage pattern and maybe implement a handful of core index structures. The good news is that decades of research provide a blueprint — you don’t need to invent a new data structure to build a database (unless you’re aiming for research novelty), but you do need to implement and tune these structures carefully.

Query Interfaces: Network Protocols and API Design

Once you can store data and retrieve it quickly on a single node, you need to expose it to users (or application developers). Designing the query interface is another crucial aspect of building a database. This means deciding how clients will connect to your database and what language or protocol they’ll use to query and modify data.

The most powerful (and complex) approach is to implement a query language, like SQL. SQL (Structured Query Language) is the gold standard for relational databases — if you implement even a subset of SQL, clients could express rich queries (joins, filters, aggregations) and use existing tools to interact with your database. However, writing an SQL parser, planner, and executor is a monumental task on its own. Many custom or niche databases forego full SQL in favor of simpler APIs. For example, you might design a key-value store interface (get/set by key), or a document-oriented API (storing JSON objects with certain query patterns), or a graph query API, depending on your data model. The interface should be tailored to the data model and use cases of your system — there’s no point in a complex SQL engine if your database is meant for simple key-value lookups or time-series appends.

Equally important is the network protocol. Clients need to talk to your database over some channel, typically TCP/IP. You might create a custom binary protocol (many databases do this for efficiency — e.g., PostgreSQL has its own wire protocol, as do Cassandra, MongoDB, etc.), or you could use a simpler approach like RESTful HTTP+JSON for queries. A binary protocol is more work to implement but can be much more efficient (saving bandwidth and parsing cost). For instance, a client library could send a binary-encoded request that says “GET key123 from table X” and receive back a packed binary response. Alternatively, a REST API might represent that request as an HTTP GET to a URL like /table/X/key/123, returning the result as JSON. The REST approach is human-friendly and leverages standard HTTP servers and libraries, but it typically incurs more overhead (HTTP headers, JSON serialization) and might not be suitable for high-throughput systems.

When building a database from scratch, a pragmatic approach for the interface is: start simple and iterate. You could begin with a basic REPL console or command-line where you accept simple commands (many early-stage databases have a crude text protocol). Then you might implement a more structured API or an actual client library in some language. Remember that if your database is intended to be embedded (like SQLite), the “query interface” might just be function calls in a library, not a network service at all.

If you do choose to implement SQL or a custom query language, you’ll need a query planner/optimizer to parse the query, figure out which indexes or operations to use, and an execution engine to run the plan. This involves a lot: lexing, parsing, building an AST (abstract syntax tree), transforming it into a relational algebra or execution plan, and then executing (e.g., performing joins, scans, aggregations). Each of those could be an article (or a career!) in itself. Many custom databases avoid reinventing SQL and instead provide a narrower interface that’s easier to implement — for example, a time-series database might only allow querying data by time range and predefined aggregation functions, which is much simpler than full SQL.

Finally, consider client drivers. If you want developers to use your database, providing libraries (in languages like Python, Java, etc.) that hide the raw networking and present a convenient API will make adoption easier. These drivers speak your wire protocol under the hood. Alternatively, embracing an existing protocol (for example, some databases are MySQL-protocol compatible, so they can be used with MySQL client libraries) can bootstrap you into an ecosystem at the cost of conforming to that protocol’s expectations.

In summary, the query interface is the “face” of your database. It should be designed with usability, performance, and future-proofing in mind. A simple key-value API might suffice for an internal tool, whereas a public-facing database might need the expressiveness of SQL or the accessibility of a REST API. Whatever you choose, make sure the network interactions are efficient — for example, support pipelining or batching of requests if using a custom protocol, so clients aren’t stuck waiting on round-trip latency for each request. And think about authentication, access control, and encryption (TLS) on the wire — these are critical for any real deployment.

Replication Strategies: Single-Leader, Multi-Leader, and Leaderless

If you want your database to scale beyond a single node or to be fault-tolerant, replication is essential. Replication means keeping copies of data on multiple nodes (servers) so that if one goes down, others still have the data, and so that read load (and sometimes write load) can be spread across nodes. There are several replication architectures to choose from, each with its own trade-offs in terms of consistency, availability, and complexity:

  • Single-Leader Replication: Also known as master-slave or primary-secondary, this approach designates one node as the leader (master) that handles all writes. That leader propagates the write updates to one or more follower nodes (slaves) which apply the changes. All reads can be served either by the leader or by any of the up-to-date followers (often reads are spread out to followers to scale read throughput). The big advantage here is simplicity: with one leader, you avoid write conflicts (no concurrent writes to the same data from different nodes) and the data changes can be applied in a defined order. If the leader fails, the system can fail over (elect a new leader) using a consensus algorithm (more on that later), during which time writes are briefly halted. Most traditional relational databases (PostgreSQL, MySQL, etc.) use single-leader replication for clustering: you can only write to the primary, but you can distribute reads to secondaries. As a custom database builder, implementing single-leader replication involves: choosing a replication log format (binary log of changes), deciding on synchronous vs asynchronous replication (synchronous means a write is not acknowledged until at least one follower has it; asynchronous means the leader can ack immediately and followers catch up later), and handling failover (detecting a dead leader and promoting a follower). If done asynchronously, you get eventual consistency — followers might lag a bit behind the leader, but catch up eventually.

  • Multi-Leader Replication: In this setup, you allow multiple nodes to accept writes (masters). The leaders must then sync their changes with each other. This can be useful for geographically distributed databases (where each data center has a local writable replica) or for improving write throughput (multiple leaders handling different parts of the workload). The challenge is conflict resolution: if two leaders accept writes on the same data concurrently (e.g., user 123 is updated in US and Europe at the “same” time), you need a strategy to resolve inconsistencies. Common conflict resolution strategies include “last write wins” (simple but can lose data), version vectors (each write carries a version and you keep multiple versions if conflict, as in Amazon’s Dynamo), or application-defined resolution (e.g., sum up two counters). Multi-leader (also called multi-master) is more complex to implement. You need to ensure that every leader’s changes eventually propagate to all other leaders (often done with an all-to-all replication mesh or via a global transaction log). Also, writing becomes less linear — there’s no single serial order of writes unless you impose one via a separate mechanism. Some modern systems (like PostgreSQL with bi-directional replication, or certain multi-region database services) offer multi-leader, but they often caution that conflicts are the application’s responsibility to resolve. If you implement this, you’ll need to build a conflict resolution scheme into your database or provide hooks for the application to resolve them.

  • Leaderless Replication: This is the model used by systems like Cassandra and Amazon’s Dynamo. There is no single primary node; instead, clients can write to any replica node, and the system will propagate the writes in the background. Consistency is maintained through quorums and read repair. For example, in Cassandra, you might have a replication factor of 3. A client write will go to, say, 2 of the 3 replicas (the write quorum), and reads will query 2 of 3 and compare results (the read quorum). As long as the sets overlap, you can detect the latest write. There’s no leader election, which means higher availability for writes (no single point of failure), but the trade-off is that it’s eventually consistent — reads might temporarily see out-of-date data if not all replicas have converged. Leaderless systems often use techniques like hinted handoff (if a replica is down, others keep the update for it and deliver later) and anti-entropy (background processes to reconcile divergent replicas). Implementing leaderless replication is advanced: you’ll be dealing with vector clocks or timestamps to resolve concurrent writes (like “last write wins” based on timestamps or merging via CRDTs in some cases). You also have to build read-repair or background consistency jobs to ensure that eventually all replicas have the latest data. This model favors availability (writes can always succeed to some replica) at the cost of sometimes reading stale data — unless the client asks for strong consistency by reading a majority of replicas.

Choosing a replication strategy in your custom database depends on CAP considerations (the famous CAP theorem). Single-leader tends toward strong consistency (at least with a failover downtime or synchronous replication) at the cost of availability during leader failover. Multi-leader and leaderless lean towards better availability and partition tolerance, at the cost of weaker consistency or more complex conflict handling. It’s worth noting you can also implement configurable consistency: for instance, Cassandra allows configuring the required number of replicas that must acknowledge a write or respond to a read (the R/W quorum numbers). As a database designer, you might allow tuning how many replicas constitute a quorum, giving users control over the consistency/availability trade-off.

In practice, a lot of distributed databases start with single-leader because it’s simpler, and only venture to multi-leader or leaderless if necessary for the product’s goals. If you do go for replication, plan how to monitor replication lag, how to recover a failed node (catch it up with the others), and how to handle a split-brain (network partition that isolates two halves of the cluster, each potentially thinking it’s the leader). This is where the next topic, partitioning and then consensus, come into play, because managing a cluster of nodes introduces many new challenges.

Partitioning: Hash vs. Range Sharding and Consistent Hashing

No single machine can hold infinite data or serve infinite queries with low latency. Partitioning (also called sharding) is the technique of splitting your database into pieces that can be distributed across multiple nodes. Each node handles a subset of the data (and associated queries). The goal is to scale out horizontally — more nodes can store more data and handle more load. However, partitioning introduces a new set of design decisions: how to divide the data, how to locate which node has what data, and how to rebalance when the cluster grows or shrinks.

Two common partitioning schemes are range-based sharding and hash-based sharding:

  • Range-Based Sharding: You assign continuous ranges of the data key space to different shards. For example, if your primary key is an integer or a string, you might say “IDs starting with A–G go to shard 1, H–N to shard 2, and so on”. This keeps data with similar keys together, which is great for range queries: if a user asks for all records between ID 100 and 200, the system might only need to talk to one shard (the one that covers that range). Many relational systems use range partitioning (e.g., splitting a table by date ranges, or alphabetically by some key prefix). The downside is the potential for skew: if the data isn’t uniform, some shards might end up with much more data or load than others. For instance, if shard 1 handles “A-G” and most of your customers have names starting with A or B, shard 1 will be hot and large, while other shards are underutilized. You can mitigate this by careful choice of ranges or by splitting shards that become too large (as the PlanetScale example noted, you might reshard if one range grows faster than others). Another challenge: if you don’t know the data distribution upfront, you might start with one big range on one shard and have to dynamically split it as data grows — this can cause a lot of data movement at once. Despite these challenges, range sharding is straightforward and works well when queries need ordering. Many systems that care about range queries (like ordered scans) prefer this approach.

  • Hash-Based Sharding: Instead of ranges, you can shard by hashing the key. You pick a hash function (like MD5 or SHA or even a modulo operation) to map each record’s key to a pseudo-random number, then assign portions of the hash space to different shards. The appeal is that if the hash function is good, it will evenly distribute data across shards (avoiding the hotspot problem of naive ranges). For example, a simple modulo sharding: shard_number = hash(key) mod N (where N is number of shards). This ensures a pretty even distribution if keys are random enough. Hash sharding makes it harder to do range queries (because adjacent keys likely reside on different shards), but it balances load nicely. If your access pattern is mostly key-based lookups (e.g., get user by ID) and not range scans, hashing is often the way to go. A drawback: adding or removing a shard in naive modulo schemes is painful, because it changes the N and effectively remaps most keys to new shards (lots of data movement). This is where consistent hashing comes in as an improvement.

  • Consistent Hashing: Consistent hashing is a technique to make hash-based partitioning more flexible when scaling cluster size. The idea (originating from the Dynamo system) is to map both nodes and keys into the same hash space (often visualized as a ring). Each node owns the keys in the segment of the ring from its position to the next node’s position. When a node is added or removed, you don’t remap everything — you only remap the keys that fall into the segments that changed. In practice, consistent hashing often involves creating many virtual “buckets” on the ring (each node gets several buckets to even out distribution) and a lookup mechanism to find for a given key, what’s the next bucket/node clockwise on the ring. The result is that adding a new node only steals a subset of keys (those in the new node’s range) from existing nodes, and similarly removing a node only causes its keys to be scattered to others. This minimizes rebalancing work. The trade-off is that consistent hashing by itself doesn’t preserve any key order locality (it’s essentially randomizing keys onto nodes), so range queries are hard (you might have to query many nodes for a range). However, it’s excellent for cache systems or high-scalability key-value stores where each access is independent. Consistent hashing was popularized by Dynamo and is used in systems like Cassandra, Riak, and many distributed caches.

When implementing partitioning in your database, you’ll also need a way to keep track of the mapping (which node has which keys). In range sharding, this could be as simple as a configuration table that lists the ranges for each shard. In hash or consistent hashing, it might be an algorithm (like you compute the location, possibly with an in-memory table of node positions for consistent hash). Some databases employ a directory service or config server — a central place clients or nodes can ask, “where should I send this key?”. Others use a formula so each node can independently compute the target shard.

Consistent hashing vs. fixed sharding: If you have a fixed number of shards and don’t expect to change often, a modulo hash might be fine. If you plan on growing the cluster dynamically, consistent hashing or a similar technique is very useful to avoid massive data reshuffles. For example, YugabyteDB (a distributed SQL database) uses a variant of consistent hashing where they break data into many tiny “tablets” upfront (e.g., 10,000) and then just move whole tablets around when scaling — a similar idea to avoid big moves.

To ground this: imagine you’re building a multi-node key-value store. If you expect users will often scan ranges of keys in sorted order, you might choose range partitioning on the key (like “a to e on node1, f to j on node2, …”). If you expect purely random access, you might do hash partitioning. You could also do composite approaches, like range-sharding on a primary key and hash-sharding within each range (to avoid hotspots). Partitioning is a big design space. The main point is you have to decide how to split, and that decision will reflect the query patterns you want to optimize for.

Rebalancing: Moving Data When Partitions Change

Partitioning solves the distribution of data, but what happens when the partition layout needs to change? Perhaps you added a new node to the cluster to handle more data, or a node died and its data needs to be redistributed to remaining nodes. Rebalancing is the process of moving data between nodes to restore a balanced state after such changes.

Consider a simple scenario: you have 4 nodes, each holding roughly 25% of the data. Now you add a 5th node. Ideally, you want each to end up with ~20% of the data. That means some data currently on nodes 1–4 must migrate to node 5. How do we do this efficiently and without downtime? The exact method depends on your partitioning scheme:

  • If you used range sharding, adding a node usually means splitting one of the existing ranges (or several) and transferring the ownership of some ranges to the new node. For example, if node 4 held keys from “M” to “Z” and you add a new node, you might decide the new node will take over keys from “T” to “Z”. The system would then move all records with keys >= “T” from node 4 to the new node. During this process, you might have to double-write (send new updates to both old and new until fully switched) or briefly lock that range from writes. A careful design can do rebalancing online, but it’s complex — you often need to coordinate so that queries during the move still find the data (maybe by looking at both old and new locations).

  • With hash/consistent hashing, adding a node automatically defines which key ranges (on the hash ring) that node will now own. For instance, node5 might fall between node4 and node1 on the ring, so it takes a slice of keys from what node1 used to handle. The database then needs to copy all those keys to node5. Ideally, you do this while still serving queries: one approach is to stream data in the background and at some point “flip a switch” to make node5 the owner for new requests of that slice. Consistent hashing has the advantage that you’re only moving keys from a portion of one (or a few) nodes, rather than from all nodes.

  • If a node fails, rebalancing is about restoring replicas or moving data off the failed node’s shard. In a single-leader replicated system, if a secondary fails, it’s not urgent (the data is elsewhere), you just arrange a new replica. If the leader fails, a new leader is elected and you might create a new replica to replace the lost one. In a sharded system (without replication), if one shard’s node died, that data is unavailable until recovered from backup or replicated copy. Many systems will replicate each partition to multiple nodes precisely so that rebalancing on failure just means failing over to a replica and maybe creating a new replica elsewhere.

The process of rebalancing typically involves data transfer (which can be heavy — imagine terabytes moving over the network). A well-designed system will throttle rebalancing traffic to avoid overloading the cluster or impacting client operations too much. Some systems perform rebalancing gradually and even continuously (always making tiny adjustments to keep things balanced). Others require a manual trigger or a maintenance window.

One strategy to reduce rebalancing pain is using virtual shards (like the tablet concept earlier, or many small hash buckets). If you have, say, 1000 small partitions and 4 nodes, each node has ~250. When you add a node, you can just move ~200 partitions to the new node (so each has ~200). Because partitions are granular, you didn’t have to split any in the middle, just relocated whole chunks. This limits the metadata and complexity of each move. Systems like Azure Cosmos DB and others use this idea of many tiny partitions and a map of partition->node.

From an implementation perspective, to rebalance you’ll likely need an internal mechanism: maybe a background thread or a management utility that can pause incoming writes for a partition, copy its data to a new node, and then update a partition map. Or, if you can’t pause long (in a live system), you might do a more complex dance: copy data while tracking changes (like copying in bulk then sending over any new writes that occurred during the copy window), similar to how table migrations in databases work.

Rebalancing also touches on whether your system is elastic. If users expect to add nodes on the fly to scale, you must have smooth rebalancing. If your cluster is relatively static (like fixed at 3 nodes), you might not implement automated rebalancing at all initially.

To summarize: partitioning decides initial data distribution; rebalancing deals with changing that distribution. It’s one of the harder parts of a distributed database — done poorly, it can cause downtime or massive performance hiccups. But it’s critical for long-running systems as data growth or hardware changes are inevitable. At the very least, your custom database should have a plan for how to redistribute data if a node is added, removed, or fails, even if the first version of your system requires some manual steps to do it.

Request Routing in a Distributed Cluster

When your data is partitioned and replicated across many nodes, a fundamental question arises: How does a client query reach the right node that has the data it needs? This is the routing problem. In a single-node database, the client always connects to that one node. In a distributed database, we have options for routing, typically falling into a few patterns:

  1. Client-Side Routing (Smart Clients): Here the application or client library is aware of the cluster topology and partitioning. The client knows exactly which node is responsible for the data it needs, and it directly connects or sends the request to that node. For example, Apache Kafka clients maintain metadata about which broker hosts which partition of a topic, and they send messages directly to the correct broker. This approach can be very efficient (no extra hops), but requires the clients to be fairly sophisticated — they must keep up with cluster membership changes and partition map updates. If you’re building a database with an officially supported client library, you can embed this logic there. If you expect users to connect via generic tools or simple interfaces, client-side routing might be less practical.

  2. Routing Tier or Proxy: In this model, clients don’t need to know about multiple nodes; they connect to a proxy server or a routing service, which in turn figures out where to send each request. The proxy holds the cluster partition map and forwards client requests to the appropriate database node. MongoDB, for instance, uses a routing process called mongos in sharded setups – clients connect to mongos as if it were a single MongoDB, and mongos dispatches queries to the right shard(s) and combines results if needed. Similarly, in Redis clustering, you can use an upstream proxy (like Twemproxy) that splits commands to the right shards. The advantage here is simplicity for the client and centralized control – the proxy can handle things like caching of the map, retries, maybe even load-balancing. The downside is the proxy can become a bottleneck or single point of failure (though you can run multiple proxies). Also, it adds one more network hop for every request which can add latency.

  3. Server-Side Routing (Mesh): In this approach, client connects to any one node of the database cluster, and that node itself will route or coordinate the request to the correct node that has the data. There are two sub-patterns here: (a) request forwarding, where the receiving node forwards the request internally to the owner node and maybe returns the result back to client (so the client is unaware of the internal forwarding); or (b) redirection, where the node tells the client “Actually, you should ask node X for that” (like an HTTP redirect, but for DB protocol). Cassandra uses a forwarding model: any node can accept a request for any key and will forward it to the coordinator for that key (often the node itself acts as the coordinator, merging results from replicas). Redis Cluster uses redirection: if you send a command to the wrong shard, it returns a MOVE or ASK error with the address of the correct shard, and then the client (if smart or built-in) retries to the given node. Server-side routing makes the system very transparent to clients (they can connect anywhere), and it avoids a single choke-point (every node can act as a router). But it means every node needs to at least have some knowledge of the partition map (to know where to forward to), and it can complicate the client slightly in the redirect case.

So, which to choose for your database? If you’re building a closed system with provided client libraries, client-side routing can be nice for efficiency. If you want a simple out-of-the-box experience, a proxy or server-side routing might be better. Often, systems evolve: you might start with a simple proxy (easier to implement — just concentrate routing logic in one place), and later optimize to let clients or servers route more directly.

Implementation notes: If using client-side or proxy, you’ll need a reliable way to distribute and update the partition map. This could be as simple as a config file, or a special query (“get cluster map”) that clients call at startup, or using an external service like ZooKeeper/etcd (some systems store cluster metadata in a consensus-backed service, so clients can fetch it). If using server-side forwarding, nodes on start-up need to know or discover the topology — this could be via a gossip protocol (nodes gossip who is responsible for what), or also via a central service.

Routing also interacts with replication. If a data item has multiple replicas, which one do you route to? A common approach is to designate one replica as the “leader” for that item and route to that (ensuring consistent reads), or route reads to nearest replica (for speed, at risk of stale data). In a single-leader setup, that naturally maps: route all writes (and maybe reads) to the leader of the partition. In a leaderless setup, you might route writes to whichever node (or multiple nodes) and let the consistency protocol handle it, and route reads either to a coordinator or to multiple nodes for quorum. These considerations mean your routing layer might also need to be aware of replication roles, not just partition ownership.

In summary, request routing is the glue that connects the client to the correct node in a distributed database. It can be implemented in the client, in a middle layer, or in the nodes themselves. Each has pros and cons, but the ultimate goal is the same: given a key or query, find out which partition or node should handle it, and get it there with minimal overhead. A well-designed routing mechanism is key to the cluster’s transparency and performance — the best systems make sharding almost invisible to the user, aside from increased capacity.

Consensus Algorithms: Raft, Paxos, and ZooKeeper for Coordination

When you move to a distributed database with replication and partitioning, you quickly encounter the need for coordination: electing leaders, agreeing on configuration changes, and generally ensuring that multiple nodes agree on certain critical pieces of state. This is where consensus algorithms come in. Consensus means that nodes (which may fail or have network issues) can still reach agreement on some value or action. It’s an extensively studied problem in distributed systems, and the foundational algorithms are Paxos and more recently Raft, with systems like ZooKeeper providing a practical implementation for use.

  • Paxos: Paxos is a family of protocols (originating from Leslie Lamport’s work) for achieving consensus on a single value (or a sequence of values, in multi-Paxos) in a network that can drop or delay messages. Paxos guarantees that even if some nodes fail, as long as a majority of nodes are functioning and communicable, they will agree on the same value. In a database context, Paxos could be used to agree on who is the leader of a shard, or to agree on the order of transactions (as in Google Spanner, which uses Paxos for cross-node transaction ordering). Paxos is famously non-trivial to understand and implement correctly, but its key property is safety — it won’t produce conflicting results — and it will make progress if a majority of nodes are working. Conceptually, Paxos has roles like proposers, acceptors, learners; proposals go through rounds until one is chosen. Many systems instead use an implementation or a library rather than coding Paxos from scratch, unless consensus is their core innovation. Paxos ensures a single value is agreed upon even in the presence of failures— you can think of it as a rigorous voting algorithm that nodes use to pick, say, “Node 3 will be the leader for partition 5”.

  • Raft: Raft emerged around 2013 as an alternative to Paxos that is designed to be easier to understand and implement. It’s a consensus algorithm that, like Paxos, achieves agreement on a log of operations (usually) among distributed nodes. Raft is often described in terms of a leader that replicates log entries to followers — it uses heartbeats and randomized timers to elect a leader, and then that leader takes client commands, appends them to its log, and replicates to followers, waiting for a majority to acknowledge (commit) before considering an entry committed. If the leader fails, the others hold an election for a new leader. Raft’s claim to fame is understandability; it’s structured in relatively modular pieces (leader election, log replication, safety) and many open-source implementations exist (etcd’s consensus uses Raft, for example). For your custom database, using Raft could solve a lot of problems: you could embed a Raft library to manage a replicated write-ahead log among nodes, which might handle both replication and failover. The advantage is that Raft is a ready blueprint for building a strongly consistent replication (many “NewSQL” distributed databases use Raft to replicate their state machine so they appear as a single consistent node to outside). Its main disadvantage might be performance at very large scale — Paxos and Raft both require a majority of nodes to respond to commit an operation, so if you have a large cluster split into many small consensus groups (e.g., per shard) it can work well, but each group typically is like 3–5 nodes. Scaling consensus to hundreds of nodes usually means partitioning into many independent consensus groups.

  • ZooKeeper / ZAB: Apache ZooKeeper is not an algorithm per se but a distributed coordination service that many systems use. Under the hood, ZooKeeper uses an algorithm called ZAB (ZooKeeper Atomic Broadcast), which is similar to Paxos in that it orders updates in a fault-tolerant way. ZooKeeper exposes a higher-level interface: essentially a small hierarchical key-value store with strong consistency. Systems use ZooKeeper to store configuration, leader election signals, locks, etc. For example, in older Hadoop and HBase, ZooKeeper would hold which node is the primary, what nodes exist in the cluster, etc. Rather than implementing consensus inside your database, one approach is to use ZooKeeper or etcd as an external brain: whenever you need to elect a leader or agree on membership, you let ZooKeeper handle it (by writing to a znode and having others watch it). ZooKeeper’s ZAB ensures total order broadcast, meaning all nodes apply changes in the same order, and it’s tailored to configuration management (fast reads, ordered writes). In the context of building your own DB, using ZooKeeper can offload the hard parts of consensus — you’d run a ZooKeeper cluster and use its API to, say, coordinate sharding info or elect a primary for replication. The downside is an external dependency and some latency overhead in using it.

Why do you need consensus at all? Consider leader election: in a single-leader replication, how do the nodes decide who should be leader if the original dies? They need to agree on one leader (split-brain with two masters would be bad). That is effectively a consensus problem — they are choosing one node as the value. Another scenario: consistent hashing ring membership — if one node thinks the ring has 5 nodes and another thinks 6, you get chaos. So you often maintain cluster membership through a consensus mechanism so that all nodes have a consistent view of who is in the cluster and their roles. If you implement transactions that span nodes, you might need consensus for committing a transaction across them (though two-phase commit is another mechanism, it has issues unless layered on consensus for failure handling).

Implementing Paxos from scratch is notoriously difficult; Raft is easier, and there are libraries for it in many languages. If your goal is an academic exercise, you might attempt your own Raft. If the goal is a production system, you might lean on existing work (for instance, etcd is a production-grade key-value store that exposes a consensus-backed store — you can use that to build leader election or metadata storage for your cluster).

To wrap up: consensus algorithms are the secret sauce that allow a distributed database to function reliably. They ensure that when the world (network, hardware) is imperfect, your system’s critical decisions are made in a consistent manner. Whether you choose to build it in (via Raft) or rely on an external service (ZooKeeper/etcd), understanding the basics of Paxos/Raft will make you a better designer of the whole system. Paxos set the stage by showing how nodes can agree on one value even with failures; Raft made it practical to implement a replicated state machine with a leader-based approach (often favored for its simplicity and understandability); ZooKeeper provided a usable tool so you might not have to implement these from scratch at all. Many modern databases (CockroachDB, TiDB, etc.) use Raft internally for consistency. If you’re building a single-node or eventually-consistent system, you might avoid consensus initially, but as soon as you need coordination (e.g., failover without human intervention), you’re in consensus territory.

Designing and implementing a custom database from scratch is an enormous but rewarding challenge. We’ve walked through the core components — storage engines, indexing data for fast access, exposing query interfaces, replicating data across nodes for reliability, partitioning data to scale out, rebalancing as the system grows, routing requests in a cluster, and keeping everything consistent with consensus protocols. Each of these topics is deep in its own right; together, they form the anatomy of a database system.

The journey of building a database teaches profound lessons about trade-offs. Every design decision — row vs column storage, B-tree vs LSM, single-leader vs leaderless — comes with benefits and drawbacks. There is no one-size-fits-all: the “best” design depends on your specific use case and constraints. By understanding these internals, you gain an intuition for why, for example, one database excels at analytics and another at quick key-value lookups, or why one system guarantees consistency at the cost of availability while another does the opposite. Even if you never write your own database from scratch (and many engineers won’t, and don’t need to), knowing how they work under the hood makes you far better at using and tuning them.

Moreover, the field is ever-evolving. New hardware (like NVM memory or fast networks) and new demands (like geo-distributed data, or AI and big data workloads) keep database architecture a rich area of innovation. The fundamental challenges we discussed — efficient data access, distribution, consistency — remain central, but the solutions adapt over time. Thus, studying database internals is not just an academic exercise; it’s key to understanding the next wave of data engineering problems.

In closing, building a database system is like building a miniature cosmos of computing: it encompasses algorithms, data structures, networking, concurrency, and fault tolerance. It forces you to think about how data and commands flow through a system, and how to make that flow robust and fast. Whether you’re motivated by curiosity, the needs of a particular project, or the challenge of it, diving into database internals will sharpen your skills and appreciation for the systems that store and retrieve the world’s data every millisecond of every day. By tackling the design points outlined in this article, you’ll be well on your way to creating a custom database system that is not only functional but thoughtfully engineered. Good luck, and happy building!

Discussion about this episode

User's avatar

Ready for more?