One of the challenges with making changes in a database environment is that undoing those changes can be hard. What’s often preferred is rolling forward with a new change to correct the issue, but that’s often done with limited analysis and thought. Instead, we hope our staff makes a quick patch and a better decision under pressure than they did with more time to examine the problem. That works if it’s a simple mistake that was made in implementation but not if we haven’t designed our solution well at the start.
I ran across an article on DoorDash that I thought was interesting. During the pandemic, their business exploded and they outgrew the Aurora PostgreSQL database. They migrated to Cockroach, a cloud version of PostgreSQL that’s distributed and can (theoretically) scale much higher.
The thing I found interesting is that the engineers at DoorDash were trying to break apart their monolith and get better scalability, primarily from certain tables, by extracting their tables to get single writers in a cluster, which should help them handle a larger workload. They wanted to use their main identity table as a test, which I assume is the table that tracks each user in the system. They tried to migrate this and cutover to a new cluster 4 times before a fifth attempt worked.
I think any large migration is fraught with issues, but I appreciated the design here that allowed them to rollback their change and revert to the previous version of the database. That’s something I don’t see many teams think about or build into their database change process. I think having a clear, known, tested way to undo changes is important, at least for some of your tables.
There are two pieces of advice they give that I often give to customers as well. First, learn to spread out changes across batches. When I work with Flyway customers, I always let them know they need to think of a migration script as a unit of deployment and break those apart as best you can. Those often also become units of rollback, so keep them small. Not necessarily every change in its own script, but don’t bundle too many things together.
Second, keep things simple. Too often I find engineers build clever solutions that make sense to them, but no one else. You never know the quality of your next hire, so don’t overcomplicate things without a really good reason.
Did their process work? They’ve grown to about 1.9PB of data. That’s a lot of food orders. They’ve also had other metrics of success, and seem to be saving time for their tech team, which is often one of the main reasons to build a better process and use it consistently.
Steve Jones
Listen to the podcast at Libsyn, Spotify, or iTunes.


“Too often I find engineers build clever solutions that make sense to them, but no one else”
Steve – this is SOOOO true. I previously worked for a software comapny delaing in accounting software and Iused to try and make the dev guys understand why they can’t do development (on anythingthat is customer facing like the UI) in a bubble and have only or own people test it, people who are very familiar with the program.
I also work with some video game developers (under NDA so I can’t say but soo much) and it’s the same with them. I am part of a group of gamers who work directly with the devs and since they started doing that the IP we areinvolved with has gotten much better but they were doing much of the same; creating thigs they thought were clever but where a “What is this” to the customer.
LikeLike
It’s taken me a long time to start to recognize when some cool or neat trick is a problem. The more I’ve learned and the more I see junior developers struggle, the more I realize that clever code sometimes creates more problems. Hard to get younger or less experienced people to see that
LikeLike
Regarding DoorDash’s situation, do you believe a RDBM’s like SQL Serevr or Oracle would not be able to handle their kind of load/activity? I would think a system like that needs accuracy like banks (but not soo much as banks) and I heard that the distributed cloud based systems are more about redundancy than precision.
LikeLike
It’s a scale issue. Postgresql is comparable to sql server and oracle. Bad code will scale poorly compared to good code, but depending on how many inserts/ updates and how hot a page/ partition/ table might be, it surely could overload sql server. That’s one reason why in memory output objects came: to reduce locks.
There customers that overload the insert capability, and definitely the identity allocation.
No idea if this could be handled by one server or a shard of a few but I do think some of the nosql and cloud versions of dbs scale higher to more inserts or updates to a single table. However lots of the impact of how high you scale is architecture
LikeLike