>*Please ask yourself why.* Looking for a way to hands-free scale very cheap...

rbranson · on Dec 7, 2010

It's not as "hands-free" as you'd like to believe. Check out the MongoDB sharding introduction[1]. There are some pretty big caveats. Very few people are using auto-sharding at scale in production (bit.ly and BoxedIce are all I know of).

There are other operational issues with MongoDB. MongoDB can only do a repair if there is twice the available disk space as the database uses, and the server must be effectively brought offline to do this. To reclaim unused disk space, you have to do a, you guessed it, compact/repair. Want to do a backup? The accepted way to do this is to have a dedicated slave that can be write-locked for however long it takes to do your backup. They suggest using LVM snapshots to make this short, but disk performance on volumes with LVM snapshots is terrible.

I would consider using MongoDB for a setup that would either be either non-critical, completely within memory with bounded growth (which itself sort of begs the question...), or involve mostly write-once data, such as feeds, analytics, and comment systems.

[1] http://www.mongodb.org/display/DOCS/Sharding+Introduction

metamemetics · on Dec 7, 2010

Well my platform is n number of $20 linodes to start. I'm clustering the python application across them using uwsgi+nginx (all I have to do is add an IP address in the config to scale), it's going to be a given that I shard the database across them as well. If you feel I should avoid Mongo would you recommend Cassandra instead?

Regardless, I think my initial question regarding when to denormalize data applies to any database including scaled MySQL, but perhaps was a better question for stackoverflow.

rbranson · on Dec 7, 2010

Cassandra has it's own hurdles, but I think if we're talking about getting your mind in the right place, it might be a better answer. Cassandra definitely has a much more mature scalability implementation that isn't caveat-ridden like MongoDB is. It's operating at scale at both Twitter and Facebook.

Cassandra has online compaction, but still requires up to 2x space for compaction. However, Cassandra does not have to do a full scan of the entire database to do compaction, and almost never actually uses the 2x space. It's also much easier to maintain a Cassandra cluster, because each instance shares equal responsibility, and replication topology is handled for you.

Despite what their fans will say, these are both beta-quality products.