Needlessly Technical: NoSQL vs YesSQL
June 24th, 2010“Needlessly Technical” articles are dry, and boring, and cover some sort of computer-programming topic.
Greg is currently involved in the process of trying to wrap his head around NoSQL. Technologies such as Cassandra, Memcached, CouchDB, MongoDB, BigTable, SmallTable, MediumTable, KitchenTable, SofaDB, CoffeeTable…
As a dude who is made of cheese occasionally trying to keep tabs on this whole “Web Development” thing, here’s what I’m seeing so far.
Pros
NoSQL solutions (Cassandra, Memcached) are trivially parallelizable.
Following the NoSQL debate in the ACM magazines, industry shills have been sending in article after article complaining that “SQL already does that“.
Every serious SQL DBMS (e.g., Greenplum, Asterdata, Vertica, Paraccel, etc.) written in the last 10 years has provided shared nothing scalability, and any new effort would be remiss if it did not do likewise.
– of course, when they say “Serious SQL DBMS”, they mean SQL databases that independently cost about as much as helicopters.
Most web development is driven, not by the enterprise crowd, but by the hobbyist crowd, people who can only really afford to use MySQL or PostgreSQL. MySQL clustering is roughly as complicated and horrifying as testicle surgery, and even then they specify that such a solution will only work well with ‘mainly primary key access’ and ’simple joins’ – essentially, if you don’t read the entire book High Performance MySQL and follow every dictate to the letter, you’re fucked. Other options include distributing MySQL through an external software solution – sharding. Also complicated and ugly. Distributed, high-availability MySQL requires a “Database Guy”.
The NoSQL solutions, however, make parallelization first-class, at the expense of other fun features from MySQL, like complex joins, or data integrity. Setting up a four or five computer memcached or Cassandra cluster is as easy as a high-school girl at a college party.
In NoSQL solutions, the only way is the fast way.
Once again, I point at the book High Performance MySQL, because this is an important thing to consider – the very importance of this book makes it quite clear that it is possible – nay, easy – to write Low Performance MySQL. Glancing through the myriad optimizations that one can make to tables and queries – the sheer amount of voodoo required is staggering. MySQL is a complex and complicated beast, one that allows the user to do most anything, at the price of making it quite easily to accidentally do things very wrong.
NoSQL solutions are cryptic and difficult to understand, at first, but the most cryptic and difficult thing about them is exactly how limited the querying languages that they express really are.
In CouchDB, for example, the only way to query is to write a ‘map’ function – a function that takes a single unit of data and produces a ‘key’ for that data. Then, a ‘reduce’ function, a function that folds keys together. After that, the software queries against the set of keys produced by the map and reduce functions. Expressing even simple operations in this manner can be complicated and difficult.
But being as this is the only way to query CouchDB, it is all-but guaranteed that actually accessing data in this manner will be quite snappy. There are no thousands of optimizations to be made because the program itself is so much simpler than in MySQL.
Flexibility
No Schema required for a key-value store. Chunk any ol’ data in there. This can turn out to be pretty darned handy – in a project that I was working on, one where I took output from Python’s staggeringly sextastic Universal Feed Parser – well, the output from the parser came back as a Python object, a tree, one which serialized quite neatly into JSON and was just as easily crammed into a CouchDB database. A bot to check RSS feeds and store them in a database, all in about 2 pages of code.
HTTP/JSON access
This one is MongoDB/CouchDB specific, but the ability to directly query a HTTP database and get a response in JSON is neat for those constructing rich Javascript apps. There needs to be no intermediate data layer at all, just connect the app right to the DB.
Cons
HTTP/JSON access
Again, MongoDB/CouchDB specific, but the ability to directly query a HTTP database and get a response in JSON is all well and good, but the lack of any authentication layer means that either all of the data is open-access, all of the data is private-to-the-server (which means an intermediary layer is needed), or a homebrew authentication layer needs to be concocted by hand.
Documentation
As it would turn out, documentation for anything but memcached is spotty at best. Nobody knows quite precisely how to work these glossy new software behemoths, because nobody ever bothered to publish more than the most cursory of docs for them. And the main database projects, what little documentation they have is better than the individual language adaptations. Looking at the couchdb-python module, the only way to discern how to use the library at all was to read the code.
In the words of Joshua Bloch, “Documentation matters. No matter how good an API, it won’t get used without good documentation. Document every exported API element: every class, method, field, and parameter.” (Note that the conference talk he mentions, “How to Design a Good API and Why it Matters”, was excellent.)
In NoSQL solutions, the only way is the fast way.
I listed this as a “pro” before, but … well, the restrictive query environment is a huge downside, too. Sometimes it’s nice to be able to give SQL some gargantuan query to puzzle over for 10 seconds, eventually producing exactly the report needed, in sorted order.
In fact, for complicated web applications – anything with users, user profiles, or complicated hat-mechanics, the ability to use INNER JOINs, LEFT JOINs, RIGHT JOINs, UNIONs, INTERSECTs, and SORT BYs can be very, very helpful.
Structure and Data Constraints
Well-structured data can be it’s own documentation, sometimes. Just looking at the “CREATE TABLE” statement in a database can tell a developer a great deal about how the software works. This is _not as easy_ in a non-relational database, especially one where any given unit of data can be slightly different.
MySQL can be pretty damn fast
With a Database Guy on the team, somebody familiar with optimizing MySQL tables and queries, somebody who can shard, shuck, jive, and moustache, MySQL can be very fast, while still offering the modern features (JOINs, sorting, data integrity) that people have come to enjoy from databases.
That ACID taste in your throat
Most NoSQL solutions subscribe to the principle of eventual consistency – given enough time, the data will eventually propagate along the entire cluster, but there’s no guarantees that data in one place is the same in another. For non-critical data, like, say, ‘blog comments’ or ‘high scores’, this is not a problem. For other applications – say, payment processing, or perhaps some sort of complicated hat system – ACID consistency is probably not such a bad idea.

Verdict
I think the verdict here is pretty obvious. You should wear more nice hats. They make you look dapper, and keep your head dry when it rains.
