Needlessly Technical: NoSQL vs YesSQL

June 24th, 2010

Needlessly Technical” articles are dry, and boring, and cover some sort of computer-programming topic.

Greg is currently involved in the process of trying to wrap his head around NoSQL. Technologies such as Cassandra, Memcached, CouchDB, MongoDB, BigTable, SmallTable, MediumTable, KitchenTable, SofaDB, CoffeeTable…

As a dude who is made of cheese occasionally trying to keep tabs on this whole “Web Development” thing, here’s what I’m seeing so far.

Pros

NoSQL solutions (Cassandra, Memcached) are trivially parallelizable.

Following the NoSQL debate in the ACM magazines, industry shills have been sending in article after article complaining that “SQL already does that“.

Every serious SQL DBMS (e.g., Greenplum, Asterdata, Vertica, Paraccel, etc.) written in the last 10 years has provided shared nothing scalability, and any new effort would be remiss if it did not do likewise.

– of course, when they say “Serious SQL DBMS”, they mean SQL databases that independently cost about as much as helicopters.

Most web development is driven, not by the enterprise crowd, but by the hobbyist crowd, people who can only really afford to use MySQL or PostgreSQL. MySQL clustering is roughly as complicated and horrifying as testicle surgery, and even then they specify that such a solution will only work well with ‘mainly primary key access’ and ’simple joins’ – essentially, if you don’t read the entire book High Performance MySQL and follow every dictate to the letter, you’re fucked. Other options include distributing MySQL through an external software solution – sharding. Also complicated and ugly. Distributed, high-availability MySQL requires a “Database Guy”.

The NoSQL solutions, however, make parallelization first-class, at the expense of other fun features from MySQL, like complex joins, or data integrity. Setting up a four or five computer memcached or Cassandra cluster is as easy as a high-school girl at a college party.

In NoSQL solutions, the only way is the fast way.

Once again, I point at the book High Performance MySQL, because this is an important thing to consider – the very importance of this book makes it quite clear that it is possible – nay, easy – to write Low Performance MySQL. Glancing through the myriad optimizations that one can make to tables and queries – the sheer amount of voodoo required is staggering. MySQL is a complex and complicated beast, one that allows the user to do most anything, at the price of making it quite easily to accidentally do things very wrong.

NoSQL solutions are cryptic and difficult to understand, at first, but the most cryptic and difficult thing about them is exactly how limited the querying languages that they express really are.

In CouchDB, for example, the only way to query is to write a ‘map’ function – a function that takes a single unit of data and produces a ‘key’ for that data. Then, a ‘reduce’ function, a function that folds keys together. After that, the software queries against the set of keys produced by the map and reduce functions. Expressing even simple operations in this manner can be complicated and difficult.

But being as this is the only way to query CouchDB, it is all-but guaranteed that actually accessing data in this manner will be quite snappy. There are no thousands of optimizations to be made because the program itself is so much simpler than in MySQL.

Flexibility

No Schema required for a key-value store. Chunk any ol’ data in there. This can turn out to be pretty darned handy – in a project that I was working on, one where I took output from Python’s staggeringly sextastic Universal Feed Parser – well, the output from the parser came back as a Python object, a tree, one which serialized quite neatly into JSON and was just as easily crammed into a CouchDB database. A bot to check RSS feeds and store them in a database, all in about 2 pages of code.

HTTP/JSON access

This one is MongoDB/CouchDB specific, but the ability to directly query a HTTP database and get a response in JSON is neat for those constructing rich Javascript apps. There needs to be no intermediate data layer at all, just connect the app right to the DB.

Cons

HTTP/JSON access

Again, MongoDB/CouchDB specific, but the ability to directly query a HTTP database and get a response in JSON is all well and good, but the lack of any authentication layer means that either all of the data is open-access, all of the data is private-to-the-server (which means an intermediary layer is needed), or a homebrew authentication layer needs to be concocted by hand.

Documentation

As it would turn out, documentation for anything but memcached is spotty at best. Nobody knows quite precisely how to work these glossy new software behemoths, because nobody ever bothered to publish more than the most cursory of docs for them. And the main database projects, what little documentation they have is better than the individual language adaptations. Looking at the couchdb-python module, the only way to discern how to use the library at all was to read the code.

In the words of Joshua Bloch, “Documentation matters. No matter how good an API, it won’t get used without good documentation. Document every exported API element: every class, method, field, and parameter.” (Note that the conference talk he mentions, “How to Design a Good API and Why it Matters”, was excellent.)

In NoSQL solutions, the only way is the fast way.

I listed this as a “pro” before, but … well, the restrictive query environment is a huge downside, too. Sometimes it’s nice to be able to give SQL some gargantuan query to puzzle over for 10 seconds, eventually producing exactly the report needed, in sorted order.

In fact, for complicated web applications – anything with users, user profiles, or complicated hat-mechanics, the ability to use INNER JOINs, LEFT JOINs, RIGHT JOINs, UNIONs, INTERSECTs, and SORT BYs can be very, very helpful.

Structure and Data Constraints

Well-structured data can be it’s own documentation, sometimes. Just looking at the “CREATE TABLE” statement in a database can tell a developer a great deal about how the software works. This is _not as easy_ in a non-relational database, especially one where any given unit of data can be slightly different.

MySQL can be pretty damn fast

With a Database Guy on the team, somebody familiar with optimizing MySQL tables and queries, somebody who can shard, shuck, jive, and moustache, MySQL can be very fast, while still offering the modern features (JOINs, sorting, data integrity) that people have come to enjoy from databases.

That ACID taste in your throat

Most NoSQL solutions subscribe to the principle of eventual consistency – given enough time, the data will eventually propagate along the entire cluster, but there’s no guarantees that data in one place is the same in another. For non-critical data, like, say, ‘blog comments’ or ‘high scores’, this is not a problem. For other applications – say, payment processing, or perhaps some sort of complicated hat system – ACID consistency is probably not such a bad idea.

Pork Pie

Verdict

I think the verdict here is pretty obvious. You should wear more nice hats. They make you look dapper, and keep your head dry when it rains.

Ship Early, Ship Often, Advertise Early, Fail

December 7th, 2009

The internet is abuzz with response to Jeff Atwood’s latest post – “Version 1 Sucks, But Ship It Anyway“.

You don’t need me to summarize it for you – it’s right there. In the link. Release early, release often. It’s well known to programmers, thanks to the full chapter devoted to it in the seminal “The Cathedral and the Bazaar“. While open-source software is characterized by RERO, the web lives by it. Deployment of new features is free, and feedback can be instant. Joy!

However, one thing that is frequently ignored is… how soon should I start telling people about my software? Advertising to people too early in a project’s lifespan basically spells instant and permanent death for that product.

While it may be good to release early, the ‘beta’ of a new web service is very, very likely to drive people away with it’s lack of polish. While users love the ‘new’, users similarly hate products that are unusable, half-baked, or still in development. Note that Google Mail, Google Maps, and StackOverflow were all complete products before they started spreading the word amongst their users – while they have enjoyed frequent revision and update, since, their first public offerings were robust enough to draw people in. They were ‘beta’ in the sense that they could still use a little bit of revision and polish, not ‘beta’ in the sense that they were still sussing out fundamental bugs. World of Warcraft could be considered a product with a release-often mentality, but they did not release early – the WoW beta was a long period of polishing a product that was already seriously ready for prime time – and the people who participated in the beta were among the first subscribers.

Now, on the other hand, the game Cities XL by Monte Carlo enjoyed a beta period that was almost more of an alpha period. Gamers involved had to deal with buggy client-side and server-side code, forbidding them from communicating with one another, using fundamentally important game features (‘trading’), saving, or even connecting with the server much of the time. First impressions were formed, and while Cities XL did well in reviews as a fairly run-of-the-mill city-builder with a lot of unnecessarily tacked-on MMO features, the game suffered in public opinion as a buggy, underperforming mess of half-implemented features.

Let’s also look at startups.com, a site that reddit has been advertising. This is a company that raised VC, earned a mention on TechCrunch, purchased the domain name for six-figures, purchased the software (from StackOverflow) for another high premium, and then, it would seem, immediately pushed their product into an advertising stage. I’m not sure how the startups.com community was formed (how is babby formed?), but it would seem that either startups.com tried to outsource community building or they just left it up to whatever spammer came along. Seriously, this is a company by-the-numbers – it barely even needs a tech person on staff. One thing it does need, however, is community building, and advertising it on reddit before it was ready just shot it in the foot.

So, while I believe in “Release Early, Release Often”, I’d like to append two corollories.

  • Never release anything that’s broken. – There’s a significant difference between ‘needs polish’ and ‘doesn’t work’.
  • Do not advertise a product that is not prepared to impress. – Otherwise you’re just creating negative word-of-mouth.

Needlessly Technical Executive Summary: The Best Software Writing

October 28th, 2009

Executive Summary” articles are produced when I read something, and then, in order to remember it, attempt to mush the entire message into a few words.

Needlessly Technical” articles are dry, and boring, and cover some sort of computer-programming topic.

So, I’m going through the book “The Best Software Writing”, which Joel Spolsky edited as an attempt to put some bloggers into print. I’m summarizing everything in very-short-form for the sake of trying to remember it all.

Style is Substance – Ken Arnold

Here.

  • For any given language, there are a few acceptable coding styles
  • No style is any better than the other
  • Arguing about style takes time
  • A common style is a good idea
  • Enforcing a universal per-language style at the compiler level is a Good Idea
  • This enforcement must be mandatory.

The Pitfalls Of Outsourcing Programmers : Why Some Software Companies Confuse The Box With The Chocolates – Micheal Bean

Here.

  • You lose the ability to be competitive or differentiate yourself in anything that you outsource.
  • This is okay for non-critical business functions – like the boxes that you ship chocolates in.
  • This is not okay for critical business functions – like the chocolates themselves.
  • “Software” is almost always “The Chocolates”.

ICSOC04 Talk – Adam Bosworth

Here.

  • Very complicated abstractions that make for very simple systems (regular expressions, SGML, C++) are often replaced by very simple abstractions that make for very complex systems (Google, HTML, PHP)
  • A lot of very successful systems are “simple, sloppy, flexible, human”, and a lot of failed systems are “clean, crisp, clear, correct”.
  • When a technology becomes ubiquitous, the ‘media’ ceases to be the message, and people become far more concerned with the content.
  • “The currency of reputation and judgement is the answer to the tragedy of the commons”
  • Machine learning, inference, and data mining all provide ways to sort through the massive amount of data we regularly face. AI is mainstream and vitally important. (Disagree? Okay, turn off your spam filter. )

Autistic Social Software – Danah Boyd

Here.

  • Many failed (and a few popular) social networks attempt to codify human interaction in simple technical ways.
  • No popular social network even approaches actual human social interaction. (I’d say you get a much more human ‘community’ from IRC or phpBB )
  • Many times, these social networks are popular, not because they model social interaction, or replace it (they don’t) but because of the ways that people use them alongside social interaction – to play games, or to interact with people they might not talk to often.
  • Flexible systems are popular because they can be repurposed to work in many ways.

There are three ways to make technology work with people.

  1. Make a technology, market it, force adoption.
  2. Make a technology, observe how users use the technology, try to support these activities
  3. Understand a group of people, try to develop software that works for them.

(Or, ideally, start at 3, then do 2. )

Why Not Just Block The Apps That Rely On Undocumented Behavior – Raymond Chen

Here.

  • Incompatibility makes people less likely to want to upgrade
  • Breaking software makes people less likely to want to develop software for your systems
  • “Lock-in” is fundamentally important to OS companies – breaking software hurts ‘lock-in’.
  • Thus, at least in older Windows releases, there are layers upon layers of compatibility hacks to keep popular software working from release to release.

Strong Typing vs. Strong Testing – Bruce Eckel

Here.

  • Static type-checking brings the gift of compile-time correctness checking. With dynamic type-checking, one must wait until runtime, and some bugs are never found.
  • Duck typing allows for much more concise end elegant code.
  • A 20 line Java program is reduplicated in Python in 10 lines. Bam.
  • Even in statically typed languages, compilation does not imply correctness.
  • Proper unit testing catches many more errors than static type checking. It should be used on all projects, static or dynamic.
  • If you’re going to do proper unit testing anyways, why include all of the needless static type checking cruft?

LumberJack launch!

July 13th, 2009

The LumberJack IRC-loggin’ package has launched!

Galleria

July 2nd, 2009

So, I’m having trouble finding a reasonable gallery software.

Honestly, what I want is a service that allows me to export seamlessly from iPhoto, not worry about storage space or download limits, share with arbitrary people, allow for posted comments, look nice, run fast, and be free.

I know, that’s a tall order.

flickr limits you to 100Mb / month, which is a bit of a pain if you take more than a few pictures a month. Google Picasaweb limits you to 1 gigabyte forever, which seems almost worse. flickr pro costs money, SmugMug costs more money, and MobileMe costs the most of all.

That pretty much leaves “Gallery”, but the last time I tried running it on my DreamHost account, it was so slow as to be not even worthwhile. Yeah, DreamHost may allow you to install a wiki or a gallery, but if you want to USE them, you’re up slow creek without a paddle.

So, what that leaves, then, is iPhoto’s “Web Export” function. Now, I’m not sure about iPhoto ‘09, but iPhoto ‘08 produces HTML that’s lightweight, but… well… there’s no way to style it. Would it be so hard to include a couple of hooks so that one could toss a CSS stylesheet on the pages?

So now, my solution is to just post iPhoto galleries (here), and then, in the FAR OFF FUTURE, I plan to write a script that iterates through them and adds a CSS-stylesheet link to each page- and maybe wraps all of the ‘descriptions’ in a <small> tag. Maybe a link to some external Javascript for special effects. Who knows?

Web Developer Tools

June 12th, 2009

I might have posted this before, but for developers looking for a shorn-down Firebug equivalent for Webkit/Safari, you can enter the following snippet into your Mac’s Terminal:

defaults write com.apple.Safari WebKitDeveloperExtras -bool true

This will give you the ability to “Inspect Element”, which is very, very useful. It brings up the Mac Web Developer Console, a sort of useful multi-tool for the web. In fact, after the appearance of the ubiquitous and incredibly useful FireBug, it would seem that most of the major browsers are going the way of integrated web-development tool modules. Nice!

FireBug

I’m not going to lie — FireBug completely changed the way that I write markup. Instead of laboriously fiddling with a CSS value and refreshing repeatedly (How does 74px look? How does 80px look? ), I just toss in a nonsense value, then inspect the element I’m working with in FireBug. From there, I can select any of the values that I want to play with, and hold the up/down keys to watch that value change in real-time. You can also add new CSS properties on the fly to see how they change the page. Useful!

The Safari Web Developer Console allows you to inspect elements and change CSS, but you can’t slide elements. Which is okay, I guess. I mean, every product works a little bit differently.

Size

The Safari Console *also* allows you to determine the full download size of a page of your website. Notably, this website’s index page, pictures and all, occupies less than 100kb of space — but Javascript plugins (jQuery, jQuery UI, and the plugin for code rendering) and sIFR eat up another 400kb of space, making the whole site a bit of a pig. Alternatively, I’ve written full-on Wordpress sites that pack tidily into less than 70kb. Google’s front-page clocks in at a svelte 36kb, and Yahoo’s front-page a beefy 1Mb.

Notably, this functionality exists in FireBug, but it’s not nearly as pretty or intuitive, so I never really played around with it.

In fact, the Safari Console has a few features that could be really helpful for someone pulling a page to bits. It lists all of the images used on the page, where they come from, and their size and dimensions. It also lists all of the stylesheets and javascript elements loaded on a page. Actually, looking at my site, one of the reasons that it’s so big is that the JavaScript Code Highlighter that I’m using depends on a number of ‘brush’ files (one for each language family being highlighted), each with a very large, identical copyright notice. I bet trimming that down could cut a couple kilobytes from this site’s hefty frame.

JavaScript

When I started programming in Javascript, I found also that FireBug logs Javascript errors in a tidy format that makes it painfully clear what’s gone wrong with the page. As well, you can toss a console.log( “foo” ) statement in your page, to print output directly to the FireBug console. For debugging, this is like sex cheese.

Notably, this also fills me with frustration whenever a jQuery library module starts throwing an obscure error. Am I doing something wrong?

The Safari Console seems also to have a console for logging. I haven’t really fiddled with it. For that sort of work, I’m almost guaranteed to go to FireBug unless it’s a Safari-specific error.

Okay, Windows, You’re Up To Bat

So, Safari and Firefox both have robust developer console systems, albeit each one with strengths and weaknesses. What does Internet Explorer have?

crickets

tumbleweed

Actually, you’d be surprised. Internet Explorer is home to a small set of development tools as well, for those developers in the unfortunate circumstance of having to develop for it as a primary. I haven’t tried any of them, and they seem to be quite the mishmash of miscellany, but it’s good to know that the tools exist.

And The Rest

Google Chrome has a very rudimentary set of built-in developer tools, although it’s so small as to be nearly meaningless &mdash you can’t even change CSS on the fly.

That just leaves Opera, and, while it was an empty field for a while, they’re currently working on Dragonfly, a yet-another-developer-console, for Opera. Considering, however, Opera’s much-greater-pull in the mobile world, they’re integrating a host of tools that work with assorted mobiles. It’s still in ‘alpha’, though, so the software might not be as rock solid as one might hope.

Anything Is Better Than Nothing.

June 4th, 2009

In my CMPT 475 (Software Engineering 2) class, the professor has repeatedly posited that “Something Is Always Better Than Nothing” — so far, in reference to the development of naming standards and the adoption of development methodologies.

Horseshit.

Now, ‘no development methodology’ may be a disorganized way to go about development, but a small team can make things work with little more than a source control system and a little bit of proximity.

It’s easy, however, to imagine both development methodologies and naming standards that would be actively detrimental to a project, both in terms of wasting time and in terms of actively confusing the system.

Naming

Starting with naming schemes— first and foremost, we can easily imagine a naming scheme that is actively detrimental to any project.

Start at n = 1. Every variable will be named _n, where ‘n’ is the order of introduction of the variable in the program.

This is, of course, a horrible, horrible naming scheme. A single line might end up looking like _12 = _14 + _29; . The only sensible way to make it work would be to maintain a table of data, mapping ‘n’ values to variable contents (maybe in code comments)? — nevertheless, there is no way that this naming scheme could be made elegant.

That’s— of course— a theoretical naming standard, and a pretty nonsensical one at that. Similar naming systems have popped up at theDailyWtf — the ‘a, b, c, aa, bb, cc’ variable naming standard seems to appear a lot, agonizingly enough. In general, though, we’ll consider this scheme totally nonsensical.

There are other schemes, however, that are very serious and standard, and still actively detrimental to a project. Here I refer to Hungarian Notation— and by Hungarian Notation, I mean “Systems Hungarian”, not the original “Apps Hungarian”. Apps Hungarian, when applied judiciously, can be a good thing for a project. Joel Spolsky wrote an entire article about it, here:

Apps Hungarian had very useful, meaningful prefixes like “ix” to mean an index into an array, “c” to mean a count, “d” to mean the difference between two numbers (for example “dx” meant “width”), and so forth.

Systems Hungarian had far less useful prefixes like “l” for long and “ul” for “unsigned long” and “dw” for double word, which is, actually, uh, an unsigned long. In Systems Hungarian, the only thing that the prefix told you was the actual data type of the variable.

This was a subtle but complete misunderstanding of Simonyi’s intention and practice, and it just goes to show you that if you write convoluted, dense academic prose nobody will understand it and your ideas will be misinterpreted and then the misinterpreted ideas will be ridiculed even when they weren’t your ideas. So in Systems Hungarian you got a lot of dwFoo meaning “double word foo,” and doggone it, the fact that a variable is a double word tells you darn near nothing useful at all. So it’s no wonder people rebelled against Systems Hungarian.

When it comes down to it, Systems Hungarian has some uses— but in many, many arens — for example, a small project in a dynamically typed language — it is utterly useless. It’s just a waste of time that introduces unnecessary code ugliness.

Methodology

Okay, development methodology can contribute a lot to the success of a project, I will admit. A carefully selected methodology can help a software project come in on time and on budget.

The trouble is, however, the postulate that any development methodology is better than no development methodology.

Once again— two guys with a subversion repository — with no development methodology at all — will likely outperform two guys using a heavyweight system like the RUP. The amount of overhead and paperwork involved in a heavyweight methodology almost necessitate a full—time project manager to learn it and deal with it all. Sure, for larger, longer projects where you’re managing hordes of replaceable cogs, you’re going to need some serious Process — but if you’re building a little blog package with one other guy, well, too much Process just wastes your time and holds you back.

Conclusion

Seriously, the ‘anything is better than nothing’ mindset is almost always a very bad idea in the tech world. Most of the time, ’something is better than nothing’ — I’d rather have the Vault or even Microsoft SourceSafe (shudder) than no source control at all — but we must not confuse minimally functional solutions with actively detrimental solutions.