Nov 27 2005

The Web Is Not A Normalized Relational Database

I had lunch with Stan James on Friday at Pasquini’s Pizzeria.  Stan is the creator of Outfoxed and was introduced to me by Seth Goldstein who is one of the guys behind AttentionTrust.org and has recently launched Root Markets (Seth has a long essay up about Root Markets: Media Futures: From Theory to Practice that is very interesting (and complex) if you are into this stuff.))

Stan’s moving to Boulder to be in the middle of the Internet software development universe (ok – he’s moving back here because it’s a much better place to live than Silicon Valley, but don’t tell anyone).  We spent a bunch of time getting to know each other, we talked about the research he’d been doing for his masters these in Cognitive Science at the University of Osnabrueck, and how this led to Outfoxed.  Oh – and we ate a huge delicious pizza.

I’d been playing with Outfoxed for a few days on my computer at home (I have a computer at home that I’ll install anything on) and was sort of getting it.  An hour with Stan helped a lot.  When I combine what Outfoxed is figuring out for me with the data I’m getting from Root’s Vault (my clickstream / attention data) I can see how this could be really useful to me in a few weeks once I’ve got enough data built up. More in a few weeks.

We then started talking about something I’ve been thinking about for a while.  My first business was a software consulting business that built database application software.  As a result, the construct of a relational database was central to everything I did for a number of years.  In the mid 1990’s when I started doing web stuff, I was amazed at how little most people working on web and Internet software really understood about relationship databases.  This has obviously changed (and improved) while evolving rapidly as a result of the semantic web, XML, and other data exchange approaches.  But – this shit got too complicated for me. Then Google entered the collective consciousness and put a very simple UI in front of all of this for search, eliminating the need for most of humanity to learn how to use a SELECT statement (ok – others – like the World Wide Web Wanderer by Matthew Gray (net.Genesis) and Yahoo did it first – but Google was the tipping point.)

I started noticing something about a year ago – the web was becoming massively denormalized.  If you know anything about relationship databases, you know that sometimes you have denormalized data to improve performance (usually because of a constraint of your underlying DBMS) but usually you want to try to keep you database normalized.  If you don’t know about databases, just think denormalization=bad.  As a result of the proliferation of user-generated content (and the ease at which is was created), services where appearing all of the place to capture that same data (reviews: books, movies, restaurants), people, jobs, stuff for sale.  “Smart” people were putting the data in multiple places (systems) – really smart people were writing software to automate this process. 

Voila – the web is now a massively denormalized database.  I’m not sure if that’s good or bad (in the case of the web, denormalization does not necessarily equal bad).  However, I think it’s a construct that is worth pondering as the amount of denormalized data appears to me to be increasing geometrically right now. 

Stan and I talked about this for a while and he taught me a few things.  Stan is going to be a huge net-add to the Boulder software community – I’m looking forward to spending more time with him.