Skip to content

Drowning in Data

Wired had a good article about the pictures we take with digital cameras, and how hard it is to archive them in usable form.

It reminds me of when my roommate back in my first year of gradschool finally got a digital camera. He was so excited about his new toy that he went around the apartment taking pictures of everything–including his trash can. The contrast between his world and the world of my great-grandparents (or, more precisely, the contrast betwen me looking back on my great-grandparents’ world and his future-great-grandchildren looking back to his and my world) was striking. My great-grandmother passed away a few months ago, at a day shy of her 104th birthday. If I want to explore about her life, the only resources I have is a few old letters she saved from her childhood, yellowed picture albums and far more yellowed newspaper clippings.
My old roommate’s future descendants, though, could be potentially drowning in information. Every email he’d ever written, thousands (by that time, millions) of pictures he’s taken, ranging in importance from a picture of he and his yet-unmet bride eating their wedding cake together, to a picture of his trash can, taken upon the second day of purchasing a digital camera.
(I first thoughta bout this a long time ago)

There’s a couple lines from a T.S. Eliot poem, “The Rock” which, taken out of context, fit all this very well:

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

The future of information (really, the future of data, from which information must first be gleaned before it can then be distilled into knowledge (and can we ever really hope for wisdom?)) relies on being able to filter what we need from the dross. Google surprised us all by showing that search was better than manual labelling. Why construct a knowledge ontology (like the one Cyc uses) documenting in prepositional logic the birth- and death-date of George Washington? Why bother when a google search can at it without me having to extend the manpower-hours constructing such a database.

The Wired article says that, as far as images go, it’s a problem of associating the right metadata to the each data. The final page of the article lays out a number of solutions to the data-lacking-metadata problem. Manual tagging, cameras writing standardized metadata, data mining of personal computer info, AI-driven image recognition, and that wonderful would-be-panacea called “Social Networks” that are supposed to save the world (if we ever find a use for it, but that’s another blog entry…).

I hazard to guess that the future is not in manual tagging, or anything relying on humans to rote-annotate data. With all respect for the niche del.icio.us has brought us, I think Google has taught us the strength of smart algorithms running on top of lots of data–that its utility can easily surpass human-driven annotation. We’ve had those algorithms for analyzing text for a while now–natural langauge processing is built on the shoulders those techniques–now we need the goods for images.

A couplet of Xhan’s (taken out of context, too), fits both with this situation (and, ironically, with the Eliot’s original context in The Rock):

This is twenty first century light, and the
darkness never seemed to shine so bright

3 Comments

  1. I think Google actually uses human input a lot: most links out there are arguably created by humans.

    Posted on 20-Oct-04 at 09:57 | Permalink
  2. Touche. As a computer scientist it’s too easy to focus on the neat algorithms and forget about the source of data. I think the neat thing about Google was that they took data that was created for an entirely different purpose, and found something emergent in it (at least back before people started to game google, people wrote links to populate their micro-world, with little thought of any big picture). Thanks for the correction, I’ll need to re-think my criticism of manual tagging.

    But I still wonder: is manual tagging the way to go? There are the focused dataphiles who care enough to metamark everything they have, but what about the lazy masses?

    Posted on 22-Oct-04 at 08:36 | Permalink
  3. I think that Google is also showing the strains of scaling–in terms of spamming, and data overload. It’s still useful to me for general queries, for very specific searching…but it’s not particularly useful in getting a feel for zeitgeist, for knowing what my peers think is important or useful, for effective resource discovery.

    (Oh…and thanks for the Eliot couplet; I’ve long wondered where those lines came from!)

    Posted on 25-Oct-04 at 10:05 | Permalink