Cognitive release?
11-Sep-08
Twittering provides just enough cognitive release that I haven’t been feeling the blogging itch. It’s so much lower-maintenance.
Disclaimer: The following web space does not contain my own opinions, but merely linguistic representations thereof.
Twittering provides just enough cognitive release that I haven’t been feeling the blogging itch. It’s so much lower-maintenance.
Spent this last week in NY, for a summit on Machine Learning for work. Had a wonderful time. Networking-wise I found the experience much better than academia (more possibility, ease for future collaboration with the folks I met).
Jetlag did a number on me (going to sleep at West Coast Time, then waking up in time for 8am EST talks…). Feh. Talks were worth it, though.
Met a bunch of really interesting people. It was great to talk to other folks that are doing classification at Google (including taxonomic classification like I’m doing!). Learned a lot (note to self: read up more on gibbs sampling, latent dirichlet allocation, and the RCV1 corpus). Also got to hang out with Chih-Jen Lin a bit, had great discussions about academia, publications (quality versus quantity of publications in China/Taiwan), and how to reform the academic publication system to give better signals as to what is readworthy. Right now, reviewers and conference organizers are the gatekeepers, and that doesn’t scale well. When you think of how many smart people read academic publications, the only way that they can give feedback is to publish something themselves. That’s such a high cost to communicate, it leads to stagnation and monoculturality in the community (echo chamber!). I wish I could easily see, for the profs I respect, a “Papers I read last year that I found really influential” list. Aggregate these and you get great, quantitative metrics of a paper’s worth. Also, “best in conference” awards for papers are so short-sighted; we really don’t know what’s good for a couple years. It would be great to have a “best 2 years ago” retrospective award for conferences and journals. компютриCiteULike starts to address these issues, but it isn’t in wide use and it’s not perfect.
The city itself
Conversations at work recently have turned again and again to consciousness and self-awareness (what, you thought “Android” was just a phone? ;) ). Now, I’m not going to belabor the point with discussions of artificial intelligence and yet another amateur’s resummarization of Searle’s Chinese Room[1]. Instead, I’ve been thinking about self-awareness in groups of humans.
A bullet-point braindump:
Hmm, I’ll have to think more about this… so many premature thoughts… And most of them the result of only 4 hours of sleep for the last couple days. My apologies, dear anonymous reader, for the unpolished words, the undeveloped concepts, the flaws. “Time past and time future / Allow but a little consciousness.”
[1] (In any case, I love Ben Goertzel’s take on the situation, which, to paraphrase: “When the time comes, and you’re actually arguing with the computer whether it is self-aware or not, then the point is already moot, isn’t it?”)
Two things this weekend:
find . | grep "/cache$" | xargs rm -rf
. Basically removed the outdated cache. Not sure if it was internal file format thing due to different version, or timestamp thing or what. But, it works now.
Also, this morning, via conversation with Hao-Chuan:
I need to think more about intellectual foraging. Metaphor of information tracking/consumption, based on food tracking/consumption.
(aside: It’s now August, last post in April… where has time gone? I think I was a lot more motivated to blog when i was back in Academia. The atmosphere back then was a bit… stagnating intellectually, so the internets became my vent. Now, here at Google, I’m in general more intellectually fulfilled, work around great people every day. This is so strange, I thought it was supposed to be the opposite (academia being the haven and nurturer of free thinking, and industry being the great pit of stagnation). Both, at least in my own microcosm, are anything but).
Migrating my life away from the ISI servers, as I don’t know how much longer I’ll have access to them. That means this blog needs a new home. And this is where it will stay, I guess, perhaps for the next decade at least…
My email, too. It’s now nick-at-motespacedotcom. Hosting everything myself, away from university hardware. The old email addresses I had will remain indefinitely, but I’m phasing them out. I suppose it’s good to change things up, but I’m going to miss fairuz, my old server that was sitting on a fat pipe out where ARPANET was birthed and, coincidentally, a couple floors below ICANN.
I suppose this is all a roundabout way of saying that my academic life is unfortunately on a bit of hiatus right now. I’m taking a one- or two-year leave of absence from USC, and am working for Google in the interim. When I come back, I’ll likely be transitioning away from Computer Aided Language Learning (that half-written thesis will be good for kindling next time I go camping, perhaps), and into the Ontology depths of Natural Language Processing.
Went to see the Magritte exhibit at LACMA today. Thoughts:
As a break from Artificial Intelligence ponderings, I was recently mailed a sampler of darjeeling oolong teas so that I could participate in an “on-line tasting”. Thanks to T-Ching, and Phyll Sheng for arranging the tasting. And, finally, a huge thanks to Lochan Tea for providing the leaves, and for pioneering this new form of tea production.
First, an explanatory note: Oolong teas are “partially fermented”, in that they’re halfway between unfermented green tea and completely fermented black tea. Typically, Oolongs are produced in Taiwan and Southern China. Darjeeling, India, by contrast, has traditionally produced black teas. A while ago, the enterprising and globalizing Lochan growers decided to try preparing their darjeeling leaves using traditional Chinese methods to produce Oolongs.
The three samples I tasted were quite exciting. Definitely a fusion of tastes, different from both typical Darjeeling and typical Oolong, but maintaining enough qualities of each that you can tell that it’s a mix of the two.
I brewed all three of these by Lochan’s provided recommendations (1 cup water with 1 teaspoon dry leaves, brewed for 3-4 minutes, 3 brewings) rather than typical Chinese style (higher leaf:water ratio, shorter brewing time, more brewings). Once I have a bit more time I’d like to go back and try brewing again, but using the Gong Fu method. Just curious.
All three were brewed using Glacier Springs water (good water is a requisite when tasting tea, and I like the flavors that a high mineral content water brings out). The teas were rated on a scale of 1(worst) to 5 (best), with a focus on flavor/smell/huigan rather than leaf appearance.
Here are my tasting notes:
Thinking a bit about AI this weekend. 30 years ago, we tried to imagine what life would be like in 2010. Intelligent Agents, Strong AI, etc etc. It’s a bit disheartening that the pinnacle of AI that we have to show for our efforts are things like PageRank and phrase-based statistical machine translation.
Not to say that either of these algorithms are bad—on the contrary, they accomplish exactly what they set out to do, and they do it well. But, there’s no magic to them. No glimmer of human-like intelligence behind them. They show us that the major accomplishment of AI for these past few decades is one of statistics (treating measurable phenomena like the trust metric of a website, or how a word in one language corresponds to a word in another) rather than one of intelligence in the more generalizable sense.
I suppose this isn’t a bad thing—the things we build do what they were built to do, after all—but still, the idealistic part of me that grew up reading Asimov and Heinlein… that part of me can’t help but wish that submarines could swim.
I ran across Leonard Richardson’s Ultra Gleeper again yesterday. I hadn’t seen it for a year, and it’s been good looking at it with new eyes since I’ve begun hacking in earnest on machine learning problems and measuring “interestingness” of RSS posts.
The project is interesting because he aims to solve the same problem I do (automatically find interesting/relevant things on the net), but goes about it in a COMPLETELY different way–different in the “automatic finding” and also different in the “deriving interestingness”.
For automatic harvesting of possible material, Richardson looks to a number of resources (not only whatever is pointed at by his RSS feeds, but also whatever is pointed at by what his RSS feeds point at—basically, he wants to search not only his information sources, but also what his own information sources treat as sources themselves). He also harvests from technorati, some custom google queries, delicious (until Joshua got mad =) ), etc.
Good stuff. This harvesting technique definitely broadens the search space. If I look at it in natural language processing terms, I would say it increases “recall”: over the set of all possible interesting articles, looking at a larger set of articles on the whole increases our chances of finding interesting stuff that a more conservative algorithm wouldn’t find. However, if we take this approach then we’d have to have much stronger algorithms that can guess interestingness—otherwise precision will suffer. Now, I want to say that the RSS feeds that I already subscribe to are nearly guaranteed to point to “interesting” pages—otherwise I wouldn’t point to them.
In other words, first-degree information sources (what I treat as a source) are guaranteed high precision low recall. Second-degree information (what my sources treat as a source), by contrast, are much higher recall (quantity order n-squared vs n), but have a hit to precision.
With an order magnitude (or more!) more information, we’d need big changes–changes in GUI, changes in expectation from the program, changes in ranking algorithm.
Richardson also addresses the last of these points–change in ranking algorithm. To compute interestingness, he does something smart that’s almost like a “reverse pagerank”. He says “things are interesting if they point to interesting things” (contrast this with Google’s Pagerank, which says “things are interesting if they are pointed to by interesting things).
The good thing about this is that, once you have some initial seed data, it becomes a sort of passive goodness metric. It bootstraps off of the initial data you provide so that it can continue learning even after you stop providing it data. Unsupervised machine learning, in other words.
Lots of things to think on.