copious spare time – sardonick

Tracking Browsing History

mote — Mon, 27 Jun 2011 02:59:38 +0000

I’ve long wanted a way to track/store (and later search?) my browser history.
Why do modern browsers throw any of this away? I estimate I consume far less than < 50M of html per day (flash videos, large files excluded) I want to be able to search over this, and gather stats about my browsing habits. Until now I'd thought of doing this with a local proxy that ran on my machine, this morning I realized that a much simpler greasemonkey script should be able to do this. My hypothetical script will inject into every web page a 1x1 image URL from some directory on a server I own. In the image's URL, I'll encode the page's URL, any parameters passed in (and also the current time or some random number to keep browser from caching the image). Then, tracking my browser history is just grep'ing server logs for everything served from that directory, and decoding the metadata. I like this because it's centralized (aggregates browsing across many machines to one central place). And super simple to install/manage. This doesn't save content, unfortunately (though I a separate script running on my server could do that, parsing logs and fetching web page contents. This wouldn't work for dynamic pages or ajax, but every click in gmail or calendar isn't as important as other pages).

Ranking Algorithms for My Feedreader

mote — Mon, 21 Mar 2011 05:46:59 +0000

I have been using a home-brew Feedreader for the last 6 years or so. It’s a river-of-news style aggregator, that ranks posts in order of “interestingness” rather than date, with the most interesting entries that I haven’t seen yet at the top.

Interestingness is derived via my click interactions: if my feedreader shows me an article and I click on it, I’m implicitly voting that the article is interesting. If it shows me something and I don’t click on it, I’m saying it’s not interesting. All of this click data is used to train a naive bayesian classifier, which classifies each new entry as it comes in.

There are some great advantages to sorting things by interest: there’s no sense of digital guilt when I don’t visit my reader for a few days (if something interesting happens while I’m on a week-long vacation, for instance, it’ll still stay at the top of my queue). Also, I can subscribe to a decently large number of feeds (300 or so, at last count) without feeling information overload.

This ranking system has been very good from a precision point of view (the items that are recommended to me are usually stuff I’m interested in). However, I’ve been feeling lately that recall or coverage is lacking (my top-ranked items are too similar, too echo-chambery, too drawn from the same sources).

(One difficulty I’m having is that precision is very easy to measure. Recall or coverage is much harder to quantify, with just the click/attention data I have available).

So… I’m now beginning to consider changing my ranking algorithm up. Maybe I can capture different views:

show me stuff from my friends only (or some pre-specified list of must-reads)
show me stuff from blogs I haven’t looked at in a while
show me stuff from blogs that publish very infrequently
show me stuff that each blog considers abnormally interesting (does it have an abnormally high amount of comments/likes/upboats/etc compared to the typical post)
show me stuff restricted by content type (video, img, comic, news, blog)
or to generalize, show me stuff that will take a certain estimated time tom consume (an image is quick, as is a tweet, a short blog entry is longer, a multi-page economist article is longer still)

Authority, Influence in Social Networks [tentative thoughts]

mote — Sun, 30 Jan 2011 04:47:30 +0000

I spent the day fiddling around with twitter and buzz, to see what signals I have at my disposal.

Eventually I’d like to get some metrics that quantify a few different aspects of human relationships:

Global influence (how much influence does this user have upon the world). This is pretty straightforward.
Local influence (how much influence does the user have within his more personal social sphere). This is less straightforward and much more interesting. Relatedly, who are the top influencers for an individual or for a clique of people. And can we get an InfluenceRank(a, b) between any two people, or a person and a group, etc.
Level of friendship, or closeness (how vague is that, huh?)
sub-graphs within a user’s FOAFs & FOAFOAFs that correspond to different social circles/publics/social identities. I’m pretty sure this is a well-studied problem, but it’s interesting to run the numbers for myself.

I’m just getting started, so here’s a working braindump…

I’d like to come up with some more rigorous definitions for these metrics (maybe look in some social psychology journals? read up on social networks?). And there are plenty of other stuff I want to measure, too…

Note: some of these are definition unidirectional (influence). Are any relationships or relationship-metrics bidirectional? (is friendship itself?)

Now, the signals that I have access to:

num followers
num followers in FOAF network
num followers in FOAFOAF network
num_replies(a, b)
num_reshares(a, b) (not in buzz, though…)
num_likes(a, b)
more?

These signals should also be normalized over how much a person communicates or follows in general — all we have is the observation “a is following b” or “a is talking to b”, we don’t know the internal impedence in a’s mind — do they follow lots of people, or is the fact that they are following this one person a more significant event?

I should probably also look at reciprocity. min(replies(a, b), replies(b, a)) for 2 users a and b will be very useful. Add on a minimum threshold (say, 3), and there’s a good proxy for friendship.

Another problem is that many of these metrics are so sparse! Just because A is friends with B doesn’t mean that A is going to necessarily comment/like/reshare that often.

I should probably also eliminate the “celebrities” of the network (people with friends/followers above a certain amount. Or at least treat them differently. These users are closer to proxies for measuring ideology or worldview of their followers, rather than “friends” in the canonical sense.

The hardest (most interesting?) part of all this will be evaluation. Once I have a metric, how can I quantify how good it is, beyond just eyeballing it? I have no labeled data…

This afternoon, I had some decent success approximating local influence as

num_followers_in_foaf_network – 0.01*num_followers_globally

(varying that 0.01 constant was a means of penalizing the global popularity of a person… keeping it at 0.01 got me the tech people who influence me personally, 0.05-0.1 got me my non-computery real-life-friends).

This one also worked nicely:

num_followers_in_foaf_network / (1 + log(num_followers_globally)

p.s. Many thanks to the authors of python-twitter and buzz-python-client, you made my life a lot easier…

Wrapped Up In Books

mote — Sun, 09 Jan 2011 08:41:43 +0000

A while ago, I decided I wanted to keep better track of what books I read. So, I created a low-maintenance google spreadsheet form to help me. I suppose I could have gone with a text file or rolled my own webapp to facilitate this, but a spreadsheet form seemed like the best balance of ease-of-data-entry, available-anywhere, and don’t-have-to-write-any-code.

This turned out to be a good choice — I didn’t realize at the time, but Google Spreadsheets has some simple but useful visualization tools. Looking back at my results, some surprising patterns emerged. Y’see, in addition to just tracking standard metadata (author/title/summary), I also ranked the books in a few different dimensions.

[Aside: I’ve been using this stat tracker for fiction books, mostly. My reading habits are different with nonfiction, technical books, and poetry — Nonfiction and technical stuff (machine learning, mostly) I’ll read 5 or 6 and once, skipping around. Maybe this is the influence of the web on my information-foraging. Poetry is even more sporadic, picking up a book when the mood strikes (I’d really like to be more methodical about poetry, but it always seems to be a mood thing). ]

Some analysis…

First, here’s a plot of how many books I’ve entered data for per date. You can see I was pretty good about entering data promptly in 2009, while in 2010 I got a bit lazy and didn’t enter data until I had multiple books to add at once. I wish there was a more automatic way to track this data and enter it automatically, especially now that I’m using an ebook reader (Sony Ereader).

My books-read-per-year is interesting, as well:

2008 [2 months]: 3 (scale this up to 18)
2009: 28 books
2010: 42 books

I hadn’t made a conscious decision to read more books this last year. If I were to hazard a guess at this stat, it’d be that in 2009 I read primarily paper books and 2010 I used an ereader (it certainly is a lot more convenient, both in terms of carrying books around and also selection/ease of procurement of books).

So, now the main plots: I’m very surprised at the distributions! Such nice curves! I would have never thought myself so… normal.

Note there is serious selection bias in the books I read. I’m not going to start reading a book I’m not interested. All these graphs have a right-shifted mean that reflects this.

The metrics I’m ranking over reflect the fact that I’m tracking fiction:

Character (how well are characters portrayed? Are they flat like Asimov’s worst, or round?)
Ideas (how interesting/novel are the ideas? Most science fiction I read for the ideas, so this is an important metric for me)
Plot (awkward? good?)
Overall

I wish desktop software or web software offered me a way to track this data online, ideally in a social fashion. Goodreads has ranking, but nothing multifaceted like this. Which is a shame, because sometimes I want to immerse myself into a fully-formed world, sometimes I want to soak in ideas.

Consolidating Music Metadata

mote — Mon, 04 Dec 2006 21:48:41 +0000

Finally finished a script a couple weekends ago to synchronize data between Amarok, Rhythmbox, and iTunes. I now use Amarok exclusively, and it’d been bugging me for a long time that my old metadata from multiple machines and multiple apps was locked away and unexploitable. So i fixed that, for myself at least. I harvest everything into a common format and populate a big ol database with everything. Then I merge all the metadata together (averaging and adding, whatever, where necessary).

The code is ugly for now, so no public release. I might clean it up some time if anybody else wants it. Just ask.

Back

mote — Tue, 10 Jan 2006 21:02:51 +0000

Back now. And married, too! Strangely enough, not much is different, with a few subtle exceptions:

our kitchen is a bit more well-equipped from the wedding gifts
I get to wake up every morning next to a beautiful woman
our apartment is an absolute mess from all the moving boxes
I suddenly have a little free time, because there’s no more wedding to plan
I don’t have to say goodbye to Mindy at night any more.

The wedding turned out wonderfully, if a bit hectic (all the “usual” last-minute wedding preparations, plus the emotional strain, plus moving Mindy’s stuff into my apartment, plus the wonderful-but-tiring opportunity to host 6 out-of-country guests (bridesmaides, plus the parents-in-law, plus my sister-in-law-in-law (err, my brother-in-law’s wife… what’s the name of that relationship?) ). It was Mindy’s parents first time in the states, which meant it was the first time our parents met, and also the first time that I got an opportunity to really host them. That last bit was really good–the opportunity to host them. In the past, I had always been visitting them, which meant it was them driving me around, them treating me to good restaurants, them cooking for me. I find it hard, generally, to serve Taiwanese parents. (don’t misunderstand, the hard part is not in finding motivation to serve, but in getting them to let you serve them. From my ?? perspective, the parent-child relationship is so fixed that it’s almost awkward for them to receive care instead of giving care). However, now that they were on my home turf, heh… I finally got to treat them at restaurants, drive them around, cook for them. It was wonderful to be able to return the love, finally.

But man, that last week before the wedding was hectic.

More later–hopefully plenty of pictures, an itinerary of wine tasting (the blogosphere was disappointingly uninformative on good santa barbara vinyards to visit), and santa barbara food.

emergent road maps and route-finding

mote — Wed, 28 Sep 2005 23:38:38 +0000

Cold-medicine induced altered mental-state yesterday gave me an interesting idea:
Right now all the mapping companies (mapquest, yahoo maps, google maps) use NavTeq to gather data. Once (if ever) cell phone usage data tied to GSM becomes public (a la MIT’s Mobile Landscape), I wonder if you can aggregate data of people going from point A to point B, and lookup solutions to the travelling salesman problem.

Similar to the concept of emergent garden paths in landscape architecture and the wisdom of crowds, let’s let public agreement decide the best way to go from one place to another.

What do you think?

More interesting reading:
Wikipedia on the Wisdom of Crowds
IAWiki on Emergent Architecture

Data Synchronization/Backup Headaching

mote — Sat, 24 Sep 2005 17:35:15 +0000

A friend of mine’s recent hard drive catastrophe finally got me around to implementing a decent, cron’ed backup implementation for all the stuff I don’t store in my svn server (mp3s, photos, and other media just don’t change enough to merit the overhead of checking them into a VCS).

RSync would work well, you’d think… Except that I have all this stuff on a fat32 drive (for the rare boot back into windows). And RSync does NOT play well with fat32, and I’m finding.

mote@server1 /media $ rsync -av -e “ssh -l mote” server2:/fat32/data/media/test_dir /media/test_dir

receiving file list … done
created directory /media/test_dir
rsync: failed to set times on “/media/test_dir/test_dir”: Operation not permitted (1)
rsync: mkstemp “/media/test_dir/test_dir/.1.txt.3FfWKE” failed: Operation not permitted (1)
rsync: mkstemp “/media/test_dir/test_dir/.2.txt.IT6nr2” failed: Operation not permitted (1)
rsync: failed to set times on “/media/test_dir/test_dir”: Operation not permitted (1)

sent 56 bytes received 212 bytes 21.44 bytes/sec
total size is 22 speedup is 0.08
rsync error: some files could not be transferred (code 23) at main.c(1173)

Google brings up this blog entry dealing with similar troubles. It suggests some rsync workarounds, and looking into FullSync (supposedly rsync with a solution to the fat32 headaches) .
Unison (slick but crashy from what I hear) is also a possible option. (Feature comparison for rsync and unison)

UPDATE:
for a quick get-rsync-working thing, “rsync –rvt” works while “rsynv -av” don’t. It’s that damn “-a” that was causing the problems. I’ll still look into unison, though.

sxsw microaggregated

mote — Tue, 15 Mar 2005 19:27:52 +0000

Some day, I need to go to SXSW–if only for the experience of being around so many creative nerds at the same time. I never see some of my friends so brimming over with ideas and vitality of life as when they come back from something like this…

In the mean time, i can live vicariously through my blogroll…

Leonard: 0, 1, 2
Liz: 0, 1, 2, 3

Nixie Clocks

mote — Mon, 17 Jan 2005 19:08:39 +0000

Someday I’d like to build a clock made from old-school nixie tubes. There’s something warm, reassuringly retro about these things. Like the comforting clicks of old keyboards when you juxtapose them with the cold feedbacklessness of touchscreen inputs. Slick and streamlined makes for professional and productive user interfaces, I’ll grant you… but there’s something soothing about these dinosaurs–soothing like the smell of old oil, sawdust, and electronics in your grandpa’s garage…

Anyways…
I’ll need to cultivate some quality electronics-hacking-skillz before I can attempt something like this.
Why is the world filled with so many things to learn?

Google image search for nixie clocks.