Skip to content

Tracking Browsing History

26-Jun-11

I’ve long wanted a way to track/store (and later search?) my browser history.
Why do modern browsers throw any of this away? I estimate I consume far less than < 50M of html per day (flash videos, large files excluded) I want to be able to search over this, and gather stats about my browsing habits. Until now I'd thought of doing this with a local proxy that ran on my machine, this morning I realized that a much simpler greasemonkey script should be able to do this. My hypothetical script will inject into every web page a 1x1 image URL from some directory on a server I own. In the image's URL, I'll encode the page's URL, any parameters passed in (and also the current time or some random number to keep browser from caching the image). Then, tracking my browser history is just grep'ing server logs for everything served from that directory, and decoding the metadata. I like this because it's centralized (aggregates browsing across many machines to one central place). And super simple to install/manage. This doesn't save content, unfortunately (though I a separate script running on my server could do that, parsing logs and fetching web page contents. This wouldn't work for dynamic pages or ajax, but every click in gmail or calendar isn't as important as other pages).

Now Githubbing

14-May-11

Have started uploading a lot of my stuff to github.

It feels strange and vulnerable. I can talk as much as I want about my code, but uploading to internet is something else entirely. It’s a crazy thing to let people peek inside the sausage factory.

A nice side effect, though, is that internet starts fixing my bugs for me. Nice.

So, for starters:

Ranking Algorithms for My Feedreader

20-Mar-11

I have been using a home-brew Feedreader for the last 6 years or so. It’s a river-of-news style aggregator, that ranks posts in order of “interestingness” rather than date, with the most interesting entries that I haven’t seen yet at the top.

Interestingness is derived via my click interactions: if my feedreader shows me an article and I click on it, I’m implicitly voting that the article is interesting. If it shows me something and I don’t click on it, I’m saying it’s not interesting. All of this click data is used to train a naive bayesian classifier, which classifies each new entry as it comes in.

There are some great advantages to sorting things by interest: there’s no sense of digital guilt when I don’t visit my reader for a few days (if something interesting happens while I’m on a week-long vacation, for instance, it’ll still stay at the top of my queue). Also, I can subscribe to a decently large number of feeds (300 or so, at last count) without feeling information overload.

This ranking system has been very good from a precision point of view (the items that are recommended to me are usually stuff I’m interested in). However, I’ve been feeling lately that recall or coverage is lacking (my top-ranked items are too similar, too echo-chambery, too drawn from the same sources).

(One difficulty I’m having is that precision is very easy to measure. Recall or coverage is much harder to quantify, with just the click/attention data I have available).

So… I’m now beginning to consider changing my ranking algorithm up. Maybe I can capture different views:

  • show me stuff from my friends only (or some pre-specified list of must-reads)
  • show me stuff from blogs I haven’t looked at in a while
  • show me stuff from blogs that publish very infrequently
  • show me stuff that each blog considers abnormally interesting (does it have an abnormally high amount of comments/likes/upboats/etc compared to the typical post)
  • show me stuff restricted by content type (video, img, comic, news, blog)
  • or to generalize, show me stuff that will take a certain estimated time tom consume (an image is quick, as is a tweet, a short blog entry is longer, a multi-page economist article is longer still)

Visualizing Command Line History

13-Mar-11

So, after documenting how I save a timestamped log of my bash file, I got curious about what kind of analyses I could pull out of it.

(caveat: I only started this logging about a month ago, so there aren’t as many data points as I’d like. However, there is enough to see some interesting trends emerging).

Day of Week

First, here is the spread of activity over day-of-week for my machine at home. I found this surprising! I’d expected my weekend hacking projects to show a significant weekend effect, but I did not notice the Thursday slump. It’s interesting when data shows us stuff about ourselves that we didn’t realize. I have no idea what causes the Tuesday mini-spike.

Next, I have activity per hour-of-day, broken up by weekends-only and weekdays-only (because my behavior differs significantly between these two sets).

Weekends

Both charts clearly show my average sleeping times. Weekends show a bump of morning hacking and evening hacking, with less computer time than I’d have expected in the middle of the day.

Weekdays

I love the evening just-got-home-from-work-and-finished-with-dinner spike for the weekdays, followed by evidence of late-night hacking (probably too late for my own good).

Where to go from here

I wonder if the unexpected Tuesday spike and 6pm-weekday spikes are legitimate phenomena or artifacts due to data sparsity. It will be interesting to check back in with this data in a few more months to see how it smooths out. (Ugh, daylight savings time is going to mess with this a bit =/ ).

Also, this only measures one aspect of my activity in a day–stuff typed at the command line, which is mostly programming-related. I would love to plot other information alongside it (emails sent, lines of code written, instant messages sent, songs played, GPS-based movement). I’m tracking much of this already. I’ll need a good way of visualizing all of these signals together, as the graph is going to get a bit crowded. Maybe I’ll pick up that Tufte book again…

(And, speaking of visualization, I think a heatmap of activity per hour of the week would be interesting as well… Google Spreadsheets doesn’t do those, though, so while I have the data I couldn’t whip one up easily tonight).

Lastly, what’s the purpose of this all? What do I want to accomplish from this analysis? They’re nice-looking graphs, for sure. And honestly there is a bit of narcissistic pleasure in self-discovery. And I suppose it’s good to realize things like the mid-week slump (exhaustion from work? external calendar factors?) are happening.

But I’m eventually hoping for something less passive than just observation. Later I look forward to using this data to change myself. I can imagine later setting goals (in bed by a certain hour, up by a certain hour, no coding on day-x vs more coding on day-y) and letting the statistics show my progress towards those goals.

Saving Command Line History

12-Mar-11

I’ve never been satisfied with the defaults for the way linux & osx save command line history. For all practical purposes, when we’re talking about text files, we have infinite hard drive space. Why not save every command that we ever type.

First, A Roundup of What’s Out There

Here’s the baseline of what I started with, in bash:

declare -x HISTFILESIZE=1000000000
declare -x HISTSIZE=1000000

But there are a few problems with this: bash and zsh sometimes corrupt their history files, and multiple terminals sometimes don’t interact properly. A few pages have suggested hacks to PROMPT_COMMAND to get terminals to play well together:

briancarper.net

  • relatedly, shopt -s histappend (for bashrc)
  • export PROMPT_COMMAND=”history -n; history -a” (upon every bash prompt write out to history and read in latest history). While this works, it feels a bit hacky

tonyscelfo.com has a more formalized version of the above.

Further down the rabbit-hole, this guy has a quite complicated script to output each session’s history to a uniquely ID’d .bash_history file. Good, but it only exports upon exit from a session (which I rarely do… for me, sessions either crash (which doesn’t trigger the write) or I don’t close them… still, it’s an interesting idea).

(Aside: shell-shink was an interesting solution to this issue, though it had its own set of problems — privacy implications… in case I type passwords in the command-prompt, I would really rather not have this stuff live on the web. Also, it’s now obselete and taken down, so it’s not even an alternative now). Links, for posterity:
[1] [2] [3]

Now, what I finally decided to use

Talking to some folks at work, I found this wonderful hack: modify $PROMPT_COMMAND to output to a history file manually… but also output a little context — the timestamp and current path, along with the command. Beautiful!

export PROMPT_COMMAND='if [ "$(id -u)" -ne 0 ]; then echo "`date` `pwd` `history 1`" >> ~/.shell.log; fi'

ZSH doesn’t have $PROMPT_COMMAND but it does have an equivalent.

For posterity, here’s what I ended up with:

  • zsh:

    function precmd() {
    if [ "$(id -u)" -ne 0 ]; then
    FULL_CMD_LOG=/export/hda3/home/mote/logs/zsh_history.log;
    echo "`/bin/date +%Y%m%d.%H%M.%S` `pwd` `history -1`" >> ${FULL_CMD_LOG};
    fi
    }

  • bash:


    case "$TERM" in
    xterm*|rxvt*)
    DISP='echo -ne "\033]0;${USER}@${HOSTNAME}: ${PWD/$HOME/~}\007"'
    BASHLOG='/home/mote/logs/bash_history.log'
    SAVEBASH='if [ "$(id -u)" -ne 0 ]; then echo "`/home/mote/bin/ndate` `pwd` `history 1`" >> ${BASHLOG}; fi'
    PROMPT_COMMAND="${DISP};${SAVEBASH}"
    ;;
    *)
    ;;
    esac

This gets ya a wonderful logfile, full of context, with no risk of corruption:

20110306.1819.03 /home/mote/dev/load 515 ls
20110306.1819.09 /home/mote/dev/load 516 gvim run_all.sh
20110306.1819.32 /home/mote/dev/load 517 svn st
20110306.1819.35 /home/mote/dev/load 518 svn add log_screensaver.py
20110306.1819.49 /home/mote/dev/load 519 svn ci -m “script to log if screensaver is running”

(As an aside, you’ll notice that these commands are all timestamped. Imagine the wealth of personal infometrics data that we can mine from here! When am I most productive (as measured by command-density-per-time-of-day?). What really are my working hours? When do I wake? Sleep? Lunch? )

Next up, need to make a `history`-like command to tail more copy-pastable stuff out of this file.

Impressionism

20-Feb-11

Spent a lazy Sunday afternoon at the Getty, mostly looking at a handful of Monets.

Aside from being beautiful, they’re amazing glimpses into visual processing into the human brain. Impressionism is, at its core, lossy compression, right?

(To digress a bit, I look at Impressionism as a reaction against photography, which says to itself “Look, this camera can capture direct reality far better than I ever could. So what is my role as a painter and artist, now? My brush can never get the colors quite right, the perspective and angle quite perfect. Where is my niche, that I am not obsolete?”.

So, impressionism says “my rough strokes can capture the spirit of reality better than the overt literal capture of a camera & lens”.)

So, impressionism is lossy compression, like a too-small .jpg (or, perhaps more accurately, one of those 8-bit tribute albums. It throws away information while still attempting to retain the overall picture. But the information it chooses to throw out seems to imply a wonderful exploitation of the human visual perception system.


Sporadic dashes of green become ship masts, stacatto jabs of orange the sun, vague blotches of purple become the fog. But not, not overtly.

(Classic computer vision & object recognition approaches would certainly work quite poorly on paintings like these. I’d be halfway curious to try to design a system that could do it well).

I wonder what it was like for the first folks exploring this technique. Especially because, standing so close to the canvas, it’s easy to see the literal but hard to get the gestalt.

Authority, Influence in Social Networks [tentative thoughts]

29-Jan-11

I spent the day fiddling around with twitter and buzz, to see what signals I have at my disposal.

Eventually I’d like to get some metrics that quantify a few different aspects of human relationships:

  • Global influence (how much influence does this user have upon the world). This is pretty straightforward.
  • Local influence (how much influence does the user have within his more personal social sphere). This is less straightforward and much more interesting. Relatedly, who are the top influencers for an individual or for a clique of people. And can we get an InfluenceRank(a, b) between any two people, or a person and a group, etc.
  • Level of friendship, or closeness (how vague is that, huh?)
  • sub-graphs within a user’s FOAFs & FOAFOAFs that correspond to different social circles/publics/social identities. I’m pretty sure this is a well-studied problem, but it’s interesting to run the numbers for myself.

I’m just getting started, so here’s a working braindump…

I’d like to come up with some more rigorous definitions for these metrics (maybe look in some social psychology journals? read up on social networks?). And there are plenty of other stuff I want to measure, too…

Note: some of these are definition unidirectional (influence). Are any relationships or relationship-metrics bidirectional? (is friendship itself?)

Now, the signals that I have access to:

  • num followers
  • num followers in FOAF network
  • num followers in FOAFOAF network
  • num_replies(a, b)
  • num_reshares(a, b) (not in buzz, though…)
  • num_likes(a, b)
  • more?

These signals should also be normalized over how much a person communicates or follows in general — all we have is the observation “a is following b” or “a is talking to b”, we don’t know the internal impedence in a’s mind — do they follow lots of people, or is the fact that they are following this one person a more significant event?

I should probably also look at reciprocity. min(replies(a, b), replies(b, a)) for 2 users a and b will be very useful. Add on a minimum threshold (say, 3), and there’s a good proxy for friendship.

Another problem is that many of these metrics are so sparse! Just because A is friends with B doesn’t mean that A is going to necessarily comment/like/reshare that often.

I should probably also eliminate the “celebrities” of the network (people with friends/followers above a certain amount. Or at least treat them differently. These users are closer to proxies for measuring ideology or worldview of their followers, rather than “friends” in the canonical sense.

The hardest (most interesting?) part of all this will be evaluation. Once I have a metric, how can I quantify how good it is, beyond just eyeballing it? I have no labeled data…

This afternoon, I had some decent success approximating local influence as

num_followers_in_foaf_network – 0.01*num_followers_globally

(varying that 0.01 constant was a means of penalizing the global popularity of a person… keeping it at 0.01 got me the tech people who influence me personally, 0.05-0.1 got me my non-computery real-life-friends).

This one also worked nicely:

num_followers_in_foaf_network / (1 + log(num_followers_globally)

p.s. Many thanks to the authors of python-twitter and buzz-python-client, you made my life a lot easier…

Wrapped Up In Books

09-Jan-11

A while ago, I decided I wanted to keep better track of what books I read. So, I created a low-maintenance google spreadsheet form to help me. I suppose I could have gone with a text file or rolled my own webapp to facilitate this, but a spreadsheet form seemed like the best balance of ease-of-data-entry, available-anywhere, and don’t-have-to-write-any-code.

This turned out to be a good choice — I didn’t realize at the time, but Google Spreadsheets has some simple but useful visualization tools. Looking back at my results, some surprising patterns emerged. Y’see, in addition to just tracking standard metadata (author/title/summary), I also ranked the books in a few different dimensions.

[Aside: I’ve been using this stat tracker for fiction books, mostly. My reading habits are different with nonfiction, technical books, and poetry — Nonfiction and technical stuff (machine learning, mostly) I’ll read 5 or 6 and once, skipping around. Maybe this is the influence of the web on my information-foraging. Poetry is even more sporadic, picking up a book when the mood strikes (I’d really like to be more methodical about poetry, but it always seems to be a mood thing). ]

Some analysis…

First, here’s a plot of how many books I’ve entered data for per date. You can see I was pretty good about entering data promptly in 2009, while in 2010 I got a bit lazy and didn’t enter data until I had multiple books to add at once. I wish there was a more automatic way to track this data and enter it automatically, especially now that I’m using an ebook reader (Sony Ereader).
form response frequency
My books-read-per-year is interesting, as well:

2008 [2 months]: 3 (scale this up to 18)
2009: 28 books
2010: 42 books

I hadn’t made a conscious decision to read more books this last year. If I were to hazard a guess at this stat, it’d be that in 2009 I read primarily paper books and 2010 I used an ereader (it certainly is a lot more convenient, both in terms of carrying books around and also selection/ease of procurement of books).

So, now the main plots: I’m very surprised at the distributions! Such nice curves! I would have never thought myself so… normal.

Note there is serious selection bias in the books I read. I’m not going to start reading a book I’m not interested. All these graphs have a right-shifted mean that reflects this.

The metrics I’m ranking over reflect the fact that I’m tracking fiction:

    character rank
    idea rank
    plot
    overall

  • Character (how well are characters portrayed? Are they flat like Asimov’s worst, or round?)
  • Ideas (how interesting/novel are the ideas? Most science fiction I read for the ideas, so this is an important metric for me)
  • Plot (awkward? good?)
  • Overall

I wish desktop software or web software offered me a way to track this data online, ideally in a social fashion. Goodreads has ranking, but nothing multifaceted like this. Which is a shame, because sometimes I want to immerse myself into a fully-formed world, sometimes I want to soak in ideas.

.

Random Thoughts on Taiwan

30-Dec-10

Wrapping up my trip in Taiwan, I’m struck by a thousand random thoughts…

  • Taiwanese fashion iterates more quickly than the United States! At a breakneck pace — and even me, fashion-challenged as I am, can see this. Before, I thought it was the result of less expensive clothes that can be bought for less and wear out more quickly. Now I am not sure… are they closer to the source/origin of new trends?
  • Taiwanese architecture also iterates more quickly. There’s more adventurism in building design here (with the associated successes and mistakes that you’d expect!). Maybe there’s some other cultural aesthetics (desire for quick change?) at play here.
  • my fingernails grow 2x faster here; cause unknown.
  • Taiwanese is a dying language, sadly (it has a warmth and down-to-earthness that mandarin lacks). even as far down south as taichung, kids listen to their parents talk to them in Taiwanese, and then answer back in Mandarin. In the office, no one spoke Taiwanese to one another. While there is some cultural preservation backlash (early into my visit, I attended a play that was intentionally set in Taiwanese), but I’m afraid it isn’t enough.
  • my mind staggers at Taiwanese economics. For lunch one day I had a $20NT ($0.8 USD) fatty-pork-with-sauce-atop-rice. followed by a $150 ($5) coffee. That’s a greater-than 6x ratio. Imagining a typical cheap $5 lunch in the US, can I imagine following it with a $30 cup of coffee?

    There are other interesting economic forces at play. Service fees are substantially cheaper compared to the US. As are locally manufactured goods and foods. Gasoline is about 2x as expensive, I think. Luxury goods are about the same price as in the States (but this is absolute price… relative currency strengths make them about 3x as expensive). How does this all shape society?

  • There are far too many binglang (a carcinogenic nut chewed by the working class) trees here for local consumption. I see way more groves than could be consumed. Do they, then, export? To where?
  • While job types are divided largely along race lines in the US, I see them more divided along age lines here. “The old man the machines”. The old are the ones who work in the factories, in construction, etc. The young are in the cleaner, air-conditioned stores. I don’t know which injustice/imbalance (tw or us) makes me more sad =(.
  • Public vs private: On one hand, Taiwanese keep their private lives very private. On the other hand, they literally air their laundy where everywhere can see it. The public turns an obligatory blind eye. Is this necessary to being short on space, or are there different sets of aesthetics and decency here?
  • Android phones are very common here! On the street, and in commercials.
  • City governments could easily use mechanical turk / crowdsourcing to translate signs and notices. Wonder if they are agile enough to do this? (America’s governments would not be).
  • My favorite game on the street: distinguish the european foreigner from the american foreigner. Or, distinguish the ABC from the local Taiwanese. It’s interesting. There are subtle differences in clothing, mannerisms, that most of the time the conscious mind doesn’t catch but the unconscious mind can pick up on.
  • Tainan restaurant traffic is shaped by information cascades
  • Tea Drinking Notes, Yuanlin

    29-Dec-10

    1. Oriental Beauty (fully oxidized): honey flavored, sweet and dry. light wheat notes, but mostly a strong honeyed flavor. Reminded us of darjeeling.

    2. 10y aged Taiwanese oolong: quite sweet flavored and salivatory-making. Very smooth and drinkable, though a little too simple and un-nuanced for my tastes.

    3. 2009 loose-leaf black puerh (puerh leaves, but fully oxidized like a black tea). Interesting! The dry and wet leaves smell like a black tea, but the liquor smells and drinks like a puerh. Fruit notes.

    4-8: A series of plantation and wild puerhs, drunk in progression from younger to older (2009 down to 2005). These ran the gamut of smokey to mild, sweet to bitter (though none of that bile lincong, thank God). Part of the motivation for this series was to differentiate between plantation and wild tree, and also the effects of aging (however, I felt there was enough variation between the individual teas of each type that they outweighed any inter-type variation we might have seen).

    9-11: Older puerhs (2001, 1994, 1988 in succession). The 2001 tiebing puerh, wild tree, from a mountain (near yiwu) I didn’t get the name of. Smooth and bitter (in a good way). It was a nice break from the younger puerhs we’d been having. The biggest improvement was in the mouthfeel rather than the taste (full and thick, whereas the younger stuff had been closer to water).

    The 1994 had an improved mouthfeel, and a good earthy flavor. The 1988 (Qiwu) continued this trend with a sweeter, smoother earthiness. It was slightly faded tasting.

    With these finished, we moved to the final two teas of the day…

    12. 1975 7572 Orchid-scent: Sweet and full, but in an indirect way. Subtle on the tongue. A sweet huigan (most of the character of this tea doesn’t come in the taste, but in the aftertaste, which I absolutely love). The mouthfeel is thick and smooth, with a menthol coolness after swallowing. Brimming with qi.

    13. 1960s hong yin
    Even more depth of character. Deep earthiness and menthol. Louder and direct, while the 7572 was more indirect (though just as strong, in its own right). While this is a wonderful, wonderful tea (and the “better” one, if you count by price alone), the subtlety of the 7572 was by far my favorite of the day.

    (Sorry, my last two teas I was too busy to enjoying to take any real notes…).

    Random tea notes:
    * My uncle’s Chinese is heavily-Taiwanese-accented, so it’s sometimes hard to follow. He doesn’t say lu cha but li cha. It’s not guoyu he speaks, but goyi.

    * There’s a distinction between hand-picked and machine-clipped when harvesting leaves, and you can tell pretty easily by the edge of the leaf’s stem. Not sure how much effect this has on taste (other than the manually-harvested being perhaps of better quality).

    * The puerh bubble started peaking in earnest in 2007. So teas especially in 2007 (and, to a lesser extend, from following years), are overharvested (and some fake). It you can (and if you can afford it) it’s best to buy teas before then.

    * Ugh. Was up until 4am the night after, from all the caffeine. But very, very worth it.