research – sardonick

Visualizing Command Line History

mote — Sun, 13 Mar 2011 07:12:34 +0000

So, after documenting how I save a timestamped log of my bash file, I got curious about what kind of analyses I could pull out of it.

(caveat: I only started this logging about a month ago, so there aren’t as many data points as I’d like. However, there is enough to see some interesting trends emerging).

Day of Week

First, here is the spread of activity over day-of-week for my machine at home. I found this surprising! I’d expected my weekend hacking projects to show a significant weekend effect, but I did not notice the Thursday slump. It’s interesting when data shows us stuff about ourselves that we didn’t realize. I have no idea what causes the Tuesday mini-spike.

Next, I have activity per hour-of-day, broken up by weekends-only and weekdays-only (because my behavior differs significantly between these two sets).

Weekends

Both charts clearly show my average sleeping times. Weekends show a bump of morning hacking and evening hacking, with less computer time than I’d have expected in the middle of the day.

Weekdays

I love the evening just-got-home-from-work-and-finished-with-dinner spike for the weekdays, followed by evidence of late-night hacking (probably too late for my own good).

Where to go from here

I wonder if the unexpected Tuesday spike and 6pm-weekday spikes are legitimate phenomena or artifacts due to data sparsity. It will be interesting to check back in with this data in a few more months to see how it smooths out. (Ugh, daylight savings time is going to mess with this a bit =/ ).

Also, this only measures one aspect of my activity in a day–stuff typed at the command line, which is mostly programming-related. I would love to plot other information alongside it (emails sent, lines of code written, instant messages sent, songs played, GPS-based movement). I’m tracking much of this already. I’ll need a good way of visualizing all of these signals together, as the graph is going to get a bit crowded. Maybe I’ll pick up that Tufte book again…

(And, speaking of visualization, I think a heatmap of activity per hour of the week would be interesting as well… Google Spreadsheets doesn’t do those, though, so while I have the data I couldn’t whip one up easily tonight).

Lastly, what’s the purpose of this all? What do I want to accomplish from this analysis? They’re nice-looking graphs, for sure. And honestly there is a bit of narcissistic pleasure in self-discovery. And I suppose it’s good to realize things like the mid-week slump (exhaustion from work? external calendar factors?) are happening.

But I’m eventually hoping for something less passive than just observation. Later I look forward to using this data to change myself. I can imagine later setting goals (in bed by a certain hour, up by a certain hour, no coding on day-x vs more coding on day-y) and letting the statistics show my progress towards those goals.

Saving Command Line History

mote — Sun, 13 Mar 2011 00:19:04 +0000

I’ve never been satisfied with the defaults for the way linux & osx save command line history. For all practical purposes, when we’re talking about text files, we have infinite hard drive space. Why not save every command that we ever type.

First, A Roundup of What’s Out There

Here’s the baseline of what I started with, in bash:

declare -x HISTFILESIZE=1000000000 declare -x HISTSIZE=1000000

But there are a few problems with this: bash and zsh sometimes corrupt their history files, and multiple terminals sometimes don’t interact properly. A few pages have suggested hacks to PROMPT_COMMAND to get terminals to play well together:

briancarper.net

relatedly, shopt -s histappend (for bashrc)
export PROMPT_COMMAND=”history -n; history -a” (upon every bash prompt write out to history and read in latest history). While this works, it feels a bit hacky

tonyscelfo.com has a more formalized version of the above.

Further down the rabbit-hole, this guy has a quite complicated script to output each session’s history to a uniquely ID’d .bash_history file. Good, but it only exports upon exit from a session (which I rarely do… for me, sessions either crash (which doesn’t trigger the write) or I don’t close them… still, it’s an interesting idea).

(Aside: shell-shink was an interesting solution to this issue, though it had its own set of problems — privacy implications… in case I type passwords in the command-prompt, I would really rather not have this stuff live on the web. Also, it’s now obselete and taken down, so it’s not even an alternative now). Links, for posterity:
[1] [2] [3]

Now, what I finally decided to use

Talking to some folks at work, I found this wonderful hack: modify $PROMPT_COMMAND to output to a history file manually… but also output a little context — the timestamp and current path, along with the command. Beautiful!

export PROMPT_COMMAND='if [ "$(id -u)" -ne 0 ]; then echo "`date` `pwd` `history 1`" >> ~/.shell.log; fi'

ZSH doesn’t have $PROMPT_COMMAND but it does have an equivalent.

For posterity, here’s what I ended up with:

zsh:

function precmd() { if [ "$(id -u)" -ne 0 ]; then FULL_CMD_LOG=/export/hda3/home/mote/logs/zsh_history.log; echo "`/bin/date +%Y%m%d.%H%M.%S` `pwd` `history -1`" >> ${FULL_CMD_LOG}; fi }
bash:

case "$TERM" in xterm*|rxvt*) DISP='echo -ne "\033]0;${USER}@${HOSTNAME}: ${PWD/$HOME/~}\007"' BASHLOG='/home/mote/logs/bash_history.log' SAVEBASH='if [ "$(id -u)" -ne 0 ]; then echo "`/home/mote/bin/ndate` `pwd` `history 1`" >> ${BASHLOG}; fi' PROMPT_COMMAND="${DISP};${SAVEBASH}" ;; *) ;; esac

This gets ya a wonderful logfile, full of context, with no risk of corruption:

20110306.1819.03 /home/mote/dev/load 515 ls
20110306.1819.09 /home/mote/dev/load 516 gvim run_all.sh
20110306.1819.32 /home/mote/dev/load 517 svn st
20110306.1819.35 /home/mote/dev/load 518 svn add log_screensaver.py
20110306.1819.49 /home/mote/dev/load 519 svn ci -m “script to log if screensaver is running”

(As an aside, you’ll notice that these commands are all timestamped. Imagine the wealth of personal infometrics data that we can mine from here! When am I most productive (as measured by command-density-per-time-of-day?). What really are my working hours? When do I wake? Sleep? Lunch? )

Next up, need to make a `history`-like command to tail more copy-pastable stuff out of this file.

Authority, Influence in Social Networks [tentative thoughts]

mote — Sun, 30 Jan 2011 04:47:30 +0000

I spent the day fiddling around with twitter and buzz, to see what signals I have at my disposal.

Eventually I’d like to get some metrics that quantify a few different aspects of human relationships:

Global influence (how much influence does this user have upon the world). This is pretty straightforward.
Local influence (how much influence does the user have within his more personal social sphere). This is less straightforward and much more interesting. Relatedly, who are the top influencers for an individual or for a clique of people. And can we get an InfluenceRank(a, b) between any two people, or a person and a group, etc.
Level of friendship, or closeness (how vague is that, huh?)
sub-graphs within a user’s FOAFs & FOAFOAFs that correspond to different social circles/publics/social identities. I’m pretty sure this is a well-studied problem, but it’s interesting to run the numbers for myself.

I’m just getting started, so here’s a working braindump…

I’d like to come up with some more rigorous definitions for these metrics (maybe look in some social psychology journals? read up on social networks?). And there are plenty of other stuff I want to measure, too…

Note: some of these are definition unidirectional (influence). Are any relationships or relationship-metrics bidirectional? (is friendship itself?)

Now, the signals that I have access to:

num followers
num followers in FOAF network
num followers in FOAFOAF network
num_replies(a, b)
num_reshares(a, b) (not in buzz, though…)
num_likes(a, b)
more?

These signals should also be normalized over how much a person communicates or follows in general — all we have is the observation “a is following b” or “a is talking to b”, we don’t know the internal impedence in a’s mind — do they follow lots of people, or is the fact that they are following this one person a more significant event?

I should probably also look at reciprocity. min(replies(a, b), replies(b, a)) for 2 users a and b will be very useful. Add on a minimum threshold (say, 3), and there’s a good proxy for friendship.

Another problem is that many of these metrics are so sparse! Just because A is friends with B doesn’t mean that A is going to necessarily comment/like/reshare that often.

I should probably also eliminate the “celebrities” of the network (people with friends/followers above a certain amount. Or at least treat them differently. These users are closer to proxies for measuring ideology or worldview of their followers, rather than “friends” in the canonical sense.

The hardest (most interesting?) part of all this will be evaluation. Once I have a metric, how can I quantify how good it is, beyond just eyeballing it? I have no labeled data…

This afternoon, I had some decent success approximating local influence as

num_followers_in_foaf_network – 0.01*num_followers_globally

(varying that 0.01 constant was a means of penalizing the global popularity of a person… keeping it at 0.01 got me the tech people who influence me personally, 0.05-0.1 got me my non-computery real-life-friends).

This one also worked nicely:

num_followers_in_foaf_network / (1 + log(num_followers_globally)

p.s. Many thanks to the authors of python-twitter and buzz-python-client, you made my life a lot easier…

Thoughts on Publishing in Academia

mote — Wed, 08 Nov 2006 07:06:19 +0000

To paraphrase/quote David Klein:

publications would be so much better if we were forward-thinking instead of rigorous in our testing. It seems like people judge a paper’s value by “in 10 years, will someone find a hole in the rigor of my testing procedure”. I would rather judge a paper by “does this make me have an interesting idea about the field that I’ve never thought of before”

Advice on Writing One’s Dissertation

mote — Fri, 03 Nov 2006 01:25:51 +0000

All dissertations require four months of uninterrupted work.

The last month of work takes 0.5 calendar months.

The second to last month takes 1.5 calendar months.

The first two months can take years, and they usually do.

Prof. Daneil Bewrry, U. Waterloo

Sigh… if only this were less true.

What we can learn from Folksonomy

mote — Sun, 25 Jun 2006 01:07:06 +0000

Outward-facing Questions:

The great thing about delicious and folksonomy is that it creates an ontology as an emergent biproduct of individual self-serving efforts (that is, personal bookmarking). I’m wondering if we can take a similar tact to solve other AI problems.

Inward-facing Questions:

What is the best way to represent the evolution of a tag’s meaning (evolution on both the individual and group scale). Folksonomy is a lot more dynamic than a fixed ontology, so we might not be able to use the same old tools.
Folksonomy is the relationship between three types of information: tags, tagged objects, and the users who tag them. What information can we derive each that are not explicit in the structure. You can call this “tag grouping”, “neighbor search”, “related items”… but it’s really all just clustering. What are the differences when you cluster each?
Continuing from the last quesiton: it’s most intuitive to hierarchically cluster tags—this maps well onto the formal “ontology” model that information architects and NLP researchers are comfortable in dealing with. But what happens when we hierarchically cluster users and tagged items? What does hierarchy infer about the relationships between parents, children, and siblings in the resulting structure?
What are the differences in (tags, users, items) between digg, delicious, flickr, and citeulike?

Ah, to have time to pursue these….

Finished the Dissertation Proposal

mote — Mon, 19 Jun 2006 07:37:08 +0000

Ahhh, I’m done. Now, don’t that feel good. 71 pages on building a computational model of language learner errors. Phew, now to sleep.

On Rexa

mote — Sun, 30 Apr 2006 21:05:32 +0000

Rexa, a new player in community bibliography management, was opened to the public a couple weeks ago.

Here’s a blog post from the PI on this project (Andrew McCallum) who details the announcement, and a little more here, from Matthew Hurst’s Data Mining blog.

A cursory use of the system shows it to be a sort of “new generation citeseer”, with a little smarter NLP and data mining, and a halfhearted attempt at facet-driven organization. They mention folksonomy in explaining their tags, but from what I can tell, implementation seems to be more like straight-up facet-based personal information management, rather than actual tag-sharing and folksonomy. But, it’s a start. And, the release is accompanied by promises to make it smarter (especially on the data mining side).

All I can say is, you can tell it was made by NLP guys and data miners and not social software guys. Interface-wise, it’s not too friendly (eh? I need to create an account before I can even begin browsing through it?? Before I’m even presented with a link to the “about” page??). And the interface looks like it was designed by a C++ monkey rather than an HTML monkey.

And I won’t even comment on the poor coverage of publications (Andrew promises to improve this). Err, actually looks like I did just comment.

These things being said, they have some GREAT approaches: smart data mining, as well as automatic extraction of author and grant profiles along with the usual paper aggregation (and with promise of forthcoming extraction/aggregation of conferences and research communities!)… it looks like they realize that research (like soylent green) is made of PEOPLE and not just papers.

The thing that really excites me is the suggested examples of tags that the use as seeds for the future folksonomy:

“hot”, “seminal”, “classic”, “controversial”, “enjoyable”

This is exciting because, if this tagging becomes more widespread and mainstream, we’ll FINALLY have a better metric of the value of a publication in academia. Think about it, right now, there are only two kinds of people that can tell the rest of the academic world that a paper is “valuable”: (1) the people on the acceptance/review committee for a conference or journal, and (2) the people who choose to cite a paper in the bibliography of their own publications. And, both of these aren’t too good–the first group is very exclusive and small in number (and at best biased, and at worst unknowledgable in the research niche of a paper’s focus), and the second group requires a high investment of investment to communicate value (need to publish a paper, just to put in a vote–and who ever reads bibliographies closely anyways, unless they’re already looking for something specific)?

The upshot is that, of so many people who read an article, only a very small few get to formally, aggregatably comment on its worth. That’s a lot of untapped, already-invested effort. I would love to see some sort of paper ranking system become more mainstream!

On The Success of LaTeX

mote — Sat, 01 Apr 2006 00:50:46 +0000

I suspect that the success of LaTeX–and its ubiquity as a format for thesis-writing–is in part due to the fact that learning its arcane subtleties is a wonderful source of procrastination.

What a glorious escape from having do to actual paper-writing!

Nature: “Scientists must embrace a culture of sharing and rethink their vision of databases”

mote — Fri, 02 Dec 2005 01:59:59 +0000

Good editorial on Nature, “Let Data Speak to Data“:

Web tools now allow data sharing and informal debate to take place alongside published papers. But to take full advantage, scientists must embrace a culture of sharing and rethink their vision of databases.

That being said, I find it more than a little ironic that this diatribe is published in Nature, of all places. Nature, the stalwart of the old academic regime… it’s method of publication (high cost of subscription, stringent peer-review-by-the-few instead of peer-review-by-the-masses as blogs are) seems quite opposed to the open, “let information be free!” attitude in this editorial.

But then again, isn’t it a good sign that the readership is thinking things like this, and that the editors are publishing such sentiments?