Thoughts on Publishing in Academia

To paraphrase/quote David Klein:

publications would be so much better if we were forward-thinking instead of rigorous in our testing. It seems like people judge a paper’s value by “in 10 years, will someone find a hole in the rigor of my testing procedure”. I would rather judge a paper by “does this make me have an interesting idea about the field that I’ve never thought of before”

Advice on Writing One’s Dissertation

All dissertations require four months of uninterrupted work.

  • The last month of work takes 0.5 calendar months.
  • The second to last month takes 1.5 calendar months.
  • The first two months can take years, and they usually do.

Prof. Daneil Bewrry, U. Waterloo

Sigh… if only this were less true.

What we can learn from Folksonomy

Outward-facing Questions:

  • The great thing about delicious and folksonomy is that it creates an ontology as an emergent biproduct of individual self-serving efforts (that is, personal bookmarking). I’m wondering if we can take a similar tact to solve other AI problems.

Inward-facing Questions:

  • What is the best way to represent the evolution of a tag’s meaning (evolution on both the individual and group scale). Folksonomy is a lot more dynamic than a fixed ontology, so we might not be able to use the same old tools.
  • Folksonomy is the relationship between three types of information: tags, tagged objects, and the users who tag them. What information can we derive each that are not explicit in the structure. You can call this “tag grouping”, “neighbor search”, “related items”… but it’s really all just clustering. What are the differences when you cluster each?
  • Continuing from the last quesiton: it’s most intuitive to hierarchically cluster tags—this maps well onto the formal “ontology” model that information architects and NLP researchers are comfortable in dealing with. But what happens when we hierarchically cluster users and tagged items? What does hierarchy infer about the relationships between parents, children, and siblings in the resulting structure?
  • What are the differences in (tags, users, items) between digg, delicious, flickr, and citeulike?

Ah, to have time to pursue these….

Finished the Dissertation Proposal

Ahhh, I’m done. Now, don’t that feel good. 71 pages on building a computational model of language learner errors. Phew, now to sleep.

On Rexa

Rexa, a new player in community bibliography management, was opened to the public a couple weeks ago.

Here’s a blog post from the PI on this project (Andrew McCallum) who details the announcement, and a little more here, from Matthew Hurst’s Data Mining blog.

A cursory use of the system shows it to be a sort of “new generation citeseer”, with a little smarter NLP and data mining, and a halfhearted attempt at facet-driven organization. They mention folksonomy in explaining their tags, but from what I can tell, implementation seems to be more like straight-up facet-based personal information management, rather than actual tag-sharing and folksonomy. But, it’s a start. And, the release is accompanied by promises to make it smarter (especially on the data mining side).

All I can say is, you can tell it was made by NLP guys and data miners and not social software guys. Interface-wise, it’s not too friendly (eh? I need to create an account before I can even begin browsing through it?? Before I’m even presented with a link to the “about” page??). And the interface looks like it was designed by a C++ monkey rather than an HTML monkey.

And I won’t even comment on the poor coverage of publications (Andrew promises to improve this). Err, actually looks like I did just comment.

These things being said, they have some GREAT approaches: smart data mining, as well as automatic extraction of author and grant profiles along with the usual paper aggregation (and with promise of forthcoming extraction/aggregation of conferences and research communities!)… it looks like they realize that research (like soylent green) is made of PEOPLE and not just papers.

The thing that really excites me is the suggested examples of tags that the use as seeds for the future folksonomy:

“hot”, “seminal”, “classic”, “controversial”, “enjoyable”

This is exciting because, if this tagging becomes more widespread and mainstream, we’ll FINALLY have a better metric of the value of a publication in academia. Think about it, right now, there are only two kinds of people that can tell the rest of the academic world that a paper is “valuable”: (1) the people on the acceptance/review committee for a conference or journal, and (2) the people who choose to cite a paper in the bibliography of their own publications. And, both of these aren’t too good–the first group is very exclusive and small in number (and at best biased, and at worst unknowledgable in the research niche of a paper’s focus), and the second group requires a high investment of investment to communicate value (need to publish a paper, just to put in a vote–and who ever reads bibliographies closely anyways, unless they’re already looking for something specific)?

The upshot is that, of so many people who read an article, only a very small few get to formally, aggregatably comment on its worth. That’s a lot of untapped, already-invested effort. I would love to see some sort of paper ranking system become more mainstream!

On The Success of LaTeX

I suspect that the success of LaTeX–and its ubiquity as a format for thesis-writing–is in part due to the fact that learning its arcane subtleties is a wonderful source of procrastination.

What a glorious escape from having do to actual paper-writing!

Nature: “Scientists must embrace a culture of sharing and rethink their vision of databases”

Good editorial on Nature, “Let Data Speak to Data“:

Web tools now allow data sharing and informal debate to take place alongside published papers. But to take full advantage, scientists must embrace a culture of sharing and rethink their vision of databases.

That being said, I find it more than a little ironic that this diatribe is published in Nature, of all places. Nature, the stalwart of the old academic regime… it’s method of publication (high cost of subscription, stringent peer-review-by-the-few instead of peer-review-by-the-masses as blogs are) seems quite opposed to the open, “let information be free!” attitude in this editorial.

But then again, isn’t it a good sign that the readership is thinking things like this, and that the editors are publishing such sentiments?

Bibliographic Management

Bibliography Management Linkdump:

  • Bibdesk : an excellent BibTeX database management system. Beautiful. But for mac only.
  • jabref : an open-source, java BibTeX database management system. Lacks Bibdesk’s panache, but not bad.
  • bibtexml: an excellent tool. Takes .bib files, converts them to xml, and then uses DTDs or XSLTs to mark them down to html APA, MLA, or whatever. This is the type of thing that XML was made for. Requires sablotron or another xslt engine to work.
  • citeulike : folksonomy + bibliography. A delicious clone built to manage academic paper metadata. Good for storing data, finding new papers to read, and making what I’ve read public

Painfully Learning Zope

My research demands that I write an interface for native speakers to annotate sound files of learner speech. Up until now, my poor annotators had been using an excel sheet I generated via a python script, with one column that pointed to sound files on the web. It worked, and it had nifty features like excel’s builtin autocomplete, but it was easy to run into versioning problems with the halfway-completed excel sheets floating around.

Now, much of our project’s work is done in python, so the powers that be say “hey, write us a web app in python that does this job”. No prob, python has lots of web app frameworks (cherrypy, twisted, django, snakelets, mod_python (and .psp pages) ). And, it was actually a Good Thing, because I’d always wanted to learn web app programming (It’s embarassing, actually. My ivory-tower programming experience has been a lot of working with statistics, machine learning, natural language processing, but I’ve never done things like web programming, database programming, etc; I’ve read php and mod_perl code, but reading is of course much different from writing). So, mod_python and psp it was. They proved to be intuitive enough to get some working teach-myself-how-to-do-this stuff code in a couple hours.

However, project requirements change. “We want you to do it in Zope or Plone” became the new order of the day. Been wrestling learning zope/plone for the past 4 days or so… The architecture has a lot of promise, but in many ways it’s frustratingly immature. It can make things look really slick… but documentation is disappointingly unclear/convoluted. There are many links out there for learning this stuff, but very few good ones.

After much searching, dev shed ended up having a high concentration of good links. I wish I had found this howtoCreating Basic Zope Applications, in particular, 5 days ago.

Zope seems full of inconsistencies. And it’s not very pythonic. Take, for instance, a mishmash of “here”, “this”, and “self” used hodgepodge to fulfil the function of python’s “self”. What’s up with that?

Ning

Phew, Ning is meme-du-jour. It’s basically a web toolkit to create social software. It’s the first product I’ve seen to come out of Marc Andreesen’s stealth startup 24 Hour Laundry. (reference)

My hunch might be wrong, but it seems to be web2.0 applied to raw application development. What I mean is this: the typical read-write-web facilitates user-contributed data, and the social sharing of user-contributed data. Ning looks like it facilitates user-contributed code, and the social sharing of user-contributed code.

And, by providing a good development platform, it encourages mash-ups between applications, data-sharing, etc. I am curious if this enablement is just inward-focused or also outward-focused. That is, is it just as easy to API into a Ning app from another webapp outside Ning as it is for one Ning app to talk to another?

I’ll be able to tell more after I get my beta developer account, which according to Gordon should be “any hour now” =p.

But, wow, this looks like it could be the sandbox to end all sandboxen.

Update: ahh, here is the business model (i.e. where the money’s gonna come from):

the third party ad networks such as Google AdSense don’t look warmly upon more than one person running ads on an App or a page. Hence the trade for running apps on Ning is that we offer free app creation, management, hosting, security, and shared services, and - in return - you open your code to inspire other developers and refrain from running third party ads. We totally understand if this is not for everybody.