Skip to content

Learner Language Modeling at NICT

About a month ago, some researchers from NICT (Japan’s National Institute of Information and Communications Technology) came to visit ISI and give a series of short presentations on their work. Among those presenting was Emi Izumi, a woman who is involved in research very similar to mine. She, and a few others over there, have been working on modeling mistakes in learner language–specifically, typical Japanese school-taught learners of English. I expect they have encountered much less logistical details than we have with tactical language (namely, shortage of language-learner-speakers, native-speaker-annotators, and pre-existing speech data models)… lucky them.

Interestingly, their work is very complementary to my own–while I have concentrated on phonology-related errors, they have put more effort into syntax and morphosyntax. It looks like there’s a lot of future for further cooperation here =).

They have also created a healthy-sized annotated database of learner speech, the NICT JLE Corpus. In accordance with their research, the corpus is rich with syntactic errors (but, unfortunately, mispronunciations are replaced with learner-intended words where they are understandable).

I’m curious how I can use this corpus to benefit my own research. While I expect many errors to be language-dependent–unique to the interaction between the L1 and L2 involved–I am sure there are some language universals that come into play–and, as I’m dealing with a paucity of data, I can at least use a Japanese model as a bootstrap.
Of course, once I get enough data, it will be really cool to compare relative statistics–get a glimpse of what exactly is universal…

I have uploaded a few of Izumi’s papers here, to my citeulike page.