Tag: corpus linguistics


27

Aug 2010

Jukuu is interesting

Sinosplice reader Matthew recently introduced me to Jukuu, a database of sample sentences. The Chinese name, 句酷, is a pun on the word 句库, meaning “sentence base,” in the naming tradition of nciku). There are some really interesting things going on on Jukuu. Here’s a screenshot from the results for a search for “get”:

Jukuu search results for "get"

I enjoyed some of those random sentences. Some things worth noting:

– The sample sentences in the screenshot above are all taken from About Face 3, a well-known book on goal-directed design, which has been published in multiple languages.
– Jukuu offers not only multiple translations (grouped by part of speech), but also the distribution of those various parts of speech in its database (that’s what the pie graph at the right represents).
– Jukuu also offers other word forms (词形) for “get” (in this case, “gets,” “getting,” “got,” “gotten,” and even “getable”).
– If you click on one of the translations in the top right, the resulting page shows you sentences with just that translation of “get” (for example, this one for 得到).
– You can get similar results without going the “exact translation route” by just searching for multiple words, in a mix of English and Chinese. (The sentences aren’t censored, either. Have fun with that!)
– If you go to the “get” results page, further down the right column, you also see links for “adjectives frequently preceding this word,” “verbs frequently preceding this word,” “prepositions frequently preceding this word,” and “nouns frequently following this word.”

This kind of thing is a linguist’s dream, and can only be accomplished by corpus analysis with part of speech tagging, which is a ton of work. It’s really cool to see a resource like this publicly available online.


29

Jul 2010

Randy and the Half-Life of Irregular Verbs

Last night I met up with Randy Alexander of Sinoglot, Yuwen, and Echoes of Manchu for dinner and imported beers. We had a great chat, with topics ranging from English and Chinese linguistics, to sci-fi and (evil genius) Joel Martinsen, to the Sinoglot crew and how they tricked Randy into learning Manchu.

We started talking about some of our favorite linguistics articles, on Language Log or elsewhere, and I brought up the one about the half-life of irregular verbs in English. I wanted to send Randy a link, but I was dismayed to discover that the original article by Harvard University mathematician Erez Lieberman is now behind a pay wall. All you can find are articles linking to what was once a freely accessible article.

But I dug some more (we’re still quite a few years away from regularizing to “digged,” I’m guessing), and I eventually found what looks like a freely available copy of the original article, Quantifying the evolutionary dynamics of language, courtesy of our friends at NIH. Unfortunately, what’s still missing is the great chart the original paper included, which ordered irregular verbs by frequency and gave time estimates (in years) for the regularization of each. (There is an unordered list in text file format linked to in the article, though.)

What does this have to do with Chinese? I’d love to see similar studies for modern Mandarin. Sure, there are no conjugations for Chinese verbs, so it wouldn’t be about the regularization of irregular verbs. But it could be about variable pronunciations of certain words (like 角色, or 说服), or selection of characters (is it or ?). A good chunk of Chinese academia is still obsessed with standardization and what is “correct,” so you don’t see many objective studies, but that attitude won’t last forever. Chinese corpus linguistics is relatively young, but it’s making great strides, and I really look forward to seeing this kind of research in the future.

What research of this type would you like to see?