A Realistic Look at the Challenges of Reading Chinese

The following is a guest article written by a Sinosplice reader, Julian Suddaby. I have followed it with some commentary of my own.

Warning: if you’re a member of the “Chinese is super easy” faction, this article might annoy you a little, but be sure to read through to the end!


How Many Characters?

by Julian Suddaby, 2014-02-13

Introduction

I asked Google “how many chinese characters do I need to learn” and the best sites I found pointed to linguist Jun Da’s website and used his data to argue that 3,500 characters should be enough for most people, being that you’ll know around 99.5% of the characters in general circulation. [1] Is that really enough?

Well, if you’ve got to that point, congratulations. It’s an achievement. But you may not want to stop accumulating characters just yet. Indeed, sad to say, at 3,500 you won’t even be able to read Jun Da’s name, being that 笪 is way down at frequency #5,231. [2] So how many, then, do you need to learn? Well, that depends on one question that you should ask yourself: what exactly do you want to read?

A Newspaper

Students often want to read Chinese newspapers. The Southern Weekly 南方周末 being a popular choice, I took the ten most popular articles over the previous thirty days and ran them through a computer program that checked them against Jun Da’s most frequent 3,500 characters. The results are fairly encouraging for the Chinese student, I think: if you knew the 3,500 you’d only encounter forty-four new characters over the course of those ten articles, and twenty-nine of those you’d only see once and so would probably just take a guess at from context and move on. But you’d possibly want to look up 甄, a pseudonymous surname given to the subject of one of the articles (and thus appearing thirty-five times); 闰, used in the name of a Zhejiang corporation which appears to have buried five hundred tons of poisonous chemicals in their backyard (seven appearances); and 驿, used in the name of a company involved in a online security breach (also seven appearances). [3]

So, while you probably shouldn’t throw out your dictionary just yet, it does seem that trying to read a newspaper won’t be a disheartening experience.

A Children’s Book

Children’s novels are another popular choice of reading material for language students. Shen Shixi is a well-regarded children’s novelist, whose Jackal and Wolf has recently been translated into English by Helen Wang. I ran an analysis on another of Shen’s novels, 《鸟奴》(lit. “Bird Slave”). This is, character-wise, much more difficult than the newspaper articles, with two hundred and one characters not in the top 3,500. Ninety of those are used more than once. As you’d expect from Shen, the “king of animal fiction”, animal-related vocabulary is one particular problem here, and you’ll probably end up very confused if you don’t look up 鹩, used two hundred and eighty-four times; 喙, used thirty-six times; and 獾, used twenty-two times. [4]

The novel is about two hundred and forty pages long, and so you should expect to find a character you don’t recognize on most pages.

A wuxia novel

Jin Yong’s novels remain firm favorites. Rather than starting with the four volumes and 1,300 pages of The Legend of the Condor Heroes 《射雕英雄传》, students might perhaps try A Deadly Secret《连城诀》, which is just four hundred pages or so. In those four hundred pages you’ll encounter two hundred and ninety-six characters not in the top 3,500.The most frequently used are from the protagonists’ names (水笙, 水岱, and 万圭), but there are plenty of new common nouns and verbs used multiple times as well. [5]

On a page-by-page basis, you should recognize more characters than in the Shen Shixi novel above. In terms of total characters, however, A Deadly Secret is more of a challenge.

A modern classic

Lu Xun’s A Call to Arms 《呐喊》, despite collecting stories he wrote at a very early stage of modern Chinese literary vernacularization, should not be much more difficult than the two novels above—at least in terms of basic character recognition. Two hundred and thirty unseen characters in total, with 闰 (remember that one from above?), 珂 (used in a name) and 锵 (a sound) taking the top three spots. [6]

Conclusion

Even from this very cursory analysis, it appears that if your goal is to read Chinese fiction comfortably without a dictionary, you’re going to need to recognize more than 3,500 characters. Chinese writers use characters well into the four or five thousand frequency range very regularly.

So although reaching 3,500 is worth celebrating, I wouldn’t stop trying to acquire characters just yet. Keep reading and dictionary-checking, and don’t abandon memorizing/spaced repetition if that’s something you find helpful. [7] You’ll still be coming across new characters for a long, long time…. [8]

  1. See http://lingua.mtsu.edu/chinese-computing/statistics/index.html.
  2. 笪 Dà (a surname here, but means “a coarse mat of rushes or bamboo”, with 旦 dān providing the phonetic). Here and later I’m using Wenlin as my main reference for character glosses.
  3. 甄 Zhēn (a surname here, but originally meaning “to make pottery” and thus composed of 垔 and 瓦, but with no phonetic clue), 闰 rùn (used in a name here, but means “intercalary”; the much more common 润 shares the same pronunciation), 驿 yì (used together with 站 to mean “post/courier station”; right-hand side is the phonetic, as in 译).
  4. 鹩 liáo (“wren”, with the left-hand side providing the phonetic), 喙 huì (“snout; mouth; beak”, with both 口 and 彖 radicals semantic; no phonetic clue), 獾 huān (“badger”, with the right-hand side phonetic).
  5. 笙 shēng (“reed-pipe instrument”, bottom is the phonetic), 岱 Dài (“Taishan mountain”, top is the phonetic), 圭 guī (“jade tablet”, cf. 挂 or 桂 for the pronunciation).
  6. 珂 (“a jade-like stone”, right-hand side is the phonetic), 锵 qiāng (“clang”, right-hand side is the phonetic).
  7. For the more technologically-oriented student, another option may be available: thanks to the increasing availability of texts in machine-readable formats students could run their own frequency analysis on a text they wanted to read and pre-learn characters they don’t already know. It’s a pity there don’t seem to be any easy-to-use programs or websites that offer this functionality.
  8. It should also be noted that single character recognition is only part of reading Chinese, and is not on its own a good measure of reading proficiency. That said, the relative ease of measuring character recognition and frequency may justify its limited use as a self-diagnostic and motivational tool for learners of Chinese.

The following is my response:

Interesting! This sort of helps make a case for the importance of graded readers. (Have you seen Mandarin Companion?)

While I know your intent is to SEEK THE TRUTH, the overall tone of the article is, unfortunately, a little discouraging for struggling learners. For me, this totally highlights the need for materials that give the learner a sense of accomplishment for having reached 300, 500, 1000 characters, rather than an incessant message saying, “STILL NOT GOOD ENOUGH.”

His response:

You’re quite right, I suppose I am a little too rigidly 实事求是 in the piece! I completely agree with you about the need to avoid the demotivating “still not good enough” feeling and message that permeates most Chinese teaching materials (how I remember my exasperation when the 高级 textbook still required fifty plus new vocabulary items per short text!). There’s really a huge need for more good reading materials with limited character/vocabulary ranges, and your graded readers look fantastic.

13 Comments to “A Realistic Look at the Challenges of Reading Chinese

  1. ChinaMatt says:

    I’m not sure I’ve cracked 1000 yet. Of course, learning without a class is rather difficult. Wish I could remember the graded readers I had from China–they were great for reinforcing vocabulary and sentence patterns. Wish I could find those here in Taiwan.

  2. laurenth says:

    “thanks to the increasing availability of texts in machine-readable formats students could run their own frequency analysis on a text they wanted to read and pre-learn characters they don’t already know. It’s a pity there don’t seem to be any easy-to-use programs or websites that offer this functionality.”

    Chinese Word Extractor helps you do just that.

    http://www.zhtoolkit.com/posts/2011/09/new-software-chinese-word-extractor/#more-292

    It’s an opensource program, usable both online and offline.

  3. Graham Bond says:

    Discussion of quantities of characters doesn’t really make much sense to me in terms of the broader subject of ‘learning Chinese’. I frequently come across characters that I have had regular exposure to (primarily from SRS-based flashcard work). I know exactly how the character is pronounced (or the range of potential pronunciations) and could probably list off at least two different potential English translations. And yet, when encountered in text or speech, I may well be totally dumbfounded by exactly what that character is doing to contribute to meaning. Herein lies the danger of being over reliant on SRS (and, by implication, over-reliant of hard ‘numbers’).

    For what it’s worth, the flashcard set that I maintain daily contains around 8,000 individual cards. These have been one and eight characters on each (at an average of around 2.5, or 3, I’d guess) meaning there are easily more than 20,000 individual characters in my flashcard deck. Obviously, many, many of these are repeated – which in itself points to the fact that meanings often change drastically when the same character is collocated differently – though I guess I have some experience of seeing and using perhaps 3,000 distinct characters. Even now, reading newspapers and novels (something I try and do at least ‘a bit’ of on a daily basis) remains a challenge, even though I can make sense of most things without a dictionary.

    Numbers really aren’t everything. That kind of donkey work is perhaps necessary if one wants to be able to understand complex texts, but it remains only a ‘first step’ and, even then (as John intimates), only one of several different potential first steps. Rather than racking up big figures, what matters is being aware of how characters work within different collocations and, then, how those collocations operate within a variety of contexts.

    In short Chinese IS hard, but that’s why we do it, right?

  4. Lucas says:

    You don’t stop learning characters, or Chinese in general. You just keep hacking away at it forever and see how far you can get before you die.

    • Adam Stout says:

      I wish there was a “like” button for you comment. Depressing and cynical, but humorous and true enough to make me laugh. Thanks!

  5. Brendan says:

    I think part of the problem here is defining what we mean when we say we “know” a character — and, for that matter, what we mean by “reading.” There are plenty of characters that are rare on their own but more or less recognizable in context; there are others, like 笪 or 甄, that are going to be clearly marked as names when they appear in context. If someone’s goal is extensive reading, I think it’s perfectly fine to skip over these: “When contacted, Mr. [BLOB] said….”

    And of course with tools like Pleco and Wenlin and Peraperakun, the pronunciation is only a mouseover or a tap away. Personal names are tough, but the good news is that the lazy approach works well enough: learn the pronunciation, or at least look it up quickly, and forget about the meaning. (Incidentally, 垔 is phonetic in 甄. Not the most helpful of clues, admittedly.)

    Basically, I think that the relative ease or difficulty of the approach is going to end up depending a lot on the learner’s focus. If you focus on the characters, and on knowing the pronunciation and meaning of every single character, you’re going to have a hard time of it. If you focus on the words, and on the general import of the sentences in which the words appear, you’re going to have a much easier time.

  6. Stavros says:

    One thing a learner (a learner of anything, not just languages) requires to understand is that the brain works on efficiency – that is, it only remembers what it needs. The general consensus among language learners is necessity tends to dictate what is learnt and what is discarded.

    I am currently reading and listening to the book 我的青春谁做主. It is based on the TV show which was a hit a few years back. Sure enough, there are tonnes of vocab items, idiomatic phrases and grammar points which baffle me. But I let them go. I am getting the gist of the story so I don’t bother looking anything up (this is partly due to being familiar with the TV show). But one pertinent point about this book: although it is written for native speakers, I’ve yet to come across a single character which I don’t recognize (I am in the 3,500 range).

  7. This is a ridiculous post, and a ridiculous way to look at language learning.

    The exact same thing happens to native speakers, in any language! Who cares who many words or characters or chengyu or whatever you know? It’s not important.

    If I picked up any English literature classic tomorrow and started reading, there would be plenty of unfamiliar words, but the odds are I just wouldn’t notice them, as skipping over words we don’t know is a habit ingrained in us since a young age. As Bokane said, you just read it as Mr. ?. There are so many words in English which are mispronounced for exactly that reason – people who have never heard it said read it in a book and take a guess at it, and end up being wrong. One day someone who knows corrects them, and so it goes.

    I personally think it’s a waste of time to sit there and try and find a magic formula for how many characters and/or words you need to know to read x text, just do what you do in your native language – pick up a book that looks interesting and start reading it. If it’s too hard, give up and find an easier one. Do the same if it’s too boring.

    Less time analyzing, more time reading!

  8. Andrew Cockerham says:

    Good post. I just downloaded “The Country of the Blind” from Mandarin Companion. Thanks so much for creating such a useful resource! I’m also a developer, and curious to talk some ideas of improvements to the app based on discussions above. Would love to chat more….how is best to contact you John?

  9. Ben says:

    If anything I find this encouraging about the 3500 character statistic. Of the low frequency of usage characters you displayed the vast majority were for names, and therefore not an impediment to comprehension of the overall piece. But of course, knowing characters is just the start, just becuase you recognise teh characters 金庸 is using doesn’t mean you’ll be able to understand 连城诀.

  10. Adam Stout says:

    Anyone have good resources for gauging the number of characters you actually know? Tests, etc.? I have tried clavisinica.com but wonder about its accuracy and would like to try something else for comparison.

    PS – The Mandarin Companion graded readers are the best thing I’ve ever read in Chinese!

Leave a Reply

Sinosplice and all material found herein © 2002-2014, John Pasden. All rights reserved.
Sinosplice is happily hosted by WebFaction. Design by Dao By Design