A Character-Counting Challenge
10 Sep 2009
My recent post on the Wikimedia Commons Stroke Order Project prompted Mark of Toshuo.com to decry the relative dearth of traditional characters being added to the project. To this, David on Formosa reminded Mark that there are also a large number of characters shared by the traditional and simplified character sets.
At this point I’ll interject a visual aid (gotta love them Venn diagrams!):
All this got me thinking about the following question: If “s” represents the characters in the simplified set not shared with the traditional set, while “t” represents the characters in the traditional set not shared with the simplified set, and “u” represents the characters shared by the two sets, then what are the number of characters belonging to groups s, t, and u, respectively?
It seems like a simple enough question, but it’s actually quite tricky for a number of reasons.
First, the total number of Chinese characters in existence varies according to source, and largely depends on how many non-standard variants you want to include in your total set. You can be reasonably certain the total number is less than 50,000, but that’s still a pretty ridiculously large number, when most Chinese people regularly use less than 5000. For basic purposes of comparison, it makes sense to limit your set to a certain number of commonly used characters, but which set? One from the PRC? From Taiwan? From Hong Kong? From Unicode?
Second, you might be tempted to think that s = t, because simplified characters were “simplified from” traditional characters. This isn’t true, however, because in many cases multiple traditional forms were conflated into one simplified form. To give a very common example, traditional characters 干, 幹, and 乾 are all written 干 in simplified. So adding these three characters adds 1 to u, 2 to t, and 0 to s. There are lots of similar cases, so clearly t is going to be significantly larger than s. But by how many characters?
I’d be very interested to see a concrete answer to this question, regardless of the character limit used. I also wonder how the proportions of s, t, and u vary as the character limit is increased, and more and more low-frequency characters are included.
If you’ve got an answer, I’d love to hear from you!