Can Project Naptha Read Chinese Text in Images?

Yesterday Project Naptha hit Hacker News. It offers a way to extract electronic text from image files through a simple Chrome browser extension. Excited to see that simplified and traditional Chinese are both supported by the extension, I immediately installed the extension and tried it out.

The results? Unfortunately, Not so great.

When it doesn’t work at all

First of all, the script needs to recognize the text in the image. This first step doesn’t always go too well, even if the text seems relatively clear to the human eye. Let’s look at some cases where the extension found nothing, despite the Chinese text being pretty legible.

In this first case, the font is non-standard. OK, fair enough. That’s to be expected.

Testing Project Naptha with Chinese

In this next case, the text is pretty clear, but the contrast is poor.

Testing Project Naptha with Chinese

In this final example, the text is fairly clear to the human eye, but also low-res and slanted. That probably makes it difficult for the algorithm.

Testing Project Naptha with Chinese

When it sort of works

In many other cases, some text was identified, but not enough for the extension to be really useful for anything. Here are some images where Project Naptha could identify some text, and the “select all text” function was applied. (The blue boxes show what Project Naptha identified in the images as “text.” Sometimes they are bizarrely incorrect.)

Some examples:

Testing Project Naptha with Chinese

Testing Project Naptha with Chinese

Testing Project Naptha with Chinese

Testing Project Naptha with Chinese

Testing Project Naptha with Chinese

I found the last two quite surprising, considering how clear and straightforward the text is, and also high-res.

When it actually works

Sometimes it was relatively successful in identifying the text. In these cases you must first set the language to Chinese (either simplified or traditional, depending on the text). There’s a cool effect showing you that some processing is going on. When that’s done, you can copy and paste the text.

Testing Project Naptha with Chinese

But… it might not be exactly what you were hoping for.

This selected Chinese text yielded the following copy-paste results:

Testing Project Naptha with Chinese

> 总统亲 ã热fl地接

> \早、待了葫芦兄妹

If it had correctly captured all the text, it would have been:

> 10、总统亲自热情地接

> 待了葫芦兄妹

This one is better:

Testing Project Naptha with Chinese

> 雹电二怪对兄妹俩尽效使用现代

> 化武器况妹俩也不示弱 麝芦神功连

> 连使出 胭宙电二怪打入深深的山沟

It should have been:

> 355、雷电二怪对兄妹俩尽使用现代

> 化武器况妹俩也不示弱,葫芦神功连

> 连使出,把雷电二怪打入深深的山沟

Also, my sample size is too small to make any definite conclusions, but it seems like the extension works better for simplified characters than for traditional.

Conclusion

I don’t mean to sound overly critical. This is amazing technology here, and the fact that it launched with any support for Chinese characters at all is pretty awesome (and brave)! I’m sure the technology will improve with time, and that is going to be tremendously helpful to Chinese learners.

To put this in perspective, the development of OCR (optical character recognition) for mobile devices meant that you could point your cell phone’s camera at any characters you see, and get feedback on what the characters say (sometimes). Project Naptha means the same thing, but for your home browsing experience. For me, that’s when I do a lot more Chinese reading, so it’s even more important. Once this technology is perfected, as long as you have a tool to help you read electronic Chinese text, you’re all set!

Personally, I think this is especially great news for comics. It’s no coincidence that I tested this extension out on comic book text. I’m really looking forward to seeing how this extension develops.

Share

John Pasden

John is a Shanghai-based linguist and entrepreneur, founder of AllSet Learning.

Comments

  1. Nice! You would expect the OCR could do better with most of those images. Here is our Hanping Camera’s effort at the first line of the second image (which is very low res 400×355 – is that the resolution you used, John?) http://t.co/gtgVrfB8zn

Leave a Reply