Japanese Text Analysis and Readability Tools

Feel like you’re wasting your time trying to read difficult texts? It’s time to start using tools that measure text difficulty.

The measuring of difficulty is called readability. Currently, readability of Japanese texts can be determined by three methods: sentence features, kanji level, and vocabulary frequency lists. Each method has a number of assumptions and limitations.

Sentence Feature Analysis

NagoyaObi Project

Satoshi Sato’s NagoyaObi Project provides a tool that analyzes sentence features (abstract). The tool analyzes the character count of English, Hiragana, Katakana, Kanji and punctuation characters to infer the number of sentences the text has. The importance of each character type is mathematically prioritized based on Japanese school graded texts, in order to create a weighted difficulty score. Texts graded to be easy to read will probably be short, and not use much Kanji. Aside from displaying the difficulty of text composition in relation to corpus data, this tool is probably not too helpful.

Kanji Level Analysis

Tokyo International University

On the other hand, there are tools to grade a text by Kanji level, and can be used to reinforce Kanji currently learned. Kanji levels include the Japanese school grades (Jouyou), the Japanese Language Proficiency Test (JLPT), and WaniKani. One tool that can analyze Jouyou levels is Kanji Sieve. JLPT level can be analyzed by Jisho, Tokyo International University, and EasyPronunciation. WaniKani level can be analyzed by Jisho, or the browser script Kanji Highlighter. There are also javascript libraries, such as Muzukashii for JLPT and Jouyou level, and Kanji Levels for JLPT, Jouyou and Wanikani levels.

Vocabulary Frequency List Analysis

Tokyo International University

Vocabulary frequency lists show how relevant each word in a text is, by finding how often each word is used. Words may be compared to other texts, such as a corpus of a specific subject, to find how important each word is within that subject. This makes frequency lists useful to focus on memorizing only important words. However, this method is only as reliable as the chosen text or corpus, which may be outdated or not contain enough subject relevant data. One type of available frequency lists are those created from previous JLPT texts. The JLPT level of vocabulary can be analyzed by Jisho and Tokyo International University. Or, create your own frequency lists using Wareya’s analyzer, or Squares.net, which uses the Yahoo API.

Brochtrup’s Text Analysis Tool

Christopher Brochtrup's Analysis Tool

Christopher Brochtrup’s Japanese Text Analysis Tool is a Windows application that combines a variety of the tools above. The application features an outdated version of the Satoshi Sato’s NagoyaObi Project analysis method, and vocabulary and kanji frequency list creation. In particular, it is possible to create personalized readability reports from one’s own vocabulary list (eg. from Anki) to display whether vocabulary from a text are already known. Vocabulary is found from texts using dictionary lookups via MeCab and should be more accurate than other methods. However, readability reports seemed inaccurate from my testing despite using a known vocabulary list generated from Edict. Furthermore, results are saved to text files which can be cumbersome to view.

Krause’s Japanese Known Word Checker

Kai Krause's Known Word Checker

Finally, my tool, Japanese Known Word Checker, compares a text to one’s own vocabulary list (eg. from Anki) and displays whether words from a text are already known. It is simpler than Brochtrup’s application’s readability reports, and focuses primarily on the number of unknown words. A list of unknown words is generated, and can be easily copied and saved to a text file. Word detection relies on the rules of the Japanese language and TinySegmenter, and is being improved regularly. Because the tool is client-side, it is slower than a native or server-side application, but it can be saved and used offline by most web browsers.

Now, go save your time and improve your confidence!

Comments