The 2,136 Japanese Joyo Kanji web-accessible database and search engine

 For the 2,136 kanji characters published in the 2010 revision of Commonly-used Kanji Table, you can search for various kanji characteristics such as frequency, radical type, radical frequency on the Web using the corpus of 11 years of the Mainichi Newspaper (from 2000 to 2010). The total type frequency of the newspaper corpus is 368,841 words, and the token frequency is 282,816,611 words. In addition to kanji, you can also search for two-kanji compound words (Jukugo) by "Select Jukugo from Database" on the search site.


(download)See details in this paper:

Tamaoka, Katsuo, Shogo Makioka, Sander Sanders and Rinus G. Verdonschot (2017). www.kanjidatabase.com: a new interactive online database for psychological and linguistic research on Japanese kanji and their compound words. Psychological Research, 81, 696-708. [Download]


Japanese, Korean, Chinese and Vietnamese kanji-word Web-accessible database and search engine

 Two-kanji compound words are frequently used in Japanese, and these words account for about 70% of the Japanese dictionary consisting of 51,962 headline words in Japanese (Yokosawa & Umeda, 1988). Using 2,060 words presented in kanji listed in the Former Japanese Language Proficiency Test (2007, Revised Version), this search engine for the Web-accessible database can find word types, printed-frequencies in the Asahi Newspaper (14-year articles from 1985 to 1998) and the Mainichi Newspaper (11-year articles from 2000 to 2010), and semantic relationships between Japan and Chinese. Pearson’s product ratio correlation coefficient for 2,029 words between two newspapers (Asahi and Mainichi) is r=0.87 (p<.001).

In addition, the phoneme similarity of the two-kanji compound words between the four languages of Japanese, Korean, Chinese and Vietnamese can be searched for two indicators: “phonemic similarity” and “phonological distance”. Phoneme similarity is a value that varies in the range of 0 to 1, and the higher the number, the higher the similarity. On the other hand, the phonological distance is the generalized Levenstein’s distance calculated by the sdists function (Buchta & Hahsler, 2016) provided in R cba package. The numbers are indicated by integers, and the larger the similarity is lower


(download)See details in these papers:

  • Yu, S., Kim, J. & Tamaoka, K. (2018). Construction of an online search engine for a database on phonological similarity and phonological distance of two-kanji compound words in Japanese, Korean, Chinese and Vietnamese. [Download]
  • Yu, S., & Tamaoka, K. (2015). A Web-accessible search engine for grammatical category for orthographically-similar two-kanji compound words between Japanese, Korean and Chinese. [Download]
  • Xiong, K., & Tamaoka, K. (2014). A database of grammatical categories for orthographically-similar two-kanji compound words among Japanese, Korean and Chinese. [Download]
  • Park, S., Xiong, K., & Tamaoka, K. (2014). A database of grammatical categories for orthographically-similar two-kanji compound words among Japanese, Korean and Chinese. [Download]
  • Park, S., Xiong, K., & Tamaoka, K. (2014). A database of grammatical categories for orthographically-similar two-kanji compound words among Japanese, Korean and Chinese. [Download]

Mora and Bi-mora Frequency Web-accessible Database

 Using corpus data from 18 years of articles in the Mainichi Newspaper (1998 to 2015), the mora frequencies and the bi-mora frequencies are calculated. The type frequency of morphological units in this newspaper corpus is 663,243, while the token frequency excluding the symbol is 398,406,147. For example, the mora frequency of /ka/ is searched by both hiragana and katakana, and it found that the /ka/ type frequency is 70,548 times, whereas the /ka/ token frequency is 31,176,377 times. In addition, up to 1,000 words including /ka/ are presented in order of frequency. The bi-mora frequency can be searched as well.

 For example, when you enter /kawa/ in hiragana or katakana, the type frequency of bi-mora is displayed as 2,377 times while the token frequency is 659,508 times. Again, up to 1,000 words including /kawa/ are presented in order of frequency.


Word Frequency Web-accessible Database and Search Engine

 Using the 18 years articles of Mainichi Newspaper, this Web-accessible search engine is possible to search various word frequencies. For example, if you input /sakura/ in hiragana or katakana, all the lexical items pronounced it are displayed. The most frequent 桜 in kanji is 15,729 times, さくら in hiragana is 6,975 times, サクラ in katakana is 1,582 times.