Data

The Kanjimori Database

Composition data for thousands of Chinese characters, regularly updated as new evidence emerges.

Rokuto Secret Data Nerd

The Kanjimori database contains composition data for thousands of Chinese characters with an emphasis on those used in Japanese (kanji). It's a living database, meaning it's regularly being updated and refined as new evidence for a character's origin emerges. Check out the database report below for insights based on our data!

Database Report

Dataset Completion Progress

Character Complexity Variation
Across Kanken Levels

Distribution of Roots and Phono-Semantic
Compounds Across Kanken Levels

Dataset Breakdown by Character Type

Sem = Semantic, Ph-Sem = Phono-Semantic
*Phono-semantic roots are characters that were historically phono-semantic compounds but were corrupted beyond component recognition

Sources

The Kanjimori database is in many ways a culmination of community efforts to better understand the origins of Chinese characters. Some key sources used in constructing the database include:

Kanji Composition

Wiktionary – The Wiktionary team has made phenomenal progress toward documenting the composition and origin of Chinese characters. Their entries were often the starting point for our investigations into the evolution of a kanji. Wiktionary is the work of hundreds of individual contributors, and contributor names can be seen in the "view history" tab of each kanji page.
Xiaoxuetang – Xiaoxuetang (小學堂) is a database of historical Chinese character forms run by the Institute of Information Science (IIS, 情報科学研究所) at Academia Sinica (中央研究院) in Taipei, Taiwan. The database is an excellent source of ancient glyph traces for studying changes in glyph composition and form over time.
角川新字源 – A Japanese kanji dictionary that includes historical forms and likely origins for each character. This book was an important reference for kanji origins, particularly as a source of perspectives on less well understood kanji. The authors include 小川環樹, 西田太一郎, 赤塚忠, 阿辻哲次, 釜谷武志, and 木津祐子.
漢字の体系 – A modernized Japanese kanji dictionary written by 白川静 that organizes characters by shared components. This book was a helpful resource for identifying overlooked Japanese-relevant characters with a certain component. Kanji entries include historical forms as well as brief explanations of character origins. We recommend this book for advanced learners as it embodies the same philosopy of learning kanji through composition that Kanjimori advocates for.
Zdic – Zdic is an online Chinese dictionary that includes historical and variant forms for many characters. Zdic is a great resource for studying variant forms of kanji that contain vestiges of ancient character forms.
Hanziyuan – Hanziyuan (漢字字源) is a site cataloguing historical forms of Chinese characters. It's a convenient reference for comparing characters across different stages of their evolution. Key contributors to Hanziyuan include Richard Sears, Ann Wu, and Dixin Yan.

Words & Definitions

KANJIDIC – KANJIDIC is a kanji database we referenced for many kanji properties including stroke counts, readings, and meanings. The database was created by Jim Breen and is currently maintained by the Electronic Dictionary Research and Development Group (EDRDG).
JMdictDB – JMdictDB is a Japanese dictionary database and is the source for most word definitions on Kanjimori. Like KANJIDIC, JMdictDB is the work of Jim Breen and is maintained by the Electronic Dictionary Research and Development Group (EDRDG).

We sincerely thank the authors and contributors of these sources for helping make Kanjimori possible!

Public Dataset

Public distributions of the data are currently unavailable while Kanjimori is under development, but the data will be made public with regular releases in the near future. Check back later for updates!