Wednesday 6 August 2014

Kanji duplicate reduction script


漢字 · かんじ

Aule Kanji Pages · Kanji Recog Pages

At 975 characters left in the file (fewer than 40 rows of 25), it is time to consider removing those kanji that are neither frequent in use nor rated 'general use'. But that requires a software script ... or an app ! 

弥哀徳尊庶霊沼鳥飛歴魚沼満跳故恭遊后蔬菜培実質允恭
傍櫻華段階妥推弥呉橋斉斉済帰畔弥呉橋根欄橋蘇我馬嶋
珍評判録坦掘済崇否争崇蘇我飛鳥蘇我済推創浮鳥舞軸線
溶鳥隠背岳雄御身得示能測澤隆伊賀掘遺跡城之越遺跡遺
跡指保護遺跡属箇涌落涌状護岬屈添改武悲舎残荒橘掘受
継録浜荒幹瀬頃追憶努展先駆示飛鳥城跡掘得知昭掘査城
条坊坪最幅細屈底玉掘催汀線複雑湾底玉縁奈濃緑紫層属
盆縁層質冷院朱雀院淳院残富湧貯巧往姿郊離別荘頃嵯峨
覚嵯峨離遺貴遺曽跡貴寝普遍寝寝遣院塩釜松浮条院丹橋
線奈受継莫催情緒換活絵絵漢詩仮院毛越極拝架橋渡御曼
越詳橘俊綱残割遣秘得張貫乞求求克遣展先駆示情緒遣筋
身父院藤頼藤頼専武打頼鎌倉受継永福頼尊毛越無院精舎
荘激死弟藤泰衡将鎮魂鎌倉育委員階阿弥陀薬伽藍遺認眼
掘査継続約掘果伽藍徐鎌倉卿資領荘得第巨富富御臨釣松
訪藤瑠璃澄比類激増鏡眺望巧将軍足満荘譲受拡荘第層楼
閣舎閣閣望楼閣俯瞰視橋閣往能竜瀑鏡湖丸極難足劣満洞
御迎幸仰満死鹿閣際破壊放職鳳章承鎌倉僧隆盛宋墨詩院
院院狭凝院院余刈段橋架堰横橋浮照夢窓疎輩夢窓疎愛芳
傑測知里似夢窓疎遊苦録残他芸匹最峰夢窓疎芳龍瑞鎌倉
帰僧蘭渓隆徳院龍養院座視智院堺衆湯飲器客湯寄屋湯客
玄別専細座向機能趣待庵府訓郡官休庵将軍城屋際遊廻遊
盛屋遊遊照兼栗香松趣熊熊屋随壊状圭郎職務余暇録資収
退職究励収資余単究伊勢栗催展身故郷桑松信九華究漢詣
漢詩晩詩激争展身横芸阿弥藤紹衷渋栄庸菱親睦網別慶雲
縣朋無鄰庵琵琶湖疏芝碧雲荘坂浜慶雲甥城亘郎各受継扇
湖荘郎座掲匠視展掲眺望視線類趣強飾郎津蘆荘継昭頃寅
郎推雑継急速雑全強類求運搬容易照項雑保鉄狭浮活改善
運動視覚評機能視視打活改善盟綱領項協協究宇精執筆保
也奇屋紹究全測昭院院批判打兼雪準慮指摘兼偕茨城別照
毛越磐郡院磐郡駒根乗谷倉福福芳苔府龍府恵指梨甲妙退
蔵院府龍曹府徳院府鹿閣府徳龍院府慈照銀閣府圓徳院府
府乗院奈奈養院庫粉竹知知院粉福芳徳龍院玉養浩福福福
福敦賀青滋賀軒蘭滋賀浜玄滋賀彦根滋賀津桂離府院離府
醍醐宝院府願院府条城丸御府府院府徳府詩府府松府幡根
城丸紅渓養翠音院鳥鳥徳城徳城御閣徳徳松玄栗条城城丸
紅渓音院渉慶雲滋賀浜無鄰庵府府府碧雲荘府慶阪府阪荘
府依奈奈裏妙庫登録念最登録件温荘潮遊個最無鄰菴慶温
荘昭福府松尾松府城府根阪府堺堺根足根根専誌念照欧鼓
橋篭奔放制職招制幕府薩摩動伊郎雇革真真視覚販売際第

The duplicate kanji are VERY visible now. 

is one example in the row

紅渓音院渉慶雲滋賀浜無鄰庵府府府碧雲荘府慶阪府阪荘

In the almost 40 rows above it, occurs 33 times. Run through the lines again. Do you start to see them ? Try separating the lines with a blank line. Try shortening the lines.

You could first remove all '\n' linefeeds, and then use a regexp such as

FIND  expression :   (..........)
REPLACE expr:     \1\n

which in Notepad++ will give almost 100 rows of 10 characters for the file above. It says "for every 10 characters, return that selection followed by a linefeed."

The one ( 1 ) in the case above refers to the first expression in parentheses.  If we'd had a second, it would have been \2.

What features make complex kanji appear to be the same ? Are some easily confused if not next to each other ?

Funny, but when I read 鳥鳥 side-by-side I just KNOW that is not horse ! But seen alone, I can be unsure ... horse or bird ? Crow ?




2 comments:

KanjiRecog said...

These remain in the file but are not in the 2500 newspaper kanji list used by jisho.org

俯圓坦廻櫻汀涌瀑璃甥疏瞰箇縣苔莫菴蔬蘆詣鄰醐陀

so I will remove them now.

KanjiRecog said...

Now the remaining characters include some not in the KLC 2,288 set

九也伽偕允卿圭堰堺宋峨嵯嶋幡彦敦曼桂橘浩淳澤琵琶瑞碧磐篭翠薩醍鳳龍

which has no 0 to 10 ( original is 2,300 kanji) so now I remove them.