Showing posts with label regexp. Show all posts
Showing posts with label regexp. Show all posts

Thursday, 7 August 2014

plain text 800 kanji


漢字 · かんじ

Aule Kanji Pages · Kanji Recog Pages

This is my exercise page reduced to 800 kanji of which 360 are unique :

弥哀徳尊庶霊沼鳥飛歴沼満跳故恭遊后菜培実
質恭傍華段階推弥呉橋斉斉済帰畔弥呉橋根欄
橋蘇我馬珍評判録掘済崇否争崇蘇我飛鳥蘇我
済推創浮鳥舞軸線溶鳥隠背岳雄御得能測隆伊
賀掘遺跡城越遺跡遺跡保護遺跡属落状護岬屈
添武悲舎残荒掘受継録浜荒幹瀬頃追憶努展駆
飛鳥城跡掘得知昭掘査城条坊坪最幅細屈底掘
催線複雑湾底縁奈濃緑紫層属盆縁層質冷院朱
雀院院残富湧貯巧往姿郊離別荘頃覚離遺貴遺
曽跡貴寝普遍寝寝遣院塩釜松浮条院橋線奈受
継催情緒換絵絵漢仮院越極拝架橋渡御越詳俊
綱残割遣秘得張貫求求克遣展駆情緒遣筋院藤
頼藤頼専武頼鎌倉受継福頼尊越無院精舎荘激
死弟藤泰衡将鎮魂鎌倉育委員階阿弥薬藍遺認
眼掘査継続約掘果藍徐鎌倉資領荘得富富御臨
釣松訪藤瑠澄類激増鏡眺望巧将軍満荘譲受拡
荘層楼閣舎閣閣望楼閣橋閣往能竜鏡湖極難劣
満洞御迎仰満死鹿閣際破壊放職承鎌倉僧隆盛
墨院院院狭凝院院余刈段橋架横橋浮照夢窓疎
輩夢窓疎愛芳傑測知似夢窓疎遊苦録残他芸匹
最峰夢窓疎芳鎌倉帰僧蘭渓隆徳院養院座智院
衆湯飲器客湯寄屋湯客玄別専細座向機能趣待
庵訓郡官休庵将軍城屋際遊遊盛屋遊遊照兼栗
香松趣熊熊屋随壊状郎職務余暇録資収退職究
励収資余単究伊勢栗催展故郷桑松信華究漢漢
晩激争展横芸阿弥藤紹衷渋栄庸菱親睦網別慶
雲朋無庵湖芝雲荘坂浜慶雲城亘郎受継扇湖荘
郎座掲匠展掲眺望線類趣強飾郎津荘継昭頃寅
郎推雑継急速雑強類求運搬容易照項雑保鉄狭
浮善運動覚評機能善盟綱領項協協究精執筆保
奇屋紹究測昭院院批判兼雪準慮摘兼茨城別照
越郡院郡駒根乗谷倉福福芳恵梨甲妙退蔵院曹
徳院鹿閣徳院慈照銀閣徳院乗院奈奈養院庫粉
知知院粉福芳徳院養福福福福賀青滋賀軒蘭滋
賀浜玄滋賀根滋賀津離院離宝院願院条城御院
徳松根城紅渓養音院鳥鳥徳城徳城御閣徳徳松
玄栗条城城紅渓音院渉慶雲滋賀浜無庵雲慶阪
阪荘依奈奈裏妙庫登録念最登録件温潮遊個最
無慶温昭福松尾松城根阪専誌念照欧鼓橋奔放
制職招制幕摩動伊郎雇革覚販売際遊遊盛屋遊




Wednesday, 6 August 2014

Kanji duplicate reduction script


漢字 · かんじ

Aule Kanji Pages · Kanji Recog Pages

At 975 characters left in the file (fewer than 40 rows of 25), it is time to consider removing those kanji that are neither frequent in use nor rated 'general use'. But that requires a software script ... or an app ! 

弥哀徳尊庶霊沼鳥飛歴魚沼満跳故恭遊后蔬菜培実質允恭
傍櫻華段階妥推弥呉橋斉斉済帰畔弥呉橋根欄橋蘇我馬嶋
珍評判録坦掘済崇否争崇蘇我飛鳥蘇我済推創浮鳥舞軸線
溶鳥隠背岳雄御身得示能測澤隆伊賀掘遺跡城之越遺跡遺
跡指保護遺跡属箇涌落涌状護岬屈添改武悲舎残荒橘掘受
継録浜荒幹瀬頃追憶努展先駆示飛鳥城跡掘得知昭掘査城
条坊坪最幅細屈底玉掘催汀線複雑湾底玉縁奈濃緑紫層属
盆縁層質冷院朱雀院淳院残富湧貯巧往姿郊離別荘頃嵯峨
覚嵯峨離遺貴遺曽跡貴寝普遍寝寝遣院塩釜松浮条院丹橋
線奈受継莫催情緒換活絵絵漢詩仮院毛越極拝架橋渡御曼
越詳橘俊綱残割遣秘得張貫乞求求克遣展先駆示情緒遣筋
身父院藤頼藤頼専武打頼鎌倉受継永福頼尊毛越無院精舎
荘激死弟藤泰衡将鎮魂鎌倉育委員階阿弥陀薬伽藍遺認眼
掘査継続約掘果伽藍徐鎌倉卿資領荘得第巨富富御臨釣松
訪藤瑠璃澄比類激増鏡眺望巧将軍足満荘譲受拡荘第層楼
閣舎閣閣望楼閣俯瞰視橋閣往能竜瀑鏡湖丸極難足劣満洞
御迎幸仰満死鹿閣際破壊放職鳳章承鎌倉僧隆盛宋墨詩院
院院狭凝院院余刈段橋架堰横橋浮照夢窓疎輩夢窓疎愛芳
傑測知里似夢窓疎遊苦録残他芸匹最峰夢窓疎芳龍瑞鎌倉
帰僧蘭渓隆徳院龍養院座視智院堺衆湯飲器客湯寄屋湯客
玄別専細座向機能趣待庵府訓郡官休庵将軍城屋際遊廻遊
盛屋遊遊照兼栗香松趣熊熊屋随壊状圭郎職務余暇録資収
退職究励収資余単究伊勢栗催展身故郷桑松信九華究漢詣
漢詩晩詩激争展身横芸阿弥藤紹衷渋栄庸菱親睦網別慶雲
縣朋無鄰庵琵琶湖疏芝碧雲荘坂浜慶雲甥城亘郎各受継扇
湖荘郎座掲匠視展掲眺望視線類趣強飾郎津蘆荘継昭頃寅
郎推雑継急速雑全強類求運搬容易照項雑保鉄狭浮活改善
運動視覚評機能視視打活改善盟綱領項協協究宇精執筆保
也奇屋紹究全測昭院院批判打兼雪準慮指摘兼偕茨城別照
毛越磐郡院磐郡駒根乗谷倉福福芳苔府龍府恵指梨甲妙退
蔵院府龍曹府徳院府鹿閣府徳龍院府慈照銀閣府圓徳院府
府乗院奈奈養院庫粉竹知知院粉福芳徳龍院玉養浩福福福
福敦賀青滋賀軒蘭滋賀浜玄滋賀彦根滋賀津桂離府院離府
醍醐宝院府願院府条城丸御府府院府徳府詩府府松府幡根
城丸紅渓養翠音院鳥鳥徳城徳城御閣徳徳松玄栗条城城丸
紅渓音院渉慶雲滋賀浜無鄰庵府府府碧雲荘府慶阪府阪荘
府依奈奈裏妙庫登録念最登録件温荘潮遊個最無鄰菴慶温
荘昭福府松尾松府城府根阪府堺堺根足根根専誌念照欧鼓
橋篭奔放制職招制幕府薩摩動伊郎雇革真真視覚販売際第

The duplicate kanji are VERY visible now. 

is one example in the row

紅渓音院渉慶雲滋賀浜無鄰庵府府府碧雲荘府慶阪府阪荘

In the almost 40 rows above it, occurs 33 times. Run through the lines again. Do you start to see them ? Try separating the lines with a blank line. Try shortening the lines.

You could first remove all '\n' linefeeds, and then use a regexp such as

FIND  expression :   (..........)
REPLACE expr:     \1\n

which in Notepad++ will give almost 100 rows of 10 characters for the file above. It says "for every 10 characters, return that selection followed by a linefeed."

The one ( 1 ) in the case above refers to the first expression in parentheses.  If we'd had a second, it would have been \2.

What features make complex kanji appear to be the same ? Are some easily confused if not next to each other ?

Funny, but when I read 鳥鳥 side-by-side I just KNOW that is not horse ! But seen alone, I can be unsure ... horse or bird ? Crow ?