Kanji Sieve v0.2

vocab.jpg

It has taken me a little longer than I thought to get to version 0.2 of Kanji Sieve. Mainly due to getting it to look better cross platform and avoiding problems for a user that wouldn’t be an issue for me as the developer.
However, as someone actually downloaded, looked at and commented on my initial little solution I looked at Kanji Sieve again. A little encouragement will always prompt me to continue projects.This time I’ve taken a bit more care over the look for the Windows file. On the suggestion of Tom Hodgers I used the Meiryo font and allowed the user to change the font size of the sample text.

I added non-Jyouyou kanji and katakana words to the sieve. This may be a bit indiscriminate. What I’m doing with katakana is searching for runs of katakana and hoping these are words. They may not be. For non-Jyouyou I try to eliminate all roman characters, kana, Jyouyou kanji, and punctuation. What’s left over in a Japanese text should be non-Jyouyou kanji. Again strange punctuation and foreign characters may appear here. I do have some plans to try to refine this panel though.

Trying it out I was surprised at the amount of non-jyouyou a friend of mine used in her mixi diary. I would have thought a larger amount of kana and jyouyou kanji in a personal diary. I wonder if it is due to using a word processor, it’s easier to generate those kanji and presumably she can expect ordinary friends to read them easily. If she was writing by hand it might be different.

Lastly I incorporated a little hack I put together to replace kanji with keywords. I did this to demonstrate how little meaning you get from just keywords. Especially when the most popular keywords in English that appear as the first entry in Kanjidict are a bit dreadful at times. These panels may or may not survive into the next version. If my notebook ever sees the light of day I’d generate the keywords from the users input which may at least be more useful and perhaps generate an xml file with the keyword furigana as pop-ups.

Further plans
I’d at least like to solve exporting. At the moment I have an issue with the flow of records of unknown length in printouts. It may just have to be an xml export.
I may make it into a multi-record solution.
I also found something very similar at the reading tutor web site at Tokyo International University. Which has the added benefit of producing custom glossaries for articles. If I could understand how they can parse for individual words I’d implement this myself.

Download from my new permanent Kanji Sieve page.

––update 11Apr10––
I’ve corrected the oversight I made in not filtering for half width kana or full width roman characters. non-Jyouyou and katakana should work a bit better now.

10. April 2010 by ロバート
Categories: 02 reading • 読む事 | Tags: , , , | 8 comments

Comments (8)

  1. Hi Robert.

    Once again, a nice job.

    Size 18 Meiryo font looks fantastic on my small screened Ideapad S10e with 800×600 resolution.
    Some lone katakana figures/other signs show up as non-Jyouyou kanji.

    I forgot to mention to you last week that a number of years ago I used successfully the ChaSen Japanese Parser with the JGloss program. Worth having a look at it to see if you can use it (I think it´s still free). Just Google it.

    Bye for now.

    Tom

    • It seems I forgot to take half-width katakana and full width romaji into account. Easy enough to fix. I must test for accented latin characters now too. Oh well work in progress. I was already working on user definable characters to ignore.

      I forgot about JGloss. I played with it quite a long time ago before I really had any reading skills.

      I think parsing for words in Japanese is beyond FileMaker’s capabilities really. From what I gather it’s a bit brute force. You find a run of kanji and potential kurigana and check combinations against a dictionary until you get something that seems to work. The only sensible way to do it in FileMaker would be through a plugin. If I had the skills to write a plugin I wouldn’t be working in FileMaker to begin with!
      Maybe passing the text to a third party (Rikai?) via a webform within FileMaker and parsing the returned result might be more feasible.

  2. JGloss uses its own internal Kanji parser or the “ChaSen morphological analyzer”.
    ChaSen 2.1 is also a standalone program and is still available as a free download. It comes with its own dictionary. http://chasen.naist.jp/hiki/ChaSen/

    This program (chasen.exe) may be invoked for use as a text converter for page reading..

    Have a look at the following page: http://gorogoro-lab.com/pukiwiki.ini.txt
    which shows how to invoke ChaSen or Kakasi in a Wiki page.

    Cheers,
    Tom

  3. Hi Robert.

    Have a look at MeCab:
    http://sourceforge.net/projects/mecab/files/

    Looks interesting.

    Cheers,
    Tom

  4. Hello Robert,

    “Bound USR file. (right click and save to disk 288 KB) To update replace a previous bound file with this copy. There is no need to download the complete runtime again. You need a full copy of FileMaker or the Shiawase Runtime Engine to open it”.

    Bound USR file for version 0.21 is named Kanji_Sieve.USR and will not open with the version 0.2 Shiawase_Runtime_engine.exe. Error message says will only open “Kanji Sieve 0.2.USR”.

    I changed the file name to Kanji Sieve 0.2.USR but the engine does not recognize it. Had to open directly with FileMaker.

    Cheers,
    Tom

  5. darn, FileMaker is binding the files differently than I thought. Thanks for pointing that out. For the time being it seems the only way is downloading the complete runtime.

    I’ve investigated Chansen and Mecab. I might have a way to integrate them into Kanji Sieve. I think it’ll take me a while though and may be a bit more difficult for the user to install.

Leave a Reply

Required fields are marked *