Notes for development track: IndLinux meet, 16/17 May, 2009
Current status, and existing problems
- Several locales do not use a common sorting definition file: ta_IN, as_IN, or_IN, and en_US are known culprits
- Testing with mixed languages in one file needed.
- Testing in each of multiple language locales needed
- Process to add new languages: Recently added Nepali, Kashmiri, Sindhi, Chhatisgarhi, and Maithili. Need to document the process (and, whom to contact) somewhere online.
- Data needed for new locales: Konkani, Bodo, Manipuri, Urdu (IN), Lepcha, Santhali, Saurasthara, Dogri. Gora: To pursue ones he can find. Pravin to folow up on Urdu.
- glibc locales:
- Progress happen in Last Year
- Collation issue got addressed,
- Indic Mashup
- Locale Related Stuff
- Links to current Unicode sorting data:
- Unicode collation table: http://www.unicode.org/Public/UCA/latest/allkeys.txt
- Mostly people use en_US, but this locale file doesnt have iso14651_t1_common file(file with collation of all languages) included, so we are not getting collation of our language by selecting en_US.
- glibc locales:
General plan is to push glibc work through Red Hat, and Pravin
- Automated testing of locales: Pravin to write a script for testing locale file directly.
- Fixing en_US: en_US should include iso14651_t1_common. Pravin will check this and submit patch. Similar problems in ta_IN, as_IN and or_IN.
- Plan for oriya: Testing the current one, if its correct will move same to iso14651_t1_common, else Mr. Gora will provide correct locale data and we will move same to iso14651_t1_common.
- Assamese: Amit will check currect sorting order of bengali, we will append same.
- Tamil: Ramdas and Sathosh will test and submit patch
- QT Locale data: Santhosh to submit patch. Some links:
- https://bugs.kde.org/show_bug.cgi?id=176537 . Problem: QT isLetter function returning wrong data for Indic Characters Ex. for ा - it is returning punct So problem is happening for detecting word boundary. People to help him by making test cases for as many Indic languages as possible.
- Test file: http://pravins.fedorapeople.org/main.cpp
- Resolution with Unicode (lower-priority. After glibc fixed). Do later, start talking with DIT, and plan for longer term.
OCR: Google Tesseract: Debayan
This is incomplete: Please see the presentation at: http://www.indlinux.org/wiki/images/Indic_ocr.pdf
- Developed originally by HP
- Now being developed by Google as an open-source project.
Steps in OCR
... Missing description ...
- Under 50 trainable characters in English. Over 1800 trainable characters in Bengali/Hindi Possible solutions are:
- Divide characters into base letters, and matras. Reduces character classes to around 300.
- Curved cut segmenters, e.g., as in OCROpus.
- Make OCR available to public.
- Web interface
- Integration with Silpa
- Create a feedback-based learning system.
- Need much more testing data: Images with corresponding text.
- Finding new project members
- Progress in OCR is slow on all fronts:
- ISI, Kolkata
- Google and Tesseract also slow on Indic OCR.
- 98% accuracy
- Trivial generation of training data.
- Automated accuracy calculation.
- Learn from web-based feedback.
Simple OCR: Shantanu, Gora
- First goal is to identify a known font in a known script at a known size. Later, relax these restrictions a step at a time.
- Steps in simple OCR:
- Break into lines, with separate thin lines for matras above, and under.
- Recombine lines with matras above/under into larger lines.
- Break lines into disjoint letters: First, simple letter breaks, and then more complicated algorithms to separate into distinct letters.
- Combine into compound letters. Each compound letter consists of a base letter, plus any matras above and under.
- For each compound letter, do an OCR, separately for base letter, and for each matra above/under. At the moment, the OCR engine is a simple XOR of the input glyph (base letter, or matra above/under) with each glyph in the font, rendered at the known size. This works reasonably well.
- Removal of restrictions:
- Automatic detection of font size: Can be done from line height.
- Automatic detection of font: For XOR OCR engine, all we need to know is whether the glyphs being compared are base letters, or matras above/under. For simple scripts, like Oriya this can be done automatically by an initial pass through the glyphs in the font, and classifying them by their vertical positioning.
- Identify script automatically: From language detection schemes. This is not an immediate priority as it is perfectly fair to ask the user to specify this.
- Immediate tasks:
- Architecture: Pluggable modules for languages, and for pre-processing, segmentation, OCR, post-processing, etc.
- Other OCR engines (see some thoughts below)
Follow-up technical work on OCR
- Grab-bag of additional thoughts for simple OCR
- Methods for OCR engine
- Adaptive image recognition
- Neural net
- Letter features
- Use degraded images as Tesseract does.
- Methods for OCR engine
- Links for Tesseract from Debayan
- Debayan's blog: http://hacking-tesseract.blogspot.com/
- Tesseract project site: http://code.google.com/p/tesseract-ocr/
- Debayan's project for Indic OCR with Tesseract: http://code.google.com/p/tesseractindic/
- Training Tesseract: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
- Google group for Tesseract: http://groups.google.com/group/tesseract-ocr
- Thread on Tesseract Google group: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/6a14c25cafb84a5f?lnk=gst&q=dawg#6a14c25cafb84a5f
Gora - How to create an aspell dictionary distribution
- Existing aspell6 Indian-language wordlists: Only Hindi has phonetic rules as of yet.
- Bengali, v-0.01.1-1, 110752 words | Gujarati, v0.03-0, 75105 words | Hindi, v0.02-0, 83522 words | Malayalam, v0.03-1, 142591 words | Marathi, v0.10-0, 70673 words | No Nepali wordlist | Oriya, v0.03-1, 1029 words | Punjabi, v0.01-1, 2045 words | Tamil, v20040424-1, 13940 words | Telugu, v0.01-2, 125111 words | No Urdu wordlist
- Existing Hunspell Indian-language wordlists:
- Bengali, v-0.01.1-1, 110750 words | Gujarati, v2006-10-15, 168957 words | Hindi, v2007-02-19, 15990 words | Malayalam, v2008-05-19, 142591 words | Marathi, v2006-09-26, 12631 words | Nepali, v2006-10-17, 36849 words | Oriya, v2005-01-25, 1029 words | Punjabi, 2005-01-25, 2045 words | Tamil from SMC site, 27995 words | No Telugu wordlist | Urdu, v2007-01-26, 33649 words
- Notes on resolving differences between files:
- Bengali: Wordlists similar, except a difference in 25% of words (seemingly mostly with ZWJ/ZWNJ). Run this by Sankarshan, Runa, and Ankur Bangla maintainers.
- Gujarati: Hunspell wordlist has many more words. Send phonetic rules to Kartik. Let him resolve it, and make a new aspell distribution.
- Hindi: Gora to make Hunspell distribution.
- Malayalam wordlists are identical.
- Marathi Hunspell distribution seems to be the old one from Janabhaaratii. Gora to make new one.
- No Nepali wordlist for aspell. Gora to make one, and get approval from Nepalinux.
- Oriya wordlists are identical. Gora to update both.
- Punjabi wordlists are identical.
- Tamil: Ask Santhosh for clarification. SMC Tamil wordlist has many more words than OpenOffice Wiki.
- Telugu: Gora to make Hunspell distribution, clear with Telugu team.
- Urdu: Gora to make Hunspell distribution.
- Phonetic rules from Foss.in 2008 discussions:
- Hunspell phonetic rules for Hindi : http://smc.org.in/~santhosh/spellcheck/hi_IN.aff - Completed.
- Hunspell Tamil wordlist: http://smc.org.in/~santhosh/spellcheck/ta_IN.aff and http://smc.org.in/~santhosh/spellcheck/ta_IN.dic
- Gora to make Hunspell equivalents of all aspell phonetic rules that are available.
Parag - converting raw wordlist to hunspell - wordxtr
Wordxtr Project is hosted at http://fedorahosted.org/wordxtr wordxtr package is also available in Fedora 10 onwards.
Font converters: Shantanu
Encompasses other converters.
Handwriting recognition: Rahul Bhalerao
- Available backends
- Integration with ibus
- UI Problems
- fast draw-suggest-n-defaultinput
- conjuncts and syllables
- multilingual issues
- Training process
- stroke editor
- cellwriter training
- Upload the developments and guidelines
- Define a roadmap and distribute it through indlinux
- Contributions expected for the training data
- Parent Project for addressing Indic Handwriting Recognition
- iAkshar https://fedorahosted.org/iakshar/
- The portal is yet to be populated with information
Transliteration with suggestions
- Need input on follow-up letters in other languages
- Need language coordinators to check hyphenation for their languages at http://santhoshtr.livejournal.com/15266.html