Development track

From IndLinux
Jump to: navigation, search

Notes for development track: IndLinux meet, 16/17 May, 2009

Contents

Locales, and related work

Current status, and existing problems

  • Santhosh:
    • Several locales do not use a common sorting definition file: ta_IN, as_IN, or_IN, and en_US are known culprits
    • Testing with mixed languages in one file needed.
    • Testing in each of multiple language locales needed
  • Rahul:
    • Process to add new languages: Recently added Nepali, Kashmiri, Sindhi, Chhatisgarhi, and Maithili. Need to document the process (and, whom to contact) somewhere online.
    • Data needed for new locales: Konkani, Bodo, Manipuri, Urdu (IN), Lepcha, Santhali, Saurasthara, Dogri. Gora: To pursue ones he can find. Pravin to folow up on Urdu.
  • Pravin:
    • glibc locales:
      • Progress happen in Last Year
      • Collation issue got addressed,
      • Indic Mashup
      • Locale Related Stuff
    • Links to current Unicode sorting data:
    • Problems:
      • Mostly people use en_US, but this locale file doesnt have iso14651_t1_common file(file with collation of all languages) included, so we are not getting collation of our language by selecting en_US.

Task list

General plan is to push glibc work through Red Hat, and Pravin

  • Automated testing of locales: Pravin to write a script for testing locale file directly.
  • Fixing en_US: en_US should include iso14651_t1_common. Pravin will check this and submit patch. Similar problems in ta_IN, as_IN and or_IN.
    • Plan for oriya: Testing the current one, if its correct will move same to iso14651_t1_common, else Mr. Gora will provide correct locale data and we will move same to iso14651_t1_common.
    • Assamese: Amit will check currect sorting order of bengali, we will append same.
    • Tamil: Ramdas and Sathosh will test and submit patch
  • QT Locale data: Santhosh to submit patch. Some links:
  • Resolution with Unicode (lower-priority. After glibc fixed). Do later, start talking with DIT, and plan for longer term.

OCR: Google Tesseract: Debayan

This is incomplete: Please see the presentation at: http://www.indlinux.org/wiki/images/Indic_ocr.pdf

Tesseract

  • Developed originally by HP
  • Now being developed by Google as an open-source project.

Steps in OCR

  • Segmentation
  • Deskewing
  • Training

... Missing description ...

Problems

  • Under 50 trainable characters in English. Over 1800 trainable characters in Bengali/Hindi Possible solutions are:
    • Divide characters into base letters, and matras. Reduces character classes to around 300.
    • Curved cut segmenters, e.g., as in OCROpus.
  • Make OCR available to public.
    • Web interface
    • Integration with Silpa
    • Create a feedback-based learning system.
  • Need much more testing data: Images with corresponding text.
  • Finding new project members
  • Progress in OCR is slow on all fronts:
    • ISI, Kolkata
    • Google and Tesseract also slow on Indic OCR.
  • Targets:
    • 98% accuracy
    • Trivial generation of training data.
    • Automated accuracy calculation.
    • Learn from web-based feedback.

Simple OCR: Shantanu, Gora

  • First goal is to identify a known font in a known script at a known size. Later, relax these restrictions a step at a time.
  • Steps in simple OCR:
    • Segmentation
      • Break into lines, with separate thin lines for matras above, and under.
      • Recombine lines with matras above/under into larger lines.
      • Break lines into disjoint letters: First, simple letter breaks, and then more complicated algorithms to separate into distinct letters.
      • Combine into compound letters. Each compound letter consists of a base letter, plus any matras above and under.
      • For each compound letter, do an OCR, separately for base letter, and for each matra above/under. At the moment, the OCR engine is a simple XOR of the input glyph (base letter, or matra above/under) with each glyph in the font, rendered at the known size. This works reasonably well.
  • Removal of restrictions:
    • Automatic detection of font size: Can be done from line height.
    • Automatic detection of font: For XOR OCR engine, all we need to know is whether the glyphs being compared are base letters, or matras above/under. For simple scripts, like Oriya this can be done automatically by an initial pass through the glyphs in the font, and classifying them by their vertical positioning.
    • Identify script automatically: From language detection schemes. This is not an immediate priority as it is perfectly fair to ask the user to specify this.
  • Immediate tasks:
    • Architecture: Pluggable modules for languages, and for pre-processing, segmentation, OCR, post-processing, etc.
    • Other OCR engines (see some thoughts below)
    • GUI

Follow-up technical work on OCR

Spellchecker

Gora - How to create an aspell dictionary distribution

Parag - converting raw wordlist to hunspell - wordxtr

Wordxtr Project is hosted at http://fedorahosted.org/wordxtr wordxtr package is also available in Fedora 10 onwards.

Font converters: Shantanu

Desktop GUI

Encompasses other converters.

Handwriting recognition: Rahul Bhalerao

  • Available backends
    • cellwriter
    • tomoe
  • Integration with ibus
  • UI Problems
    • fast draw-suggest-n-defaultinput
    • conjuncts and syllables
  • multilingual issues
  • Training process
    • stroke editor
    • cellwriter training
  • Todo
    • Upload the developments and guidelines
    • Define a roadmap and distribute it through indlinux
  • Contributions expected for the training data

Santhosh

Transliteration with suggestions

  • Need input on follow-up letters in other languages

Hyphenation

Several other projects at hosted at http://smc.org.in/silpa/

Links to development presentations