IndicMeet09

From IndLinux

Jump to: navigation, search

This page is now out of date, and will not be updated further. See here for further details on the meet.



Indic Meet 09

Venue:

Dates:
16th - 17th May 2009

Contents

Proposed Agenda

Multi Language Text: Collation Testing and Fixing

Collation definition in glibc is ready for some languages of India. But not for all languages. Work is going to add collation rules for bn_IN( Pravin S). For some languages the existing collation rules in glibc is wrong(technically or/and linguistically)

  1. Collect the status of languages for glibc collation
  2. Prepare/Locate the test cases for the languages that have collation definition already
  3. ta_IN, or_IN, as_IN require technical fixes in the existing collation.(Santhosh working on ta_IN fix)
  4. Test the data in various locales. For eg: test tamil data in malayalam locale. Verify the output and Fix if there is any bugs
  5. Mix the testing data of all languages to a single file and sort it in many locales. Verify the output.(Santhosh found big difference in outputs). Fix it.
  6. Experiment 1: Does the glibc collation rules are followed by all application in GNOME/KDE etc?
  7. Experiment 2: Does the string search in Applications follow the locale aware search or just byte comparison? Do they follow Canonical equivalence definitions of glibc?

Qt locale data verification

  1. Many lowlevel functions of QChar is misbehaving for Indic languages. Prepare test cases and test/report bug/fix
  2. Making the C/C++ wide character functions compatible with Qt Unicode character functions

Verifying the Unicode character data that Qt using and verifying CLDR data

  1. Qt extracts unicode character database for its string operations. Verify/Fix

Project: Desktop dictionaries for all Indian Languages

  1. Preparing Desktop dictionaries for all languages based on DICT protocol.
  2. If data not available create a basic dictionary so that others can just add data later(Bilingual eng to xx_IN desktop dictionary)

Project: Hyphenation pattern definition for all Indian Languages

  1. Preparing , testing, packaging Hyphenation patterns for Indian languages(mainly for For openoffice. But for webpages too)

Spellchecker Improvements

  1. Hunspell Hindi phonetic rules preparation and testing. If possible for other languages too

Project: Legacy document conversion from font encoded data to Unicode: Req Collection

  1. Develop a framework for convering legacy documents with ASCII font encoded data to Unicode(text or .doc/.odt documents- Inputs from more langauges required)

Indic OCR - Debayan Banerjee

Objectives/Deliverables for Indic Meet

I shall first demonstrate the working of the OCR on some sample images. Then I plan to explain the working of the OCR system on a higher level. It shall be followed by a demonstration of the problems that exist in the present system and potential solutions that I have in mind. I shall demonstrate how to train this OCR for a particular language. This should be over in 75 minutes. Then we move on to the problems I am facing. We have a discussion on possible solutions. Here are a few problems to tackle:

  1. Learning about the various efforts made in the past. BOCRA / Aksharbodh etc
  2. Dealing with the post-OCR spell-checker problem
  3. A better segmentation algorithm. Ocropus Curved cut segmenter. Merits/demerits
  4. Reducing number of character classes to be trained as explained at http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html
  5. Talk to Santhosh Thottingal about integrating the service to Silpa
  6. How to build a web interface that can train the OCR engine from user input.


Indic OCR - Gora Mohanty, Shantanu, others

Other approaches to OCR

Discussions

Brainstorming on the following topics

  1. L10N process. Testing the localized desktop for usability
  2. Indic computing status-Prepare a status table, and prepare a roadmap
  3. A task tracker for Indic computing
  4. Documentation on Indic computing
  5. Indic OCR

Participants

  1. G Karunakar
  2. Santhosh Thottingal
  3. Ramakrishna Reddy Yekulla
  4. Kartik Mistry (kartik at debian dot org)
  5. Jinesh K J
  6. Debayan Banerjee
  7. Sri Ramadoss M (amachu at amachu dot net)
  8. Pravin Satpute

Localisation team representatives

  1. Assamese: Amitakhya Phukan (local)
  2. Bengali: Sayamindu?
  3. Chhatisgarhi: Ravishankar Shrivastava
  4. Gujarati: Kartik Mistry
  5. Hindi: Ravishankar, Ravikant, Karunakar
  6. Kannada: H P Nadig?
  7. Kashmiri: Aadil Kak
  8. Maithili: Sangeeta Kumari
  9. Malayalam: Who from SMC? Mahesh?
  10. Marathi: ?
  11. Oriya: Manoj Giri (local)
  12. Punjabi: ?
  13. Sanskrit: Siji (from CDAC)?
  14. Tamil: Ramadoss
  15. Telugu: Sunil Mohan?
  16. Urdu: Syed Shikeb

Logistics

Some travel sponsorship would come through Sarai. Need more for accomodation and food.

Venue should have 1-2 rooms with suitable arrangements to accomodate upto 20ppl. Internet over Wifi+LAN and LCD projector. Participants to get their own laptops.

Personal tools
communication
Development