IndicMeet09
From IndLinux
This page is now out of date, and will not be updated further. See here for further details on the meet.
Indic Meet 09
Venue:
Dates:
16th - 17th May 2009
Proposed Agenda
Multi Language Text: Collation Testing and Fixing
Collation definition in glibc is ready for some languages of India. But not for all languages. Work is going to add collation rules for bn_IN( Pravin S). For some languages the existing collation rules in glibc is wrong(technically or/and linguistically)
- Collect the status of languages for glibc collation
- Prepare/Locate the test cases for the languages that have collation definition already
- ta_IN, or_IN, as_IN require technical fixes in the existing collation.(Santhosh working on ta_IN fix)
- Test the data in various locales. For eg: test tamil data in malayalam locale. Verify the output and Fix if there is any bugs
- Mix the testing data of all languages to a single file and sort it in many locales. Verify the output.(Santhosh found big difference in outputs). Fix it.
- Experiment 1: Does the glibc collation rules are followed by all application in GNOME/KDE etc?
- Experiment 2: Does the string search in Applications follow the locale aware search or just byte comparison? Do they follow Canonical equivalence definitions of glibc?
Qt locale data verification
- Many lowlevel functions of QChar is misbehaving for Indic languages. Prepare test cases and test/report bug/fix
- Making the C/C++ wide character functions compatible with Qt Unicode character functions
Verifying the Unicode character data that Qt using and verifying CLDR data
- Qt extracts unicode character database for its string operations. Verify/Fix
Project: Desktop dictionaries for all Indian Languages
- Preparing Desktop dictionaries for all languages based on DICT protocol.
- If data not available create a basic dictionary so that others can just add data later(Bilingual eng to xx_IN desktop dictionary)
Project: Hyphenation pattern definition for all Indian Languages
- Preparing , testing, packaging Hyphenation patterns for Indian languages(mainly for For openoffice. But for webpages too)
Spellchecker Improvements
- Hunspell Hindi phonetic rules preparation and testing. If possible for other languages too
Project: Legacy document conversion from font encoded data to Unicode: Req Collection
- Develop a framework for convering legacy documents with ASCII font encoded data to Unicode(text or .doc/.odt documents- Inputs from more langauges required)
Indic OCR - Debayan Banerjee
Objectives/Deliverables for Indic Meet
I shall first demonstrate the working of the OCR on some sample images. Then I plan to explain the working of the OCR system on a higher level. It shall be followed by a demonstration of the problems that exist in the present system and potential solutions that I have in mind. I shall demonstrate how to train this OCR for a particular language. This should be over in 75 minutes. Then we move on to the problems I am facing. We have a discussion on possible solutions. Here are a few problems to tackle:
- Learning about the various efforts made in the past. BOCRA / Aksharbodh etc
- Dealing with the post-OCR spell-checker problem
- A better segmentation algorithm. Ocropus Curved cut segmenter. Merits/demerits
- Reducing number of character classes to be trained as explained at http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html
- Talk to Santhosh Thottingal about integrating the service to Silpa
- How to build a web interface that can train the OCR engine from user input.
Indic OCR - Gora Mohanty, Shantanu, others
Other approaches to OCR
Discussions
Brainstorming on the following topics
- L10N process. Testing the localized desktop for usability
- Indic computing status-Prepare a status table, and prepare a roadmap
- A task tracker for Indic computing
- Documentation on Indic computing
- Indic OCR
Participants
- G Karunakar
- Santhosh Thottingal
- Ramakrishna Reddy Yekulla
- Kartik Mistry (kartik at debian dot org)
- Jinesh K J
- Debayan Banerjee
- Sri Ramadoss M (amachu at amachu dot net)
- Pravin Satpute
Localisation team representatives
- Assamese: Amitakhya Phukan (local)
- Bengali: Sayamindu?
- Chhatisgarhi: Ravishankar Shrivastava
- Gujarati: Kartik Mistry
- Hindi: Ravishankar, Ravikant, Karunakar
- Kannada: H P Nadig?
- Kashmiri: Aadil Kak
- Maithili: Sangeeta Kumari
- Malayalam: Who from SMC? Mahesh?
- Marathi: ?
- Oriya: Manoj Giri (local)
- Punjabi: ?
- Sanskrit: Siji (from CDAC)?
- Tamil: Ramadoss
- Telugu: Sunil Mohan?
- Urdu: Syed Shikeb
Logistics
Some travel sponsorship would come through Sarai. Need more for accomodation and food.
Venue should have 1-2 rooms with suitable arrangements to accomodate upto 20ppl. Internet over Wifi+LAN and LCD projector. Participants to get their own laptops.

