From IndLinux
Jump to: navigation, search

Planning for any Indic activities at FOSS.IN
(These need not really be ones which only satisfy the CfP criteria). Proposals to FOSS.IN/2008 submitted


Indic Computing: What Next

  • User experience.
    • How many users are using computer in their mother tounge?
    • How comfortable they are with xx_IN desktops ?
    • How can we get the feedback from them ? and what they expect from a xx_IN desktop?
    • How we are planning to address the user experience issues ?
  • Roadmap
    • Collation
    • Corpus
    • Morphological analysis tools: Stemmer, lemmatiser
    • Better data indexing and searching for Indic- brainstorming
    • Spellcheck-Addressing Inflection/Agglutination
    • Language Neutral Interface- brainstorming
  • Community
    • l10n teams- Addressing deadlines with more team members
    • FUEL
    • Public awareness and training on Indic Desktop and computing
    • Collaboration of all language communities
proposed by Santhosh 10:46, 29 September 2008 (IST)
  • Geometric algorithms for Indic fonts
    • Vector renderings of Indic fonts
    • Extraction of geometric rules underlying glyphs
    • Identification of common shapes in fonts
    • Parametric definitions of such shapes
  • Indic OCR
    • Possibility of adapting Tesseract OCR for our needs
    • Current work done
    • Improvements required
    • Future of Indic OCR


Add here any talks being submitted (not required that they have been selected).

  • Building Tools using HindiASR - Sachin

This talk is intended to explain to Indian Open Source community about how to integrate Hindi Automatic Speech Recognition system into any applications they are intending to build. It will discuss various steps and issues involved in integration, a brief outline of Sphinx APIs, Creating various Language Models etc.

  • Speech Enabled Applications in Indian Languages - Santhosh 13:23, 13 October 2008 (IST)

How to develop speech enabled applications ? How to add Indic speech to that? and Screenreader in Indic language for localized desktops


Add here any workouts being submitted (not required that they have been selected).

  • Add Indic script support to Tesseract OCR - Debayan This workout will be highly interactive. We shall have a few scanned documents in Indic fonts. Participants will be asked to code required modules/methods for the Tesseract OCR in a programming language of their choice. The one with the best results shall be converted to C/C++ and added to Tesseract. Training data will be provided by me. Currently plan to cover Hindi, and Bengali. Required features include:
  • Sorting in Indian language locales for glibc - Gora The idea is to complete checking the sorting order for as many Indian languages as we can manage. There should be little technical work left, apart from correcting bugs, and adding characters for the latest version of Unicode. This will be combined into one workout, along with the next item.
    • Background: The goal is to have a single file defining the sorting order for all Indian languages, including English. This can then simply be included into the LC_COLLATE section for each language locale.
    • Status The sorting in indicsort, should broadly work now, though it has been tested only for Hindi, and Oriya. Pravin Satpute has also worked on sorting in Marathi, and has views on this topic here. See also the following page on collation data.
    • Preparatory work before For each Indian language in Unicode, prepare character tables as on this page, and as per Pravin's suggestions. Also, a sample list of words for each language. Circulate this, and hopefully resolve any issues beforehand.
    • Immediate work at Do a quick check of sorting in each language. Get the glibc folk (Drepper?) involved in immediately submitting a patch.
    • Longer-term work after Resolve any differences with respect to Indic sorting between this, and the Unicode CLDR. Get package maintainers for Linux distributions involved in pushing changes upstream.
    • Volunteers Native speakers for each language are badly needed. Please add your name, and the languages(s) that you wish to be involved in here: Gora (Oriya)
  • Enhancing Indic spell-checking - Gora Indic spell-checking works fairly well with aspell, and it should be possible to adapt the rules used for aspell to Hunspell. At the same time, we will add rules for advanced features of aspell, namely the phonetic rules, for as many Indian languages as possible. This will be combined into one workout, along with the previous item.
    • Background: aspell and Hunspell allow the incorporation of advanced rules for spell-checking. Perhaps the most important aspect for Indian languages is to add phonetic rules. This has currently been done for Hindi, Oriya, and Punjabi, though only tested for Hindi. Also, affix rules can be added for deriving words from bases in the dictionary. Apparently, with Hunspell, it is also possible to add rules to handle agglutinative languages like Malayalam. A general write-up is here. See also the page on phonetic rules for aspell, a cookbook on creating aspell dictionaries, and a proposal to review the Hindi dictionary.
    • Preparatory work before For each Indian language in Unicode, prepare phonetic rules for aspell as on this page. This file is also included in the latest aspell Hindi dictionary distribution. Please note that the format is a little quirky.
    • Immediate work at Go over phonetic rules for each language. Build an interface to allow dictionary review, and the addition of affix flags, which is a painstaking job. Discuss merging aspell, hunspell, and using a common spell-checking interface. Port all aspell dictionaries to Hunspell. Also, discuss using Hunspell for agglutinative languages.
    • Longer-term work after Finish porting aspell dictionaries to Hunspell, including advanced rules. Finish up work on a common spell-checking interface, and look at language-neutral interfaces. Consider adding spell-checking plugins for various applications.
    • Volunteers Native speakers for each language are badly needed. Please add your name, and the languages(s) that you wish to be involved in here: Gora (Oriya)
  • Machine translation-Pranav, Nandeep, Shantanu, Rakesh
    • In this session, we will start with a sample website and guide you through the usage of the existing Indic language APIs. We will use a combination of one of transliteration API which allows Indian language text to be typed in with ease, and the sample translation API for automated translation between English and Hindi. Please check this space for more details on the tools that we would be using and specific documentations.
    • We will then discuss about creating APIs for different open source Machine Translation engines, this in my opinion is important as, these are the tools that give us the flexibility both on the levels of improving the tool in terms of efficiency as well as in terms of language options. If time permits, we could build an API for one of them.
    • We need to concentrate on the issue of extending the APIs to translation tools like the Poedit and Entrans through the mailing list. It would be great if we are able to extend to both Poedit and Entrans before
  • Translation management Framework - Karunakar

This workout doesnt aim to be a short 3hr or 1day one but extended over longer period, before, during and after the event. Precursor to the workout at event would be to understand requirements and put in Design and workflow in place, which would then be coded. A running prototype should already by the eve of the event. During the event, depending upon the number of participants willing to join in a fully functional application is to be readied, and a dry run done for a language along with mock asistance from a pseudo team. Feedback could then be taken in to reimplement some of the functionality.


Indic contributors if you planning to attend, list your name here.

  • Sachin Joshi
  • Gora Mohanty
  • Debayan Banerjee
  • Nandeep Mali
  • Shantanu Choudhary
  • Pranava Swaroop