OCR

From IndLinux

Jump to: navigation, search

Character Recognition for Indic Language

Contents

Requirements

Aim for this project is to create a set of libraries and applications for character recognition for Indic
languages using computers. 
This project will have two parts: (i) Character recognition from scanned images, (ii) handwriting recognition.
Character recognition from scanned image can be restricted to fonts, but handwriting recognition has to be
generic.

Generic Features

  1. Library should be able to recognize all the characters from the Unicode subset for Indic languages.
  2. Character identification should be separate from identification process so that new characters could be added by any layman user through some interface. This impies that characters database should be separate from library.
  3. Identification process should be modularized, with initial image processing to remove artifacts, some pre-processing to improve character recognition, letter characterization, adaptation to document, and use of linguistic context. Different algorithms in each of these areas should be available as plugins, with the final choice of a default set of plugins that works the best on average.
  4. Library should be portable. It should run on all architectures available today.
  5. First release should be able to identify characters with at least 90% success.

{add more}

Outline of generic process

  1. Improved image processing:
    • Removing stray marks
    • Curved baseline
    • Underlining, shaded background, reverse video
  2. Preprocessing
    • Pattern removal
    • Skeletonization
    • Slope, size, slant normalization
  3. Better letter characterization
    • Recognize strokes and morphological shape instead of treating letter as a collection of pixels
    • Simple and intuitive methods rather than complex algorithms, e.g., 2D Fourier transforms should work well in principle, but are liable to be susceptible to noise.
    • Shape and pattern matching.
    • Pattern-point extraction, e.g., counting the number of curves and corners.
  4. Adapting to document
    • Automatically recognizing and adapting to the font used.
    • Responding to document content, e.g., switching to a specialized glossary if the document is recognized as a technical one.
  5. Linguistic context
    • Multi-character recognition: Human beings are much more likely to scan text and recognize words rather than single letters. The shape of a common word is often automatically recognized.
    • Forward and reverse Markov models to predict the next letter in a sequence forming a word.
    • Using special features of languages, e.g., using the shirorekha in Hindi/Bengali, focussing on the parts of the letter that carry identifying information in Oriya, etc.

Character Recognition from scanned images

  1. Should recognize different fonts
  2. If possible, extract style information too.

{add more}

Handwriting Recognition

{todo}


Existing non-Indic solutions

Personal tools
communication
Development