Optical Character Recognition (OCR)

15-Jan-2020: Bharati Script

Researchers from IIT Madras have already developed a unified script for nine Indian languages, named the Bharati Script. Now, going a step further, developed a method for reading documents in Bharati script using a multi-lingual optical character recognition (OCR) scheme.

It involves first separating (or segmenting) the document into text and non-text. The text is then segmented into paragraphs, sentences words and letters. Each letter has to be recognised as a character in some recognisable format such as ASCII or Unicode. The letter has various components such as the basic consonant, consonant modifiers, vowels etc.

It is an alternative script for the languages of India developed by a team at the Indian Institute of Technology (IIT) in Madras lead by Dr. Srinivasa Chakravarthy.

The scripts that have been integrated include Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Telugu, Kannada, Malayalam and Tamil.

The Bharati characters are made up of three tiers stacked vertically. The consonant at the root of the letter is placed in the centre and the modifiers are in the top and bottom tiers. Bharati has, in general, 17 vowels and 22 consonants.

A common script for the entire country is hoped to bring down many communication barriers in India.

28-Apr-2019: IIT Madras team develops easy OCR system for nine Indian languages

Taking a cue from European languages, several of which have the same (Roman letter–based) script, Srinivasa Chakravarthy's team at IIT Madras has, over the last decade, developed a unified script for nine Indian languages, named the Bharati script. The team has now gone a step further since developing the script: it has developed a method for reading documents in Bharati script using a multi-lingual optical character recognition (OCR) scheme. The team has also created a finger-spelling method that can be used to generate a sign language for hearing-impaired persons. In collaboration with TCS Mumbai, the researchers have found a way for persons with hearing disability to generate signatures using this finger-spelling technique.

The scripts that have been integrated include Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Telugu, Kannada, Malayalam and Tamil. English and Urdu have not been integrated so far. Urdu and English alphabet systems have a very different phonetic organisation. But that does not mean a mapping is not possible. It is quite possible and can be done.

In general, optical character recognition schemes involve first separating (or segmenting) the document into text and non-text. The text is then segmented into paragraphs, sentences words and letters. Each letter has to be recognised as a character in some recognisable format such as ASCII or Unicode. The letter has various components such as the basic consonant, consonant modifiers, vowels etc.

Easy to read: The scripts of Indian languages pose a problem for such a character recognition because the vowel and consonant-modifier components are attached to the main consonant part. This difficulty is removed in the Bharati script which can be easily read. In Bharati characters, these different components are segmentable by design. So OCR works quite accurately. OCR engines gives almost 100% accuracy even with mild noise added.

Three-tiered structure: The ease in design comes about because the Bharati characters are made up of three tiers stacked vertically. The consonant at the root of the letter is placed in the centre and the modifiers are in the top and bottom tiers.

In collaboration with Sunil Kopparappu of Innovation Labs, TCS, Mumbai, the team has developed a universal finger-spelling language for the nine Indian languages. They are working on a system that can help people sign documents using a finger-spelling method, and future plans include developing a new Braille system with the Bharati script.