Repository logo
Collections
Browse
Statistics
  • English
  • हिंदी
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Theses and Dissertations
  3. M Tech Dissertations
  4. Document Language Classification Using Deep Learning Approaches

Document Language Classification Using Deep Learning Approaches

Files

201911019_Sarathi_Shah_MTech_Thesis_Dean Research.pdf (2.01 MB)

Date

2021

Authors

Shah, Sarathi Surendra

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Optical character recognition (OCR) refers to the task of recognizing the characters or text from digital document images. OCR is a widely researched area for the past many years due to its applications in various fields. It helps in the natural language processing of the documents, convert the document text to speech, semantic analysis of the text, searching in the documents etc. Multilingual OCR works with documents having more than one language. Different OCR models have been created and optimized for a particular language. However, while dealing with multiple languages or translation of documents, one needs to detect the language of the document first and then give it as input to a model-specific to that language. So, while performing OCR on multilingual documents, it is better to first recognize the language of the document and then give it as input to the OCR model optimized for that particular language. Most of the researched work in this area focuses on identifying scripts, but considering that the Convolutional Neural Network (CNN) can learn appropriate features, our work focuses on language detection using learned features. We have proposed two classification models using CNN where one model classifies Gujarati and English language at word-level and the other classifies six Indian languages at page-level. We use a hierarchical based method in which a binary classification followed by the multiclass classification is used to improve detection accuracy for page-level classification. Largely, the current approaches do not use hierarchy and hence fail to identify the language correctly. The proposed hierarchical approach is used to detect six Indian languages namely: Tamil, Telugu, Kannada, Hindi, Marathi, Gujarati, using the CNN from printed documents based on the text content in a page. Experiments are performed on scanned government documents, and results indicate that the proposed approach performs better than the other similar methods. Advantage of our approach is that it is based on features extracted from the entire page rather than the words or characters, and it can also be applied to handwritten documents.

Description

Keywords

Optical Character Recognition, Document Language Classification, Convolutional Neural Network, Indian Languages

Citation

Shah, Sarathi Surendra (2021). Document Language Classification Using Deep Learning Approaches. Dhirubhai Ambani Institute of Information and Communication Technology. viii, 38 p. (Acc.No: T00946)

URI

http://ir.daiict.ac.in/handle/123456789/1011

Collections

M Tech Dissertations

Endorsement

Review

Supplemented By

Referenced By

Full item page
 
Quick Links
  • Home
  • Search
  • Research Overview
  • About
Contact

DAU, Gandhinagar, India

library@dau.ac.in

+91 0796-8261-578

Follow Us

© 2025 Dhirubhai Ambani University
Designed by Library Team