M Tech Dissertations

Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/3

Browse

Search Results

Now showing 1 - 6 of 6
  • ItemOpen Access
    Study on cross lingual information retrieval in indian languages
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2016) Chander, Ankush; Majumder, Prasenjit
    In this article we evaluate and report the effectiveness of various available resources includingMachine translation systems(MTS) and machine readable dictionaries in the retrievalof Indian language documents. Machine translation systems lend themselves to the task ofcrossing the language barrier in Cross Lingual Information Retrieval because of their easyavailability. In our work we picked up three online translation systems:Google 1, Bing2,Technology Development of Indian Languages(TDIL) translation service3 and assessedtheir performances in the task of Cross Lingual Information Retrieval. We also did querywiseanalysis to determine the issues faced when we rely on machine translation systems inthe multi lingual retrieval. Our experiments shows that not only the quality of translation toa language depends on source language but also differ a lot querywise within source-targetlanguage pair. We also explored different translation difficulties faced in using MTS.We then evaluated and explored the machine readable dictionaries using naive dictionarybased approach which is seen as the simplest implementation of CLIR. Then we exploredthe possibility of enhancing naive dictionary results using word embedding methodword2vec which was followed by error analysis.
  • ItemOpen Access
    Integrating semantics into biomedical information retrieval
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Thakrar, Fenny; Majumder, Prasenjit
    Integrating semantics into Biomedical Information Retrieval is concerned with studying the meaning of concepts and focusing on their relationships. We have used semantic document representation approach to applying domain-specific knowledge into the information retrieval system. Single and multi word concepts are extracted from the document using an external semantic structure UMLS Metathesaurus. Word sense disambiguation is performed on the extracted concepts to disambiguate different concept senses. And, the document is represented in the form of UMLS concepts. The documents and queries are represented in semantic space and fed to an information retrieval system to rank those documents, according to the given query. We have performed experiments on TREC 2014 CDS Task data and its 30 queries. Two types of retrieval techniques namely single word and multi word retrieval are experimented. The results obtained using conceptual information retrieval are compared with the results obtained using traditional term based retrieval. The conceptual IR approach proved better compared to term based IR system for the evaluation metrics MAP, P10 and RPrec. And, single word retrieval proved better compared to multi word retrieval technique for conceptual IR. Also, query expansion in conceptual IR system proved better compared to non query expanded conceptual IR system.
  • ItemOpen Access
    Text retrieval from the degraded document images
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Vasani, Hiral; Mitra, Suman K.
    Image binarization is used to obtain a black and white text document from a colored one. Basically, it can be taken as an image segmentation task that segments the text part from the background. Such a black and white document can be used in many applications, namely Optical Character Recognition (OCR). Text documents suffer from various types of degradations that make image binarization a challenging task. This thesis presents the work done to design a technique that segments text from the background. In this method, the document image is first darkened in order to enhance the text (foreground) in it. The text image is again processed separately so as to suppress the background. The two images so obtained are combined in such a way that the suppressed background is retained from the last image and enhanced text is used from the first image. Then this pre-processed image is binarized using an existing thresholding technique. The first binarized image is subjected to some post-processing in order to remove unwanted smaller components and other noise. The output image so obtained is compared to the ground truth results using some evaluation parameters. The results of the algorithm are compared to the existing Binarization techniques.
  • ItemOpen Access
    Summarizing medical texts for effective retrieval
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Iyer, Ganesh R; Majumder, Prasenjit
    User centered health information retrieval is a challenging and important problem in information retrieval. In this work, we apply medical resources to bridge the vocabulary mismatch between lay-users and medical documents. We also applied text summarization techniques to reduce the document to relevant information while pruning irrelevant information. We provide a survey of medical resources and application of text summarization in information retrieval. The primary research goals were to investigate the use of medical resources in query expansion and text summarization in indexing. The experiments were performed as a part of a CLEF eHealth Task, overview of which is provided. From our experiments we observed that a summarized index can be used to replace a full collection index. Also a compression rate of 40-80% outperformed the baseline indicating that retrieval on the summarized collection can indeed improve performance. Using MeSH(Medical Subject Headings) as a thesaurus to supplement the query terms improved retrieval for certain queries. We obtained the best MAP score of 0.415, for all teams, using query expansion with discharge summaries.
  • ItemOpen Access
    Query expansion in biomedical information retrieval
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Sankhavara, Jainisha; Majumder, Prasenjit
    Retrieving relevant information from biomedical documents is a new challenging task. Health related articles from the literature of biomedical and life sciences are a good source of knowledge for searching information relevant to a patient’s medical case report. Medical case reports describe patients’ medical condition i.e. medical history, current symptoms, tests performed, undergoing treatments etc. The articles related to medical case reports can be useful for clinicians to best care their patients. For example, a successful treatment described in an article for patients of a particular age group, having particular medical history and symptoms might be advisory to the patients having similar medical case report. This thesis focuses on applying query expansion techniques and fusing them for biomedical domain, especially while retrieving biomedical articles from the literature relevant to a particular case report. Along with the traditional query expansion techniques, query expansion using external medical knowledge is also carried out and compared with the state-of-the-art query expansion technique i.e. Incremental Blind Feedback. For the external knowledge source, UMLS Metathesaurus is used which is a network of medical related concepts. Text REtrieval Conference provided the data for this research as a part of Clinical Decision Support track in 2014 for which results of traditional query expansion techniques and fusion with manual feedback are reported. The fusion run gives consistent results for considered evaluation metrics. The results of Incremental Blind Feedback technique are comparable to the best of TREC CDS-2014. While considering the type of queries, the queries of type ’diagnosis’ and ’treatment’ performed better than that of ’test’.
  • Item
    Understanding user intent in community question answering
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2014) Shah, Harsh Kaushikbhai; Majumder, Prasenjit
    Yahoo! Answers, Quora like Community Question Answering (CQA) services are mainly created to remove the limitation of Web search engines by helping users to get information from a community. This CQA system has the so many questions in its memory with possible number of answer. And number of times the questions are repeated. So, if the CQA system understand the user intent of question it helps it to recognize similar kind of questions, find relevant answers and hence, recommend potential answers more effectively and effectively. So, thesis approach is to classify the CQA questions, according to user intent, into three categories: objective, subjective, and social. So, to understand the user intent of questions, we first find the text features and metadata features and then through the machine learning algorithms we build a predictive model that classify the questions into above three categories. This one is supervised learning model. We have a very limited number of labeled questions and large number of unlabeled questions. So, to improve the question classification we also use the co-training, a semi supervised learning algorithm, which uses a small set of labeled questions plus a large number of unlabeled questions for classification. Our results shows that the co-training approach that regards text features and metadata features as two views works better than the supervised learning approach that simply applying these two types of features together. This is because co-training, as a semi-supervised learning method, can make use of a large amount of unlabelled questions in addition to the small set of labeled questions.