Theses and Dissertations

Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1

Browse

Search Results

Now showing 1 - 6 of 6
  • ItemOpen Access
    Entity Based Query Processing For Retrieval And Summarization In Biomedical Domain
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2021) Sankhavara, Jainisha; Majumder, Prasenjit
    Exponential growth of biomedical literature poses different challenges in searching. To address complex information needs of the users, rigorous semantic processing of biomedical text is required. Biomedical information access emerges out as a new discipline for this reason. Traditional information access methods of matching, ranking, entity processing, entity-entity relationship processing, etc. are challenged in this domain. These are the major building blocks used to frame queries that represent complex information need in the area of biomedical and clinical information access. This thesis aims to do query processing using different IR and bioNLP techniques and to study their effects in retrieval and summarization. Various techniques of biomedical query reformulations are carried out and compared for biomedical document retrieval. Query expansion is one query reformulation technique which was carried out using relevance feedback and pseudo relevance feedback for biomedical document retrieval. Relevance feedback approach uses information regarding actual relevant documents to the query for feedback while pseudo relevance feedback approach does not have such information and uses top retrieved documents for feedback as they are assumed to be relevant to the query. One combined approach of relevance feedback and pseudo relevance feedback has been proposed which is based on feedback documentdiscovery and uses various classification and clustering techniques on biomedical documents to identify good document for feedback. This approach uses relevance feedback for a number of documents and tries to learn relevance for other documents for feedback. This feedback document discovery based query expansion approach shows improvement over relevance feedback based query expansion technique for biomedical document retrieval. An improved version of this feedback document discovery based query expansion approach where the features of entities are weighted based on the type of the entities and query is also proposed which shows improvement of the document retrieval system over the previous one without feature weighting. Automatic query expansion techniques based on feedback relies on two feedbacksources: feedback documents selection and feedback terms selection. In biomedical domain, medical entities are more meaningful than surface words. Therefore the entity based processing is necessary for any application in this domain. This thesis also includes a survey on advances in biomedical entity identification which includes biomedical entity identification process, various community identified challenges in the area, various resources available, approaches for biomedical entity identification and comparison of various techniques proposed in the literature for biomedical entity identification. UMLS is one biomedical resource which brings together many health and biomedical vocabularies and standards. UMLS contains biomedical entities with categorization and their relations with semantic information. A novel query expansion technique which uses knowledge from UMLS for feedback term selection is proposed where the queries are expanded using biomedical entities. The proposed method considers UMLS entities from a query with their related entities identified by UMLS and constructs query specific graph of biomedical entities for term selection. This query reformulation approach shows improvement over pseudo relevance feedback and state-of-the-art UMLS based query reformulation approaches. The amount of information for clinicians and clinical researchers is growing exponentially. These documents are long and number of topical documents are more. To synthesize the documents, text summarization attempts to reduce text so that the users can quickly understand relevant source information. In the biomedical domain, various summarization techniques are developed in recent years. Text summarization may be useful to medical practitioners with their information and knowledge management tasks. In this work we focus on query focused biomedical text summarization where the summary should be related to the query. The entity-based processing is incorporated in the summarization process along with word-embedding based similarity. The aim of this work is to use query reformulation in the summarization and see how it affects the summaries, whether expanded queries help to get better summaries.
  • ItemOpen Access
    Study on cross lingual information retrieval in indian languages
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2016) Chander, Ankush; Majumder, Prasenjit
    In this article we evaluate and report the effectiveness of various available resources includingMachine translation systems(MTS) and machine readable dictionaries in the retrievalof Indian language documents. Machine translation systems lend themselves to the task ofcrossing the language barrier in Cross Lingual Information Retrieval because of their easyavailability. In our work we picked up three online translation systems:Google 1, Bing2,Technology Development of Indian Languages(TDIL) translation service3 and assessedtheir performances in the task of Cross Lingual Information Retrieval. We also did querywiseanalysis to determine the issues faced when we rely on machine translation systems inthe multi lingual retrieval. Our experiments shows that not only the quality of translation toa language depends on source language but also differ a lot querywise within source-targetlanguage pair. We also explored different translation difficulties faced in using MTS.We then evaluated and explored the machine readable dictionaries using naive dictionarybased approach which is seen as the simplest implementation of CLIR. Then we exploredthe possibility of enhancing naive dictionary results using word embedding methodword2vec which was followed by error analysis.
  • ItemOpen Access
    Integrating semantics into biomedical information retrieval
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Thakrar, Fenny; Majumder, Prasenjit
    Integrating semantics into Biomedical Information Retrieval is concerned with studying the meaning of concepts and focusing on their relationships. We have used semantic document representation approach to applying domain-specific knowledge into the information retrieval system. Single and multi word concepts are extracted from the document using an external semantic structure UMLS Metathesaurus. Word sense disambiguation is performed on the extracted concepts to disambiguate different concept senses. And, the document is represented in the form of UMLS concepts. The documents and queries are represented in semantic space and fed to an information retrieval system to rank those documents, according to the given query. We have performed experiments on TREC 2014 CDS Task data and its 30 queries. Two types of retrieval techniques namely single word and multi word retrieval are experimented. The results obtained using conceptual information retrieval are compared with the results obtained using traditional term based retrieval. The conceptual IR approach proved better compared to term based IR system for the evaluation metrics MAP, P10 and RPrec. And, single word retrieval proved better compared to multi word retrieval technique for conceptual IR. Also, query expansion in conceptual IR system proved better compared to non query expanded conceptual IR system.
  • ItemOpen Access
    Text retrieval from the degraded document images
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Vasani, Hiral; Mitra, Suman K.
    Image binarization is used to obtain a black and white text document from a colored one. Basically, it can be taken as an image segmentation task that segments the text part from the background. Such a black and white document can be used in many applications, namely Optical Character Recognition (OCR). Text documents suffer from various types of degradations that make image binarization a challenging task. This thesis presents the work done to design a technique that segments text from the background. In this method, the document image is first darkened in order to enhance the text (foreground) in it. The text image is again processed separately so as to suppress the background. The two images so obtained are combined in such a way that the suppressed background is retained from the last image and enhanced text is used from the first image. Then this pre-processed image is binarized using an existing thresholding technique. The first binarized image is subjected to some post-processing in order to remove unwanted smaller components and other noise. The output image so obtained is compared to the ground truth results using some evaluation parameters. The results of the algorithm are compared to the existing Binarization techniques.
  • ItemOpen Access
    Summarizing medical texts for effective retrieval
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Iyer, Ganesh R; Majumder, Prasenjit
    User centered health information retrieval is a challenging and important problem in information retrieval. In this work, we apply medical resources to bridge the vocabulary mismatch between lay-users and medical documents. We also applied text summarization techniques to reduce the document to relevant information while pruning irrelevant information. We provide a survey of medical resources and application of text summarization in information retrieval. The primary research goals were to investigate the use of medical resources in query expansion and text summarization in indexing. The experiments were performed as a part of a CLEF eHealth Task, overview of which is provided. From our experiments we observed that a summarized index can be used to replace a full collection index. Also a compression rate of 40-80% outperformed the baseline indicating that retrieval on the summarized collection can indeed improve performance. Using MeSH(Medical Subject Headings) as a thesaurus to supplement the query terms improved retrieval for certain queries. We obtained the best MAP score of 0.415, for all teams, using query expansion with discharge summaries.
  • ItemOpen Access
    Query expansion in biomedical information retrieval
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Sankhavara, Jainisha; Majumder, Prasenjit
    Retrieving relevant information from biomedical documents is a new challenging task. Health related articles from the literature of biomedical and life sciences are a good source of knowledge for searching information relevant to a patient’s medical case report. Medical case reports describe patients’ medical condition i.e. medical history, current symptoms, tests performed, undergoing treatments etc. The articles related to medical case reports can be useful for clinicians to best care their patients. For example, a successful treatment described in an article for patients of a particular age group, having particular medical history and symptoms might be advisory to the patients having similar medical case report. This thesis focuses on applying query expansion techniques and fusing them for biomedical domain, especially while retrieving biomedical articles from the literature relevant to a particular case report. Along with the traditional query expansion techniques, query expansion using external medical knowledge is also carried out and compared with the state-of-the-art query expansion technique i.e. Incremental Blind Feedback. For the external knowledge source, UMLS Metathesaurus is used which is a network of medical related concepts. Text REtrieval Conference provided the data for this research as a part of Clinical Decision Support track in 2014 for which results of traditional query expansion techniques and fusion with manual feedback are reported. The fusion run gives consistent results for considered evaluation metrics. The results of Incremental Blind Feedback technique are comparable to the best of TREC CDS-2014. While considering the type of queries, the queries of type ’diagnosis’ and ’treatment’ performed better than that of ’test’.