Theses and Dissertations
Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1
Browse
9 results
Search Results
Item Open Access On designing DNA codes and their applications(Dhirubhai Ambani Institute of Information and Communication Technology, 2019) Limbachiya, Dixita; Gupta, Manish K.Bio-computing uses the complexes of biomolecules such as DNA (Deoxyribonucleic acid), RNA (Ribonucleic acid) and proteins to perform the computational processes for encoding and processing the data. In 1994, L. Adleman introduced the field of DNA computing by solving an instance of the Hamiltonian path problem using the bunch of DNA sequences and biotechnology lab methods. An idea of DNA hybridization was used to perform this experiment. DNA hybridization is a backbone for any computation using the DNA sequences. However, it is also cause of errors. To use the DNA for computing, a specific set of the DNA sequences (DNA codes) which satisfies particular properties (DNA codes constraints) that avoid cross-hybridization are designed to perform a particular task. Contributions of this dissertation can be broadly divided into two parts as 1) Designing the DNA codes by using algebraic coding theory. 2) Codes for DNA data storage systems to encode the data in the DNA. The main research objective in designing the DNA codes over the quaternary alphabets {A, C, G, T}, is to find the largest possible set of M codewords each of length n such that they are at least at the distance d and satisfies the desired constraints which are feasible with respect to practical implementation. In the literature, various computational and theoretical approaches have been used to design a set of DNA codes which are sufficiently dissimilar. Furthermore, DNA codes are constructed using coding theoretic approaches using fields and rings. In this dissertation, one such approach is used to generate the DNA codes from the ring R = Z4 + wZ4, where w2 = 2 + 2w. Some of the algebraic properties of the ring R are explored. In order to define an isometry from the elements of the ring R to DNA, a new distance called Gau distance is defined. The Gau distance motivates the distance preserving map called Gau map f. Linear and closure properties of the Gau map are obtained. General conditions on the generator matrix over the ring R to satisfy reverse and reverse complement constraints on the DNA code are derived. Using this map, several new classes of the DNA codes which satisfies the Hamming distance, reverse and reverse complement constraints are given. The families of the DNA codes via the Simplex type codes, first order and rth order Reed-Muller type codes and Octa type codes are developed. Some of the general results on the generator matrix to satisfy the reverse and reverse complement constraints are given. Some of the constructed DNA codes are optimal with respect to the bounds on M, the size of the code. These DNA codes can be used for a myriad of applications, one of which is data storage. DNA is stable, robust and reliable. Theoretically, it is estimated that one gram of DNA can store 455 EB (1 Exabyte = 1018 bytes). These properties make the DNA a potential candidate for data storage. However, there are various practical constraints for the DNA data storage system. In this work, we construct DNA codes with some of the DNA constraints to design efficient codes to store data in DNA. One of the practical constraints in designing DNA codes for storage is the repeated bases (runlengths) of the same DNA nucleotides. Hence, it is essential that each DNA codeword should avoid long runlengths. In this thesis, codes are proposed for data storage that will dis-allow runlengths of any base to develop DNA data storage error-free codes. A fixed GC-weight u (the occurrence of G and C nucleotides in a DNA codeword) is another requirement for DNA codewords used in DNA storage. DNA codewords with large GC-weight lead to insertion and deletion (indel) errors in DNA reading and amplification process thus, it is crucial to consider a fixed GCweight for DNA code. In this work, we propose methods that generate families of codes for the DNA data storage systems that satisfy no-runlength and fixed GC-weight constraints for the DNA codewords used for data storage. The first is the constrained codes which use the quaternary code and the second is DNA Golay subcodes that use the ternary encoding. The constrained quaternary coding is presented to generate DNA codes for the data storage. We give a construction algorithm for finding families of DNA codes with the no-runlength and fixed GC-weight constraints. The number of DNA codewords of fixed GC-weight with the no-runlength constraint is enumerated. We note that the prior work only gave bounds on the number of such codewords while in this work we count the number of these DNA codewords exactly. We observe that the bound mentioned in the previous work does not take into account the distance of the code which is essential for data reliability. Thus, we consider distance to obtain a lower bound on the number of codewords along with the fixed GC-weight and no-runlength constraints. In the second method, we demonstrate the Golay subcode method to encode the data in a variable chunk architecture of the DNA using ternary encoding. N.Goldman et al. introduced the first proof of concept of the DNA data storage in 2013 by encoding the data without using error correction in the DNA which motivated us to implement this method. While implementing this method, a bottleneck of this approach was identified which limited the amount of data that can be encoded due to fix length chunk architecture used for data encoding. In this work, we propose a modified scheme using a non-linear family of ternary codes based on the Golay subcode that includes flexible length chunk architecture for data encoding in DNA. By using the Golay ternary subcode, two substitution errors can be corrected. In a nutshell, the significant contributions of this thesis are designing DNA codes with specific constraints. First, DNA codes from the ring using algebraic coding by defining a new type of distance (Gau distance) and map (Gau map) are proposed. These DNA codes satisfy reverse, reverse complement and complement with the minimum Hamming distance constraints. Several families of these DNA codes and their properties are studied. Second, DNA codes using constrained coding and Golay subcode method are developed that satisfy norunlength and GC-weight constraints for a DNA data storage system.Item Open Access Wind energy forecasting using recurrent neural networks(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Rani, Neha; Joshi, Manjunath V.Wind energy has the potential to meet all our electricity demands and is a cost effective source of energy. Power operators dealing with the electricity generation from wind energy often face many problems due to the fluctuations in the wind energy and the uncertainty associated with it . Therefore, system operators need to have enough resources which can ensure the proper functioning of the system in the situation of fluctuating wind energy generation. Proper wind energy forecasting is a crucial part of the smart energy grid. The new era of machine learning has become very popular due to its fast training and good forecasting performance. This capability of machine learning techniques can be applied to predict the wind energy. In this thesis work, an efficient recurrent neural network based forecasting of wind energy is being proposed. ELMAN network is developed for short term forecast specifically, we consider 24 hour ahead forecast. An ELMAN neural network is a kind of recurrent neural network that incorporates the dynamic dependency in the data due to the presence of the feedback path present in the network. It has three main layers i.e, an input layer, hidden layers and the context layer that captures the dynamic behaviour in the time series. Supervised learning method is used for the forecasting purpose using themeteorological data as the training features. Three ELMAN networks having different input feature vectors are developed i.e. two weather sensitive networks having input vector of size (4 1) and (7 1) respectively and a non- weather sensitive network having input vector of size (4 1) and these networks are compared with other conventional methods. Experimental results show that there is a significant reduction in the evaluation criteria i.e. Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) for the proposed method when compared to other approaches.Item Open Access Generative Adversarial Networks for Speech Technology Applications(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Shah, Neil; Patil, Hemant A.The deep learning renaissance has enabled the machines to understand the observed data in terms of a hierarchy of representations. This allows the machines to learn complicated nonlinear relationships between the representative pairs. In context of the speech, deep learning architectures, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs) are the traditional supervised learning algorithms employing Maximum Likelihood (ML)-based optimization. These techniques reduce the numerical estimates between the generated and the groundtruth. However, the performance gap between the generated representation and the groundtruth in various speech applications is due to the fact that the numerical estimation may not correlate with the human perception mechanism. On the other hand, the Generative Adversarial Networks (GANs) reduces the distributional divergence, rather than minimizing the numerical errors and hence, may synthesize the samples with improved perceptual quality. However, the vanilla GAN (v-GAN) architecture generates the spectrum that may belong to the true desired distribution but may not correspond to the given spectral frames at the input. To address this issue, the Minimum Mean Square Error (MMSE) regularized, MMSE-GAN and CNN-GAN architectures are proposed for the Speech Enhancement (SE) task. The objective evaluation shows the improvement in the speech quality and suppression of the background interferences over the state-ofthe- art techniques. The effectiveness of the proposed MMSE-GAN is explored in other speech technology applications, such as Non-Audible Murmur-to-Whisper Speech Conversion (NAM2WHSP), Query-by-Example Spoken Term Detection (QbE-STD), and Voice Conversion (VC). In QbE-STD, a DNN-based GAN with a cross-entropy regularization is proposed for extracting an unsupervised posterior feature representation (uGAN-PG), trained on labeled Gaussian Mixture Model (GMM) posteriorgram. Moreover, the ability of Wasserstein GAN (WGAN) in improving the optimization stability and providing a meaningful loss metric that correlates to the generated sample quality and the generator's convergence is also exploited. To that effect, MMSE-WGAN is proposed for the VC task and its performance is compared with the MMSE-GAN and DNN-based approaches.Item Open Access Replay spoof detection using handcrafted features(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Tapkir, Prasad Anil; Patil, Hemant A.In the past few years, there has been noteworthy demand in the use of Automatic Speaker Verification (ASV) system for numerous applications. The increased use of the ASV systems for voice biometrics purpose comes with the threat of spoofing attacks. The ASV systems are vulnerable to five types of spoofing attacks, namely, impersonation, Voice Conversion (VC), Speech Synthesis (SS), twins and replay. Among which, replay possess a greater threat to the ASV system than any other spoofing attacks, as it neither require any specific expertise nor a sophisticated equipment. Replay attacks require low efforts and most accessible attacks. The replay speech can be modeled as a convolution of the genuine speech with the impulse response of microphone, multimedia speaker, recording environment and playback environment. The detection difficulty of replay attacks increases with a high quality intermediate devices, clean recording and playback environment. In this thesis, we have propose three novel handcrafted cepstral feature sets for replay spoof detection task, namely, Magnitude-based Spectral Root Cepstral Coefficients (MSRCC), Phase-based Spectral Root Cepstral Coefficients (PSRCC) and Empirical Mode Decomposition Cepstral Coefficients (EMDCC). In addition, we explored the significance of Teager Energy Operator (TEO) phase feature for replay spoof detection. The EMDCC feature set replace the filterbank structure with Empirical Mode Decomposition (EMD) technique to obtain the subband signals. The number of subbands obtained for the replay speech signal using EMD is more as compared to the genuine speech signal. The MSRCC and PSRCC feature sets are extracted using spectral root cepstrum of speech signal. The spectral root cepstrum spreads the effect of additional impulse responses in replay speech over entire quefrencydomain. The TEO phase feature set provides the high security information when fused with other magnitude-based features, such as Mel Frequency Cepstral Coefficients (MFCC). The experiments are performed on ASV spoof 2017 challenge database and all the systems are implemented using Gaussian Mixture Model (GMM) as a classifier. All the feature set performs better than the ASV spoof 2017 challenge baseline Constant Q Cepstral Coefficients (CQCC) system.Item Open Access Multi-class diagnosis of diabetic retinopathy using deep learning(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Shrivastava, Udit; Joshi, Manjunath V.Diabetic Retinopathy is the main cause of blindness in the modern world. As per studies, around 40-45% of people suffering from diabetes have DR in their later stages of life. All the forms of diabetic eye disease have the potential to cause vision impairment or blindness. Early stages of DR shows very small and intricate features like micro-aneurysms (swelling of blood vessels), hard exudates (protein deposits), whereas the severe and proliferative stages show more prominent features like hemorrhages (blood clot), neovascularization (abnormal growth of vessels), macular edema etc. Detecting such small and complex features through Fundus images is a very tedious and time-consuming process and requires an experienced ophthalmologist. This demands an automated diagnosis system which can vastly reduce the burden on the clinicians. In this thesis, we propose a Convolutional Neural Network (CNN) based automated diagnosis system that can classify various stages of diabetic retinopathy accurately. A hierarchical approach is adopted for classification in which we break down our classification task into two stages. In the first stage we perform binary classification and find out the true positive and negative samples and in the second stage, five class classification is performed with the images which were classified as true positive, false positive and false negative in the first stage of classification. Our proposed method uses the Inception-v3 net for feature extraction. Our proposed method uses the Inception-v3 net for feature extraction in which we use the features of second last layer features and also the features from the second last layer of an auxiliary classifier. These extracted features are concatenated into a single feature vector to train a Support Vector Machine (SVM). For multiclass classification, SVM classifies sample into one of the five classes. Experiments are conducted on "Kaggle" dataset and our proposed approach attains an accuracy of 91% on validation data for binary classification and 78% for multiclass classification. The results obtained are better than the recent methods on multiclass classification of diabetic retinopathyItem Open Access Schema based indexing for namespace mapping of raw sparql and summarization of lod(Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Hapani, Hitesh; Jat, P. M.Linked open data(LOD) in Semantic Web is growing day by day. There are datasets available that can be used in different application. However, identifying useful dataset from cloud, determining the quality and obtaining inductive information from dataset are all tasks that require to be addressed. The more traffic on LOD increases, the more difficult it will become to identify useful dataset. The reason behind this problem is that there is no useful summary available about datasets. While querying any dataset through endpoint, The most cumbersome part is remembering URIs for resources. There is no known interface that provides URIs for the user terms. There are some standard available for providing summary and metadata about datasets. But till now no standard is available that is universally accepted. Index structure proposed in this thesis gives a schema level information about any dataset and provides URI information for dataset. This index structure has been successfully implemented on local dataset server and remote dataset server in this thesis.Item Open Access Text negation normalization(Dhirubhai Ambani Institute of Information and Communication Technology, 2014) Dubey, Mohnish; Dasgupta, Sourish0Item Open Access Text normalization and non-monotonic knowledge base revision for consistent ontology learning(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Shah, Kushal; Dasgupta, Sourish0Item Open Access Study of bayesian learning of system characteristics(Dhirubhai Ambani Institute of Information and Communication Technology, 2008) Sharma, Abhishek; Jotwani, Naresh D.This thesis report basically deals with the scheduling algorithms implemented in our computer systems and about the creation of probabilistic network which predicts the behavior of system. The aim of this thesis is to provide a better and optimized results for any system where scheduling can be done. The material presented in this report will provide an overview of the field and pave the way to studying subsequent topics which gives the detailed theories on Bayesian networks, learning the Bayesian networks and the concepts related to the process scheduling. Bayesian network is graphical model for probabilistic relationships among a set of random variables (either discrete or continuous). These models having several advantages over data analysis. The goal of learning is to find the Bayesian network that best represents the joint probability distribution. One approach is to find the network that maximizes the likelihood of the data or (more conveniently) its logarithm. We describe the methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with complete data also. We relate Bayesian network methods for learning, to learn from data samples generated from the operating system scheduling environment. The various results produced, tested and verified for scheduling algorithms (FCFS, SJF, RR and PW) by an Operating System Scheduling Simulator implemented in programming language JAVA. Here, the given code is modified according to requirements and fulfilling the necessary task.