KRISHI
ICAR RESEARCH DATA REPOSITORY FOR KNOWLEDGE MANAGEMENT
(An Institutional Publication and Data Inventory Repository)
"Not Available": Please do not remove the default option "Not Available" for the fields where metadata information is not available
"1001-01-01": Date not available or not applicable for filling metadata infromation
"1001-01-01": Date not available or not applicable for filling metadata infromation
Please use this identifier to cite or link to this item:
http://krishi.icar.gov.in/jspui/handle/123456789/73739
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Prabina Kumar Meher | en_US |
dc.contributor.author | Tanmaya Kumar Sahu | en_US |
dc.contributor.author | Shachi Gahoi | en_US |
dc.contributor.author | Subhrajit Satpathy | en_US |
dc.contributor.author | Atmakuri Ramakrishna Rao | en_US |
dc.date.accessioned | 2022-08-07T07:34:34Z | - |
dc.date.available | 2022-08-07T07:34:34Z | - |
dc.date.issued | 2019-07-01 | - |
dc.identifier.citation | Prabina Kumar Meher, Tanmaya Kumar Sahu, Shachi Gahoi, Subhrajit Satpathy, Atmakuri Ramakrishna Rao, (2019). Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition, Gene, 705, 113-126, https://doi.org/10.1016/j.gene.2019.04.047 | en_US |
dc.identifier.issn | Not Available | - |
dc.identifier.uri | http://krishi.icar.gov.in/jspui/handle/123456789/73739 | - |
dc.description | Not Available | en_US |
dc.description.abstract | Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA. | en_US |
dc.description.sponsorship | Not Available | en_US |
dc.language.iso | English | en_US |
dc.publisher | Not Available | en_US |
dc.relation.ispartofseries | Not Available; | - |
dc.subject | Gene prediction | en_US |
dc.subject | Intron density | en_US |
dc.subject | Markov model | en_US |
dc.subject | Sequence encoding | en_US |
dc.subject | Supervised learning | en_US |
dc.title | Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. | en_US |
dc.title.alternative | Not Available | en_US |
dc.type | Research Paper | en_US |
dc.publication.projectcode | Not Available | en_US |
dc.publication.journalname | Gene | en_US |
dc.publication.volumeno | 705 | en_US |
dc.publication.pagenumber | 113-126 | en_US |
dc.publication.divisionUnit | Not Available | en_US |
dc.publication.sourceUrl | 10.1016/j.gene.2019.04.047 | en_US |
dc.publication.authorAffiliation | ICAR::Indian Agricultural Statistics Research Institute | en_US |
dc.publication.authorAffiliation | ICAR::Indian Agricultural Research Institute | en_US |
dc.publication.authorAffiliation | ICAR::National Bureau of Plant Genetics Resources | en_US |
dc.publication.authorAffiliation | International Crops Research Institute for Semi Arid Tropics | en_US |
dc.ICARdataUseLicence | http://krishi.icar.gov.in/PDF/ICAR_Data_Use_Licence.pdf | en_US |
Appears in Collections: | AEdu-IASRI-Publication |
Files in This Item:
There are no files associated with this item.
Items in KRISHI are protected by copyright, with all rights reserved, unless otherwise indicated.