THE APPROACH TOWARDS FEATURE SELECTION FOR HIGH DIMENSIONAL DATA
Shodhganga@INFLIBNET
View Archive InfoField | Value | |
Title |
THE APPROACH TOWARDS FEATURE SELECTION FOR HIGH DIMENSIONAL DATA
— |
|
Contributor |
Dr. Vijendra Singh
|
|
Subject |
High Dimensional, Feature Selection, Filter Method, Wrapper Method, Hybrid Method, Random Forest, Mutual Information, Classification
|
|
Description |
Feature subset selection is an important preprocessing step in the field of data mining due to huge increase in dimensionality. High dimensionality is described by many attributes and, there might be correlations or overlaps between these attributes. The number of data points becomes increasingly sparse as the dimensionality increases. This extreme sparseness is believed to greatly affect the performance of the classifiers. Therefore, to achieve better accuracy for classifiers, reducing the features using feature selection algorithms plays a vital role. The reduce set of features help in many mining tasks such as making pattern easier to understand, improving accuracy, reduces the training time of classifier and reduces over fitting. . Major challenge comes up when the data is microarray in nature. In the field of bioinformatics, data is high dimensional with large number of genes and small sample size. The perfect example could be the data collected from cancer patients, where there is a need to identify smaller subset of genes so that it becomes easy in clinical practice for diagnostic purposes. newlineThe main objective of this thesis is to overcome the curse of dimensionality from high dimensional dataset by selecting a subset of relevant features that improve the classification accuracy. To achieve the objectives, this thesis first proposed an algorithm to give an efficient feature evaluation criterion that removes irrelevant and redundant features from high dimensional data. This algorithm has been experimented on three different domains of data that are text, image and microarray. Second, a hybrid algorithm is proposed that utilizes the different bio inspired optimization techniques. According to different optimization techniques applied, the algorithm is named TGRASP and FFGRASP. newline Further, to overcome the major challenge with microarray data of small sample size and large number of genes, the third algorithm named RFST is proposed. Besides this, it also investigates the use of Multivariate Adaptive Regression Spline (MARS) algorithm for classification in case of microarray data. The feature subset found using MARS algorithm has been compared in terms of its accuracy with Random Forest, RFST and other existing feature selection techniques. To fulfill the objective of analyzing the best candidate genes from different subset of features selected from different feature selection algorithm, a new hybrid feature selection algorithm is developed. newlineFinally, at the end of the thesis, a qualitative mutual information-based feature selection algorithm is proposed that initially balances the dataset and then reduces the number of features from cancer microarray data. The proposed algorithms provide an optimal number of features with high classification accuracy. Adding a qualitative measure along with mutual information has proved to improve the robustness and work better than the proposed RFST algorithm. newline newline — |
|
Date |
2018-10-17T09:50:23Z
2018-10-17T09:50:23Z 2-7-2012 2018 08/10/2018 |
|
Type |
Ph.D.
|
|
Identifier |
http://hdl.handle.net/10603/218975
|
|
Language |
English
|
|
Relation |
IEEE
|
|
Rights |
university
|
|
Format |
165p.
— DVD |
|
Coverage |
Data mining
|
|
Publisher |
Gurgaon
The Northcap University (Formerly ITM University, Gurgaon) Department of CSE and IT |
|
Source |
University
|
|