Record Details

THE APPROACH TOWARDS FEATURE SELECTION FOR HIGH DIMENSIONAL DATA

Shodhganga@INFLIBNET

View Archive Info
 
 
Field Value
 
Title THE APPROACH TOWARDS FEATURE SELECTION FOR HIGH DIMENSIONAL DATA

 
Contributor Dr. Vijendra Singh
 
Subject High Dimensional, Feature Selection, Filter Method, Wrapper Method, Hybrid Method, Random Forest, Mutual Information, Classification
 
Description Feature subset selection is an important preprocessing step in the field of data mining due to huge increase in dimensionality. High dimensionality is described by many attributes and, there might be correlations or overlaps between these attributes. The number of data points becomes increasingly sparse as the dimensionality increases. This extreme sparseness is believed to greatly affect the performance of the classifiers. Therefore, to achieve better accuracy for classifiers, reducing the features using feature selection algorithms plays a vital role. The reduce set of features help in many mining tasks such as making pattern easier to understand, improving accuracy, reduces the training time of classifier and reduces over fitting. . Major challenge comes up when the data is microarray in nature. In the field of bioinformatics, data is high dimensional with large number of genes and small sample size. The perfect example could be the data collected from cancer patients, where there is a need to identify smaller subset of genes so that it becomes easy in clinical practice for diagnostic purposes.
newlineThe main objective of this thesis is to overcome the curse of dimensionality from high dimensional dataset by selecting a subset of relevant features that improve the classification accuracy. To achieve the objectives, this thesis first proposed an algorithm to give an efficient feature evaluation criterion that removes irrelevant and redundant features from high dimensional data. This algorithm has been experimented on three different domains of data that are text, image and microarray. Second, a hybrid algorithm is proposed that utilizes the different bio inspired optimization techniques. According to different optimization techniques applied, the algorithm is named TGRASP and FFGRASP.
newline Further, to overcome the major challenge with microarray data of small sample size and large number of genes, the third algorithm named RFST is proposed. Besides this, it also investigates the use of Multivariate Adaptive Regression Spline (MARS) algorithm for classification in case of microarray data. The feature subset found using MARS algorithm has been compared in terms of its accuracy with Random Forest, RFST and other existing feature selection techniques. To fulfill the objective of analyzing the best candidate genes from different subset of features selected from different feature selection algorithm, a new hybrid feature selection algorithm is developed.
newlineFinally, at the end of the thesis, a qualitative mutual information-based feature selection algorithm is proposed that initially balances the dataset and then reduces the number of features from cancer microarray data. The proposed algorithms provide an optimal number of features with high classification accuracy. Adding a qualitative measure along with mutual information has proved to improve the robustness and work better than the proposed RFST algorithm.
newline
newline

 
Date 2018-10-17T09:50:23Z
2018-10-17T09:50:23Z
2-7-2012
2018
08/10/2018
 
Type Ph.D.
 
Identifier http://hdl.handle.net/10603/218975
 
Language English
 
Relation IEEE
 
Rights university
 
Format 165p.

DVD
 
Coverage Data mining
 
Publisher Gurgaon
The Northcap University (Formerly ITM University, Gurgaon)
Department of CSE and IT
 
Source University