Vortrag

am Dienstag, 22. September 2009
15:30 Uhr
Raum S-05-170
LKH-Eingangsgebäude, Auenbruggerplatz 2 / 5.OG, 8036 Graz




Multiple Testing Schemes for Feature Selection in Neural Network Models

von Michele La Rocca (and Cira Perna)
Department of Economics and Statistics, University of Salerno, Italy



Abstract

Feature selection (or variable selection) is a critical step in constructing statistical regression, pattern classification, or time series models that are capable of optimum generalization performance. Having a number of irrelevant or redundant input variables can lead to overfitting and to a poor generalization ability especially for those models that suffer from the curse of dimensionality in the number of input variables. Determining the most appropriate inputs to a model provides a better understanding of the underlying process that generated the data and has a significant impact on performance of the model and on the associated algorithms for classification, prediction and data analysis. The importance of this topic has grown in recent years as computing power has encouraged the modelling of data sets of ever-increasing size. Data mining applications in finance, marketing and bioinformatics are obvious examples.

The talk deals with variable selection in artificial neural networks. Interest in this class of models is related to their ability to accurately represent the complex, non-linear behaviour of relatively poorly understood processes which makes them particularly well suited to finding accurate solutions in problems characterized by complex, noisy, irrelevant or partial information frameworks. However, when using neural networks the selection of an adequate model is always a hard task, due to the "atheoretical" nature of the tool and its intrinsic misspecification. The problem is not new and several approaches have been proposed in the literature both in a frequentist and Bayesian framework.

In this talk, a novel strategy for input selection in neural network modeling is discussed. The approach is in the same spirit of those based on relevance indexes which measure the input-output sensitivity but, to avoid the data snooping problem, familywise error rate is controlled by using a multiple testing scheme. When compared to existing testing solutions, the approach does not require any a priori identification of a proper set of variables to test, which can often lead to sequential testing schemes and, as a consequence, to loose control over the true size of the test. The sampling distribution of the test statistic involved is approximated by subsampling, a resampling scheme which is able to deliver consistent results under very weak assumptions. Observe that the method is discussed within the framework of Artificial Neural Networks, but it can be applied to any non-linear model with very minor modifications, as far as the input-output sensitivities, or other variable relevance measures, can be computed or estimated. Results of the application of the novel procedure to neural network modelling of the Euro exchange rates will also be discussed.


Zurück zur Seite der biometrischen Sektion Steiermark-Kärnten