Presenter/Author Information

Wenyan Wu
Robert May
Graeme C. Dandy
Holger R. Maier

Keywords

artificial neural networks, data splitting

Start Date

1-7-2012 12:00 AM

Abstract

Data splitting is an important step in the artificial neural network (ANN)development process whereby data are divided into training, test and validationsubsets to ensure good generalization ability of the model. Considering that onlyone split of data is typically used when developing ANN models, data splitting hasa significant impact on the performance of the final model by potentially introducingbias and variance into the model development process. Therefore, it is important tofind a robust data splitting method which results in an ANN model that representsthe underlying data generation process of a given dataset. In practice, ANN modelsdeveloped using different data splitting methods are often assessed based onvalidation results. In previous research, however, it has been found that validationresults alone are not adequate for assessing the performance of ANN models.Data splitting methods have the potential to bias the validation results by allocatingextreme observations into the training set and therefore, the test and validationsets contain fewer patterns compared to the training set. Consequently, thegeneralization ability of the model may be compromised and the trained modelcannot be adequately validated. This paper introduces a method to comparedifferent data splitting methods for developing ANN models fairly. The methodologyis applied to compare a number of well-known data splitting techniques in thecontext of some hydrological ANN modeling problems.

COinS
 
Jul 1st, 12:00 AM

A method for comparing data splitting approaches for developing hydrological ANN models

Data splitting is an important step in the artificial neural network (ANN)development process whereby data are divided into training, test and validationsubsets to ensure good generalization ability of the model. Considering that onlyone split of data is typically used when developing ANN models, data splitting hasa significant impact on the performance of the final model by potentially introducingbias and variance into the model development process. Therefore, it is important tofind a robust data splitting method which results in an ANN model that representsthe underlying data generation process of a given dataset. In practice, ANN modelsdeveloped using different data splitting methods are often assessed based onvalidation results. In previous research, however, it has been found that validationresults alone are not adequate for assessing the performance of ANN models.Data splitting methods have the potential to bias the validation results by allocatingextreme observations into the training set and therefore, the test and validationsets contain fewer patterns compared to the training set. Consequently, thegeneralization ability of the model may be compromised and the trained modelcannot be adequately validated. This paper introduces a method to comparedifferent data splitting methods for developing ANN models fairly. The methodologyis applied to compare a number of well-known data splitting techniques in thecontext of some hydrological ANN modeling problems.