Presenter/Author Information

Marina Erechtchoukova, York University, Canada

Keywords

Supervised classification; Extreme Gradient Boosting; Short-term prediction; Flood event; Training sets

Start Date

17-9-2020 3:00 PM

End Date

17-9-2020 3:20 PM

Abstract

In support of watershed management decisions, trustworthy and efficient predictive models are required. Classification machine learning algorithms representing a data-driven approach to hydrological model development had become increasingly popular in recent years. The specific of their utilization, however, is in the necessity to develop a reliable model on data from the recent past which will be applied to data collected in the future. Incorporation of generated predictions in management decisions is only possible after the assessment of their reliability, traditionally associated with the estimates of the model generalization error. Since a model is created by training (e.g., calibration of) a machine learning algorithm on a representative data set, the issue of transforming the spatially and temporally dispersed data into training and testing sets becomes critical for developing a predictive tool. Just evaluating the model performance on data samples unseen on the training step is insufficient for model validation. This study was focused on investigation of the approaches to building training sets which ensure models reliable for short-term prediction of flood events for extended lead time intervals in a watershed with a flashy response to precipitation. The original computational scheme was based on the framework incorporating time-delay embedding applied to time-series data from all observation sites of a watershed. Stratified random sampling was compared with chronological splits of complete data sets and stratified data reflected distinct classes of hydrological events. While stratified random sampling provides average estimates of model performance, practical needs dictate the necessity to choose a model with better performance on events occurring outside of the time interval used for training the model. The computational experiments were conducted on data collected during years with different hydrological characteristics. The results of this study are presented in this paper.

Stream and Session

false

COinS
 
Sep 17th, 3:00 PM Sep 17th, 3:20 PM

Developing Training Sets for Hydrological Prediction Based on Supervised Classification

In support of watershed management decisions, trustworthy and efficient predictive models are required. Classification machine learning algorithms representing a data-driven approach to hydrological model development had become increasingly popular in recent years. The specific of their utilization, however, is in the necessity to develop a reliable model on data from the recent past which will be applied to data collected in the future. Incorporation of generated predictions in management decisions is only possible after the assessment of their reliability, traditionally associated with the estimates of the model generalization error. Since a model is created by training (e.g., calibration of) a machine learning algorithm on a representative data set, the issue of transforming the spatially and temporally dispersed data into training and testing sets becomes critical for developing a predictive tool. Just evaluating the model performance on data samples unseen on the training step is insufficient for model validation. This study was focused on investigation of the approaches to building training sets which ensure models reliable for short-term prediction of flood events for extended lead time intervals in a watershed with a flashy response to precipitation. The original computational scheme was based on the framework incorporating time-delay embedding applied to time-series data from all observation sites of a watershed. Stratified random sampling was compared with chronological splits of complete data sets and stratified data reflected distinct classes of hydrological events. While stratified random sampling provides average estimates of model performance, practical needs dictate the necessity to choose a model with better performance on events occurring outside of the time interval used for training the model. The computational experiments were conducted on data collected during years with different hydrological characteristics. The results of this study are presented in this paper.