Keywords
Extreme hydrological events; Imbalanced data; Oversampling; Multi-year high-flow approach
Start Date
5-7-2022 12:00 PM
End Date
8-7-2022 9:59 AM
Abstract
Classification machine learning algorithms are successfully used in hydrological predictions. Specifically, they have the potential to predict accurately and efficiently extreme hydrological events. To develop models generating reliable predictions supervised classification algorithms are “trained” based on the samples of pre-processed data which are supplied by hydrological monitoring networks. In many cases, however, data sets contain elements predominantly from some classes leaving other classes underrepresented. There are the majority of elements in hydrological data sets describing low-flow events, and relatively fewer elements correspond to high‐flow events that occasionally happen in natural waterbodies. The imbalance in class representations in training data sets may lead to model’s bias and significant errors in predictions of minority class events. The presented study was aimed at evaluating the extent of imbalance nature of data and how this extent affects the accuracy of modeling results. To mitigate imbalance of data used for training models, two approaches: oversampling and multi-year high-flow approach - were investigated. The results of the study are presented in the paper.
Key Laboratory of the Virtual Geographic Environment, Ministry of Education of PR China, Nanjing Normal University, Nanjing, Jiangsu, China
Classification machine learning algorithms are successfully used in hydrological predictions. Specifically, they have the potential to predict accurately and efficiently extreme hydrological events. To develop models generating reliable predictions supervised classification algorithms are “trained” based on the samples of pre-processed data which are supplied by hydrological monitoring networks. In many cases, however, data sets contain elements predominantly from some classes leaving other classes underrepresented. There are the majority of elements in hydrological data sets describing low-flow events, and relatively fewer elements correspond to high‐flow events that occasionally happen in natural waterbodies. The imbalance in class representations in training data sets may lead to model’s bias and significant errors in predictions of minority class events. The presented study was aimed at evaluating the extent of imbalance nature of data and how this extent affects the accuracy of modeling results. To mitigate imbalance of data used for training models, two approaches: oversampling and multi-year high-flow approach - were investigated. The results of the study are presented in the paper.
Stream and Session
false