Streams in small watersheds are often known to exhibit diel fluctuations, in which streamflow oscillates on a 24-hour cycle. Streamflow diel fluctuations, which we investigate in this study, are an informative indicator of environmental processes. However, in Environmental Data sets, as well as many others, there is a range of noise associated with individual data points. Some points are extracted under relatively clear and defined conditions, while others may include a range of known or unknown confounding factors, which may decrease those points' validity. These points may or may not remain useful for training, depending on how much uncertainty they contain. We submit that in situations where some variability exists in the clarity or 'Confidence' associated with individual data points – Notably environmental data – an approach that factors this confidence into account during the training phase is beneficial. We propose a methodological framework for assigning confidence to individual data records and augmenting training with that information. We then exercise this methodology on two separate datasets: A simulated data set, and a real-world, Environmental Science data set with a focus on streamflow diel signals. The simulated data set provides integral understanding of the nature of the data involved, and the Environmental Science data set provides a real-world case study of an application of this methodology against noisy data. Both studies' results indicate that applying and utilizing confidence in training increases performance and assists in the Data Mining Process.



College and Department

Physical and Mathematical Sciences; Computer Science



Date Submitted


Document Type





machine learning, data mining, data, data processing, pre-processing, confidence, prioritization, environmental science, hydrology, diel, diel fluctuation, diel signal, streamflow, hydrogeology, watershed