Keywords
human-altered streams, multivariate data clustering, generative topographic mapping, feature relevance determination, ecological status
Start Date
1-7-2006 12:00 AM
Abstract
The large dimensionality of real data sets usually hampers the interpretability of the results of their analysis. In a previous study, some stream data that are part of the knowledge base of an environmental decision support system were explored through clustering and visualization. The interpretability of these clustering results would be improved by the use of a feature selection strategy based on a method capable of ranking the observed features according to their relative relevance. In this paper, we use one such a method that is an integral part of a probabilistic model for multivariate data clustering and visualization: Generative Topographic Mapping. The feature relevance determination method estimates a saliency for each feature, which is a measure of its influence on the clustering structure of the data. It is, therefore, a fully unsupervised interpretation of relevance. Its application to the available streams data shows that chemical parameters dominate the clustering structure, which is an indication that they might be also relevant for the prediction of the streams’ ecological status. Furthermore, no feature is deemed irrelevant by the model, fact that supports expert decisions in the pre-processing stage of the mining of these data.
Finding relevant features for the characterization of the ecological status of human altered streams using a constrained mixture model
The large dimensionality of real data sets usually hampers the interpretability of the results of their analysis. In a previous study, some stream data that are part of the knowledge base of an environmental decision support system were explored through clustering and visualization. The interpretability of these clustering results would be improved by the use of a feature selection strategy based on a method capable of ranking the observed features according to their relative relevance. In this paper, we use one such a method that is an integral part of a probabilistic model for multivariate data clustering and visualization: Generative Topographic Mapping. The feature relevance determination method estimates a saliency for each feature, which is a measure of its influence on the clustering structure of the data. It is, therefore, a fully unsupervised interpretation of relevance. Its application to the available streams data shows that chemical parameters dominate the clustering structure, which is an indication that they might be also relevant for the prediction of the streams’ ecological status. Furthermore, no feature is deemed irrelevant by the model, fact that supports expert decisions in the pre-processing stage of the mining of these data.