Keywords
gc-ms, data mining, pah, workflow
Start Date
1-7-2010 12:00 AM
Abstract
Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GC-MS). A GC-MS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide reliable prediction of compound concentrations. A workflow engine with an easy-to-use graphical user interface is at the heart of processing the data. This engine allows a domain expert to set up workflows that can load the data, preprocess it in parallel in various ways and convert it into data suitable for data mining toolkits. The generated output can then be evaluated using different data mining techniques, to determine the impact of preprocessing steps on the performance of the generated models and for picking the best approach. Encouraging results for predicting PAH compound concentrations, in terms of correlation coefficients and root-mean-squared error are demonstrated.
Predicting Polycyclic Aromatic Hydrocarbon Concentrations in Soil and Water Samples
Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GC-MS). A GC-MS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide reliable prediction of compound concentrations. A workflow engine with an easy-to-use graphical user interface is at the heart of processing the data. This engine allows a domain expert to set up workflows that can load the data, preprocess it in parallel in various ways and convert it into data suitable for data mining toolkits. The generated output can then be evaluated using different data mining techniques, to determine the impact of preprocessing steps on the performance of the generated models and for picking the best approach. Encouraging results for predicting PAH compound concentrations, in terms of correlation coefficients and root-mean-squared error are demonstrated.