Stream C: Processing environmental information including data mining, machine learning, GIS, remote sensing

R package MVGHD: causal inference procedure for geographic high-dimensional

Stéphane Bourrelly, University of Lyon IIIFollow
Pascal Auquier, Aix-Marseille UniversityFollow

Keywords

Variable selection, random forests, high-dimensional dataset, lattice.

Location

Session C1: VI Data Mining for Environmental Sciences Session

Start Date

12-7-2016 8:30 AM

End Date

12-7-2016 8:50 AM

Abstract

Causal statistic inference is a key issue in machine learning. The goal is designing procedures that are able to select relevant subsets of explanatory variables, which might help scientists to better understand the underlying mechanisms behind the studied phenomena. We present a causal statistic inference procedure, especially designed for Geographic High-dimensional Datasets (GHD). The promise of discovering unknown informative factors are as great as the intrinsic learning challenges in these complex datasets, which are more and more common in the fields of environment, health, ecology, epidemiology, geography, agriculture, etc. Firstly, we point out the difference between the variable selection strategies designed for the purpose of “understanding” or “predicting”. Then we review the characteristics of the scarce variable selection strategies suitable for the causal statistic inference. Next we highlight the complexity of GHD; through the one included in the presented R package. This latter was created with the objective of better understanding the health impacts of 63 environmental factors, from the hundreds of sources of the so-called “French environmental big data” and the medical database: LEA. Indeed, at geographical scales the variety of available data allows the study of unexplored environmental factors (e.g. chronic exposure to trace metals or radiation). However, the GIS aggregations performed to create the final spatial indicators is very time consuming and decrease the accuracy of sources. Therefore the studied phenomenon (e.g. the morbidity) is usually represented both by a numerical and a multiclass spatial indicator. In addition, the potential explanatory spatial indicators (e.g. environmental factors) are also qualitative or quantitative and more or less correlated. Moreover, they are known in a very low number of spatial units. Secondly, from the previous considerations we explain why at present only the heuristic variable selection strategies based on Random Forests (FR) are able to handle the GHD. Then we present step by step the backgrounds of the causal inference procedure: MVGHD, through the convenient (beta) functions of the R package applied to this eco-epidemiological GHD. For each step we explain how to run the procedure in order to select relevant subsets of explanatory variables. ‘mvg.tune(.)’ optimizes the parameters of RF through a trade-off between statistical accuracy and computational time. ‘mvg.select(.)’ selects and compares the subset of explanatory spatial indicators, by the two different and customised variable selection strategies. ‘mvg.estimate(.)’ assesses the significance of results, by performing cross-validation techniques. ‘mvg.display(.)’ provides statistical summaries, charts and maps, so to help interpreting the results. Finally we conclude on the strengths and limits of the understanding thus gained on the role played by the combined effects of environmental factors on health risks.

Download

Included in

Civil Engineering Commons, Data Storage Systems Commons, Environmental Engineering Commons, Hydraulic Engineering Commons, Other Civil and Environmental Engineering Commons

COinS

Jul 12th, 8:30 AM Jul 12th, 8:50 AM

R package MVGHD: causal inference procedure for geographic high-dimensional

Session C1: VI Data Mining for Environmental Sciences Session

Stream C: Processing environmental information including data mining, machine learning, GIS, remote sensing

R package MVGHD: causal inference procedure for geographic high-dimensional

Keywords

Location

Start Date

End Date

Abstract

Included in

Conference Links

Search

BYU

BYU Links

Links

Stream C: Processing environmental information including data mining, machine learning, GIS, remote sensing

R package MVGHD: causal inference procedure for geographic high-dimensional

Presenter/Author Information

Keywords

Location

Start Date

End Date

Abstract

Included in

Share

Conference Links

Search

BYU

BYU Links

Links