Introduction to the modelIntegration Package

The R package ‘modelIntegration’ implements aggregation of several probability distributions into a single integrated one. Suppose that, several independent methods are used to observe a deterministic element and each method represents the latter as a probability distribution. Thus, we deal with a family of probability distributions providing alternative descriptions to the same object. The problem is how to combine information from the prior estimates. This package implements the posterior integration method [Kryazhimskiy, 2013]. For comparison, an implementation of simple averaging of the input distributions is added.

Methods

The posterior integration method [Kryazhimskiy, 2013; Kryazhimskiy, 2016] is based on the assumption that model outcomes are mutually compatible, i.e., we should observe identical outcomes after the use of model ensemble. Formally, the product probability distribution of the original estimates is \begin{equation} p(z)=\frac{p_1(z)*p_2(z)* \dots *p_n(z)}{\sum_{z' \in Z}{p_1(z')*p_2(z')* \dots *p_n(z')}} \end{equation}

where \(p_1,p_2,\dots,p_n\) are prior distributions on \(Z\) associated with the methods \(1,\dots,n\). \(Z\) is a non-empty finite set, whose number of elements is bigger than one.

Alternatively, prior estimates can be combined using simple averaging. This approach represents the distribution of the outcomes of random tests, in each of which one of the priors is chosen at random with probability \(1/n\), and then an outcome is picked up randomly according to the probability distribution based on the chosen method. Namely, \begin{equation} p(z)=\frac{p_1(z)+p_2(z)+ \dots +p_n(z)}{n} \end{equation}

Data

To explore the basic usage of modelIntegration, we’ll start with the built-in forest_npp and forest_npp90 data frames. These datasets contain probability distribution tables for net primary production (NPP) of the forest ecosystems in seven bioclimatic zones in Russia, reported in [Kryazhimskiy et al., 2015]. The documentantation of the datasets is provided with ?forest_npp and ?forest_npp90 calls.

dim(forest_npp)

## [1] 1131 17

colnames(forest_npp)

## [1] "npp" "LEA_Tundra" ## [3] "LEA_Tundra_Northern_Taiga" "LEA_Middle_Taiga" ## [5] "LEA_Southern_Taiga" "LEA_Temperate" ## [7] "LEA_Steppe" "LEA_Deserts" ## [9] "LEA_Total" "DGVM_Tundra" ## [11] "DGVM_Tundra_Northern_Taiga" "DGVM_Middle_Taiga" ## [13] "DGVM_Southern_Taiga" "DGVM_Temperate" ## [15] "DGVM_Steppe" "DGVM_Deserts" ## [17] "DGVM_Total"

Basic usage

The main method of the modelIntegration package is integrate. It can work with several representations of probability distributions. The discrete distributions are supplied through pdfs argument, which supports a ‘table-based’ format. A continuous distribution is discretized using the cdf, supplied in cdfs. In this case, a bin center equals to a value of the corresponding outcome and a bin width is determined from the subsequent outcome values in the range. The identical range of the random variables (associated with each prior distribution) is set in the vals argument.

example1 <- integrate( vals = forest_npp[, 1], pdfs = as.list(forest_npp[c("LEA_Tundra", "DGVM_Tundra")])) summary(example1)

## Product Average ## mean 189.29034 213.6184 ## std 42.78502 74.0616

example2 <- integrate( vals = forest_npp90[, 1], pdfs = as.list(forest_npp90["LEA_Tundra"]), cdfs = list("DGVM_Tundra" = function(x)(pnorm(x, mean = 202, sd = 52)))) summary(example2)

## Product Average ## mean 183.73562 212.92005 ## std 43.87124 79.16872

Aggregated distributions

The two integrated estimates can be accessed with product and average calls correspondingly. The package also supports a summary of descriptive statistics for the integrated distributions and the priors.

example <- integrate(c(1, 2), list(c(0.75, 0.25), c(0.75, 0.25))) product(example)

## x prob ## 1 1 0.9 ## 2 2 0.1

average(example)

## x prob ## 1 1 0.75 ## 2 2 0.25

statistics(example)

## P1 P2 Product Average ## mean 1.2500000 1.2500000 1.1 1.2500000 ## std 0.4330127 0.4330127 0.3 0.4330127

References

[1] Kryazhimskiy, A.V. (2013). Posterior integration of independent stochastic estimates. IIASA Interim Report. IR-13-006.

[2] Kryazhimskiy, A.V. (2016). Posteriori integration of probabilities. Elementary theory. Theory of Probability and its Applications, 60(1): 62-87.

[3] Kryazhimskiy, A., Rovenskaya, E., Shvidenko, A., Gusti, M. Shchepashchenko, D. & Veshchinskaya, V. (2015). Towards harmonizing competing models: Russian forests’ net primary production case study. Technological Forecasting & Social Change, 98: 245-254.