regression-tests.core

-main

(-main & args)
Runs hypothesis testing and estimates confidence intervals for the multiple linear regression
with the unknown joint distribution of error terms using OLS and resampling.

This includes: 1) hypothesis testing using permutations method
                  (model and coefficents significance),
               2) estimating confidence intervals for regression model parameters
                  using bootstrapping,
               3) estimating approximation accuracy using bootstrapping,
               4) bootstrap hypothesis testing on spatial autocorrelation.

Saves results into the bunch of csv files
in the root execution directory.

Arguments: path
           n-replications
           test-id
           path2

Type of analysis is specified in the [test-id] argument.
Supported values: "permutations"
                  "bootstrap-regression"
                  "bootstrap-accuracy"
                  "iid"

Original sample is read from the csv file by the [path] address.
The first row should contain variable labels.
The first column contains values of the response y.


Model:
      y=Xb+eps,
      eps ~ F(0,sigma^2).

      y: [n x 1] vector of the response.
      X: [n x (p+1)] matrix of the explanatory variables.
      b: [(p+1) x 1] vector of the regression coefficients.
      eps: [n x 1] vector of the independent and
           identically distributed errors with common distribution F
           having mean 0 and finite variance sigma^2.
      n: number of observations.
      p: number of explanatory variables in the input file.

Assumptions: error terms are independent and identically distributed.

Regression coefficients are estimated using ordinary least squares (OLS).


<!-- [test-id] = "permutations" -->
Hypothesis testing (permutation tests):
   output: regression-tests/permutation_tests.csv
           regression-tests/permutation_r2_sample.csv

  1) Overall model significance - exact permutation test on R-square.
      H0: b_1=b_2=...=b_p=0.

      out: approximate p-value, calculated after [n-replications] permutations
            with 95%-normal approximation interval.

  2) Significance of the i-th coefficient - approximate permutation
     test (Freedman & Lane, 1983) on t-statistic.
      H0: b_i=0.

      out: approximate p-value, calculated after [n-replications] permutations
            with 95%-normal approximation interval.

<!-- [test-id] = "bootstrap-regression" -->
Confidence intervals (bootstrapping):
   output: regression-tests/regression-stat-bootstrap.csv

   estimates:
           b_0, b_1, ..., b_n;
           R-square, MSE (mean square error).
   out: mean with 95% percentile confidence interval.

   bootstrap scheme: percentile bootstrap (Efron & Tibshirani, 1993),
                     left border - value at position of the largest
                     integer not greater than alpha/2*[n-rep],
                     right border - value at position of the smallest
                     integer not less than (1-alpha/2)*[n-rep].

   confidence level: alpha=0.05.


<!-- [test-id] = "bootstrap-accuracy" -->
Target values are read from the csv file by the [path2] address.
The first row should contain variable labels.
The first column contains values of the response y.
The second column in the [path2] file contains the group id for
the given sample value.
Remaining columns contain values of the explanatory variables.

Accuracy (single value):
      pho_j=|y~_j-y^_j|,
      y^_j=X'_j*b.

      y~_j: [N x 1] vector of the true observed values (in [path2]).
      y^_j: [N x 1] vector of the fitted values from the model.
      X'_j: [N x (p+1)] matrix of the explanatory variables (in [path2]).
      N: number of observations (in [path2]).

Accuracy (in subset):
      pho(k,S)=argmin_(pho- >=0)[#{pho_i <= pho- | i in S} >= km].

      pho(k,S): a (100 x k) percentile of the accuracy sample.
      k: belongs to [0,1].
      S: subset of values (subset in [path2]).
      m: number of values in S.
      #: denotes the number of elements in the set.

Accuracy estimates: pho(Q_1,S)=pho(0.25,S),
                    pho(Q_2,S)=pho(0.50,S),
                    pho(Q_3,S)=pho(0.75,S),
                    pho_max(S)=pho(1,S).

Accuracy estimates are calculated for each group in [path2].

Confidence intervals (bootstrapping):
   output: regression-tests/accuracy-bootstrap.csv
           regression-tests/accuracy-sample.csv

   estimates:
           pho(Q_1), pho(Q_2), pho(Q_3), pho_max
           all calculated after [n-replications] replications.
   out: mean with 95% percentile confidence interval.


   bootstrap scheme: percentile bootstrap (Efron & Tibshirani, 1993).

   confidence level: alpha=0.05.


<!-- [test-id] = "iid" -->
Bootstrap hypothesis testing on spatial autocorrelation
   output: regression-tests/independence-tests-bootstrap.csv
           regression-tests/morans-i-test-sample.csv
           regression-tests/geary-c-test-sample.csv

   estimates: Moran's I (Moran, 1950), Geary's C (Geary, 1954) coefficients.
   out: mean with 95% percentile confidence interval, p-value.

   bootstrapping: bootstrap sample (pairs) is drawn from original residuals
                  with replacement [n-replications] times.

   confidence level: alpha=0.05.

   The p-value in the two-tailed test is calculated as a twofold minimum between
   1) the relative number of bootstrap statistics equal or less than a test statistic
   (for the original sample) 2) the relative number of bootstrap statistics bigger than
   a test statistic (for the original sample).

   Original neighbours matrix of spatial proximity is normalized by the number of neighbours
   of the i-th observation.


## Usage

      ````` Runs permutation tests for a multiple linear regression model specified by data in [path].
      $ lein run "path.csv" 10000 "permutations"

      ````` Estimates confidence intervals by bootstrapping for a multiple linear regression model
      ````` specified by data in [path].
      $ lein run "path.csv" 10000 "bootstrap-regression"

      ````` Approximates points in [path2] by a a multiple linear regression model
      ````` specified by data in [path] using bootstrapping.
      $ lein run "path.csv 10000" "bootstrap-accuracy" "cells.csv"

      ````` Runs bootstrap tests with weight matrix in [path2] to check spatial autocorrelation
      ````` in residuals from a multiple linear regression model specified by data in [path].
      $ lein run "path.csv" 10000 "iid" "neighbours.csv"

## References
    [1] Anderson, M. (2001). Permutation tests for univariate or multivariate analysis of variance and regression.
        Canadian Journal of Fisheries and Aquatic Sciences, 58(3): 626-639. DOI: 10.1139/f01-004.
    [2] Freedman, D., & Lane, D. (1983). A Nonstochastic Interpretation of Reported Significance Levels.
        Journal of Business & Economic Statistics, 1(4): 292-298. DOI: 10.2307/1391660.
    [3] Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife.
        Annals of Statistics, 7(1): 1-26. DOI:10.1214/aos/1176344552.
    [4] Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap.
        New York: Chapman and Hall.
    [5] Geary, R. (1954). The Contiguity Ratio and Statistical Mapping.
        The Incorporated Statistician, 5(3): 115-145. DOI: 10.2307/2986645.
    [6] Moran, P. (1950). Notes on Continuous Stochastic Phenomena.
        Biometrika, 37(1-2): 17-23. DOI: 10.2307/2332142.
    [7] Lin, K.-P., Long, Z.-H., & Ou, B. (2011). The Size and Power of Bootstrap Tests for Spatial Dependence in a Linear Regression Model.
        Computational Economics, 38(2): 153-171. DOI: 10.1007/s10614-010-9224-0.