Thursday, July 12, 2012

GSoC Midterm Evaluations

My goal for the GSoC was to add to statsmodels the capability to estimate multivariate, mixed data, unconditional and conditional densities with nonparametric methods; nonparametric regressions and popular nonparametric models.
The goal for the first half of the GSoC was to create the building blocks of a nonparametric estimation library. These are multivariate unconditional density estimation, multivariate conditional density estimation and nonparametric regression. Many of the models to come are heavily dependent on these three classes. Most of the work that I set out to do in my GSoC application has been completed. However, there are still several missing parts that need to be completed and/or optimized.

The challenge when coding nonparametric estimators is twofold. On the one hand you  have to make sure that the estimators are correctly implemented and yield the right results, but at the same time you need to program them efficiently. This is because many of the nonparametric methods are very computationally intensive and often the time they need to converge is prohibitively long.

Here is an overview of the completed parts:

I) Unconditional Multivariate Density Estimation


This is the fundamental  class in the nonparametric estimation. The goal is to calculate the joint probability density of many variables that can be of mixed types (continuous, discrete ordered and discrete unordered): P(x1, x2, ... , xN). This can be implemented with UKDE class (acronym for Unconditional Kernel Density Estimation):

UKDE (tdat, var_type, bw)

where:
tdat is a list of 1D arrays comprising of the training data
var_type is a string that specifies the variable type (e.g. 'ccouc' specifies that the variables are of type continuous, continuous, ordered, unordered and continuous in that order)
bw is either the user-specified bandwidth in which case it is a 1D vector of bandwidth values or it can be a string that specifies the bandwidth selection method. The possible methods are:

'cv_ls' for cross-validation least squares
'cv_ml' for cross-validation maximum likelihood
'normal_reference' for the Scott's rule of thumb (aka normal reference rule of thumb)

The UKDE class has the following two methods

UKDE.pdf(edat)
UKDE.cdf(edat)

The pdf(edat) method calculates the probability density function at the evaluation points edat. edat is NxK where K is the number of variables in tdat and N is the number of evaluation (sets of) points. If edat is not specified by the user then the pdf is estimated at the training data tdat.

The structure of UKDE.cdf(edat) is similar. The difference is that it returns the cumulative distribution function evaluated at edat: P(X<=x).

All components of UKDE (pdf, cdf, bw methods etc.) are tested in test_nonparametric2.py (with both real and generated data) and all results are identical (or very close) to those reported by the np package in R written by Jeff Racine (whose text and papers are the primary reference for the discussed methods). A few univariate graphical examples for the nonparametric estimation of popular distributions (f, Poisson, Pareto, Laplace etc.) are presented in ex_univar_kde.py.

The convergence time for the bandwidth selection methods has been tested and is consistent with the theoretical results.

II) Conditional Multivariate Density Estimation

The density of a set of variables Y=(y1,y2,...,yP), conditional on X=(x1,x2,...xQ) is denoted f(Y|X) and equals the ration of the joint dnesity and the marginal density of X or f(Y,X)/f(X).
One difficulty that arises in conditional estimation is that the procedure for the bandwidth selection in the conditional case is different than the one in the unconditional case which prevents simple recycling of the code in UKDE.

Conditional density estimation is implemented with the class CKDE (acronym for Conditional Kernel Density Estimation)
CKDE(tydat, txdat, dep_type, indep_type, bw)

where:
tydat is the dependent variable(s)
txdat is the independent variable(s)
dep_type and indep_type are strings specifying the types of the dependent and independent varaibles
bw is either user-specified bandwidth values or bandwidth selection method (similar as in UKDE)

CKDE also has the option of calculating the PDF and CDF of the density with the two methods
CKDE.pdf(eydat, exdat) and CKDE.cdf(eydat, exdat)
if left unspecified eydat and exdat default to tydat and txdat

Note: currently eydat needs to be N x (P+Q) and exdat is NxQ. To be fixed soon.

The functionality of the CKDE including the pdf, cdf and bandwidth estimates have been cross-checked with the np package and produce identical or very similar results. All tests are in test_nonparametric2.py.
The cross-validation least squares bandwidth selection method is still a little faster in R so some optimization of the code for speed could be required.
The convergence time for the bandwidth selection methods has been tested and is consistent with the theoretical results.

III) Nonparametric Regression

The model is y = g(X) + e
A regression is an estimation of E[y|X]. When the functional form of g(X) is not specified it can be estimated with kernel methods.

Nonparametric regression can be implemented with the class Reg:
Reg(tydat, txdat, var_type, ret_type, bw)
where
tydat is the dependent (left-hand-side variable)
txdat is a list of independent variables
var_type is the independent variable type (txdat)
reg_type is the type of estimator. It can take on two values: 
    'lc' - local constant (default)
    'll' - local linear estimator
bw: is either the user-specified vector of bandwidth values or a bandwidth selection method:
    'cv_ls' - cross validation least squares

Reg.mean(edat) returns the conditional mean E[y|X=edat]. If edat is not specified it defaults to the training data txdat.
Reg.r-squared() returns the R-Squared of the model (the model fit)

The bandwidth selection methods, R-Squared and conditional mean have been tested and the results are identical or very similar to those produced by the np package in R. 

IV) List of completed elements

Kernels:
  • Gaussian kernel (for continuous variables)
  • Aitchison-Aitken kernel (for unordered discrete variables)
  • Wang-van Ryzin kernel (for ordered discrete variables)
  • Epanechnikov kernel -- has been dropped from KernelFunctions.py (can be added again at some point if we decide to give the user the ability to specify kernels. However, the literature is unanimous that the kernel type is of small importance to the density estimation)
  • "Convolution kernels" for all three types of kernels (used in the conditional pdf and cdf estimation
  • "CDF" kernels for all three types of kernels (the integrals of the kernels) used for the fast estimation of conditional and unconditional cumulative distribution
Density Estimation
  • Normal reference rule of thumb bandwidth selection method (Scott's) for both UKDE and CKDE
  • Cross-validation least squares for both UKDE and CKDE for mixed data types
  • Cross-validation maximum likelihood for both UKDE and CKDE for mixed data types
  • Probability density function (PDF) for both UKDE and CKDE for mixed data types
  • Cumulative distribution function (CDF) for both UKDE and CKDE for mixed data types
Nonparametric Regression
  • Local constant estimator
  • Local linear estimator
  • Cross validation least squares bandwidth selection method for both estimators
  • Conditional mean for both estimation
  • R-Squared (fit for the model)
Other
  • Tests for all density and regression methods and bandwidth selection procedures
  • Graphical example with popular distributions in ex_univar_kde.py
  • Docstrings with references, LaTeX formulas and (some) examples
The plug-in bandwidth selection methods have been dropped because of their lack of robustness in a multivariate setting and inferiority to the data-driven methods like cv_ls and cv_ml.

V) Still to do:

  • Add marginal effects for regression
  • Add significance tests for regression
  • Add AIC bandwidth selection method to regression class
(I am currently working on these and they might be ready by the end of the week. )

Wednesday, July 11, 2012

KDE estimate of Pareto RV


a=2
N=150
support = np.random.pareto(a, size = N)
rv = stats.pareto(a)
ix = np.argsort(support)

dens_normal = nparam.UKDE(tdat=[support], var_type='c', bw='normal_reference')
dens_cvls = nparam.UKDE(tdat=[support], var_type='c', bw='cv_ls')
dens_cvml = nparam.UKDE(tdat=[support], var_type='c', bw='cv_ml')
plt.figure(3)
plt.plot(support[ix],rv.pdf(support[ix]), label='Actual')
plt.plot(support[ix],dens_normal.pdf()[ix],label='Scott')
plt.plot(support[ix],dens_cvls.pdf()[ix], label='CV_LS')
plt.plot(support[ix],dens_cvml.pdf()[ix], label='CV_ML')
plt.title("Nonparametric Estimation of the Density of Pareto Distributed Random Variable")
plt.legend(('Actual','Scott','CV_LS','CV_ML'))

Out-of-sample performance of OLS vs. Nonparametric Regression

The true usefulness of a regression estimator should be judged by its out-of-sample performance. It is easy to achieve a perfect fit for a finite sample by fitting an N-order polynomial to the data. However, the estimator will then suffer by "high-variance" or poor out-of-sample forecasting accuracy.

Let's build on the example from the previous post http://statsmodels-np.blogspot.com/2012/07/nonparametric-regression.html

To summarize we are using the ccard dataset in statsmodels which has data on the average credit card expenditures of 72 individuals. We are interested in studying the relationship between income and credit card expenditure. We saw that, while a linear relationship is inadequate a quadratic relationship provides a good fit for the data.

Let's use the first 57 observations to train our estimates. To get the kernel regression estimate in statsmodels:

model = nparam.Reg(tydat=[np.log(avgexp[0:57])], txdat=[income[0:57]], var_type='c', bw='cv_lc')


We obtain the least-squares coefficients for the model Log(AvgExp) = a + b1*income +b2*income^2 + e to be:
a = 3.11, b1= 0.72, and b2= - 0.04

We then use the estimates from both models to forecast the remaining 15 observations. The accuracy of the forecast is judged by sum of the squared forecast error (SSFE) and the Mean Squared Forecast Error (MSFE).
To obtain the nonparametric out-of-sample forecasts we can simply specify the edat input in the Cond_Mean attribute of the Reg class:

np_fcast = model.Cond_Mean(edat=income[57:72])

The SSFE with OLS is 12.09 while in the nonparametric case it is only 3.416.

The MSFE with OLS is 0.81, while in the nonparametric case it is significantly lower : 0.23

Nonparametric Regression

Intro

To regress a left-hand-variable y on several right-hand-side variables X = {x_1, x_2,...,x_N} means to estimate the conditional expected value (mean) E[y|X]. Another way to formulate the problem is to say that y = g(X) +e , where e is the error term or residual.

Traditionally, calculating the conditional mean is done by minimizing the sum of squared residuals e = y - g(X), where we assume that we know the true functional form. For example we could assume a linear relationship between X and y so that g(X) = a + b1*x_1 + ...+ bN*x_N. Then the problem is reduced to estimating the unknown coefficients a, b1, b2, ... , bN.

Although computationally efficient, this approach suffers from one major limitation. The researcher needs to specify beforehand the functional form of g(X), which is often difficult to do. Therefore, the results of the traditional estimation are only as good as the assumption of the underlying function form of the relationship between y and X. 

In a non-parametric regression, the researcher can avoid making this uncomfortable assumptions about data generating process and can estimate the conditional mean E[y|X] with minimal assumptions about the data. The non-parametric approach relies on using kernel density estimates of g(X).

Example

For this example we will use ccard dataset in statsmodels. The data consists of the average credit card expenditures of 72 individuals along with information about their income, whether they own or rent a house and their age. Suppose we are interested in studying the effects of income on credit card spending. 

One way to proceed is parametrically by assuming that we know the functional relationship between Log(AvgExp) and income. The plot of the data suggests that a linear estimator will fit the data with high bias and maybe a quadratic relationship is more appropriate. We can use the OLS to estimate the following regression 
Log(AvgExp) = a + b1*income + b2* income^2 + e

Alternatively we can proceed non-parametrically without any strong assumptions about the functional relationship between credit card expenditure and income. To do that within statsmodels we can use the nonparametric estimators:

import statsmodels.nonparametric as nparam
model = nparam.Reg(tydat=[np.log(avgexp)], txdat=[income], var_type='c', bw='cv_lc')

where var_type specifies the type of the dependent variable (continuous in this case) and bw specifies that bandwidth regression estimator (in this case the cross validation least squares local constant estimator)

we can obtain the fitted values (the conditional mean) by
np_fitted = model.Cond_Mean()

and we can plot them against the OLS estimates:

Note that we didn't have to specify a quadratic relationship (include income^2 in the list of dependent variables) in the nonparametric approach.

We can obtain the R-Squared in the nonparametric approach:
model.R2()
which is 0.25

The bandwidth estimate is 0.78 (model.bw)

As usual, the nonparametric approach comes at a cost - the researcher needs more data for robust estimates. Furthermore, the data-driven bandwidth selection methods are significantly more computationally intensive. 

However a nonparametric approach will usually outperform an incorrectly specified parametric approach. For example if we assume a linear relationship between the log of average expenditure and income: Log(AvgExp) = a +b*income + e, then the sum of squared residuals for the OLS is 69.32, while for the nonparametric estimator is lower: 67.37