Wednesday, July 11, 2012

Nonparametric Regression

Intro

To regress a left-hand-variable y on several right-hand-side variables X = {x_1, x_2,...,x_N} means to estimate the conditional expected value (mean) E[y|X]. Another way to formulate the problem is to say that y = g(X) +e , where e is the error term or residual.

Traditionally, calculating the conditional mean is done by minimizing the sum of squared residuals e = y - g(X), where we assume that we know the true functional form. For example we could assume a linear relationship between X and y so that g(X) = a + b1*x_1 + ...+ bN*x_N. Then the problem is reduced to estimating the unknown coefficients a, b1, b2, ... , bN.

Although computationally efficient, this approach suffers from one major limitation. The researcher needs to specify beforehand the functional form of g(X), which is often difficult to do. Therefore, the results of the traditional estimation are only as good as the assumption of the underlying function form of the relationship between y and X. 

In a non-parametric regression, the researcher can avoid making this uncomfortable assumptions about data generating process and can estimate the conditional mean E[y|X] with minimal assumptions about the data. The non-parametric approach relies on using kernel density estimates of g(X).

Example

For this example we will use ccard dataset in statsmodels. The data consists of the average credit card expenditures of 72 individuals along with information about their income, whether they own or rent a house and their age. Suppose we are interested in studying the effects of income on credit card spending. 

One way to proceed is parametrically by assuming that we know the functional relationship between Log(AvgExp) and income. The plot of the data suggests that a linear estimator will fit the data with high bias and maybe a quadratic relationship is more appropriate. We can use the OLS to estimate the following regression 
Log(AvgExp) = a + b1*income + b2* income^2 + e

Alternatively we can proceed non-parametrically without any strong assumptions about the functional relationship between credit card expenditure and income. To do that within statsmodels we can use the nonparametric estimators:

import statsmodels.nonparametric as nparam
model = nparam.Reg(tydat=[np.log(avgexp)], txdat=[income], var_type='c', bw='cv_lc')

where var_type specifies the type of the dependent variable (continuous in this case) and bw specifies that bandwidth regression estimator (in this case the cross validation least squares local constant estimator)

we can obtain the fitted values (the conditional mean) by
np_fitted = model.Cond_Mean()

and we can plot them against the OLS estimates:

Note that we didn't have to specify a quadratic relationship (include income^2 in the list of dependent variables) in the nonparametric approach.

We can obtain the R-Squared in the nonparametric approach:
model.R2()
which is 0.25

The bandwidth estimate is 0.78 (model.bw)

As usual, the nonparametric approach comes at a cost - the researcher needs more data for robust estimates. Furthermore, the data-driven bandwidth selection methods are significantly more computationally intensive. 

However a nonparametric approach will usually outperform an incorrectly specified parametric approach. For example if we assume a linear relationship between the log of average expenditure and income: Log(AvgExp) = a +b*income + e, then the sum of squared residuals for the OLS is 69.32, while for the nonparametric estimator is lower: 67.37






No comments:

Post a Comment