statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. For example:
In [1]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
11 NegativeBinomial)
---> 12 from .tsa import api as tsa
13 from .nonparametric import api as nonparametric
14 import distributions
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 import vector_ar as var
4 from .vector_ar.var_model import VAR
5 from .vector_ar.svar_model import SVAR
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
17 approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
19 import statsmodels.base.wrapper as wrap
20 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
30 from numpy.linalg import inv, pinv
31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
33
34 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
NameError: name 'sm' is not defined
In [3]: print duncan_prestige.__doc__
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-9b4cf6ceaa3f> in <module>()
----> 1 print duncan_prestige.__doc__
NameError: name 'duncan_prestige' is not defined
In [4]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
get_rdataset(dataname[, package, cache]) | download and return R dataset |
get_data_home([data_home]) | Return the path of the statsmodels data dir. |
clear_data_home([data_home]) | Delete all the content of the data home cache. |
Load a dataset:
In [5]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-5-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
11 NegativeBinomial)
---> 12 from .tsa import api as tsa
13 from .nonparametric import api as nonparametric
14 import distributions
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 import vector_ar as var
4 from .vector_ar.var_model import VAR
5 from .vector_ar.svar_model import SVAR
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
17 approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
19 import statsmodels.base.wrapper as wrap
20 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
30 from numpy.linalg import inv, pinv
31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
33
34 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [6]: data = sm.datasets.longley.load()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-6daf677753dc> in <module>()
----> 1 data = sm.datasets.longley.load()
NameError: name 'sm' is not defined
The Dataset object follows the bunch pattern explained in proposal.
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [7]: data.endog[:5]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-ecf121fa201d> in <module>()
----> 1 data.endog[:5]
NameError: name 'data' is not defined
In [8]: data.exog[:5,:]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-eb86cb28e7fa> in <module>()
----> 1 data.exog[:5,:]
NameError: name 'data' is not defined
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [9]: data.endog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-78ac46fd3666> in <module>()
----> 1 data.endog_name
NameError: name 'data' is not defined
In [10]: data.exog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-53b38d63b171> in <module>()
----> 1 data.exog_name
NameError: name 'data' is not defined
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [11]: type(data.data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-2a4072828d02> in <module>()
----> 1 type(data.data)
NameError: name 'data' is not defined
In [12]: type(data.raw_data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-55b385c14017> in <module>()
----> 1 type(data.raw_data)
NameError: name 'data' is not defined
In [13]: data.names
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-bb6578e2a1cd> in <module>()
----> 1 data.names
NameError: name 'data' is not defined
For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data as pandas objects:
In [14]: data = sm.datasets.longley.load_pandas()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-dd9cc940a6dd> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()
NameError: name 'sm' is not defined
In [15]: data.exog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-a6a50950081b> in <module>()
----> 1 data.exog
NameError: name 'data' is not defined
In [16]: data.endog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-5f625520ab35> in <module>()
----> 1 data.endog
NameError: name 'data' is not defined
With pandas integration in the estimation classes, the metadata will be attached to model results:
In [17]: y, x = data.endog, data.exog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-1bd5ddef021a> in <module>()
----> 1 y, x = data.endog, data.exog
NameError: name 'data' is not defined
In [18]: res = sm.OLS(y, x).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-63aeb0f069b7> in <module>()
----> 1 res = sm.OLS(y, x).fit()
NameError: name 'sm' is not defined
In [19]: res.params
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-cdb8a4b8734e> in <module>()
----> 1 res.params
NameError: name 'res' is not defined
In [20]: res.summary()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-864f3205c5bd> in <module>()
----> 1 res.summary()
NameError: name 'res' is not defined
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']