Logo

The Datasets Package

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

Using Datasets from R

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. For example:

In [1]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
     10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
     11                                       NegativeBinomial)
---> 12 from .tsa import api as tsa
     13 from .nonparametric import api as nonparametric
     14 import distributions

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
      2 from .arima_model import ARMA, ARIMA
      3 import vector_ar as var
      4 from .vector_ar.var_model import VAR
      5 from .vector_ar.svar_model import SVAR

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
     16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
     17         approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
     19 import statsmodels.base.wrapper as wrap
     20 from statsmodels.tsa.vector_ar import util

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
     30 from numpy.linalg import inv, pinv
     31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
     33 
     34 #Fast filtering and smoothing for multivariate state space models

ImportError: cannot import name kalman_loglike

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

NameError: name 'sm' is not defined

In [3]: print duncan_prestige.__doc__
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-9b4cf6ceaa3f> in <module>()
----> 1 print duncan_prestige.__doc__

NameError: name 'duncan_prestige' is not defined

In [4]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)

NameError: name 'duncan_prestige' is not defined

R Datasets Function Reference

get_rdataset(dataname[, package, cache]) download and return R dataset
get_data_home([data_home]) Return the path of the statsmodels data dir.
clear_data_home([data_home]) Delete all the content of the data home cache.

Usage

Load a dataset:

In [5]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-5-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
     10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
     11                                       NegativeBinomial)
---> 12 from .tsa import api as tsa
     13 from .nonparametric import api as nonparametric
     14 import distributions

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
      2 from .arima_model import ARMA, ARIMA
      3 import vector_ar as var
      4 from .vector_ar.var_model import VAR
      5 from .vector_ar.svar_model import SVAR

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
     16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
     17         approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
     19 import statsmodels.base.wrapper as wrap
     20 from statsmodels.tsa.vector_ar import util

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
     30 from numpy.linalg import inv, pinv
     31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
     33 
     34 #Fast filtering and smoothing for multivariate state space models

ImportError: cannot import name kalman_loglike

In [6]: data = sm.datasets.longley.load()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-6daf677753dc> in <module>()
----> 1 data = sm.datasets.longley.load()

NameError: name 'sm' is not defined

The Dataset object follows the bunch pattern explained in proposal.

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [7]: data.endog[:5]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-ecf121fa201d> in <module>()
----> 1 data.endog[:5]

NameError: name 'data' is not defined

In [8]: data.exog[:5,:]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-eb86cb28e7fa> in <module>()
----> 1 data.exog[:5,:]

NameError: name 'data' is not defined

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [9]: data.endog_name
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-78ac46fd3666> in <module>()
----> 1 data.endog_name

NameError: name 'data' is not defined

In [10]: data.exog_name
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-53b38d63b171> in <module>()
----> 1 data.exog_name

NameError: name 'data' is not defined

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [11]: type(data.data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-2a4072828d02> in <module>()
----> 1 type(data.data)

NameError: name 'data' is not defined

In [12]: type(data.raw_data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-55b385c14017> in <module>()
----> 1 type(data.raw_data)

NameError: name 'data' is not defined

In [13]: data.names
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-bb6578e2a1cd> in <module>()
----> 1 data.names

NameError: name 'data' is not defined

Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data as pandas objects:

In [14]: data = sm.datasets.longley.load_pandas()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-dd9cc940a6dd> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()

NameError: name 'sm' is not defined

In [15]: data.exog
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-a6a50950081b> in <module>()
----> 1 data.exog

NameError: name 'data' is not defined

In [16]: data.endog
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-5f625520ab35> in <module>()
----> 1 data.endog

NameError: name 'data' is not defined

With pandas integration in the estimation classes, the metadata will be attached to model results:

In [17]: y, x = data.endog, data.exog
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-1bd5ddef021a> in <module>()
----> 1 y, x = data.endog, data.exog

NameError: name 'data' is not defined

In [18]: res = sm.OLS(y, x).fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-63aeb0f069b7> in <module>()
----> 1 res = sm.OLS(y, x).fit()

NameError: name 'sm' is not defined

In [19]: res.params
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-cdb8a4b8734e> in <module>()
----> 1 res.params

NameError: name 'res' is not defined

In [20]: res.summary()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-864f3205c5bd> in <module>()
----> 1 res.summary()

NameError: name 'res' is not defined

Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

Additional information

  • The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
  • To add datasets, see the notes on adding a dataset.