About the stats module¶

The stats module is composed of the wrappers for statistical distributions and wrappers for other statistical function and summaries. These two follow completely different approaches which are discussed in the two sections below, one per block.

This page assumes you have already read Intro to the stats module.

About distribution wrappers¶

There are only two classes that serve as wrappers to all statistical distributions in scipy.stats, one for continuous distributions and another for discrete ones. This has some drawbacks related to preservation of named dimensions and coordinates, but we believe its simplicity and ease of maintenance outweighs this drawbacks.

The two wrappers XrContinuousRV and XrDiscreteRV take as first argument the scipy distribution to be wrapped and then optional args and kwargs. Methods take also a mixture of args and kwargs, mimicking the behaviour of scipy distributions where the scale for example can be defined either at creation time or when calling a method and it can be passed as a positional or keyword argument.

The xarray einstats wrapper classes however, instead of initializing the distributions at creation store the distribution and initialization args and kwargs. Then, whenever a method is called the args and kwargs provided via the method and the ones provided at initialization are combined and broadcasted. This happens in the _broadcast_args method of xarray_einstats.stats.XrRV. The combined+broadcasted arguments are used to call the scipy distribution via xarray.apply_ufunc, which ensures that the shapes will be compatible.

As the same wrappers are used for all distributions, even if both positional and keyword arguments are broadcasted, they are used as provided when calling apply_ufunc. The main drawback of this approach is that apply_ufunc is only able to preserve the dimensions and coordinates of positional arguments. Therefore, given two equivalent wrappers, one using positional and another using keyword arguments, there are some edge cases where the one using keyword arguments will return numpy arrays instead of DataArrays. Values are the same in both cases, but one case has lost all information about named dimensions and coordinates. The arguably more common and annoying case of such behaviour is with the .rvs method.

import numpy as np
from scipy import stats
from xarray_einstats.tutorial import generate_mcmc_like_dataset
from xarray_einstats import stats as xtats
ds = generate_mcmc_like_dataset(3)

dist_pos = xtats.XrContinuousRV(stats.norm, ds["mu"], ds["sigma"])
dist_kw = xtats.XrContinuousRV(stats.norm, loc=ds["mu"], scale=ds["sigma"])

rvs_pos = dist_pos.rvs(size=5, random_state=7)
rvs_kw = dist_kw.rvs(size=5, random_state=7)
allclose = np.allclose(rvs_pos, rvs_kw)

print(f"Output type of rv_pos: {type(rvs_pos)}")
print(f"Output type of rv_kw:  {type(rvs_kw)}")
print(f"\nCheck all values are indeed equal in both cases: {allclose}")

Output type of rv_pos: <class 'xarray.core.dataarray.DataArray'>
Output type of rv_kw:  <class 'xarray.core.dataarray.DataArray'>

Check all values are indeed equal in both cases: True

In other methods, this is more complicated to trigger, because only one positional argument is enough to preserve all information. As the rest of the methods convert array input to xarray under the hood, the following code doesn’t lose any labels:

About statistical function and summary wrappers¶

Most wrappers here are minimal wrappers, that generally spend more time handling argument defaults. The general pattern of these wrappers is the following:

Handle arguments. Arguments whose information is not needed by the wrapper generally default to None and are not included in the dict passed to apply_ufunc as kwargs argument. This covers us from having to track changes in scipy and update our argument defaults.
(optional) Take care of arguments that accept array values, that in xarray_einstats take DataArrays. They are broadcasted and aligned so the computation works.
Stack/reshape if necessary. xarray_einstats uses a dims argument that differs from scipy axis because it takes strings and because sequences of strings are also valid. When multiple dimensions are provided via dims they are stacked before calling scipy as it only takes integer axis
Call the scipy function.

Steps 3 and 4 are generally done via xarray_einstats.stats._apply_reduce_func and xarray_einstats.stats._apply_nonreduce_func. However, if necessary, they are done manually (for now this only happens with median_abs_deviation)