About the stats module¶
The stats module is composed of the wrappers for statistical distributions and wrappers for other statistical function and summaries. These two follow completely different approaches which are discussed in the two sections below, one per block.
This page assumes you have already read Intro to the stats module.
About distribution wrappers¶
There are only two classes that serve as wrappers to all statistical
distributions in scipy.stats
, one for continuous distributions
and another for discrete ones. This has some drawbacks related to preservation
of named dimensions and coordinates, but we believe its simplicity and ease of
maintenance outweighs this drawbacks.
The two wrappers XrContinuousRV
and
XrDiscreteRV
take as first
argument the scipy distribution to be wrapped and then optional
args and kwargs. Methods take also a mixture of args and kwargs,
mimicking the behaviour of scipy distributions where the
scale for example can be defined either at creation time
or when calling a method and it can be passed as a positional
or keyword argument.
The xarray einstats wrapper classes however, instead of initializing the distributions
at creation store the distribution and initialization args and kwargs. Then,
whenever a method is called the args and kwargs provided via the method
and the ones provided at initialization are combined and broadcasted.
This happens in the _broadcast_args
method of xarray_einstats.stats.XrRV
.
The combined+broadcasted arguments are used to call the scipy distribution via
xarray.apply_ufunc
, which ensures that the shapes will be compatible.
As the same wrappers are used for all distributions, even if both positional and
keyword arguments are broadcasted, they are used as provided when calling apply_ufunc
.
The main drawback of this approach is that apply_ufunc
is only able to preserve
the dimensions and coordinates of positional arguments. Therefore, given
two equivalent wrappers, one using positional and another using keyword arguments,
there are some edge cases where the one using keyword arguments will return numpy
arrays instead of DataArray
s. Values are the same in both cases, but one
case has lost all information about named dimensions and coordinates.
The arguably more common and annoying case of such behaviour is with the .rvs
method.
import numpy as np
from scipy import stats
from xarray_einstats.tutorial import generate_mcmc_like_dataset
from xarray_einstats import stats as xtats
ds = generate_mcmc_like_dataset(3)
dist_pos = xtats.XrContinuousRV(stats.norm, ds["mu"], ds["sigma"])
dist_kw = xtats.XrContinuousRV(stats.norm, loc=ds["mu"], scale=ds["sigma"])
rvs_pos = dist_pos.rvs(size=5, random_state=7)
rvs_kw = dist_kw.rvs(size=5, random_state=7)
allclose = np.allclose(rvs_pos, rvs_kw)
print(f"Output type of rv_pos: {type(rvs_pos)}")
print(f"Output type of rv_kw: {type(rvs_kw)}")
print(f"\nCheck all values are indeed equal in both cases: {allclose}")
Output type of rv_pos: <class 'xarray.core.dataarray.DataArray'>
Output type of rv_kw: <class 'numpy.ndarray'>
Check all values are indeed equal in both cases: True
In other methods, this is more complicated to trigger, because only one positional argument is enough to preserve all information. As the rest of the methods convert array input to xarray under the hood, the following code doesn’t lose any labels:
dist_kw.pdf(np.linspace(-3, 3))
<xarray.DataArray (point: 50, chain: 4, draw: 10, team: 6)> Size: 96kB 0.07148 0.06092 0.03051 0.01553 0.06261 ... 0.0711 0.08956 0.07408 0.05843 Coordinates: * point (point) float64 400B -3.0 -2.878 -2.755 -2.633 ... 2.755 2.878 3.0 * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9 * team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
About statistical function and summary wrappers¶
Most wrappers here are minimal wrappers, that generally spend more time handling argument defaults. The general pattern of these wrappers is the following:
Handle arguments. Arguments whose information is not needed by the wrapper generally default to
None
and are not included in thedict
passed toapply_ufunc
askwargs
argument. This covers us from having to track changes in scipy and update our argument defaults.(optional) Take care of arguments that accept array values, that in
xarray_einstats
takeDataArray
s. They are broadcasted and aligned so the computation works.Stack/reshape if necessary.
xarray_einstats
uses adims
argument that differs from scipyaxis
because it takes strings and because sequences of strings are also valid. When multiple dimensions are provided viadims
they are stacked before calling scipy as it only takes integeraxis
Call the scipy function.
Steps 3 and 4 are generally done via xarray_einstats.stats._apply_reduce_func
and
xarray_einstats.stats._apply_nonreduce_func
. However, if necessary, they are
done manually (for now this only happens with median_abs_deviation
)