xarray-einstats aims to cover two important limitations one can encounter
when working with xarray objects:
Labeled dimensions become a nuisance instead of being helpful#
If you need to perform many specific or niche computations such as
linear algebra or statistical operations, you end up with a choice
between calling the functions as is or wrapping them with
This means either losing (or ignoring) all the dims and coords information or
adding multiple lines of code per operation.
The half way solution between those two would be to group all operations and wrap the result, which is actually the first case, we need to operate on numpy arrays within the wrapped function so are forced to ignore the labels. At the end of the day, we can’t take advantage of the labels during those generally critical computations yet we are still forced to write longer code to not loose the info completely.
Moreover, the arguments to
apply_ufunc are not intuitive
(nor they can be given the virtually endless flexibility they allow).
We end up writing longer more verbose code but it doesn’t really become
Functions don’t work on high dimensional data#
Most functions were designed for 1-3D data and many have not been updated
to broadcast and handle multidimensional data properly. Usually this
is due to the functions requesting a specific number of dimensions
for the input arguments (i.e. the covariance matrix of a multidimensional
normal must be
(N, N), it can’t be
(..., N, N)) or because
axes argument only accepts integers.
When this happens, wrapping those functions and allowing users to call them with labeled data and labels instead of axis numbers, they are still not useful. If you have 4+ dimensions, chances are you’ll need to reduce two at a time at some point.
We believe the best way to tackle those two issues if by wrapping
our target functions using
apply_ufunc and extending them when necessary.
Extending can range from some stacking or automatic renaming
of dimensions to avoid conflicts and repeated names to using
numba to generate proper broadcasting ufuncs out of shape limited
Our goal is not to reimplement any of those functions ourselves,
and we will “move back” to a simpler extension once the upstream function
is improved/extended. Extensions however are still necessary because
upstreaming those changes take too much time and/or effort. In fact,
in many cases the process has already started, see this GitHub issue
numpy.random.multivariate_normal for example,
but it is often hard to reach an agreement on what is the best way forward.
The general approach is to wrap our target functions using
which generally reduces to a 1-3 line wrapper changing
input/output_core_dims as needed.
We believe even such a simple wrapper can have a huge impact as it makes
the code more clear and allows any user to use such functions, as many
might be intimidated by
The second approach is some minimal extension using xarray and string methods. Some examples of this are
stacking some dimensions of the input array, calling a function that accepts only integer
axis, and then unstacking before returning the result
creating or renaming new dimensions when dimensions are repeated or duplicated.
Lastly, the 3rd approach which is much more complex and requires significant
extra effort is to use
jax to extend the function to proper
support for multidimensional data, then wrap it using
As we have said, this is not our goal and we only resort to this option
for critical functions.
We also combine this layering approach to wrappers with strict modularity.
Wrappers are structured in completely independent modules
and not imported explicitly in the highest level namespace
(except for some numpy only wrappers).
xarray-einstats depend only on xarray even if we
have modules wrapping functions from other libraries which are
then only optional dependencies.
More details about the design decisions and implementation details of each module can be found in their specific “About …” pages: