Background¶
Why xarray-einstats?¶
xarray-einstats
aims to cover two important limitations one can encounter
when working with xarray objects:
Labeled dimensions become a nuisance instead of being helpful¶
If you need to perform many specific or niche computations such as
linear algebra or statistical operations, you end up with a choice
between calling the functions as is or wrapping them with xarray.apply_ufunc
.
This means either losing (or ignoring) all the dims and coords information or
adding multiple lines of code per operation.
The half way solution between those two would be to group all operations and wrap the result, which is actually the first case, we need to operate on numpy arrays within the wrapped function so are forced to ignore the labels. At the end of the day, we can’t take advantage of the labels during those generally critical computations yet we are still forced to write longer code to not loose the info completely.
Moreover, the arguments to apply_ufunc
are not intuitive
(nor they can be given the virtually endless flexibility they allow).
We end up writing longer more verbose code but it doesn’t really become
more clear.
Functions don’t work on high dimensional data¶
Most functions were designed for 1-3D data and many have not been updated
to broadcast and handle multidimensional data properly. Usually this
is due to the functions requesting a specific number of dimensions
for the input arguments (i.e. the covariance matrix of a multidimensional
normal must be (N, N)
, it can’t be (..., N, N)
) or because
the axis
/axes
argument only accepts integers.
When this happens, wrapping those functions and allowing users to call them with labeled data and labels instead of axis numbers, they are still not useful. If you have 4+ dimensions, chances are you’ll need to reduce two at a time at some point.
Approach¶
We believe the best way to tackle those two issues if by wrapping
our target functions using apply_ufunc
and extending them when necessary.
Extending can range from some stacking or automatic renaming
of dimensions to avoid conflicts and repeated names to using
numba
to generate proper broadcasting ufuncs out of shape limited
functions.
Our goal is not to reimplement any of those functions ourselves,
and we will “move back” to a simpler extension once the upstream function
is improved/extended. Extensions however are still necessary because
upstreaming those changes take too much time and/or effort. In fact,
in many cases the process has already started, see this GitHub issue
about extending numpy.random.multivariate_normal
for example,
but it is often hard to reach an agreement on what is the best way forward.
The general approach is to wrap our target functions using apply_ufunc
,
which generally reduces to a 1-3 line wrapper changing axis
for dims
that converts dims
to input/output_core_dims
as needed.
We believe even such a simple wrapper can have a huge impact as it makes
the code more clear and allows any user to use such functions, as many
might be intimidated by apply_ufunc
.
The second approach is some minimal extension using xarray and string methods. Some examples of this are
stacking some dimensions of the input array, calling a function that accepts only integer
axis
, and then unstacking before returning the resultcreating or renaming new dimensions when dimensions are repeated or duplicated.
Lastly, the 3rd approach which is much more complex and requires significant
extra effort is to use numba
or jax
to extend the function to proper
support for multidimensional data, then wrap it using apply_ufunc
.
As we have said, this is not our goal and we only resort to this option
for critical functions.
We also combine this layering approach to wrappers with strict modularity.
Wrappers are structured in completely independent modules
and not imported explicitly in the highest level namespace
(except for some numpy only wrappers).
This makes xarray-einstats
depend only on xarray even if we
have modules wrapping functions from other libraries which are
then only optional dependencies.
More details about the design decisions and implementation details of each module can be found in their specific “About …” pages: