Python-Xarray einstats v0.1.1: Stats, linear algebra and einops for xarray

ArviZ

Jan 14, 2022

Comments (1)

Latest Release: v0.1.1

Initial release of xarray_einstats.

xarray_einstats extends array manipulation libraries to use with xarray. It starts with 4 modules:

linalg -> extends functionality from numpy.linalg module

stats -> extends functionality from scipy.stats module

einops -> extends einops library, which needs to be installed

numba -> miscellaneous extensions (numpy.histogram for now only) that need numba to accelerate and/or vectorize the functions. numba needs to be installed to use it

v0.1.1 indicates the second try at uploading to pypi

⚠️
Caution: This project is still in a very early development stage

Installation

To install, run

(.venv) $ pip install xarray-einstats

Overview

As stated in their website:

xarray makes working with multi-dimensional labeled arrays simple, efficient and fun!

The code is often more verbose, but it is generally because it is clearer and thus less error prone and more intuitive. Here are some examples of such trade-off where we believe the increased clarity is worth the extra characters:

numpy

xarray

a[2, 5]

da.sel(drug="paracetamol", subject=5)

a.mean(axis=(0, 1))

da.mean(dim=("chain", "draw"))

a.reshape((-1, 10))

da.stack(sample=("chain", "draw"))

a.transpose(2, 0, 1)

da.transpose("drug", "chain", "draw")

In some other cases however, using xarray can result in overly verbose code that often also becomes less clear. xarray-einstats provides wrappers around some numpy and scipy functions (mostly numpy.linalg and scipy.stats) and around einops with an api and features adapted to xarray.

%
⚠️
Attention: A nicer rendering of the content below is available at our documentation

Data for examples

The examples in this overview page use the DataArrays from the Dataset below (stored as ds variable) to illustrate xarray-einstats features:

xarray-einstats provides two wrapper classes {class}xarray_einstats.XrContinuousRV and {class}xarray_einstats.XrDiscreteRV that can be used to wrap any distribution in {mod}scipy.stats so they accept {class}~xarray.DataArray as inputs.

We can evaluate the logpdf using inputs that wouldn't align if using numpy in a couple lines:

einops uses a convenient notation inspired in Einstein notation to specify operations on multidimensional arrays. It uses spaces as a delimiter between dimensions, parenthesis to indicate splitting or stacking of dimensions and -> to separate between input and output dim specification. einstats uses an adapted notation then translates to einops and calls {func}xarray.apply_ufunc under the hood.

Why change the notation? There are three main reasons, each concerning one of the elements respectively: ->, space as delimiter and parenthesis:

In xarray dimensions are already labeled. In many cases, the left side in the einops notation is only used to label the dimensions. In fact, 5/7 examples in https://einops.rocks/api/rearrange/ fall in this category. This is not necessary when working with xarray objects.

In xarray dimension names can be any {term}hashable <xarray:name>. xarray-einstats only supports strings as dimension names, but the space can't be used as delimiter.

In xarray dimensions are labeled and the order doesn't matter. This might seem the same as the first reason but it is not. When splitting or stacking dimensions you need (and want) the names of both parent and children dimensions. In some cases, for example stacking, we can autogenerate a default name, but in general you'll want to give a name to the new dimension. After all, dimension order in xarray doesn't matter and there isn't much to be done without knowing the dimension names.

:::{attention} We also provide some cruder wrappers with syntax closer to einops. We are experimenting on trying to find the right spot between being clear, semantic and flexible yet concise.

These raw_ wrappers like {func}xarray_einstats.einops.raw_rearrange impose several extra constraints to accepted xarray inputs, in addition to dimension names being strings.

The example data we are using on this page uses single word alphabetical dimensions names which allows us to demonstrate both side by side. :::

xarray-einstats uses two separate arguments, one for the input pattern (optional) and another for the output pattern. Each is a list of dimensions (strings) or dimension operations (lists or dictionaries). Some examples:

We can combine the chain and draw dimensions and name the resulting dimension sample using a list with a single dictionary. The team dimension is not present in the pattern and is not modified.

Note that following xarray convention, new dimensions and dimensions on which we operated are moved to the end. This only matters when you access the underlying array with .values or .data and you can always transpose using {meth}xarray.Dataset.transpose, but it can matter. You can change the pattern to enforce the output dimension order:

Now to a more complicated pattern. We will split the chain and draw dimension, then combine those split dimensions between them.

rearrange(
ds.atts,
# use dicts to specify which dimensions to split, here we *need* to use a dictin_dims=[{"chain": ("chain1", "chain2")}, {"team": ("team1", "team2")}],
# combine split chain and team dims between them# here we don't use a dict so the new dimensions get a default nameout_dims=[("chain1", "team1"), ("team2", "chain2")],
# set the lengths of split dimensions as kwargschain1=2, chain2=2, team1=2, team2=3
)
raw_rearrange(
ds.atts,
"(chain1 chain2)=chain (team1 team2)=team -> (chain1 team1) (team2 chain2)",
chain1=2, chain2=2, team1=2, team2=3
)

There is no one size fits all solution, but knowing the function we are wrapping we can easily make the code more concise and clear. Without xarray-einstats, to invert a batch of matrices stored in a 4d array you have to do:

inv=xarray.apply_ufunc( # output is a 4d labeled arraynumpy.linalg.inv,
batch_of_matrices, # input is a 4d labeled arrayinput_core_dims=[["matrix_dim", "matrix_dim_bis"]],
output_core_dims=[["matrix_dim", "matrix_dim_bis"]]
)

to calculate it's norm instead, it becomes:

norm=xarray.apply_ufunc( # output is a 2d labeled arraynumpy.linalg.norm,
batch_of_matrices, # input is a 4d labeled arrayinput_core_dims=[["matrix_dim", "matrix_dim_bis"]],
)

xarray-einstats Stats, linear algebra and einops for xarray ⚠️ Caution: This project is still in a very early development stage Installation To instal
arviz-devs/xarray-einstats

## Fix codecov

Jan 23, 2022

null