Introducing anndata: indexing, views and HDF5-backing

Published: December 23, 2017. Updated: December 28, 2017.

See a more recent version of this.

With anndata, we recently released a package for handling annotated data in scalable Python-based data analysis pipelines.

Here, we introduce basic properties of the central object, AnnData.

import numpy as np
import pandas as pd
import anndata as ad
print(ad.__version__)
0.4.1

Let us generate some example data:

# number of observations
n_obs = 1000
# say we measure the time of observing the data points
# add them to a dataframe for storing some annotation
obs = pd.DataFrame()
obs['time'] = np.random.choice(['day 1', 'day 2', 'day 4', 'day 8'], n_obs)
# set the names of variables/features to the following
# ['A', 'B', 'C', ..., 'AA', 'BB', 'CC', ..., 'AAA', ...]
from string import ascii_uppercase
var_names = [i*letter for i in range(1, 10) for letter in ascii_uppercase]
# number of variables
n_vars = len(var_names)
# dataframe for annotating the variables
var = pd.DataFrame(index=var_names)
# the data matrix of shape n_obs x n_vars
X = np.arange(n_obs*n_vars).reshape(n_obs, n_vars)

and init an AnnData object.

# we're using an integer data type just for prettier outputs
# the default 'float32' is flexible and precise enough for most purposes
adata = ad.AnnData(X, obs=obs, var=var, dtype='int32')

The convention is that observations/samples of variables/features are stored in the rows of a data matrix $\mathbf{X}$. This is the convention of the modern classics of Statistics (Hastie et al., 2009) and Machine Learning (Murphy, 2012), the convention of dataframes both in Python and R and the established machine learning and statistics packages in Python (statsmodels, scikit-learn).

print(adata)
AnnData object with n_obs × n_vars = 1000 × 234 
    obs_keys = ['time']

The names of observations and of variables are as follows.

print(adata.obs_names[:10].tolist())
print(adata.obs_names[-10:].tolist())
print(adata.var_names[:10].tolist())
print(adata.var_names[-10:].tolist())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[990, 991, 992, 993, 994, 995, 996, 997, 998, 999]
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
['QQQQQQQQQ', 'RRRRRRRRR', 'SSSSSSSSS', 'TTTTTTTTT', 'UUUUUUUUU', 'VVVVVVVVV', 'WWWWWWWWW', 'XXXXXXXXX', 'YYYYYYYYY', 'ZZZZZZZZZ']

This is a large telephone matrix.

print(adata.X)
[[     0      1      2 ...,    231    232    233]
 [   234    235    236 ...,    465    466    467]
 [   468    469    470 ...,    699    700    701]
 ..., 
 [233298 233299 233300 ..., 233529 233530 233531]
 [233532 233533 233534 ..., 233763 233764 233765]
 [233766 233767 233768 ..., 233997 233998 233999]]

Indexing and Views

Similar to numpy arrays, AnnData objects can either hold actual data or reference another AnnData object. In the later case, they are referred to as "view", as in numpy.

Indexing AnnData objects always returns views, which has two advantages:

  • no new memory is allocated
  • it is possible to modify the underlying AnnData object

You can get an actual AnnData object from a view by calling .copy() on the view. Usually, this is not necessary, as any modification of elements of a view (calling .[] on an attribute of the view) internally calls .copy() and makes the view an AnnData object that holds actual data. See the example below.

adata
AnnData object with n_obs × n_vars = 1000 × 234 
    obs_keys = ['time']
adata[:, 'A']
View of AnnData object with n_obs × n_vars = 1000 × 1 
    obs_keys = ['time']

Get the first three elements of a column.

print(adata[:3, 'A'].X)
[  0 234 468]

Set the first three elements of a column.

adata[:3, 'A'].X = [0, 0, 0]
print(adata[:5, 'A'].X)
[  0   0   0 702 936]

However, if you try to access parts of a view of an AnnData, the content will be copied and a data-storing object will be generated.

adata_subset = adata[:5, ['A', 'B']]
adata_subset
View of AnnData object with n_obs × n_vars = 5 × 2 
    obs_keys = ['time']
adata_subset.obs['foo'] = range(5)

Now adata_subset stores the actual data and is no longer just a reference to adata.

adata_subset
AnnData object with n_obs × n_vars = 5 × 2 
    obs_keys = ['time', 'foo']

You can also slice with sequences or boolean indices.

adata[adata.obs['time'].isin(['day 1', 'day 2'])].obs.head()
time
1 day 1
2 day 2
3 day 2
4 day 1
6 day 1

Writing the results to disk

adata.write('./write/my_results.h5ad')
!h5ls './write/my_results.h5ad'
X                        Dataset {1000, 234}
obs                      Dataset {1000}
uns                      Group
var                      Dataset {234}
adata.write_csvs('./write/my_results_csvs', )
!ls './write/my_results_csvs'
obs.csv  obsm.csv uns      var.csv  varm.csv

Backing the object on disk

# Convenience method for computing the size of objects
def print_size_in_MB(x):
    print('{:.3} MB'.format(x.__sizeof__()/1e6))

The size of our AnnData object is about 1 MB.

print_size_in_MB(adata)
1.01 MB

As is, the AnnData object can is essentially a collection of simpler data containers: arrays, sparse matrices, dataframes.

You just saw that if you index AnnData, you get a view on elements of these data containers that essentially behaves the same as the containers themselves, but doesn't take additional memory.

You can do something similar when backing an AnnData object with a file, then AnnData will act as a view on this file, and still essentially behave the same.

adata.isbacked
False

Switch to "backed" mode by setting a backing file name. It's an HDF5 file in the AnnData formatting convention, hence we use the extension ".h5ad".

adata.filename = './write/test.h5ad'

Simply switch back by setting .filename = None. This will load the whole object back into memory.

See whether the backing file has been created.

adata.isbacked
True
print_size_in_MB(adata)
0.0773 MB

Indexing in "backed" mode

Retrieving the first column.

print(adata[:3, 'A'].X)
[0 0 0]
adata
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test.h5ad'
    obs_keys = ['time']

Setting some elements to a new value.

adata[:3, 'A'].X = [1, 1, 1]
print(adata[:3, 'A'].X)
[1 1 1]

The file has the same structure as above when we wrote a whole AnnData object to disk.

!h5ls './write/test.h5ad'
X                        Dataset {1000, 234}
obs                      Dataset {1000}
uns                      Group
var                      Dataset {234}

Backing only affects the data matrix X. All annotations are kept in memory. If you make changes to them, you have to write them to disk.

Call .write() to write changes made to the annotations to disk and close the file.

If you didn't make changes to the annotations, you can also call .file.close() in order to merely close the file.

adata.file.isopen
True
adata.write()

Note: The file is closed after this, but accessing an object will reopen it.

adata.file.isopen
False
!h5ls './write/test.h5ad'
X                        Dataset {1000, 234}
obs                      Dataset {1000}
uns                      Group
var                      Dataset {234}
adata.X
<HDF5 dataset "X": shape (1000, 234), type "<i4">
adata.file.isopen
True

Copying the backed object

In order to copy a backed object, you need to pass a new backing file name.

adata_new = adata.copy(filename='./write/test1.h5ad')
adata_new
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test1.h5ad'
    obs_keys = ['time']
!ls ./write/
my_results.h5ad   test.h5ad         test1_subset.h5ad
my_results_csvs   test1.h5ad

Moving the backing file

You can move the backing file on disk simply by resetting the filename.

adata
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test.h5ad'
    obs_keys = ['time']
adata.filename = './write/test1.h5ad'
!ls ./write/
my_results.h5ad   my_results_csvs   test1.h5ad        test1_subset.h5ad
adata
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test1.h5ad'
    obs_keys = ['time']
!ls ./write/
my_results.h5ad   my_results_csvs   test1.h5ad        test1_subset.h5ad

Views on backed AnnData objects

Only one thing is different from the non-backed case.

adata_subset = adata[:5, ['A', 'B']]
adata_subset
View of AnnData object with n_obs × n_vars = 5 × 2 backed at './write/test1.h5ad'
    obs_keys = ['time']

You cannot just set an element of an attribute of the subset, as copying the object requires setting a filename. Hence, do the following.

adata_subset = adata[:5, ['A', 'B']].copy(filename='./write/test1_subset.h5ad')
adata_subset
AnnData object with n_obs × n_vars = 5 × 2 backed at './write/test1_subset.h5ad'
    obs_keys = ['time']
adata_subset.obs['foo'] = range(5)
adata_subset
AnnData object with n_obs × n_vars = 5 × 2 backed at './write/test1_subset.h5ad'
    obs_keys = ['time', 'foo']
adata_subset.write()