Introducing anndata: indexing, views and HDF5-backing
Published: December 23, 2017. Updated: December 28, 2017.
See a more recent version of this.
With anndata, we recently released a package for handling annotated data in scalable Python-based data analysis pipelines.
Here, we introduce basic properties of the central object, AnnData.
import numpy as np
import pandas as pd
import anndata as ad
print(ad.__version__)
0.4.1
Let us generate some example data:
# number of observations
n_obs = 1000
# say we measure the time of observing the data points
# add them to a dataframe for storing some annotation
obs = pd.DataFrame()
obs['time'] = np.random.choice(['day 1', 'day 2', 'day 4', 'day 8'], n_obs)
# set the names of variables/features to the following
# ['A', 'B', 'C', ..., 'AA', 'BB', 'CC', ..., 'AAA', ...]
from string import ascii_uppercase
var_names = [i*letter for i in range(1, 10) for letter in ascii_uppercase]
# number of variables
n_vars = len(var_names)
# dataframe for annotating the variables
var = pd.DataFrame(index=var_names)
# the data matrix of shape n_obs x n_vars
X = np.arange(n_obs*n_vars).reshape(n_obs, n_vars)
and init an AnnData object.
# we're using an integer data type just for prettier outputs
# the default 'float32' is flexible and precise enough for most purposes
adata = ad.AnnData(X, obs=obs, var=var, dtype='int32')
The convention is that observations/samples of variables/features are stored in the rows of a data matrix $\mathbf{X}$. This is the convention of the modern classics of Statistics (Hastie et al., 2009) and Machine Learning (Murphy, 2012), the convention of dataframes both in Python and R and the established machine learning and statistics packages in Python (statsmodels, scikit-learn).
print(adata)
AnnData object with n_obs × n_vars = 1000 × 234
obs_keys = ['time']
The names of observations and of variables are as follows.
print(adata.obs_names[:10].tolist())
print(adata.obs_names[-10:].tolist())
print(adata.var_names[:10].tolist())
print(adata.var_names[-10:].tolist())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[990, 991, 992, 993, 994, 995, 996, 997, 998, 999]
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
['QQQQQQQQQ', 'RRRRRRRRR', 'SSSSSSSSS', 'TTTTTTTTT', 'UUUUUUUUU', 'VVVVVVVVV', 'WWWWWWWWW', 'XXXXXXXXX', 'YYYYYYYYY', 'ZZZZZZZZZ']
This is a large telephone matrix.
print(adata.X)
[[ 0 1 2 ..., 231 232 233]
[ 234 235 236 ..., 465 466 467]
[ 468 469 470 ..., 699 700 701]
...,
[233298 233299 233300 ..., 233529 233530 233531]
[233532 233533 233534 ..., 233763 233764 233765]
[233766 233767 233768 ..., 233997 233998 233999]]
Indexing and Views
Similar to numpy arrays, AnnData objects can either hold actual data or reference another AnnData object. In the later case, they are referred to as "view", as in numpy.
Indexing AnnData objects always returns views, which has two advantages:
- no new memory is allocated
- it is possible to modify the underlying AnnData object
You can get an actual AnnData object from a view by calling .copy()
on the view. Usually, this is not necessary, as any modification of elements of a view (calling .[]
on an attribute of the view) internally calls .copy()
and makes the view an AnnData object that holds actual data. See the example below.
adata
AnnData object with n_obs × n_vars = 1000 × 234
obs_keys = ['time']
adata[:, 'A']
View of AnnData object with n_obs × n_vars = 1000 × 1
obs_keys = ['time']
Get the first three elements of a column.
print(adata[:3, 'A'].X)
[ 0 234 468]
Set the first three elements of a column.
adata[:3, 'A'].X = [0, 0, 0]
print(adata[:5, 'A'].X)
[ 0 0 0 702 936]
However, if you try to access parts of a view of an AnnData, the content will be copied and a data-storing object will be generated.
adata_subset = adata[:5, ['A', 'B']]
adata_subset
View of AnnData object with n_obs × n_vars = 5 × 2
obs_keys = ['time']
adata_subset.obs['foo'] = range(5)
Now adata_subset
stores the actual data and is no longer just a reference to adata
.
adata_subset
AnnData object with n_obs × n_vars = 5 × 2
obs_keys = ['time', 'foo']
You can also slice with sequences or boolean indices.
adata[adata.obs['time'].isin(['day 1', 'day 2'])].obs.head()
time | |
---|---|
1 | day 1 |
2 | day 2 |
3 | day 2 |
4 | day 1 |
6 | day 1 |
Writing the results to disk
adata.write('./write/my_results.h5ad')
!h5ls './write/my_results.h5ad'
X Dataset {1000, 234}
obs Dataset {1000}
uns Group
var Dataset {234}
adata.write_csvs('./write/my_results_csvs', )
!ls './write/my_results_csvs'
obs.csv obsm.csv [34muns[m[m var.csv varm.csv
Backing the object on disk
# Convenience method for computing the size of objects
def print_size_in_MB(x):
print('{:.3} MB'.format(x.__sizeof__()/1e6))
The size of our AnnData object is about 1 MB.
print_size_in_MB(adata)
1.01 MB
As is, the AnnData object can is essentially a collection of simpler data containers: arrays, sparse matrices, dataframes.
You just saw that if you index AnnData, you get a view on elements of these data containers that essentially behaves the same as the containers themselves, but doesn't take additional memory.
You can do something similar when backing an AnnData object with a file, then AnnData will act as a view on this file, and still essentially behave the same.
adata.isbacked
False
Switch to "backed" mode by setting a backing file name. It's an HDF5 file in the AnnData formatting convention, hence we use the extension ".h5ad".
adata.filename = './write/test.h5ad'
Simply switch back by setting .filename = None
. This will load the whole object back into memory.
See whether the backing file has been created.
adata.isbacked
True
print_size_in_MB(adata)
0.0773 MB
Indexing in "backed" mode
Retrieving the first column.
print(adata[:3, 'A'].X)
[0 0 0]
adata
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test.h5ad'
obs_keys = ['time']
Setting some elements to a new value.
adata[:3, 'A'].X = [1, 1, 1]
print(adata[:3, 'A'].X)
[1 1 1]
The file has the same structure as above when we wrote a whole AnnData object to disk.
!h5ls './write/test.h5ad'
X Dataset {1000, 234}
obs Dataset {1000}
uns Group
var Dataset {234}
Backing only affects the data matrix X. All annotations are kept in memory. If you make changes to them, you have to write them to disk.
Call .write()
to write changes made to the annotations to disk and close the file.
If you didn't make changes to the annotations, you can also call .file.close()
in order to merely close the file.
adata.file.isopen
True
adata.write()
Note: The file is closed after this, but accessing an object will reopen it.
adata.file.isopen
False
!h5ls './write/test.h5ad'
X Dataset {1000, 234}
obs Dataset {1000}
uns Group
var Dataset {234}
adata.X
<HDF5 dataset "X": shape (1000, 234), type "<i4">
adata.file.isopen
True
Copying the backed object
In order to copy a backed object, you need to pass a new backing file name.
adata_new = adata.copy(filename='./write/test1.h5ad')
adata_new
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test1.h5ad'
obs_keys = ['time']
!ls ./write/
my_results.h5ad test.h5ad test1_subset.h5ad
[34mmy_results_csvs[m[m test1.h5ad
Moving the backing file
You can move the backing file on disk simply by resetting the filename.
adata
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test.h5ad'
obs_keys = ['time']
adata.filename = './write/test1.h5ad'
!ls ./write/
my_results.h5ad [34mmy_results_csvs[m[m test1.h5ad test1_subset.h5ad
adata
AnnData object with n_obs × n_vars = 1000 × 234 backed at './write/test1.h5ad'
obs_keys = ['time']
!ls ./write/
my_results.h5ad [34mmy_results_csvs[m[m test1.h5ad test1_subset.h5ad
Views on backed AnnData objects
Only one thing is different from the non-backed case.
adata_subset = adata[:5, ['A', 'B']]
adata_subset
View of AnnData object with n_obs × n_vars = 5 × 2 backed at './write/test1.h5ad'
obs_keys = ['time']
You cannot just set an element of an attribute of the subset, as copying the object requires setting a filename. Hence, do the following.
adata_subset = adata[:5, ['A', 'B']].copy(filename='./write/test1_subset.h5ad')
adata_subset
AnnData object with n_obs × n_vars = 5 × 2 backed at './write/test1_subset.h5ad'
obs_keys = ['time']
adata_subset.obs['foo'] = range(5)
adata_subset
AnnData object with n_obs × n_vars = 5 × 2 backed at './write/test1_subset.h5ad'
obs_keys = ['time', 'foo']
adata_subset.write()