Jupyter Notebook Binder

Analysis flow#

Here, we’ll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: Project flow.

Setup#

# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Hide code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-10-23 17:45:30)
✅ saved: Storage(uid='XmX5Jbqn', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-23 17:45:30, created_by_id=1)
💡 loaded instance: testuser1/analysis-usecase
💡 did not register local instance on hub

import lamindb as ln
import lnschema_bionty as lb

lb.settings.organism = "human"  # globally set organism
lb.settings.auto_save_parents = False
💡 loaded instance: testuser1/analysis-usecase (lamindb 0.57.2)
ln.track()
💡 notebook imports: lamindb==0.57.2 lnschema_bionty==0.33.0
💡 Transform(uid='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-10-23 17:45:32, created_by_id=1)
💡 Run(uid='kVTOQgyAx68YsbVkoQHD', run_at=2023-10-23 17:45:32, transform_id=1, created_by_id=1)

Track cell types, tissues and diseases#

We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:

Hide code cell content
adata = ln.dev.datasets.anndata_with_obs()
adata
AnnData object with n_obs × n_vars = 40 × 100
    obs: 'cell_type', 'cell_type_id', 'tissue', 'disease'
adata.var_names[:5]
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
adata.obs[["tissue", "cell_type", "disease"]].value_counts()
tissue  cell_type                disease                   
brain   my new cell type         Alzheimer disease             10
heart   hepatocyte               cardiac ventricle disorder    10
kidney  T cell                   chronic kidney disease        10
liver   hematopoietic stem cell  liver lymphoma                10
dtype: int64

Processing the dataset#

To track our data transformation we create a new Transform of type “pipeline”:

transform = ln.Transform(
    name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)

Set the current tracking to the new transform:

ln.track(transform)
💡 Transform(uid='eK6wXyT78bxgZ4', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-23 17:45:35, created_by_id=1)
💡 Run(uid='kLXAPMH9o6n0emnOBshW', run_at=2023-10-23 17:45:35, transform_id=2, created_by_id=1)

Get a backed AnnData object#

file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()
adata = file.backed()
adata
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object mini_anndata_with_obs.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']
adata.obs[["cell_type", "disease"]].value_counts()
cell_type                disease                   
T cell                   chronic kidney disease        10
hematopoietic stem cell  liver lymphoma                10
hepatocyte               cardiac ventricle disorder    10
my new cell type         Alzheimer disease             10
dtype: int64

Subset dataset to specific cell types and diseases#

Create the subset:

subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
dtype: int64

This subset can now be registered:

file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    key="subset/mini_anndata_with_obs.h5ad",
    field=lb.Gene.ensembl_gene_id,
)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1900: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
❗    received 99 unique terms, 1 empty/duplicated term is ignored
99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...
❗    no validated features, skip creating feature set
1 term (25.00%) is not validated for name: cell_type_id
file_subset.save()

Add labels to features, all of them validate:

cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

file_subset.labels.add(cell_types, feature=features.cell_type)
file_subset.labels.add(tissues, feature=features.tissue)
file_subset.labels.add(diseases, feature=features.disease)
Hide code cell output
did not create CellType record for 1 non-validated name: 'my new cell type'
file_subset.describe()
File(uid='zirlXUW0rXsHfMsGEhoF', key='subset/mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', updated_at=2023-10-23 17:45:35)

Provenance:
  🗃️ storage: Storage(uid='XmX5Jbqn', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-23 17:45:30, created_by_id=1)
  🧩 transform: Transform(uid='eK6wXyT78bxgZ4', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-23 17:45:35, created_by_id=1)
  👣 run: Run(uid='kLXAPMH9o6n0emnOBshW', run_at=2023-10-23 17:45:35, transform_id=2, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-10-23 17:45:30)
Features:
  obs: FeatureSet(uid='lygziRfkBfrJa34d2uUA', n=3, registry='core.Feature', hash='wlYmDAPUN_HwomBnSH6w', updated_at=2023-10-23 17:45:35, modality_id=1, created_by_id=1)
    🔗 cell_type (3, bionty.CellType): 'T cell', 'hematopoietic stem cell', 'hepatocyte'
    🔗 tissue (4, bionty.Tissue): 'kidney', 'liver', 'heart', 'brain'
    🔗 disease (4, bionty.Disease): 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
Labels:
  🏷️ tissues (4, bionty.Tissue): 'kidney', 'liver', 'heart', 'brain'
  🏷️ cell_types (3, bionty.CellType): 'T cell', 'hematopoietic stem cell', 'hepatocyte'
  🏷️ diseases (4, bionty.Disease): 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'

Examine data flow#

Common questions that might arise are:

  • Which h5ad file is in the subset subfolder?

  • Which notebook ingested this file?

  • By whom?

  • And which file is its parent?

Let’s answer this using LaminDB:

Query a subsetted .h5ad file containing “hematopoietic stem cell” and “T cell” to learn which h5ad file is in the subset subfolder:

cell_types_bt_lookup = lb.CellType.lookup()
my_subset = ln.File.filter(
    suffix=".h5ad",
    key__startswith="subset",
    cell_types__in=[
        cell_types_bt_lookup.hematopoietic_stem_cell,
        cell_types_bt_lookup.t_cell,
    ],
).first()
my_subset.view_flow()
_images/2c3e3e4bb2908d8d2d5c602b6991328235966c6849223964ce7fd149dfd2573e.svg
Hide code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
💡 deleting instance testuser1/analysis-usecase
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--analysis-usecase.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase