scATAnno: An Automated Cell Type annotation Pipeline for scATAC-seq

scATAnno is a Python-based workflow that automates the annotation of single-cell ATAC-seq (scATAC-seq) data using reference atlases. It directly uses chromatin accessibility peaks as input features. scATAnno utilizes large-scale reference atlases constructed from publicly available datasets and various scATAC-seq technologies to annotate cell types in an unsupervised manner. It incorporates uncertainty score metrics to assess the confidence of cell-type assignments. The workflow has been demonstrated to accurately annotate cell types in different case studies, including PBMC, TNBC, and BCC, and can identify unknown cell populations. Overall, scATAnno expedites our understanding of cellular heterogeneity based on chromatin accessibility profiles in complex biological systems.

  • Integration of newly generated scATAC-seq data with reference atlases

  • Celltype annotation of single-cell chromatin accessibility profile

  • Uncertainty score distribution to assess cell-type assignment

Workflow of scATAC Annotation

If you use this pipeline, please cite the following reference:

Jiang et al., scATAnno: Automated Cell Type Annotation for single-cell ATAC-seq Data

BioRxiv Preprint (https://www.biorxiv.org/content/10.1101/2023.06.01.543296v2)

The documentation is organized into the following sections:

About scATAnno

scATAnno is a Python-based workflow designed to annotate cell types in single-cell ATAC-seq (scATAC-seq) data based on scATAC-seq reference atlases. It addresses the challenges presented by scATAC-seq data, such as data sparsity, high dimensionality of epigenomic features, and the lack of marker enhancers for annotation. Unlike existing methods, scATAnno directly utilizes peaks or cis-regulatory element (CRE) genomic regions as input features without converting them into gene activity scores. It overcomes the high dimensionality by employing spectral embedding to transform the data into a low dimensional space. To create catalogs of CREs at the single-cell level, scATAnno uses the chromatin state profile of large-scale reference atlases to generate peak signals and reference peaks.

By integrating query data with reference atlases in an unsupervised manner, scATAnno enables accurate cell type annotation. It provides uncertainty score metrics to assess the confidence of cell-type assignment, including a K-Nearest Neighbor (KNN) uncertainty score and a weighted distance-based uncertainty score in a low dimensional space. The workflow has been demonstrated to be effective in multiple case studies, including peripheral blood mononuclear cells (PBMC), Triple Negative Breast Cancer (TNBC), and basal cell carcinoma (BCC). It accurately annotates cell types and identifies unknown cell populations not represented in the reference atlas. Additionally, scATAnno enables efficient cell type annotation across tumor conditions and accurately annotates tumor-infiltrating lymphocytes (TIL) in TNBC using a BCC TIL atlas.

Installation

Instruction of Installation

Install through pip

pip install scATAnno

Install through github

Clone the source code repository and go to the scATAnno-main directory, use `pip install . ` to install the package.

git clone https://github.com/Yijia-Jiang/scATAnno-main.git
cd scATAnno-main
pip install .

Reference Atlas

Below are Reference Altases in the original publication. If you want to find processed and ready-to-use reference atlases, please check Prepare Reference Atlas

Human scATAC Reference: Zhang, Kai, et al. Cell (2021)

Dataset: 1.3 million cells, 1.2 million peaks; 30 major cell types and 222 minor cell types

Human scATAC Reference Atlas

PBMC scATAC Reference: Satpathy, Ansuman et al. Nature Biotechnology (2019)

Dataset: 61,806 peripheral blood and bone marrow cells, 571,400 peaks

Mouse scATAC Reference Atlas

BCC Tumor-Infiltrating Lymphocytes (TILs) scATAC Reference:

Satpathy, Ansuman et al. Nature Biotechnology (2019)

Dataset: 37,818 cells from BCC clinical biopsies

Mouse scATAC Reference Atlas

Prepare Reference Atlas

Healthy Adult Reference Atlas

  • Select deep-sequenced 100K adult cells

  • Select adult specific peaks (~ 890K peaks)

  • Downloaded Healthy Adult Reference atlas

    UMAP of Human scATAC Reference Atlas

    Perform dimensionality reduction using spectral embedding, visualize annotation on UMAP

PBMC Reference Atlas

  • Select 39441 PBMC cells

  • Generate 196K peaks by MACS2 Peak-Calling

  • Downloaded PBMC atlas

    UMAP of PBMC scATAC Reference Atlas

    Perform dimensionality reduction using spectral embedding, visualize annotation on UMAP

BCC TIL Reference Atlas

  • Select 22008 TIL cells

  • Generate 340K peaks by MACS2 Peak-Calling

  • Downloaded TIL atlas

    UMAP of Mouse scATAC Reference Atlas

    Perform dimensionality reduction using spectral embedding, visualize annotation on UMAP

Pipeline Overview

The conceptual idea and schematic of scATAnno is illustrated here.

Workflow of scATAC Annotation

Original Input

The following files are needed to run Celltype Annotation on your own experiment:

  • fragments.tsv.gz fragment file for each scATAC data

  • barcodes.tsv cell barcodes for each scATAC data

  • reference.bed reference peaks with chromosome regions from the selected reference atlas

  • Optionally: UMAP or tSNE projection coordinates and Cluster cluster numbers of cells can be provided by users

Currently, this package only supports hg38 reference mapping

Intermediate Output

The following files are intermediate outputs of scATAnno in order to generate a peak-by-cell matrix for query data:

  • matrix.mtx Sparse matrix files with fragment reads

  • features.tsv Reference peaks/cis-Regulatory Elements

  • barcodes.tsv Cell barcodes of high quality cells

Final Output

The following files are final outputs of scATAnno using the annotation tool:

  • 1.Merged_query_reference.h5ad Anndata of integrated query and reference cells

  • X_spectral_harmony.csv Harmozied spectral embeddings of integrated data

  • query.h5ad Anndata of query cells which stores annotation results. This AnnData should include essential prediction results in AnnData.obs

    • column cluster_annotation stores cell type assignment at cluster-level

    • column uncertainty_score stores final uncertainty score, which takes the maximum of KNN-based uncertainty and weighted distance-based uncertainty of query cells

Dimension reduction

Cited from https://kzhang.org/SnapATAC2/algorithms/dimension_reduction.html#spectral-embedding

[Fang_2021]: Fang, R., Preissl, S., Li, Y. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12, 1337 (2021). https://doi.org/10.1038/s41467-021-21583-9

The dimension reduction method is a pairwise-similarity based method, which requires defining and computing similarity between each pair of cells in the data.

In order to address the issue of high dimensionality in scATAC-seq data, we integrated SnapATAC2 into our workflow. SnapATAC2 utilizes spectral embedding techniques based on the work of Zhang et al. (2021) and Fang et al. (2021). We combined the reference and query data, creating a unified peak-by-cell matrix. SnapATAC2 transformed the count matrix into a cell-cell similarity matrix using the Jaccard Similarity Score, which measures the similarity between cells based on the ratio of their intersection and union. To mitigate sequencing depth bias, the Jaccard similarity matrix was normalized. The normalized matrix was then subjected to eigenvector decomposition to obtain low-dimensional components. To handle the computational complexity of high-dimensional data, SnapATAC2 first sampled “landmark” cells that capture the overall data distribution, and obtained the low-dimensional components for these informative cells. Finally, the remaining cells were projected into the same low-dimensional space, allowing us to obtain low-dimensional components for all cells in the dataset.

Cell-type Assignment

K-Nearest Neighorbors

In the initial cell-type assignment, each query cell is assigned a cell label using its closest K-nearest neighbors (KNN) in the reference atlas, based on the low dimensional embeddings.

To evaluate the confidence of the KNN assignment, we employ two distinct uncertainty score metrics. The first metric is the KNN-based uncertainty score. This score is calculated by considering the closest neighbors of a query cell in the reference atlas. Query cells that have predominantly nearest neighbors belonging to a single cell type are assigned low uncertainty scores, while those with neighbors consisting of a mixture of cell types receive higher uncertainty scores.

The second metric is a novel computation of a weighted distance-based uncertainty score. This approach is based on the concept that query cells located far from the centroid of their assigned cell type represent an unknown population. The computation involves three steps. First, we identify centroids of reference cell types based on low-dimensional components. Second, we assign different weights to the components for each cell type, considering the closeness of reference cells to their respective cell type’s centroid along each dimension.

Case Study Datasets

Three case studies below gives you a taste of different capabilities of our Annotation workflow. The datasets you will be exploring are:

  1. PBMC 10K Multiome from 10X

  2. scATAC of Triple Negative Breast Cancer (TNBC)

    Citation: Zhang, Yuanyuan, Hongyan Chen, Hongnan Mo, Xueda Hu, Ranran Gao, Yahui Zhao, Baolin Liu, et al. 2021. “Single-Cell Analyses Reveal Key Immune Cell Subsets Associated with Response to PD-L1 Blockade in Triple-Negative Breast Cancer.” Cancer Cell 39 (12): 1578–93.e8.

    • We selected one scATAC sample (9,935 cells) collected from TNBC patients and applied the TIL reference atlas to annotate TIL cells from the TNBC query data

    • Processed TNBC data download

    UMAP of TNBC cells
  3. scATAC of two hold-out Basal cell carcinoma samples

    Citation: Satpathy, Ansuman T et al. “Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion.” Nature biotechnology (2019)

    • We selected two BCC samples held out from the BCC TIL reference building as query data. One sample is pre-treatment and the other is post-treatment by immunotherapy. The original information of BCC cells is shown below:

    • Processed BCC data download

    UMAP of BCC cells

Tutorials

PBMC10K Multiome

[16]:
import scanpy as sc
import pandas as pd
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy.io
import os
import anndata as ad # Anndata version must > 0.8
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')

import scATAnno
from scATAnno.SnapATAC2_spectral import *
from scATAnno.SnapATAC2_tools import *
from scATAnno.SnapATAC2_utils import *
from scATAnno import scATAnno_preprocess
from scATAnno import scATAnno_assignment
from scATAnno import scATAnno_integration
from scATAnno import scATAnno_plotting

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

default_28 = scATAnno_plotting.get_palettes("default28")
default_102 = scATAnno_plotting.get_palettes("default102")

In this tutorial, we apply PBMC reference atlas to PBMC 10K Multiome dataset. The input files are 1) reference atlas in h5ad format; 2) query data in mtx and tsv formats, whose features are aligned with PBMC reference peak.

Download reference data

[13]:
# !wget -O PBMC_reference_atlas_final.h5ad "https://www.dropbox.com/s/y9wc6h5mmydj7gf/PBMC_reference_atlas_final.h5ad?dl=1"

Download query data, including 1) scATAC (matrix.mtx, features.tsv and barcodes.tsv), features are the reference peak of PBMC reference atlas 2) scRNA (filtered_feature_bc_matrix.h5)

[14]:
# !wget -O PBMC10k_multiome_vPBMCatlas.zip "https://www.dropbox.com/s/3g63m832mbeec4s/PBMC10k_multiome_vPBMCatlas.zip?dl=1"

We load reference atlas data and query data. Reference atlas is stored as h5ad AnnData in the data directory; query data is imported from QuickATAC output.

[3]:
os.chdir("scATAnno-main")

output_name = "PBMC_10K_Multiome"
out_dir = os.path.join("case_study", output_name)


reference_data_path = "data/Reference/PBMC_reference_atlas_final.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
query_data = scATAnno_preprocess.import_query_data(path = 'data/PBMC10k_multiome_vPBMCatlas/scATAC/',
                                    mtx_file = 'matrix.mtx.gz',
                                    cells_file = 'barcodes.tsv.gz',
                                    features_file = 'features.tsv.gz',
                                    variable_prefix = "pbmc10k_query",
                                    celltype_col = "celltypes",
                                    add_metrics=False)

Before performing integration, we check feature dimensions of the reference and query.

[5]:
print(reference_data)
print(query_data)
assert reference_data.var.shape[0] == query_data.var.shape[0]
AnnData object with n_obs × n_vars = 39441 × 196618
    obs: 'celltypes', 'tissue', 'dataset'
View of AnnData object with n_obs × n_vars = 11909 × 196618
    obs: 'celltypes', 'tissue', 'dataset'

We integrate the reference atlas and query data using SnapATAC2 by 1) concatenating two datasets and selecting highly informative “landmark” reference cells to compute similarity matrix, normalize the matrix and eigen decompositionand and 2) performing Nystrom extension to project the remaining reference and query cells onto the same spectral embedding space

[6]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate(reference_data, query_data, variable_prefix = "pbmc10k_query", sample_size = 25000)
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var["selected"] = selected_features
Compute similarity matrix
Normalization
Perform decomposition
Perform Nystrom extension
100%|██████████████████████████████████████████| 3/3 [16:56<00:00, 338.90s/it]

Next, apply harmony to spectral embeddings of integrated data to regress out batch effects indicated in batch_col.

[7]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-23 15:52:26,525 - harmonypy - INFO - Iteration 1 of 20
2023-05-23 15:52:33,706 - harmonypy - INFO - Iteration 2 of 20
2023-05-23 15:52:40,648 - harmonypy - INFO - Iteration 3 of 20
2023-05-23 15:52:47,785 - harmonypy - INFO - Iteration 4 of 20
2023-05-23 15:52:54,813 - harmonypy - INFO - Iteration 5 of 20
2023-05-23 15:53:01,840 - harmonypy - INFO - Iteration 6 of 20
2023-05-23 15:53:08,845 - harmonypy - INFO - Iteration 7 of 20
2023-05-23 15:53:15,864 - harmonypy - INFO - Converged after 7 iterations
[9]:
integrated_adata_harmony
[9]:
AnnData object with n_obs × n_vars = 51350 × 196618
    obs: 'celltypes', 'tissue', 'dataset'
    obsm: 'X_spectral', 'X_spectral_harmony'

we then plot UMAP of the integrated data using harmonized spectral embeddings. Optionally, we can save the AnnData object in an output directory.

[10]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
integrated_adata
[10]:
AnnData object with n_obs × n_vars = 51350 × 196618
    obs: 'celltypes', 'tissue', 'dataset'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap'
[27]:
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#228833', '#cccccc'], show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_PBMC10K_Multiome_18_1.png
[25]:
import pickle
with open(os.path.join(out_dir, 'reference_palette.pickle'), 'rb') as handle:
    celltype_palette = pickle.load(handle)

scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "PBMC Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_PBMC10K_Multiome_19_1.png

We investigated cell types in the PBMC reference atlas and re-annotated gamma delta T cell cluster as MAIT cell cluster by showing chromatin accessibility of gene signatures in this cluster.

After integration, we want to assign cell types to the query data by taking three steps. First, query cells are assigned labels using K-Nearest neighbors along with KNN-based uncertainty score. Second, query cell labels are corrected using weighted-distance-based uncertainty score. Third, query cells are clustered and annotated at the cluster level.

[28]:
uncertainty_threshold = 0.2
distance_threshold = 90
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
atlas = "PBMC"

query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
# Perform KNN assignment
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)
[29]:
# Perform weighted-distance based assignment
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)
[31]:
# Perform cluster-level assignment
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, cluster_col = None, use_rep=use_rep)
query_annotated
[31]:
AnnData object with n_obs × n_vars = 11909 × 196618
    obs: 'celltypes', 'tissue', 'dataset', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'leiden', 'cluster_annotation'
    uns: 'dataset_colors', 'neighbors', 'umap', 'leiden'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
    obsp: 'distances', 'connectivities'
[35]:
sc.pl.umap(query_annotated, color = "cluster_annotation",palette=celltype_palette, title = "scATAnno Annotation")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_PBMC10K_Multiome_25_1.png

We benchmark scATAnno results with seurat annotation. Seurat annotation is stored in “Seurat_RNA_annotation.csv”

[34]:
seurat_df = pd.read_csv(os.path.join("data/PBMC10k_multiome_vPBMCatlas/","Seurat_RNA_annotation.csv"), index_col=0)
query_annotated = query_annotated[query_annotated.obs.index.isin(seurat_df.index)]
query_annotated.obs = pd.merge(query_annotated.obs, seurat_df, left_index=True, right_index=True)
query_annotated
[34]:
AnnData object with n_obs × n_vars = 10412 × 196618
    obs: 'celltypes', 'tissue', 'dataset', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'leiden', 'cluster_annotation', 'seurat_rna_annotations', 'seurat_new_annotation'
    uns: 'dataset_colors', 'neighbors', 'umap', 'leiden'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
    obsp: 'distances', 'connectivities'
[37]:
sc.pl.umap(query_annotated, color = "seurat_new_annotation", title = "Seurat Annotation", palette = default_28)
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_PBMC10K_Multiome_28_1.png

Lastly, we compare the annotation results between scATAnno and Seurat.

[40]:
outcomes2 = []
for i in query_annotated.obs["cluster_annotation"].index:
    if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"seurat_new_annotation"]:
        outcomes2.append("Consistent")
    elif query_annotated.obs.loc[i,"cluster_annotation"] == "unknown":
        outcomes2.append("Unknown")
    else: outcomes2.append("Inconsistent")
query_annotated.obs["outcomes2"] = outcomes2
sc.pl.umap(query_annotated, color = "outcomes2", palette= ['#1f77b4','#ff7f0e','#279e68'], title = "Comparison")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_PBMC10K_Multiome_30_1.png
[42]:
outcomes = []
for i in query_annotated.obs["cluster_annotation"].index:
    if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"seurat_new_annotation"]:
        outcomes.append("Consistent")
    else: outcomes.append("Inconsistent")
query_annotated.obs["outcomes"] = outcomes

adata_query = query_annotated.copy()
variable_col = "cluster_annotation"
outcome_col = "outcomes"
listdict = []
for ct in adata_query.obs[variable_col].unique():
    # print(len(adata_query.obs[adata_query.obs['tissue'] == ct]))
    tmp_dict = {}
    for pt in np.unique(adata_query.obs[outcome_col]):
        tmp_dict[pt] = (
            np.sum(
                adata_query.obs[adata_query.obs[variable_col] == ct][outcome_col] == pt
            )
            # / len(adata_query.obs[adata_query.obs[variable_col] == ct])
        )
    l = len(adata_query.obs[adata_query.obs[variable_col] == ct])
    tmp_dict['ct'] = f'{ct} ({l})'
    tmp_dict['n_cells'] = len(adata_query.obs[adata_query.obs[variable_col] == ct])
    listdict.append(tmp_dict)


df = pd.DataFrame(listdict).set_index('ct').sort_values(by='ct')
df = df.iloc[0:df.shape[0]-1].sort_values("n_cells", ascending=True)
print(df)
                             Consistent  Inconsistent  n_cells
ct
pDC (89)                             89             0       89
MAIT (118)                          112             6      118
Treg (148)                          113            35      148
cDC (199)                           183            16      199
Naive B (349)                         3           346      349
NK (383)                            377             6      383
Memory B (414)                        6           408      414
Effector memory CD8 T (736)         582           154      736
Memory CD4 T (1227)                1142            85     1227
Naive CD4 T (1237)                 1178            59     1237
Naive CD8 T (1384)                 1320            64     1384
Monocyte (3369)                    3274            95     3369
[44]:
fig, ax = plt.subplots(1, 1, figsize=(2, 6))
df[df.columns[0:len(df.columns)-1]].plot(kind='barh', stacked=True, ax=ax, width=1.0)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
ax.set_xlabel('Cell Numbers')
ax.set_ylabel('')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., frameon=False)
ax.grid(False)
sns.despine()
plt.show()
/var/folders/79/yz2sgggd11n9c7kj2nws4w700000gp/T/ipykernel_70711/2726220942.py:3: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
_images/tutorials_demo_PBMC10K_Multiome_32_1.png
[47]:
query_annotated.obs
[47]:
celltypes tissue dataset uncertainty_score_step1 pred_y uncertainty_score_step2 Uncertainty_Score 1.knn-based_celltype 2.corrected_celltype leiden cluster_annotation seurat_rna_annotations seurat_new_annotation outcomes2 outcomes
TCAAGAACAGTAATAG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 4.000000e-01 Treg 1.0 1.000000e+00 unknown unknown 32 Treg CD4 Naive Naive CD4 T Inconsistent Inconsistent
AACCCGCAGGTAGCTT-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 5.000000e-01 Central memory CD8 T 0.0 5.000000e-01 unknown unknown 6 Effector memory CD8 T gdT Gamma delta T Inconsistent Inconsistent
CGCATATAGGTTACGT-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query -2.220446e-16 Memory CD4 T 0.0 0.000000e+00 Memory CD4 T Memory CD4 T 2 Memory CD4 T CD4 TCM Memory CD4 T Consistent Consistent
TATTCGTTCCGCCTCA-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query -2.220446e-16 Monocyte 0.0 0.000000e+00 Monocyte Monocyte 10 Monocyte CD14 Mono Monocyte Consistent Consistent
CGGTTTGAGTTTGGTA-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 0.000000e+00 Effector memory CD8 T 0.0 0.000000e+00 Effector memory CD8 T Effector memory CD8 T 12 Effector memory CD8 T gdT Gamma delta T Inconsistent Inconsistent
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
GGTCAAGCACCTACTT-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 2.220446e-16 Monocyte 0.0 2.220446e-16 Monocyte Monocyte 17 Monocyte CD14 Mono Monocyte Consistent Consistent
CTTGCTCAGACAAACG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query -2.220446e-16 Monocyte 1.0 1.000000e+00 Monocyte unknown 0 Monocyte CD16 Mono Monocyte Consistent Consistent
GAAAGCCAGCAGGTGG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 0.000000e+00 Monocyte 1.0 1.000000e+00 Monocyte unknown 14 Monocyte CD14 Mono Monocyte Consistent Consistent
ACGGGAAGTGTTAGCA-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 2.220446e-16 Naive CD8 T 0.0 2.220446e-16 Naive CD8 T Naive CD8 T 4 Naive CD8 T CD8 Naive Naive CD8 T Consistent Consistent
CTATGATCACCATATG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query -2.220446e-16 Monocyte 1.0 1.000000e+00 Monocyte unknown 7 Monocyte CD14 Mono Monocyte Consistent Consistent

10412 rows × 15 columns

[45]:
query_annotated.write(os.path.join(out_dir, "query_annotated.h5ad"))

To examine B cell subtypes, we leverage scRNA-profile to show signature genes

[61]:
# Read in scRNA profile
RNA_anndata = sc.read_10x_h5("data/PBMC10k_multiome_vPBMCatlas/scRNA/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
RNA_anndata.obs.index = [ i+ "_pbmc10k_query" for i in RNA_anndata.obs.index]
RNA_anndata = RNA_anndata[RNA_anndata.obs.index.isin(query_annotated.obs.index)]
/opt/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[62]:
RNA_anndata.obs = pd.merge(RNA_anndata.obs, query_annotated.obs[["cluster_annotation", "seurat_new_annotation"]], left_index=True, right_index=True)
UMAP_df = pd.DataFrame(query_annotated.obsm["X_umap"])
UMAP_df.index = query_annotated.obs.index
UMAP_df.columns = ["UMAP_1", "UMAP_2"]
UMAP_df = UMAP_df.loc[RNA_anndata.obs.index]
RNA_anndata.obsm["X_umap"] = np.array(UMAP_df)
RNA_anndata.obs["cluster_annotation"].cat.reorder_categories(np.unique(query_annotated.obs["cluster_annotation"]), inplace=True)
/opt/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[63]:
RNA_anndata.obs
[63]:
cluster_annotation seurat_new_annotation
AAACAGCCAAGGAATC-1_pbmc10k_query unknown Naive CD4 T
AAACAGCCAATCCCTT-1_pbmc10k_query Memory CD4 T Memory CD4 T
AAACAGCCAATGCGCT-1_pbmc10k_query Naive CD4 T Naive CD4 T
AAACAGCCACCAACCG-1_pbmc10k_query Naive CD8 T Naive CD8 T
AAACAGCCAGGATAAC-1_pbmc10k_query Naive CD4 T Naive CD4 T
... ... ...
TTTGTTGGTGACATGC-1_pbmc10k_query Naive CD8 T Naive CD8 T
TTTGTTGGTGTTAAAC-1_pbmc10k_query Naive CD8 T Naive CD8 T
TTTGTTGGTTAGGATT-1_pbmc10k_query NK NK
TTTGTTGGTTGGTTAG-1_pbmc10k_query Memory CD4 T Memory CD4 T
TTTGTTGGTTTGCAGA-1_pbmc10k_query Effector memory CD8 T Effector memory CD8 T

10412 rows × 2 columns

[64]:
sc.pp.filter_genes(RNA_anndata, min_cells=3)
sc.pp.normalize_total(RNA_anndata, target_sum=1e4)
sc.pp.log1p(RNA_anndata)
sc.pp.highly_variable_genes(RNA_anndata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.scale(RNA_anndata, max_value=10)
RNA_anndata.var_names_make_unique()
/opt/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[67]:
N_col = 2
N_row = 2

markergenes = ["CD27","TNFRSF13B", # memoryB
              "IGHM", "TCL1A", # naive B
              ]
fig, axes=plt.subplots(nrows= N_row , ncols= N_col , figsize=(N_col*2,N_row*2),
                       sharex=True,sharey=True, )
for i in range(len(markergenes) ):
    i_row = int(i/N_col)
    i_col = i%N_col
    ax = axes[i_row, i_col]
    sc.pl.umap(RNA_anndata, color= markergenes[i],
                    title =  markergenes[i],
               # title = "",
               show=False, ax = ax, legend_loc=None,color_map='plasma', colorbar_loc = None, )
    ax.set_xlabel('')
    ax.set_ylabel('')
fig.tight_layout()
plt.show()
_images/tutorials_demo_PBMC10K_Multiome_40_0.png
[6]:
import scanpy as sc
import pandas as pd
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy.io
import os
import anndata as ad # Anndata version must > 0.8
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')

from scATAnno.SnapATAC2_spectral import *
from scATAnno.SnapATAC2_tools import *
from scATAnno.SnapATAC2_utils import *
from scATAnno import scATAnno_preprocess
from scATAnno import scATAnno_assignment
from scATAnno import scATAnno_integration
from scATAnno import scATAnno_plotting

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

default_28 = scATAnno_plotting.get_palettes("default28")
default_102 = scATAnno_plotting.get_palettes("default102")

TNBC sample

Download TIL reference data

[1]:
# !wget -O BCC_TIL_reference_atlas_final.h5ad "https://www.dropbox.com/s/ky4jezsj3pf2qwi/BCC_TIL_reference_atlas_final.h5ad?dl=1"

Download TNBC query data, whose features are aligned with BCC TIL reference peak

[2]:
# !wget -O TNBC_Pre_P012_t.zip "https://www.dropbox.com/s/7vyk08xrv81b1q2/TNBC_Pre_P012_t.zip?dl=1"
[8]:
os.chdir("scATAnno-main")

output_name = "TNBC"
out_dir = os.path.join("case_study", output_name)


[3]:

reference_data_path = "data/Reference/BCC_TIL_reference_atlas_final.h5ad" reference_data = scATAnno_preprocess.load_reference_data(reference_data_path) query_data = scATAnno_preprocess.import_query_data(path = 'data/TNBC_Pre_P012_t/', mtx_file = 'matrix.mtx.gz', cells_file = 'barcodes.tsv.gz', features_file = 'features.tsv.gz', variable_prefix = "TNBC_Pre_P012_t", celltype_col = "celltypes", add_metrics=False)
[4]:
print(reference_data)
print(query_data)
assert reference_data.var.shape[0] == query_data.var.shape[0]
AnnData object with n_obs × n_vars = 22008 × 344492
    obs: 'celltypes', 'tissue', 'dataset'
    var: 'selected'
    uns: 'Group_colors', 'celltypes_colors', 'spectral_eigenvalue', 'true_label_colors'
    obsm: 'X_spectral', 'X_umap'
View of AnnData object with n_obs × n_vars = 9935 × 344492
    obs: 'celltypes', 'tissue', 'dataset'
[5]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate(reference_data, query_data, variable_prefix = "TNBC_Pre_P012_t", sample_size = 25000)
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var["selected"] = selected_features
Compute similarity matrix
Normalization
Perform decomposition
Perform Nystrom extension
100%|████████████████████████████████████████████| 2/2 [04:20<00:00, 130.26s/it]
[6]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-25 17:42:14,786 - harmonypy - INFO - Iteration 1 of 20
2023-05-25 17:42:21,018 - harmonypy - INFO - Iteration 2 of 20
2023-05-25 17:42:26,180 - harmonypy - INFO - Converged after 2 iterations
[7]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
integrated_adata
[7]:
AnnData object with n_obs × n_vars = 31943 × 344492
    obs: 'celltypes', 'tissue', 'dataset'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap'
[8]:
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#228833', '#cccccc'], show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_TNBC_12_1.png
[17]:
integrated_adata = sc.read_h5ad(os.path.join(out_dir, "1.Merged_query_reference.h5ad"))
integrated_adata = scATAnno_preprocess.add_variable(os.path.join(out_dir,"X_spectral_harmony.csv"), integrated_adata)
[18]:
reference = integrated_adata[integrated_adata.obs['dataset'] == "Atlas",:].copy()
reference.obs["celltypes"] =  scATAnno_assignment.curate_celltype_names(reference.obs["celltypes"], atlas = "TIL")
[11]:
import pickle
with open(os.path.join(out_dir, 'TIL_reference_palette.pickle'), 'rb') as handle:
    celltype_palette = pickle.load(handle)

scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "TIL Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_TNBC_15_1.png
[19]:
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
[20]:
projecton = pd.read_csv(os.path.join("data/TNBC_Pre_P012_t/", "query_metadata.csv"), index_col=0)
query.obs = pd.merge(query.obs, projecton, left_index=True, right_index=True)

We first assign cell types based on KNN and compute KNN-based uncertainty score

[21]:
# Perform KNN assignment
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)

Because we are annotating TIL subtypes, we adopt more loose threshold cutoffs to annotate query cells. Query cells are assigned high weighted distance-based uncertainty score if their distances to the assigned reference cell type centroid are greater than 95 percentile. Query cells with uncertainty score greater than 0.5 are annotated as unknown.

[22]:
# Perform weighted-distance based assignment
distance_threshold = 95
uncertainty_threshold = 0.5

atlas = "TIL"
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)

Finally, we perform cluster-level assignment of query cells.

[23]:
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, use_rep=use_rep, cluster_col = "Clusters")
[24]:
import pickle
with open(os.path.join(out_dir, 'all_palette_dictionary.pkl'), 'rb') as handle:
    annotation_palette = pickle.load(handle)
sc.pl.umap(query_annotated, color = ["paper_annotation_curated", "cluster_annotation", 'Uncertainty_Score'], palette=annotation_palette,title = ["Annotation Published in Zhang et al.","scATAnno Annotation"],  legend_loc = "on data")
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_TNBC_24_2.png

Compare the annotation results between scATAnno and original publication.

[25]:
from pySankey.sankey import sankey

pred_col = 'cluster_annotation'
true_col = 'paper_annotation_curated'

sankey(
    query_annotated.obs[pred_col], query_annotated.obs[true_col], aspect=10, colorDict=annotation_palette,
    fontsize=10, figure_name=None
    )

# plt.gcf().set_size_inches(6, 10)
# plt.savefig(os.path.join(out_dir,"{}.png".format("Sankey_CellType_Annotation_paper_cluster")), bbox_inches='tight', dpi=300)

_images/tutorials_demo_TNBC_26_0.png

BCC Multiple Samples

[1]:
import scanpy as sc
import pandas as pd
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy.io
import os
import anndata as ad # Anndata version must > 0.8
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')

import scATAnno
from scATAnno.SnapATAC2_spectral import *
from scATAnno.SnapATAC2_tools import *
from scATAnno.SnapATAC2_utils import *
from scATAnno import scATAnno_preprocess
from scATAnno import scATAnno_assignment
from scATAnno import scATAnno_integration
from scATAnno import scATAnno_plotting

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

default_28 = scATAnno_plotting.get_palettes("default28")
default_102 = scATAnno_plotting.get_palettes("default102")

Part I: First round annotation - Apply Healthy Adult reference atlas to BCC tumor samples

We first apply healthy adult reference atlas to two BCC samples. We load reference atlas data and BCC query data with healthy adult reference peak as feature input. Reference atlas is stored as h5ad AnnData in the data directory; query data is imported as MTX and TSV files from QuickATAC output.

Download reference data

[ ]:
# !wget -O Healthy_Adult_reference_atlas.h5ad "https://www.dropbox.com/s/3ezp2t6gw6hw21v/Healthy_Adult_reference_atlas.h5ad?dl=1"

Download query data. Each BCC sample has two features, one aligned with Healthy adult reference peak, and the other aligned with BCC TIL reference peak.

[1]:
# !wget -O BCC_samples.zip "https://www.dropbox.com/s/eqkmt5trp3350cj/BCC_samples.zip?dl=1"
[3]:
os.chdir("scATAnno-main")

output_name = "BCC_samples"
out_dir = os.path.join("case_study", output_name)

reference_data_path = "data/Reference/Healthy_Adult_reference_atlas.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
SU007_Total_Post = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU007_Total_Post_vHealthyAdult',
                                    mtx_file = 'matrix.mtx.gz',
                                    cells_file = 'barcodes.tsv.gz',
                                    features_file = 'features.tsv.gz',
                                    variable_prefix = "SU007_Total_Post",
                                    celltype_col = "celltypes",
                                    add_metrics=False)
SU010_Total_Pre = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU010_Total_Pre_vHealthyAdult',
                                    mtx_file = 'matrix.mtx.gz',
                                    cells_file = 'barcodes.tsv',
                                    features_file = 'features.tsv.gz',
                                    variable_prefix = "SU010_Total_Pre",
                                    celltype_col = "celltypes",
                                    add_metrics=False)
[35]:
print(SU010_Total_Pre)
print(SU007_Total_Post)
print(reference_data)

assert reference_data.var.shape[0] == SU010_Total_Pre.var.shape[0]
assert reference_data.var.shape[0] == SU007_Total_Post.var.shape[0]
AnnData object with n_obs × n_vars = 1071 × 890130
    obs: 'celltypes', 'tissue', 'dataset'
    var: 'selected'
AnnData object with n_obs × n_vars = 666 × 890130
    obs: 'celltypes', 'tissue', 'dataset'
    var: 'selected'
AnnData object with n_obs × n_vars = 100158 × 890130
    obs: 'celltypes', 'tissue', 'dataset'
    var: 'selected'
    uns: 'cell type_colors', 'major cell type_colors', 'spectral_eigenvalue', 'tissue_colors'
    obsm: 'X_umap'

We first create a dictionary of reference and query datasets

[6]:
data_list = [SU007_Total_Post, SU010_Total_Pre]
print("total number of query cells: {}".format(np.sum([i.obs.shape[0] for i in data_list])))
total number of query cells: 1737

Before integration, we need to select features of datasets and prepare for integration

[7]:
select_features(reference_data)
for anndata in data_list:
    select_features(anndata)

datasets={}
datasets["Atlas"] = reference_data
for anndata in data_list:
    key = (anndata.obs["dataset"])[0]
    datasets[key] = anndata
datasets
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var["selected"] = selected_features
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var["selected"] = selected_features
[7]:
{'Atlas': AnnData object with n_obs × n_vars = 100158 × 890130
     obs: 'Unnamed: 0', 'sample', 'replicate', 'logUMI', 'tsse', 'tissue', 'cell type', 'Life stage', 'major cell type', 'dataset'
     var: 'selected'
     uns: 'cell type_colors', 'major cell type_colors', 'spectral_eigenvalue', 'tissue_colors'
     obsm: 'X_umap',
 'SU007_Total_Post': AnnData object with n_obs × n_vars = 666 × 890130
     obs: 'celltypes', 'tissue', 'dataset'
     var: 'selected',
 'SU010_Total_Pre': AnnData object with n_obs × n_vars = 1071 × 890130
     obs: 'celltypes', 'tissue', 'dataset'
     var: 'selected'}

We use SnapATAC2 to integrate the reference atlas and multiple query data

[8]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate_multiple(datasets, sample_size = 25000)
Compute similarity matrix
Normalization
Perform decomposition
Perform Nystrom extension
100%|█████████████████████████████████████████████| 6/6 [05:55<00:00, 59.27s/it]

Apply harmony to remove batch effects of datasets

[10]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-25 14:11:10,406 - harmonypy - INFO - Iteration 1 of 20
2023-05-25 14:11:31,432 - harmonypy - INFO - Iteration 2 of 20
2023-05-25 14:11:52,638 - harmonypy - INFO - Converged after 2 iterations

Plot UMAP using spectral embeddings

[47]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
[16]:
# Set plotting parameters
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#cccccc', '#228833','#8c564b'], size=10, show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_BCC_multiple_samples_20_1.png
[46]:
integrated_adata.obs
[46]:
tissue dataset celltypes
adipose_omentum_SM-CSSD4_1+AAGTCCTTAGGGCTGATATGCT_Atlas adipose_omentum_SM-CSSD4 Atlas Mesenchymal
adipose_omentum_SM-CSSD4_1+AACGACCAAAGGCGTACTTCTA_Atlas adipose_omentum_SM-CSSD4 Atlas Mesenchymal
adipose_omentum_SM-IOBHJ_1+ATACTCGCTTAGCGGCTGAAAC_Atlas adipose_omentum_SM-IOBHJ Atlas Mesenchymal
adipose_omentum_SM-CSSD4_1+CAGGAAAGCAAAAGGGATGCCA_Atlas adipose_omentum_SM-CSSD4 Atlas Mesenchymal
adipose_omentum_SM-IOBHJ_1+ATGCAGGTAAATAGCTGTGTCT_Atlas adipose_omentum_SM-IOBHJ Atlas Mesenchymal
... ... ... ...
SU010_Total_Pre#TGGGTTATCCCAGCAG-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre
SU010_Total_Pre#GGGTGTCTCGGGACAA-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre
SU010_Total_Pre#ACCGGGTCATCCCAAA-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre
SU010_Total_Pre#CCCGTTATCTAGCAGT-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre
SU010_Total_Pre#GAACGTTGTTAAGTCC-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre

101895 rows × 3 columns

[51]:
reference = integrated_adata[integrated_adata.obs['dataset'] == "Atlas",:].copy()
reference.obs["celltypes"] =  scATAnno_assignment.curate_celltype_names(reference.obs["celltypes"], atlas = "HealthyAdult")
[61]:
import pickle
with open(os.path.join(out_dir, 'HealthyAdult_reference_palette.pickle'), 'rb') as handle:
    celltype_palette = pickle.load(handle)

scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "Healthy Adult Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_BCC_multiple_samples_23_1.png

After integration, we want to assign cell types to the query data. First, query cells are assigned labels using K-Nearest neighbors along with KNN-based uncertainty score. Second, query cell labels are corrected using weighted-distance-based uncertainty score. Third, query cells are clustered and annotated at the cluster level.

[62]:
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()

Obtain UMAP coordinates and clusters provided by the original publication. Then Query cells are clustered and annotated at the cluster-level

[71]:
projection_allcells = pd.read_table("data/BCC_samples/BCC_study_metadata.csv", index_col = 0, sep = ",")
projection_allcells.index = [projection_allcells.loc[i,"Group"] + "#" + i for i in projection_allcells.index]
sample_names = ["SU010_Total_Pre", "SU007_Total_Post"]
projection = projection_allcells[projection_allcells["Group"].isin(sample_names)]
query.obs = pd.merge(query.obs, projection, left_index=True, right_index=True)
projection.head()
[71]:
UMAP_1 UMAP_2 Clusters Group depth FRIP Barcodes Internal_Name Group_Barcode true_label
SU007_Total_Post#GGACACCAGCGCACAA-1_SU007_Total_Post 1.320694 12.943564 Cluster14 SU007_Total_Post 50427 0.676443 GGACACCAGCGCACAA-1 SU007_Total_Post_57 SU007_Total_Post#GGACACCAGCGCACAA-1 Myeloid
SU007_Total_Post#ACTAGGTGTACCCATA-1_SU007_Total_Post 5.631482 13.093065 Cluster15 SU007_Total_Post 45889 0.580553 ACTAGGTGTACCCATA-1 SU007_Total_Post_59 SU007_Total_Post#ACTAGGTGTACCCATA-1 Endothelial
SU007_Total_Post#AGGACGATCGGTCTCT-1_SU007_Total_Post 1.471412 -7.473051 Cluster4 SU007_Total_Post 21568 0.688590 AGGACGATCGGTCTCT-1 SU007_Total_Post_60 SU007_Total_Post#AGGACGATCGGTCTCT-1 Treg
SU007_Total_Post#GACCAATCATGGGTGA-1_SU007_Total_Post -4.932662 2.046299 Cluster7 SU007_Total_Post 25402 0.714097 GACCAATCATGGGTGA-1 SU007_Total_Post_61 SU007_Total_Post#GACCAATCATGGGTGA-1 Memory CD8 T
SU007_Total_Post#GCTCGAGAGTAATGTG-1_SU007_Total_Post -4.171803 3.467478 Cluster7 SU007_Total_Post 6544 0.714318 GCTCGAGAGTAATGTG-1 SU007_Total_Post_62 SU007_Total_Post#GCTCGAGAGTAATGTG-1 Memory CD8 T

We first assign cell types based on KNN and compute KNN-based uncertainty score

[72]:
# Perform KNN assignment
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)

Query cells are assigned high weighted distance-based uncertainty score if their distances to the assigned reference cell type centroid are greater than 90 percentile. Query cells with uncertainty score greater than 0.2 are annotated as unknown.

[73]:
# Perform weighted-distance based assignment
distance_threshold = 90
uncertainty_threshold = 0.2
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
atlas = "HealthyAdult"
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)

We then project query cells into the same UMAP coordinates as original publication for better comparisons, and perform cluster-level assignment.

[85]:
query_distance.obsm["X_umap"] = np.array(query_distance.obs[["UMAP_1", "UMAP_2"]])
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, use_rep=use_rep, cluster_col = "Clusters", UMAP=False)
query_annotated
[85]:
AnnData object with n_obs × n_vars = 1737 × 890130
    obs: 'tissue', 'dataset', 'celltypes', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'UMAP_1', 'UMAP_2', 'Clusters', 'Group', 'depth', 'FRIP', 'Barcodes', 'Internal_Name', 'Group_Barcode', 'true_label', 'cluster_annotation'
    uns: 'dataset_colors'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'

Compare annotation results from scATAnno and the origional publication

[95]:
sc.pl.umap(query_annotated, color = ['true_label', "cluster_annotation", 'Uncertainty_Score'], title = ["Annotation Published in Satpathy et al.","scATAnno Annotation"], legend_loc = "on data")
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_BCC_multiple_samples_35_2.png
[129]:
immune_cells = pd.DataFrame(query_annotated[query_annotated.obs["cluster_annotation"] == "Immune Cells"].obs.index)
print(immune_cells.head())
                                                   0
0  SU007_Total_Post#TCGTTCGAGGTTCTCA-1_SU007_Tota...
1  SU007_Total_Post#GCAACCGAGAAGAGTG-1_SU007_Tota...
2  SU007_Total_Post#GTGTCAAGTTAGTAGA-1_SU007_Tota...
3  SU007_Total_Post#CTTGTCGAGCGAGCTA-1_SU007_Tota...
4  SU007_Total_Post#GCAGCTGCATTCGTCC-1_SU007_Tota...

Part II: Second round annotation - Apply BCC TIL reference atlas to BCC immune cells

Download TIL reference data

[2]:
# !wget "https://www.dropbox.com/s/ky4jezsj3pf2qwi/BCC_TIL_reference_atlas_final.h5ad?dl=1"

In the second round annotation, we apply BCC TIL reference atlas to the immune cells identified from the first round. Repeat the integration process. This time we use BCC TIL as reference, and use BCC query data with TIL reference peak as feature input. Reference atlas is stored as h5ad AnnData in the data directory; query data is imported as MTX and TSV files from QuickATAC output.

[108]:
reference_data_path = "data/Reference/BCC_TIL_reference_atlas_final.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
SU007_Total_Post = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU007_Total_Post_vBCC',
                                    mtx_file = 'matrix.mtx.gz',
                                    cells_file = 'barcodes.tsv.gz',
                                    features_file = 'features.tsv.gz',
                                    variable_prefix = "SU007_Total_Post",
                                    celltype_col = "celltypes",
                                    add_metrics=False)
SU010_Total_Pre = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU010_Total_Pre_vBCC',
                                    mtx_file = 'matrix.mtx.gz',
                                    cells_file = 'barcodes.tsv',
                                    features_file = 'features.tsv.gz',
                                    variable_prefix = "SU010_Total_Pre",
                                    celltype_col = "celltypes",
                                    add_metrics=False)
[132]:
data_list = [SU007_Total_Post, SU010_Total_Pre]
print("Total number of query cells: {}".format(np.sum([i.obs.shape[0] for i in data_list])))
Total number of query cells: 1737

Select the immune cells from first round annotation

[130]:
immune_cells.index = immune_cells[0]
immune_cells.index = [i.split("_SU")[0] for i in immune_cells.index]
immune_cells.head()
[130]:
0
SU007_Total_Post#TCGTTCGAGGTTCTCA-1 SU007_Total_Post#TCGTTCGAGGTTCTCA-1_SU007_Tota...
SU007_Total_Post#GCAACCGAGAAGAGTG-1 SU007_Total_Post#GCAACCGAGAAGAGTG-1_SU007_Tota...
SU007_Total_Post#GTGTCAAGTTAGTAGA-1 SU007_Total_Post#GTGTCAAGTTAGTAGA-1_SU007_Tota...
SU007_Total_Post#CTTGTCGAGCGAGCTA-1 SU007_Total_Post#CTTGTCGAGCGAGCTA-1_SU007_Tota...
SU007_Total_Post#GCAGCTGCATTCGTCC-1 SU007_Total_Post#GCAGCTGCATTCGTCC-1_SU007_Tota...
[133]:
data_list_use = data_list.copy()
for i in range(len(data_list_use)):
    print(data_list_use[i].obs["dataset"][0])
    data_list_use[i] = data_list_use[i][data_list_use[i].obs.index.isin(immune_cells.index)]
    print(data_list_use[i])

print("Total immune cells: {}".format(np.sum([i.obs.shape[0] for i in data_list_use])))
SU007_Total_Post
View of AnnData object with n_obs × n_vars = 618 × 344492
    obs: 'celltypes', 'tissue', 'dataset'
    var: 'selected'
SU010_Total_Pre
View of AnnData object with n_obs × n_vars = 188 × 344492
    obs: 'celltypes', 'tissue', 'dataset'
    var: 'selected'
Total immune cells: 806

We use SnapATAC2 to integrate the reference atlas and multiple query data

[134]:
for anndata in data_list_use:
    select_features(anndata)
select_features(reference_data)

datasets={}
datasets["Atlas"] = reference_data
for anndata in data_list_use:
    key = (anndata.obs["dataset"])[0]
    datasets[key] = anndata
datasets
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var["selected"] = selected_features
[134]:
{'Atlas': AnnData object with n_obs × n_vars = 22008 × 344492
     obs: 'celltypes', 'tissue', 'dataset'
     var: 'selected'
     uns: 'Group_colors', 'celltypes_colors', 'spectral_eigenvalue', 'true_label_colors'
     obsm: 'X_spectral', 'X_umap',
 'SU007_Total_Post': AnnData object with n_obs × n_vars = 618 × 344492
     obs: 'celltypes', 'tissue', 'dataset'
     var: 'selected',
 'SU010_Total_Pre': AnnData object with n_obs × n_vars = 188 × 344492
     obs: 'celltypes', 'tissue', 'dataset'
     var: 'selected'}
[135]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate_multiple(datasets, sample_size = 25000)
Compute similarity matrix
Normalization
Perform decomposition
[136]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-25 15:23:46,290 - harmonypy - INFO - Iteration 1 of 20
2023-05-25 15:23:49,706 - harmonypy - INFO - Iteration 2 of 20
2023-05-25 15:23:53,264 - harmonypy - INFO - Iteration 3 of 20
2023-05-25 15:23:56,800 - harmonypy - INFO - Iteration 4 of 20
2023-05-25 15:24:00,334 - harmonypy - INFO - Iteration 5 of 20
2023-05-25 15:24:03,925 - harmonypy - INFO - Iteration 6 of 20
2023-05-25 15:24:07,365 - harmonypy - INFO - Iteration 7 of 20
2023-05-25 15:24:11,058 - harmonypy - INFO - Converged after 7 iterations
[137]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
[138]:
# Set plotting parameters
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#cccccc', '#228833','#8c564b'], size=10, show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_BCC_multiple_samples_51_1.png
[145]:
reference = integrated_adata[integrated_adata.obs['dataset'] == "Atlas",:].copy()
reference.obs["celltypes"] =  scATAnno_assignment.curate_celltype_names(reference.obs["celltypes"], atlas = "TIL")
[144]:
import pickle
with open(os.path.join(out_dir, 'TIL_reference_palette.pickle'), 'rb') as handle:
    celltype_palette = pickle.load(handle)

scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "TIL Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_BCC_multiple_samples_53_1.png
[146]:
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
[147]:
projection_allcells = pd.read_table("data/BCC_samples/BCC_study_metadata.csv", index_col = 0, sep = ",")
projection_allcells.index = [projection_allcells.loc[i,"Group"] + "#" + i for i in projection_allcells.index]
projection_allcells["true_label"] = [i.strip() for i in projection_allcells["true_label"]]
sample_names = ["SU010_Total_Pre", "SU007_Total_Post"]
projection = projection_allcells[projection_allcells["Group"].isin(sample_names)]
query.obs = pd.merge(query.obs, projection, left_index=True, right_index=True)
query.obs
[147]:
celltypes tissue dataset UMAP_1 UMAP_2 Clusters Group depth FRIP Barcodes Internal_Name Group_Barcode true_label
SU007_Total_Post#TCGTTCGAGGTTCTCA-1_SU007_Total_Post SU007_Total_Post SU007_Total_Post SU007_Total_Post 2.118399 -6.020474 Cluster4 SU007_Total_Post 10159 0.722758 TCGTTCGAGGTTCTCA-1 SU007_Total_Post_312 SU007_Total_Post#TCGTTCGAGGTTCTCA-1 Treg
SU007_Total_Post#GCAACCGAGAAGAGTG-1_SU007_Total_Post SU007_Total_Post SU007_Total_Post SU007_Total_Post 2.138090 -8.092027 Cluster4 SU007_Total_Post 12455 0.715857 GCAACCGAGAAGAGTG-1 SU007_Total_Post_1 SU007_Total_Post#GCAACCGAGAAGAGTG-1 Treg
SU007_Total_Post#GTGTCAAGTTAGTAGA-1_SU007_Total_Post SU007_Total_Post SU007_Total_Post SU007_Total_Post 0.018025 -1.757976 Cluster1 SU007_Total_Post 9101 0.621910 GTGTCAAGTTAGTAGA-1 SU007_Total_Post_468 SU007_Total_Post#GTGTCAAGTTAGTAGA-1 Naive CD4 T
SU007_Total_Post#CTTGTCGAGCGAGCTA-1_SU007_Total_Post SU007_Total_Post SU007_Total_Post SU007_Total_Post -1.315312 1.582670 Cluster5 SU007_Total_Post 22287 0.669695 CTTGTCGAGCGAGCTA-1 SU007_Total_Post_3 SU007_Total_Post#CTTGTCGAGCGAGCTA-1 Naive CD8 T
SU007_Total_Post#GCAGCTGCATTCGTCC-1_SU007_Total_Post SU007_Total_Post SU007_Total_Post SU007_Total_Post 0.688227 -1.430185 Cluster1 SU007_Total_Post 22482 0.693866 GCAGCTGCATTCGTCC-1 SU007_Total_Post_4 SU007_Total_Post#GCAGCTGCATTCGTCC-1 Naive CD4 T
... ... ... ... ... ... ... ... ... ... ... ... ... ...
SU010_Total_Pre#CCAGATAAGTACGCGA-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre -3.657552 2.379094 Cluster7 SU010_Total_Pre 6428 0.583852 CCAGATAAGTACGCGA-1 SU010_Total_Pre_1057 SU010_Total_Pre#CCAGATAAGTACGCGA-1 Memory CD8 T
SU010_Total_Pre#ACAGGCCTCAAGAGAT-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre -3.117948 1.990111 Cluster5 SU010_Total_Pre 3820 0.529188 ACAGGCCTCAAGAGAT-1 SU010_Total_Pre_1058 SU010_Total_Pre#ACAGGCCTCAAGAGAT-1 Naive CD8 T
SU010_Total_Pre#TGGGTTATCCCAGCAG-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre -1.875685 0.338385 Cluster5 SU010_Total_Pre 4741 0.469099 TGGGTTATCCCAGCAG-1 SU010_Total_Pre_1061 SU010_Total_Pre#TGGGTTATCCCAGCAG-1 Naive CD8 T
SU010_Total_Pre#GGGTGTCTCGGGACAA-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre -4.008999 16.443800 Cluster13 SU010_Total_Pre 6619 0.599940 GGGTGTCTCGGGACAA-1 SU010_Total_Pre_1064 SU010_Total_Pre#GGGTGTCTCGGGACAA-1 Plasma B
SU010_Total_Pre#CCCGTTATCTAGCAGT-1_SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre SU010_Total_Pre -4.014687 1.344591 Cluster10 SU010_Total_Pre 32230 0.209680 CCCGTTATCTAGCAGT-1 SU010_Total_Pre_1068 SU010_Total_Pre#CCCGTTATCTAGCAGT-1 NK1

806 rows × 13 columns

We first assign cell types based on KNN and compute KNN-based uncertainty score

[149]:
# Perform KNN assignment
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)

Because we are annotating TIL subtypes, we adopt more loose threshold cutoffs to annotate query cells. Query cells are assigned high weighted distance-based uncertainty score if their distances to the assigned reference cell type centroid are greater than 95 percentile. Query cells with uncertainty score greater than 0.5 are annotated as unknown.

[150]:
# Perform weighted-distance based assignment
distance_threshold = 95
uncertainty_threshold = 0.5

atlas = "HealthyAdult"
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)

We then project query cells into the same UMAP coordinates as original publication for better comparisons, and perform cluster-level assignment.

[151]:
query_distance.obsm["X_umap"] = np.array(query_distance.obs[["UMAP_1", "UMAP_2"]])
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, use_rep=use_rep, cluster_col = "Clusters", UMAP=False)
query_annotated
[151]:
AnnData object with n_obs × n_vars = 806 × 344492
    obs: 'celltypes', 'tissue', 'dataset', 'UMAP_1', 'UMAP_2', 'Clusters', 'Group', 'depth', 'FRIP', 'Barcodes', 'Internal_Name', 'Group_Barcode', 'true_label', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'cluster_annotation'
    uns: 'dataset_colors'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
[152]:
sc.pl.umap(query_annotated, color = ['true_label', "cluster_annotation", 'Uncertainty_Score'], title = ["Annotation Published in Satpathy et al.","scATAnno Annotation"], legend_loc = "on data")
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_BCC_multiple_samples_62_2.png

Finally, compare the annotation results between scATAnno and original publication.

[175]:
outcomes2 = []
for i in query_annotated.obs["cluster_annotation"].index:
    if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"true_label"]:
        outcomes2.append("Consistent")
    elif query_annotated.obs.loc[i,"cluster_annotation"] == "unknown":
        outcomes2.append("Unknown")
    else:
        outcomes2.append("Inconsistent")

query_annotated.obs["outcomes2"] = outcomes2
sc.pl.umap(query_annotated, color = "outcomes2", palette = ['#ff7f0e','#279e68'], title = "Comparison")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/tutorials_demo_BCC_multiple_samples_64_1.png
[180]:
adata_query = query_annotated.copy()
variable_col = "true_label"
outcome_col = "outcomes2"
listdict = []
for ct in adata_query.obs[variable_col].unique():
    tmp_dict = {}
    for pt in np.unique(adata_query.obs[outcome_col]):
        tmp_dict[pt] = (
            np.sum(
                adata_query.obs[adata_query.obs[variable_col] == ct][outcome_col] == pt
            )
        )
    l = len(adata_query.obs[adata_query.obs[variable_col] == ct])
    tmp_dict['ct'] = f'{ct} ({l})'
    tmp_dict['n_cells'] = len(adata_query.obs[adata_query.obs[variable_col] == ct])
    listdict.append(tmp_dict)


df = pd.DataFrame(listdict).set_index('ct').sort_values(by='ct')
df = df.sort_values("n_cells", ascending=True)

fig, ax = plt.subplots(1, 1, figsize=(2, 6))
df[df.columns[0:len(df.columns)-1]].plot(kind='barh', stacked=True, ax=ax, width=1.0, color = ['#ff7f0e','#279e68'])
ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
ax.set_xlabel('cell numbers')
ax.set_ylabel('')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., frameon=False)
ax.grid(False)
sns.despine()
plt.show()
/var/folders/79/yz2sgggd11n9c7kj2nws4w700000gp/T/ipykernel_5654/2520740504.py:24: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
_images/tutorials_demo_BCC_multiple_samples_65_1.png

Compute accuracy and F1 scores

[183]:
query = query_annotated.copy()
paper_anno_col="true_label"
cluster_anno_col = "cluster_annotation"
from sklearn.metrics import f1_score, confusion_matrix, classification_report
report = classification_report(y_true=query.obs[paper_anno_col],
                                       y_pred=query.obs[cluster_anno_col],
                                       output_dict=True)
#Also generate a confusion matrix
cm = confusion_matrix(y_true=query.obs[paper_anno_col],
                                       y_pred=query.obs[cluster_anno_col],)

FP = cm.sum(axis=0)-np.diag(cm)
FN = cm.sum(axis=1) - np.diag(cm)
TP = np.diag(cm)
TN = cm.sum() - (FP + FN + TP)
class_spec = TN/(TN+FP)

perfdf = pd.DataFrame(columns = ['class',
                                     'precision',
                                     'recall',
                                     'f1_score',
                                     'support',
                                     'specificity'])

i=0
for key, val in report.items():
    #we don't use accuracy so skip it
    if key =='accuracy':
        continue
    perflist = []
    perflist.append(key)
    for key2, val2 in val.items():
        perflist.append(val2)
    if i>=len(class_spec):
        perflist.append(0)
    else:
        perflist.append(class_spec[i])
    perfdf=perfdf.append(pd.Series(perflist, index=perfdf.columns), ignore_index=True)
    i+=1
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[184]:
perfdf
[184]:
class precision recall f1_score support specificity
0 B 1.000000 1.000000 1.000000 25 1.000000
1 CD8 TEx 1.000000 1.000000 1.000000 23 1.000000
2 Effector CD8 T 1.000000 1.000000 1.000000 48 1.000000
3 Memory CD8 T 1.000000 1.000000 1.000000 147 1.000000
4 Myeloid 1.000000 1.000000 1.000000 21 1.000000
5 NK1 1.000000 1.000000 1.000000 29 1.000000
6 NK2 1.000000 1.000000 1.000000 16 1.000000
7 Naive CD4 T 1.000000 1.000000 1.000000 140 1.000000
8 Naive CD8 T 1.000000 1.000000 1.000000 51 1.000000
9 Plasma B 1.000000 1.000000 1.000000 118 1.000000
10 Tfh 1.000000 1.000000 1.000000 44 1.000000
11 Th1 0.000000 0.000000 0.000000 23 1.000000
12 Th17 1.000000 1.000000 1.000000 19 1.000000
13 Treg 1.000000 1.000000 1.000000 102 1.000000
14 unknown 0.000000 0.000000 0.000000 0 0.971464
15 macro avg 0.866667 0.866667 0.866667 806 0.000000
16 weighted avg 0.971464 0.971464 0.971464 806 0.000000
[186]:
# Accuracy
from scATAnno import scATAnno_evaluation
df = query.obs.copy()
cell_accr = {}
for celltype_of_interest in np.unique(df[paper_anno_col]):
    true_cells = df[df[paper_anno_col]==celltype_of_interest].index
    accr = scATAnno_evaluation.compute_accuracy(df.loc[true_cells,paper_anno_col], df.loc[true_cells,cluster_anno_col])
    cell_accr[celltype_of_interest] = accr

cell_level_df = pd.DataFrame([cell_accr], columns=cell_accr.keys())
cell_level_df = cell_level_df.T
cell_level_df.columns = ["accuracy"]
cell_level_df
[186]:
accuracy
B 1.0
CD8 TEx 1.0
Effector CD8 T 1.0
Memory CD8 T 1.0
Myeloid 1.0
NK1 1.0
NK2 1.0
Naive CD4 T 1.0
Naive CD8 T 1.0
Plasma B 1.0
Tfh 1.0
Th1 0.0
Th17 1.0
Treg 1.0
[187]:
data = np.array([np.float(cell_level_df.mean()),
                            np.float(perfdf[perfdf['class'] == "weighted avg"]["f1_score"])])

overallscore = pd.DataFrame(data.T, index = ['Accuracy',
                                     'Weighted F1',], columns=["value"]
                             )

overallscore
[187]:
value
Accuracy 0.928571
Weighted F1 0.971464
[189]:
ax = sns.barplot(x = overallscore.index, y= overallscore["value"], capsize=.2, dodge=True, palette="Blues",
)
ax.set_title("Classification Performance")
ax.set_ylim(0,1)
ax.set_ylabel("")
fig = ax.get_figure()
fig.tight_layout()
plt.show()
_images/tutorials_demo_BCC_multiple_samples_71_0.png