scATAnno: An Automated Cell Type annotation Pipeline for scATAC-seq
scATAnno is a Python-based workflow that automates the annotation of single-cell ATAC-seq (scATAC-seq) data using reference atlases. It directly uses chromatin accessibility peaks as input features. scATAnno utilizes large-scale reference atlases constructed from publicly available datasets and various scATAC-seq technologies to annotate cell types in an unsupervised manner. It incorporates uncertainty score metrics to assess the confidence of cell-type assignments. The workflow has been demonstrated to accurately annotate cell types in different case studies, including PBMC, TNBC, and BCC, and can identify unknown cell populations. Overall, scATAnno expedites our understanding of cellular heterogeneity based on chromatin accessibility profiles in complex biological systems.
Integration of newly generated scATAC-seq data with reference atlases
Celltype annotation of single-cell chromatin accessibility profile
Uncertainty score distribution to assess cell-type assignment

If you use this pipeline, please cite the following reference:
Jiang et al., scATAnno: Automated Cell Type Annotation for single-cell ATAC-seq Data
BioRxiv Preprint (https://www.biorxiv.org/content/10.1101/2023.06.01.543296v2)
The documentation is organized into the following sections:
About scATAnno
scATAnno is a Python-based workflow designed to annotate cell types in single-cell ATAC-seq (scATAC-seq) data based on scATAC-seq reference atlases. It addresses the challenges presented by scATAC-seq data, such as data sparsity, high dimensionality of epigenomic features, and the lack of marker enhancers for annotation. Unlike existing methods, scATAnno directly utilizes peaks or cis-regulatory element (CRE) genomic regions as input features without converting them into gene activity scores. It overcomes the high dimensionality by employing spectral embedding to transform the data into a low dimensional space. To create catalogs of CREs at the single-cell level, scATAnno uses the chromatin state profile of large-scale reference atlases to generate peak signals and reference peaks.
By integrating query data with reference atlases in an unsupervised manner, scATAnno enables accurate cell type annotation. It provides uncertainty score metrics to assess the confidence of cell-type assignment, including a K-Nearest Neighbor (KNN) uncertainty score and a weighted distance-based uncertainty score in a low dimensional space. The workflow has been demonstrated to be effective in multiple case studies, including peripheral blood mononuclear cells (PBMC), Triple Negative Breast Cancer (TNBC), and basal cell carcinoma (BCC). It accurately annotates cell types and identifies unknown cell populations not represented in the reference atlas. Additionally, scATAnno enables efficient cell type annotation across tumor conditions and accurately annotates tumor-infiltrating lymphocytes (TIL) in TNBC using a BCC TIL atlas.
Installation
Instruction of Installation
Install through pip
pip install scATAnno
Install through github
Clone the source code repository and go to the scATAnno-main directory, use `pip install . ` to install the package.
git clone https://github.com/Yijia-Jiang/scATAnno-main.git
cd scATAnno-main
pip install .
Reference Atlas
Below are Reference Altases in the original publication. If you want to find processed and ready-to-use reference atlases, please check Prepare Reference Atlas
Human scATAC Reference: Zhang, Kai, et al. Cell (2021)
Dataset: 1.3 million cells, 1.2 million peaks; 30 major cell types and 222 minor cell types
PBMC scATAC Reference: Satpathy, Ansuman et al. Nature Biotechnology (2019)
Dataset: 61,806 peripheral blood and bone marrow cells, 571,400 peaks

BCC Tumor-Infiltrating Lymphocytes (TILs) scATAC Reference:
Satpathy, Ansuman et al. Nature Biotechnology (2019)
Dataset: 37,818 cells from BCC clinical biopsies

Prepare Reference Atlas
Healthy Adult Reference Atlas
Select deep-sequenced 100K adult cells
Select adult specific peaks (~ 890K peaks)
Downloaded Healthy Adult Reference atlas
Perform dimensionality reduction using spectral embedding, visualize annotation on UMAP
PBMC Reference Atlas
Select 39441 PBMC cells
Generate 196K peaks by MACS2 Peak-Calling
-
Perform dimensionality reduction using spectral embedding, visualize annotation on UMAP
BCC TIL Reference Atlas
Select 22008 TIL cells
Generate 340K peaks by MACS2 Peak-Calling
-
Perform dimensionality reduction using spectral embedding, visualize annotation on UMAP
Pipeline Overview
The conceptual idea and schematic of scATAnno is illustrated here.

Original Input
The following files are needed to run Celltype Annotation on your own experiment:
fragments.tsv.gz fragment file for each scATAC data
barcodes.tsv cell barcodes for each scATAC data
reference.bed reference peaks with chromosome regions from the selected reference atlas
Optionally: UMAP or tSNE projection coordinates and Cluster cluster numbers of cells can be provided by users
Currently, this package only supports hg38 reference mapping
Intermediate Output
The following files are intermediate outputs of scATAnno in order to generate a peak-by-cell matrix for query data:
matrix.mtx Sparse matrix files with fragment reads
features.tsv Reference peaks/cis-Regulatory Elements
barcodes.tsv Cell barcodes of high quality cells
Final Output
The following files are final outputs of scATAnno using the annotation tool:
1.Merged_query_reference.h5ad Anndata of integrated query and reference cells
X_spectral_harmony.csv Harmozied spectral embeddings of integrated data
query.h5ad Anndata of query cells which stores annotation results. This AnnData should include essential prediction results in AnnData.obs
column cluster_annotation stores cell type assignment at cluster-level
column uncertainty_score stores final uncertainty score, which takes the maximum of KNN-based uncertainty and weighted distance-based uncertainty of query cells
Dimension reduction
Cited from https://kzhang.org/SnapATAC2/algorithms/dimension_reduction.html#spectral-embedding
[Fang_2021]: Fang, R., Preissl, S., Li, Y. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12, 1337 (2021). https://doi.org/10.1038/s41467-021-21583-9
The dimension reduction method is a pairwise-similarity based method, which requires defining and computing similarity between each pair of cells in the data.
In order to address the issue of high dimensionality in scATAC-seq data, we integrated SnapATAC2 into our workflow. SnapATAC2 utilizes spectral embedding techniques based on the work of Zhang et al. (2021) and Fang et al. (2021). We combined the reference and query data, creating a unified peak-by-cell matrix. SnapATAC2 transformed the count matrix into a cell-cell similarity matrix using the Jaccard Similarity Score, which measures the similarity between cells based on the ratio of their intersection and union. To mitigate sequencing depth bias, the Jaccard similarity matrix was normalized. The normalized matrix was then subjected to eigenvector decomposition to obtain low-dimensional components. To handle the computational complexity of high-dimensional data, SnapATAC2 first sampled “landmark” cells that capture the overall data distribution, and obtained the low-dimensional components for these informative cells. Finally, the remaining cells were projected into the same low-dimensional space, allowing us to obtain low-dimensional components for all cells in the dataset.
Cell-type Assignment
K-Nearest Neighorbors
In the initial cell-type assignment, each query cell is assigned a cell label using its closest K-nearest neighbors (KNN) in the reference atlas, based on the low dimensional embeddings.
To evaluate the confidence of the KNN assignment, we employ two distinct uncertainty score metrics. The first metric is the KNN-based uncertainty score. This score is calculated by considering the closest neighbors of a query cell in the reference atlas. Query cells that have predominantly nearest neighbors belonging to a single cell type are assigned low uncertainty scores, while those with neighbors consisting of a mixture of cell types receive higher uncertainty scores.
The second metric is a novel computation of a weighted distance-based uncertainty score. This approach is based on the concept that query cells located far from the centroid of their assigned cell type represent an unknown population. The computation involves three steps. First, we identify centroids of reference cell types based on low-dimensional components. Second, we assign different weights to the components for each cell type, considering the closeness of reference cells to their respective cell type’s centroid along each dimension.
Case Study Datasets
Three case studies below gives you a taste of different capabilities of our Annotation workflow. The datasets you will be exploring are:
PBMC 10K Multiome from 10X
Raw data can be directly downloaded from 10X webpage: https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k
scATAC of Triple Negative Breast Cancer (TNBC)
Citation: Zhang, Yuanyuan, Hongyan Chen, Hongnan Mo, Xueda Hu, Ranran Gao, Yahui Zhao, Baolin Liu, et al. 2021. “Single-Cell Analyses Reveal Key Immune Cell Subsets Associated with Response to PD-L1 Blockade in Triple-Negative Breast Cancer.” Cancer Cell 39 (12): 1578–93.e8.
We selected one scATAC sample (9,935 cells) collected from TNBC patients and applied the TIL reference atlas to annotate TIL cells from the TNBC query data
scATAC of two hold-out Basal cell carcinoma samples
Citation: Satpathy, Ansuman T et al. “Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion.” Nature biotechnology (2019)
We selected two BCC samples held out from the BCC TIL reference building as query data. One sample is pre-treatment and the other is post-treatment by immunotherapy. The original information of BCC cells is shown below:
Tutorials
PBMC10K Multiome
[16]:
import scanpy as sc
import pandas as pd
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy.io
import os
import anndata as ad # Anndata version must > 0.8
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')
import scATAnno
from scATAnno.SnapATAC2_spectral import *
from scATAnno.SnapATAC2_tools import *
from scATAnno.SnapATAC2_utils import *
from scATAnno import scATAnno_preprocess
from scATAnno import scATAnno_assignment
from scATAnno import scATAnno_integration
from scATAnno import scATAnno_plotting
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
default_28 = scATAnno_plotting.get_palettes("default28")
default_102 = scATAnno_plotting.get_palettes("default102")
In this tutorial, we apply PBMC reference atlas to PBMC 10K Multiome dataset. The input files are 1) reference atlas in h5ad format; 2) query data in mtx and tsv formats, whose features are aligned with PBMC reference peak.
Download reference data
[13]:
# !wget -O PBMC_reference_atlas_final.h5ad "https://www.dropbox.com/s/y9wc6h5mmydj7gf/PBMC_reference_atlas_final.h5ad?dl=1"
Download query data, including 1) scATAC (matrix.mtx, features.tsv and barcodes.tsv), features are the reference peak of PBMC reference atlas 2) scRNA (filtered_feature_bc_matrix.h5)
[14]:
# !wget -O PBMC10k_multiome_vPBMCatlas.zip "https://www.dropbox.com/s/3g63m832mbeec4s/PBMC10k_multiome_vPBMCatlas.zip?dl=1"
We load reference atlas data and query data. Reference atlas is stored as h5ad AnnData in the data directory; query data is imported from QuickATAC output.
[3]:
os.chdir("scATAnno-main")
output_name = "PBMC_10K_Multiome"
out_dir = os.path.join("case_study", output_name)
reference_data_path = "data/Reference/PBMC_reference_atlas_final.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
query_data = scATAnno_preprocess.import_query_data(path = 'data/PBMC10k_multiome_vPBMCatlas/scATAC/',
mtx_file = 'matrix.mtx.gz',
cells_file = 'barcodes.tsv.gz',
features_file = 'features.tsv.gz',
variable_prefix = "pbmc10k_query",
celltype_col = "celltypes",
add_metrics=False)
Before performing integration, we check feature dimensions of the reference and query.
[5]:
print(reference_data)
print(query_data)
assert reference_data.var.shape[0] == query_data.var.shape[0]
AnnData object with n_obs × n_vars = 39441 × 196618
obs: 'celltypes', 'tissue', 'dataset'
View of AnnData object with n_obs × n_vars = 11909 × 196618
obs: 'celltypes', 'tissue', 'dataset'
We integrate the reference atlas and query data using SnapATAC2 by 1) concatenating two datasets and selecting highly informative “landmark” reference cells to compute similarity matrix, normalize the matrix and eigen decompositionand and 2) performing Nystrom extension to project the remaining reference and query cells onto the same spectral embedding space
[6]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate(reference_data, query_data, variable_prefix = "pbmc10k_query", sample_size = 25000)
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var["selected"] = selected_features
Compute similarity matrix
Normalization
Perform decomposition
Perform Nystrom extension
100%|██████████████████████████████████████████| 3/3 [16:56<00:00, 338.90s/it]
Next, apply harmony to spectral embeddings of integrated data to regress out batch effects indicated in batch_col.
[7]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-23 15:52:26,525 - harmonypy - INFO - Iteration 1 of 20
2023-05-23 15:52:33,706 - harmonypy - INFO - Iteration 2 of 20
2023-05-23 15:52:40,648 - harmonypy - INFO - Iteration 3 of 20
2023-05-23 15:52:47,785 - harmonypy - INFO - Iteration 4 of 20
2023-05-23 15:52:54,813 - harmonypy - INFO - Iteration 5 of 20
2023-05-23 15:53:01,840 - harmonypy - INFO - Iteration 6 of 20
2023-05-23 15:53:08,845 - harmonypy - INFO - Iteration 7 of 20
2023-05-23 15:53:15,864 - harmonypy - INFO - Converged after 7 iterations
[9]:
integrated_adata_harmony
[9]:
AnnData object with n_obs × n_vars = 51350 × 196618
obs: 'celltypes', 'tissue', 'dataset'
obsm: 'X_spectral', 'X_spectral_harmony'
we then plot UMAP of the integrated data using harmonized spectral embeddings. Optionally, we can save the AnnData object in an output directory.
[10]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
integrated_adata
[10]:
AnnData object with n_obs × n_vars = 51350 × 196618
obs: 'celltypes', 'tissue', 'dataset'
obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap'
[27]:
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#228833', '#cccccc'], show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[25]:
import pickle
with open(os.path.join(out_dir, 'reference_palette.pickle'), 'rb') as handle:
celltype_palette = pickle.load(handle)
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "PBMC Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

We investigated cell types in the PBMC reference atlas and re-annotated gamma delta T cell cluster as MAIT cell cluster by showing chromatin accessibility of gene signatures in this cluster.
After integration, we want to assign cell types to the query data by taking three steps. First, query cells are assigned labels using K-Nearest neighbors along with KNN-based uncertainty score. Second, query cell labels are corrected using weighted-distance-based uncertainty score. Third, query cells are clustered and annotated at the cluster level.
[28]:
uncertainty_threshold = 0.2
distance_threshold = 90
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
atlas = "PBMC"
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
# Perform KNN assignment
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)
[29]:
# Perform weighted-distance based assignment
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)
[31]:
# Perform cluster-level assignment
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, cluster_col = None, use_rep=use_rep)
query_annotated
[31]:
AnnData object with n_obs × n_vars = 11909 × 196618
obs: 'celltypes', 'tissue', 'dataset', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'leiden', 'cluster_annotation'
uns: 'dataset_colors', 'neighbors', 'umap', 'leiden'
obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
obsp: 'distances', 'connectivities'
[35]:
sc.pl.umap(query_annotated, color = "cluster_annotation",palette=celltype_palette, title = "scATAnno Annotation")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

We benchmark scATAnno results with seurat annotation. Seurat annotation is stored in “Seurat_RNA_annotation.csv”
[34]:
seurat_df = pd.read_csv(os.path.join("data/PBMC10k_multiome_vPBMCatlas/","Seurat_RNA_annotation.csv"), index_col=0)
query_annotated = query_annotated[query_annotated.obs.index.isin(seurat_df.index)]
query_annotated.obs = pd.merge(query_annotated.obs, seurat_df, left_index=True, right_index=True)
query_annotated
[34]:
AnnData object with n_obs × n_vars = 10412 × 196618
obs: 'celltypes', 'tissue', 'dataset', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'leiden', 'cluster_annotation', 'seurat_rna_annotations', 'seurat_new_annotation'
uns: 'dataset_colors', 'neighbors', 'umap', 'leiden'
obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
obsp: 'distances', 'connectivities'
[37]:
sc.pl.umap(query_annotated, color = "seurat_new_annotation", title = "Seurat Annotation", palette = default_28)
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

Lastly, we compare the annotation results between scATAnno and Seurat.
[40]:
outcomes2 = []
for i in query_annotated.obs["cluster_annotation"].index:
if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"seurat_new_annotation"]:
outcomes2.append("Consistent")
elif query_annotated.obs.loc[i,"cluster_annotation"] == "unknown":
outcomes2.append("Unknown")
else: outcomes2.append("Inconsistent")
query_annotated.obs["outcomes2"] = outcomes2
sc.pl.umap(query_annotated, color = "outcomes2", palette= ['#1f77b4','#ff7f0e','#279e68'], title = "Comparison")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[42]:
outcomes = []
for i in query_annotated.obs["cluster_annotation"].index:
if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"seurat_new_annotation"]:
outcomes.append("Consistent")
else: outcomes.append("Inconsistent")
query_annotated.obs["outcomes"] = outcomes
adata_query = query_annotated.copy()
variable_col = "cluster_annotation"
outcome_col = "outcomes"
listdict = []
for ct in adata_query.obs[variable_col].unique():
# print(len(adata_query.obs[adata_query.obs['tissue'] == ct]))
tmp_dict = {}
for pt in np.unique(adata_query.obs[outcome_col]):
tmp_dict[pt] = (
np.sum(
adata_query.obs[adata_query.obs[variable_col] == ct][outcome_col] == pt
)
# / len(adata_query.obs[adata_query.obs[variable_col] == ct])
)
l = len(adata_query.obs[adata_query.obs[variable_col] == ct])
tmp_dict['ct'] = f'{ct} ({l})'
tmp_dict['n_cells'] = len(adata_query.obs[adata_query.obs[variable_col] == ct])
listdict.append(tmp_dict)
df = pd.DataFrame(listdict).set_index('ct').sort_values(by='ct')
df = df.iloc[0:df.shape[0]-1].sort_values("n_cells", ascending=True)
print(df)
Consistent Inconsistent n_cells
ct
pDC (89) 89 0 89
MAIT (118) 112 6 118
Treg (148) 113 35 148
cDC (199) 183 16 199
Naive B (349) 3 346 349
NK (383) 377 6 383
Memory B (414) 6 408 414
Effector memory CD8 T (736) 582 154 736
Memory CD4 T (1227) 1142 85 1227
Naive CD4 T (1237) 1178 59 1237
Naive CD8 T (1384) 1320 64 1384
Monocyte (3369) 3274 95 3369
[44]:
fig, ax = plt.subplots(1, 1, figsize=(2, 6))
df[df.columns[0:len(df.columns)-1]].plot(kind='barh', stacked=True, ax=ax, width=1.0)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
ax.set_xlabel('Cell Numbers')
ax.set_ylabel('')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., frameon=False)
ax.grid(False)
sns.despine()
plt.show()
/var/folders/79/yz2sgggd11n9c7kj2nws4w700000gp/T/ipykernel_70711/2726220942.py:3: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)

[47]:
query_annotated.obs
[47]:
celltypes | tissue | dataset | uncertainty_score_step1 | pred_y | uncertainty_score_step2 | Uncertainty_Score | 1.knn-based_celltype | 2.corrected_celltype | leiden | cluster_annotation | seurat_rna_annotations | seurat_new_annotation | outcomes2 | outcomes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCAAGAACAGTAATAG-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | 4.000000e-01 | Treg | 1.0 | 1.000000e+00 | unknown | unknown | 32 | Treg | CD4 Naive | Naive CD4 T | Inconsistent | Inconsistent |
AACCCGCAGGTAGCTT-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | 5.000000e-01 | Central memory CD8 T | 0.0 | 5.000000e-01 | unknown | unknown | 6 | Effector memory CD8 T | gdT | Gamma delta T | Inconsistent | Inconsistent |
CGCATATAGGTTACGT-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | -2.220446e-16 | Memory CD4 T | 0.0 | 0.000000e+00 | Memory CD4 T | Memory CD4 T | 2 | Memory CD4 T | CD4 TCM | Memory CD4 T | Consistent | Consistent |
TATTCGTTCCGCCTCA-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | -2.220446e-16 | Monocyte | 0.0 | 0.000000e+00 | Monocyte | Monocyte | 10 | Monocyte | CD14 Mono | Monocyte | Consistent | Consistent |
CGGTTTGAGTTTGGTA-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | 0.000000e+00 | Effector memory CD8 T | 0.0 | 0.000000e+00 | Effector memory CD8 T | Effector memory CD8 T | 12 | Effector memory CD8 T | gdT | Gamma delta T | Inconsistent | Inconsistent |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
GGTCAAGCACCTACTT-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | 2.220446e-16 | Monocyte | 0.0 | 2.220446e-16 | Monocyte | Monocyte | 17 | Monocyte | CD14 Mono | Monocyte | Consistent | Consistent |
CTTGCTCAGACAAACG-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | -2.220446e-16 | Monocyte | 1.0 | 1.000000e+00 | Monocyte | unknown | 0 | Monocyte | CD16 Mono | Monocyte | Consistent | Consistent |
GAAAGCCAGCAGGTGG-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | 0.000000e+00 | Monocyte | 1.0 | 1.000000e+00 | Monocyte | unknown | 14 | Monocyte | CD14 Mono | Monocyte | Consistent | Consistent |
ACGGGAAGTGTTAGCA-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | 2.220446e-16 | Naive CD8 T | 0.0 | 2.220446e-16 | Naive CD8 T | Naive CD8 T | 4 | Naive CD8 T | CD8 Naive | Naive CD8 T | Consistent | Consistent |
CTATGATCACCATATG-1_pbmc10k_query | pbmc10k_query | pbmc10k_query | pbmc10k_query | -2.220446e-16 | Monocyte | 1.0 | 1.000000e+00 | Monocyte | unknown | 7 | Monocyte | CD14 Mono | Monocyte | Consistent | Consistent |
10412 rows × 15 columns
[45]:
query_annotated.write(os.path.join(out_dir, "query_annotated.h5ad"))
To examine B cell subtypes, we leverage scRNA-profile to show signature genes
[61]:
# Read in scRNA profile
RNA_anndata = sc.read_10x_h5("data/PBMC10k_multiome_vPBMCatlas/scRNA/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
RNA_anndata.obs.index = [ i+ "_pbmc10k_query" for i in RNA_anndata.obs.index]
RNA_anndata = RNA_anndata[RNA_anndata.obs.index.isin(query_annotated.obs.index)]
/opt/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
[62]:
RNA_anndata.obs = pd.merge(RNA_anndata.obs, query_annotated.obs[["cluster_annotation", "seurat_new_annotation"]], left_index=True, right_index=True)
UMAP_df = pd.DataFrame(query_annotated.obsm["X_umap"])
UMAP_df.index = query_annotated.obs.index
UMAP_df.columns = ["UMAP_1", "UMAP_2"]
UMAP_df = UMAP_df.loc[RNA_anndata.obs.index]
RNA_anndata.obsm["X_umap"] = np.array(UMAP_df)
RNA_anndata.obs["cluster_annotation"].cat.reorder_categories(np.unique(query_annotated.obs["cluster_annotation"]), inplace=True)
/opt/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
[63]:
RNA_anndata.obs
[63]:
cluster_annotation | seurat_new_annotation | |
---|---|---|
AAACAGCCAAGGAATC-1_pbmc10k_query | unknown | Naive CD4 T |
AAACAGCCAATCCCTT-1_pbmc10k_query | Memory CD4 T | Memory CD4 T |
AAACAGCCAATGCGCT-1_pbmc10k_query | Naive CD4 T | Naive CD4 T |
AAACAGCCACCAACCG-1_pbmc10k_query | Naive CD8 T | Naive CD8 T |
AAACAGCCAGGATAAC-1_pbmc10k_query | Naive CD4 T | Naive CD4 T |
... | ... | ... |
TTTGTTGGTGACATGC-1_pbmc10k_query | Naive CD8 T | Naive CD8 T |
TTTGTTGGTGTTAAAC-1_pbmc10k_query | Naive CD8 T | Naive CD8 T |
TTTGTTGGTTAGGATT-1_pbmc10k_query | NK | NK |
TTTGTTGGTTGGTTAG-1_pbmc10k_query | Memory CD4 T | Memory CD4 T |
TTTGTTGGTTTGCAGA-1_pbmc10k_query | Effector memory CD8 T | Effector memory CD8 T |
10412 rows × 2 columns
[64]:
sc.pp.filter_genes(RNA_anndata, min_cells=3)
sc.pp.normalize_total(RNA_anndata, target_sum=1e4)
sc.pp.log1p(RNA_anndata)
sc.pp.highly_variable_genes(RNA_anndata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.scale(RNA_anndata, max_value=10)
RNA_anndata.var_names_make_unique()
/opt/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
[67]:
N_col = 2
N_row = 2
markergenes = ["CD27","TNFRSF13B", # memoryB
"IGHM", "TCL1A", # naive B
]
fig, axes=plt.subplots(nrows= N_row , ncols= N_col , figsize=(N_col*2,N_row*2),
sharex=True,sharey=True, )
for i in range(len(markergenes) ):
i_row = int(i/N_col)
i_col = i%N_col
ax = axes[i_row, i_col]
sc.pl.umap(RNA_anndata, color= markergenes[i],
title = markergenes[i],
# title = "",
show=False, ax = ax, legend_loc=None,color_map='plasma', colorbar_loc = None, )
ax.set_xlabel('')
ax.set_ylabel('')
fig.tight_layout()
plt.show()

[6]:
import scanpy as sc
import pandas as pd
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy.io
import os
import anndata as ad # Anndata version must > 0.8
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')
from scATAnno.SnapATAC2_spectral import *
from scATAnno.SnapATAC2_tools import *
from scATAnno.SnapATAC2_utils import *
from scATAnno import scATAnno_preprocess
from scATAnno import scATAnno_assignment
from scATAnno import scATAnno_integration
from scATAnno import scATAnno_plotting
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
default_28 = scATAnno_plotting.get_palettes("default28")
default_102 = scATAnno_plotting.get_palettes("default102")
TNBC sample
Download TIL reference data
[1]:
# !wget -O BCC_TIL_reference_atlas_final.h5ad "https://www.dropbox.com/s/ky4jezsj3pf2qwi/BCC_TIL_reference_atlas_final.h5ad?dl=1"
Download TNBC query data, whose features are aligned with BCC TIL reference peak
[2]:
# !wget -O TNBC_Pre_P012_t.zip "https://www.dropbox.com/s/7vyk08xrv81b1q2/TNBC_Pre_P012_t.zip?dl=1"
[8]:
os.chdir("scATAnno-main")
output_name = "TNBC"
out_dir = os.path.join("case_study", output_name)
[3]:
reference_data_path = "data/Reference/BCC_TIL_reference_atlas_final.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
query_data = scATAnno_preprocess.import_query_data(path = 'data/TNBC_Pre_P012_t/',
mtx_file = 'matrix.mtx.gz',
cells_file = 'barcodes.tsv.gz',
features_file = 'features.tsv.gz',
variable_prefix = "TNBC_Pre_P012_t",
celltype_col = "celltypes",
add_metrics=False)
[4]:
print(reference_data)
print(query_data)
assert reference_data.var.shape[0] == query_data.var.shape[0]
AnnData object with n_obs × n_vars = 22008 × 344492
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'
uns: 'Group_colors', 'celltypes_colors', 'spectral_eigenvalue', 'true_label_colors'
obsm: 'X_spectral', 'X_umap'
View of AnnData object with n_obs × n_vars = 9935 × 344492
obs: 'celltypes', 'tissue', 'dataset'
[5]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate(reference_data, query_data, variable_prefix = "TNBC_Pre_P012_t", sample_size = 25000)
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var["selected"] = selected_features
Compute similarity matrix
Normalization
Perform decomposition
Perform Nystrom extension
100%|████████████████████████████████████████████| 2/2 [04:20<00:00, 130.26s/it]
[6]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-25 17:42:14,786 - harmonypy - INFO - Iteration 1 of 20
2023-05-25 17:42:21,018 - harmonypy - INFO - Iteration 2 of 20
2023-05-25 17:42:26,180 - harmonypy - INFO - Converged after 2 iterations
[7]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
integrated_adata
[7]:
AnnData object with n_obs × n_vars = 31943 × 344492
obs: 'celltypes', 'tissue', 'dataset'
obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap'
[8]:
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#228833', '#cccccc'], show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[17]:
integrated_adata = sc.read_h5ad(os.path.join(out_dir, "1.Merged_query_reference.h5ad"))
integrated_adata = scATAnno_preprocess.add_variable(os.path.join(out_dir,"X_spectral_harmony.csv"), integrated_adata)
[18]:
reference = integrated_adata[integrated_adata.obs['dataset'] == "Atlas",:].copy()
reference.obs["celltypes"] = scATAnno_assignment.curate_celltype_names(reference.obs["celltypes"], atlas = "TIL")
[11]:
import pickle
with open(os.path.join(out_dir, 'TIL_reference_palette.pickle'), 'rb') as handle:
celltype_palette = pickle.load(handle)
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "TIL Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[19]:
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
[20]:
projecton = pd.read_csv(os.path.join("data/TNBC_Pre_P012_t/", "query_metadata.csv"), index_col=0)
query.obs = pd.merge(query.obs, projecton, left_index=True, right_index=True)
We first assign cell types based on KNN and compute KNN-based uncertainty score
[21]:
# Perform KNN assignment
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)
Because we are annotating TIL subtypes, we adopt more loose threshold cutoffs to annotate query cells. Query cells are assigned high weighted distance-based uncertainty score if their distances to the assigned reference cell type centroid are greater than 95 percentile. Query cells with uncertainty score greater than 0.5 are annotated as unknown.
[22]:
# Perform weighted-distance based assignment
distance_threshold = 95
uncertainty_threshold = 0.5
atlas = "TIL"
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)
Finally, we perform cluster-level assignment of query cells.
[23]:
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, use_rep=use_rep, cluster_col = "Clusters")
[24]:
import pickle
with open(os.path.join(out_dir, 'all_palette_dictionary.pkl'), 'rb') as handle:
annotation_palette = pickle.load(handle)
sc.pl.umap(query_annotated, color = ["paper_annotation_curated", "cluster_annotation", 'Uncertainty_Score'], palette=annotation_palette,title = ["Annotation Published in Zhang et al.","scATAnno Annotation"], legend_loc = "on data")
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

Compare the annotation results between scATAnno and original publication.
[25]:
from pySankey.sankey import sankey
pred_col = 'cluster_annotation'
true_col = 'paper_annotation_curated'
sankey(
query_annotated.obs[pred_col], query_annotated.obs[true_col], aspect=10, colorDict=annotation_palette,
fontsize=10, figure_name=None
)
# plt.gcf().set_size_inches(6, 10)
# plt.savefig(os.path.join(out_dir,"{}.png".format("Sankey_CellType_Annotation_paper_cluster")), bbox_inches='tight', dpi=300)

BCC Multiple Samples
[1]:
import scanpy as sc
import pandas as pd
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy.io
import os
import anndata as ad # Anndata version must > 0.8
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')
import scATAnno
from scATAnno.SnapATAC2_spectral import *
from scATAnno.SnapATAC2_tools import *
from scATAnno.SnapATAC2_utils import *
from scATAnno import scATAnno_preprocess
from scATAnno import scATAnno_assignment
from scATAnno import scATAnno_integration
from scATAnno import scATAnno_plotting
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
default_28 = scATAnno_plotting.get_palettes("default28")
default_102 = scATAnno_plotting.get_palettes("default102")
Part I: First round annotation - Apply Healthy Adult reference atlas to BCC tumor samples
We first apply healthy adult reference atlas to two BCC samples. We load reference atlas data and BCC query data with healthy adult reference peak as feature input. Reference atlas is stored as h5ad AnnData in the data directory; query data is imported as MTX and TSV files from QuickATAC output.
Download reference data
[ ]:
# !wget -O Healthy_Adult_reference_atlas.h5ad "https://www.dropbox.com/s/3ezp2t6gw6hw21v/Healthy_Adult_reference_atlas.h5ad?dl=1"
Download query data. Each BCC sample has two features, one aligned with Healthy adult reference peak, and the other aligned with BCC TIL reference peak.
[1]:
# !wget -O BCC_samples.zip "https://www.dropbox.com/s/eqkmt5trp3350cj/BCC_samples.zip?dl=1"
[3]:
os.chdir("scATAnno-main")
output_name = "BCC_samples"
out_dir = os.path.join("case_study", output_name)
reference_data_path = "data/Reference/Healthy_Adult_reference_atlas.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
SU007_Total_Post = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU007_Total_Post_vHealthyAdult',
mtx_file = 'matrix.mtx.gz',
cells_file = 'barcodes.tsv.gz',
features_file = 'features.tsv.gz',
variable_prefix = "SU007_Total_Post",
celltype_col = "celltypes",
add_metrics=False)
SU010_Total_Pre = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU010_Total_Pre_vHealthyAdult',
mtx_file = 'matrix.mtx.gz',
cells_file = 'barcodes.tsv',
features_file = 'features.tsv.gz',
variable_prefix = "SU010_Total_Pre",
celltype_col = "celltypes",
add_metrics=False)
[35]:
print(SU010_Total_Pre)
print(SU007_Total_Post)
print(reference_data)
assert reference_data.var.shape[0] == SU010_Total_Pre.var.shape[0]
assert reference_data.var.shape[0] == SU007_Total_Post.var.shape[0]
AnnData object with n_obs × n_vars = 1071 × 890130
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'
AnnData object with n_obs × n_vars = 666 × 890130
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'
AnnData object with n_obs × n_vars = 100158 × 890130
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'
uns: 'cell type_colors', 'major cell type_colors', 'spectral_eigenvalue', 'tissue_colors'
obsm: 'X_umap'
We first create a dictionary of reference and query datasets
[6]:
data_list = [SU007_Total_Post, SU010_Total_Pre]
print("total number of query cells: {}".format(np.sum([i.obs.shape[0] for i in data_list])))
total number of query cells: 1737
Before integration, we need to select features of datasets and prepare for integration
[7]:
select_features(reference_data)
for anndata in data_list:
select_features(anndata)
datasets={}
datasets["Atlas"] = reference_data
for anndata in data_list:
key = (anndata.obs["dataset"])[0]
datasets[key] = anndata
datasets
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var["selected"] = selected_features
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var["selected"] = selected_features
[7]:
{'Atlas': AnnData object with n_obs × n_vars = 100158 × 890130
obs: 'Unnamed: 0', 'sample', 'replicate', 'logUMI', 'tsse', 'tissue', 'cell type', 'Life stage', 'major cell type', 'dataset'
var: 'selected'
uns: 'cell type_colors', 'major cell type_colors', 'spectral_eigenvalue', 'tissue_colors'
obsm: 'X_umap',
'SU007_Total_Post': AnnData object with n_obs × n_vars = 666 × 890130
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected',
'SU010_Total_Pre': AnnData object with n_obs × n_vars = 1071 × 890130
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'}
We use SnapATAC2 to integrate the reference atlas and multiple query data
[8]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate_multiple(datasets, sample_size = 25000)
Compute similarity matrix
Normalization
Perform decomposition
Perform Nystrom extension
100%|█████████████████████████████████████████████| 6/6 [05:55<00:00, 59.27s/it]
Apply harmony to remove batch effects of datasets
[10]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-25 14:11:10,406 - harmonypy - INFO - Iteration 1 of 20
2023-05-25 14:11:31,432 - harmonypy - INFO - Iteration 2 of 20
2023-05-25 14:11:52,638 - harmonypy - INFO - Converged after 2 iterations
Plot UMAP using spectral embeddings
[47]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
[16]:
# Set plotting parameters
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#cccccc', '#228833','#8c564b'], size=10, show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[46]:
integrated_adata.obs
[46]:
tissue | dataset | celltypes | |
---|---|---|---|
adipose_omentum_SM-CSSD4_1+AAGTCCTTAGGGCTGATATGCT_Atlas | adipose_omentum_SM-CSSD4 | Atlas | Mesenchymal |
adipose_omentum_SM-CSSD4_1+AACGACCAAAGGCGTACTTCTA_Atlas | adipose_omentum_SM-CSSD4 | Atlas | Mesenchymal |
adipose_omentum_SM-IOBHJ_1+ATACTCGCTTAGCGGCTGAAAC_Atlas | adipose_omentum_SM-IOBHJ | Atlas | Mesenchymal |
adipose_omentum_SM-CSSD4_1+CAGGAAAGCAAAAGGGATGCCA_Atlas | adipose_omentum_SM-CSSD4 | Atlas | Mesenchymal |
adipose_omentum_SM-IOBHJ_1+ATGCAGGTAAATAGCTGTGTCT_Atlas | adipose_omentum_SM-IOBHJ | Atlas | Mesenchymal |
... | ... | ... | ... |
SU010_Total_Pre#TGGGTTATCCCAGCAG-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre |
SU010_Total_Pre#GGGTGTCTCGGGACAA-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre |
SU010_Total_Pre#ACCGGGTCATCCCAAA-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre |
SU010_Total_Pre#CCCGTTATCTAGCAGT-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre |
SU010_Total_Pre#GAACGTTGTTAAGTCC-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre |
101895 rows × 3 columns
[51]:
reference = integrated_adata[integrated_adata.obs['dataset'] == "Atlas",:].copy()
reference.obs["celltypes"] = scATAnno_assignment.curate_celltype_names(reference.obs["celltypes"], atlas = "HealthyAdult")
[61]:
import pickle
with open(os.path.join(out_dir, 'HealthyAdult_reference_palette.pickle'), 'rb') as handle:
celltype_palette = pickle.load(handle)
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "Healthy Adult Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

After integration, we want to assign cell types to the query data. First, query cells are assigned labels using K-Nearest neighbors along with KNN-based uncertainty score. Second, query cell labels are corrected using weighted-distance-based uncertainty score. Third, query cells are clustered and annotated at the cluster level.
[62]:
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
Obtain UMAP coordinates and clusters provided by the original publication. Then Query cells are clustered and annotated at the cluster-level
[71]:
projection_allcells = pd.read_table("data/BCC_samples/BCC_study_metadata.csv", index_col = 0, sep = ",")
projection_allcells.index = [projection_allcells.loc[i,"Group"] + "#" + i for i in projection_allcells.index]
sample_names = ["SU010_Total_Pre", "SU007_Total_Post"]
projection = projection_allcells[projection_allcells["Group"].isin(sample_names)]
query.obs = pd.merge(query.obs, projection, left_index=True, right_index=True)
projection.head()
[71]:
UMAP_1 | UMAP_2 | Clusters | Group | depth | FRIP | Barcodes | Internal_Name | Group_Barcode | true_label | |
---|---|---|---|---|---|---|---|---|---|---|
SU007_Total_Post#GGACACCAGCGCACAA-1_SU007_Total_Post | 1.320694 | 12.943564 | Cluster14 | SU007_Total_Post | 50427 | 0.676443 | GGACACCAGCGCACAA-1 | SU007_Total_Post_57 | SU007_Total_Post#GGACACCAGCGCACAA-1 | Myeloid |
SU007_Total_Post#ACTAGGTGTACCCATA-1_SU007_Total_Post | 5.631482 | 13.093065 | Cluster15 | SU007_Total_Post | 45889 | 0.580553 | ACTAGGTGTACCCATA-1 | SU007_Total_Post_59 | SU007_Total_Post#ACTAGGTGTACCCATA-1 | Endothelial |
SU007_Total_Post#AGGACGATCGGTCTCT-1_SU007_Total_Post | 1.471412 | -7.473051 | Cluster4 | SU007_Total_Post | 21568 | 0.688590 | AGGACGATCGGTCTCT-1 | SU007_Total_Post_60 | SU007_Total_Post#AGGACGATCGGTCTCT-1 | Treg |
SU007_Total_Post#GACCAATCATGGGTGA-1_SU007_Total_Post | -4.932662 | 2.046299 | Cluster7 | SU007_Total_Post | 25402 | 0.714097 | GACCAATCATGGGTGA-1 | SU007_Total_Post_61 | SU007_Total_Post#GACCAATCATGGGTGA-1 | Memory CD8 T |
SU007_Total_Post#GCTCGAGAGTAATGTG-1_SU007_Total_Post | -4.171803 | 3.467478 | Cluster7 | SU007_Total_Post | 6544 | 0.714318 | GCTCGAGAGTAATGTG-1 | SU007_Total_Post_62 | SU007_Total_Post#GCTCGAGAGTAATGTG-1 | Memory CD8 T |
We first assign cell types based on KNN and compute KNN-based uncertainty score
[72]:
# Perform KNN assignment
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)
Query cells are assigned high weighted distance-based uncertainty score if their distances to the assigned reference cell type centroid are greater than 90 percentile. Query cells with uncertainty score greater than 0.2 are annotated as unknown.
[73]:
# Perform weighted-distance based assignment
distance_threshold = 90
uncertainty_threshold = 0.2
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
atlas = "HealthyAdult"
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)
We then project query cells into the same UMAP coordinates as original publication for better comparisons, and perform cluster-level assignment.
[85]:
query_distance.obsm["X_umap"] = np.array(query_distance.obs[["UMAP_1", "UMAP_2"]])
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, use_rep=use_rep, cluster_col = "Clusters", UMAP=False)
query_annotated
[85]:
AnnData object with n_obs × n_vars = 1737 × 890130
obs: 'tissue', 'dataset', 'celltypes', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'UMAP_1', 'UMAP_2', 'Clusters', 'Group', 'depth', 'FRIP', 'Barcodes', 'Internal_Name', 'Group_Barcode', 'true_label', 'cluster_annotation'
uns: 'dataset_colors'
obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
Compare annotation results from scATAnno and the origional publication
[95]:
sc.pl.umap(query_annotated, color = ['true_label', "cluster_annotation", 'Uncertainty_Score'], title = ["Annotation Published in Satpathy et al.","scATAnno Annotation"], legend_loc = "on data")
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[129]:
immune_cells = pd.DataFrame(query_annotated[query_annotated.obs["cluster_annotation"] == "Immune Cells"].obs.index)
print(immune_cells.head())
0
0 SU007_Total_Post#TCGTTCGAGGTTCTCA-1_SU007_Tota...
1 SU007_Total_Post#GCAACCGAGAAGAGTG-1_SU007_Tota...
2 SU007_Total_Post#GTGTCAAGTTAGTAGA-1_SU007_Tota...
3 SU007_Total_Post#CTTGTCGAGCGAGCTA-1_SU007_Tota...
4 SU007_Total_Post#GCAGCTGCATTCGTCC-1_SU007_Tota...
Part II: Second round annotation - Apply BCC TIL reference atlas to BCC immune cells
Download TIL reference data
[2]:
# !wget "https://www.dropbox.com/s/ky4jezsj3pf2qwi/BCC_TIL_reference_atlas_final.h5ad?dl=1"
In the second round annotation, we apply BCC TIL reference atlas to the immune cells identified from the first round. Repeat the integration process. This time we use BCC TIL as reference, and use BCC query data with TIL reference peak as feature input. Reference atlas is stored as h5ad AnnData in the data directory; query data is imported as MTX and TSV files from QuickATAC output.
[108]:
reference_data_path = "data/Reference/BCC_TIL_reference_atlas_final.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
SU007_Total_Post = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU007_Total_Post_vBCC',
mtx_file = 'matrix.mtx.gz',
cells_file = 'barcodes.tsv.gz',
features_file = 'features.tsv.gz',
variable_prefix = "SU007_Total_Post",
celltype_col = "celltypes",
add_metrics=False)
SU010_Total_Pre = scATAnno_preprocess.import_query_data(path = 'data/BCC_samples/SU010_Total_Pre_vBCC',
mtx_file = 'matrix.mtx.gz',
cells_file = 'barcodes.tsv',
features_file = 'features.tsv.gz',
variable_prefix = "SU010_Total_Pre",
celltype_col = "celltypes",
add_metrics=False)
[132]:
data_list = [SU007_Total_Post, SU010_Total_Pre]
print("Total number of query cells: {}".format(np.sum([i.obs.shape[0] for i in data_list])))
Total number of query cells: 1737
Select the immune cells from first round annotation
[130]:
immune_cells.index = immune_cells[0]
immune_cells.index = [i.split("_SU")[0] for i in immune_cells.index]
immune_cells.head()
[130]:
0 | |
---|---|
SU007_Total_Post#TCGTTCGAGGTTCTCA-1 | SU007_Total_Post#TCGTTCGAGGTTCTCA-1_SU007_Tota... |
SU007_Total_Post#GCAACCGAGAAGAGTG-1 | SU007_Total_Post#GCAACCGAGAAGAGTG-1_SU007_Tota... |
SU007_Total_Post#GTGTCAAGTTAGTAGA-1 | SU007_Total_Post#GTGTCAAGTTAGTAGA-1_SU007_Tota... |
SU007_Total_Post#CTTGTCGAGCGAGCTA-1 | SU007_Total_Post#CTTGTCGAGCGAGCTA-1_SU007_Tota... |
SU007_Total_Post#GCAGCTGCATTCGTCC-1 | SU007_Total_Post#GCAGCTGCATTCGTCC-1_SU007_Tota... |
[133]:
data_list_use = data_list.copy()
for i in range(len(data_list_use)):
print(data_list_use[i].obs["dataset"][0])
data_list_use[i] = data_list_use[i][data_list_use[i].obs.index.isin(immune_cells.index)]
print(data_list_use[i])
print("Total immune cells: {}".format(np.sum([i.obs.shape[0] for i in data_list_use])))
SU007_Total_Post
View of AnnData object with n_obs × n_vars = 618 × 344492
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'
SU010_Total_Pre
View of AnnData object with n_obs × n_vars = 188 × 344492
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'
Total immune cells: 806
We use SnapATAC2 to integrate the reference atlas and multiple query data
[134]:
for anndata in data_list_use:
select_features(anndata)
select_features(reference_data)
datasets={}
datasets["Atlas"] = reference_data
for anndata in data_list_use:
key = (anndata.obs["dataset"])[0]
datasets[key] = anndata
datasets
/Users/jiang/Dropbox (Partners HealthCare)/Software/ATAC_Annotation_V3/ATAC_Annotation_Package/scATAnno/SnapATAC2_spectral.py:159: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var["selected"] = selected_features
[134]:
{'Atlas': AnnData object with n_obs × n_vars = 22008 × 344492
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'
uns: 'Group_colors', 'celltypes_colors', 'spectral_eigenvalue', 'true_label_colors'
obsm: 'X_spectral', 'X_umap',
'SU007_Total_Post': AnnData object with n_obs × n_vars = 618 × 344492
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected',
'SU010_Total_Pre': AnnData object with n_obs × n_vars = 188 × 344492
obs: 'celltypes', 'tissue', 'dataset'
var: 'selected'}
[135]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate_multiple(datasets, sample_size = 25000)
Compute similarity matrix
Normalization
Perform decomposition
[136]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2023-05-25 15:23:46,290 - harmonypy - INFO - Iteration 1 of 20
2023-05-25 15:23:49,706 - harmonypy - INFO - Iteration 2 of 20
2023-05-25 15:23:53,264 - harmonypy - INFO - Iteration 3 of 20
2023-05-25 15:23:56,800 - harmonypy - INFO - Iteration 4 of 20
2023-05-25 15:24:00,334 - harmonypy - INFO - Iteration 5 of 20
2023-05-25 15:24:03,925 - harmonypy - INFO - Iteration 6 of 20
2023-05-25 15:24:07,365 - harmonypy - INFO - Iteration 7 of 20
2023-05-25 15:24:11,058 - harmonypy - INFO - Converged after 7 iterations
[137]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
[138]:
# Set plotting parameters
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#cccccc', '#228833','#8c564b'], size=10, show=True, title = "Integration")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[145]:
reference = integrated_adata[integrated_adata.obs['dataset'] == "Atlas",:].copy()
reference.obs["celltypes"] = scATAnno_assignment.curate_celltype_names(reference.obs["celltypes"], atlas = "TIL")
[144]:
import pickle
with open(os.path.join(out_dir, 'TIL_reference_palette.pickle'), 'rb') as handle:
celltype_palette = pickle.load(handle)
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "TIL Reference Atlas")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[146]:
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
[147]:
projection_allcells = pd.read_table("data/BCC_samples/BCC_study_metadata.csv", index_col = 0, sep = ",")
projection_allcells.index = [projection_allcells.loc[i,"Group"] + "#" + i for i in projection_allcells.index]
projection_allcells["true_label"] = [i.strip() for i in projection_allcells["true_label"]]
sample_names = ["SU010_Total_Pre", "SU007_Total_Post"]
projection = projection_allcells[projection_allcells["Group"].isin(sample_names)]
query.obs = pd.merge(query.obs, projection, left_index=True, right_index=True)
query.obs
[147]:
celltypes | tissue | dataset | UMAP_1 | UMAP_2 | Clusters | Group | depth | FRIP | Barcodes | Internal_Name | Group_Barcode | true_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SU007_Total_Post#TCGTTCGAGGTTCTCA-1_SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | 2.118399 | -6.020474 | Cluster4 | SU007_Total_Post | 10159 | 0.722758 | TCGTTCGAGGTTCTCA-1 | SU007_Total_Post_312 | SU007_Total_Post#TCGTTCGAGGTTCTCA-1 | Treg |
SU007_Total_Post#GCAACCGAGAAGAGTG-1_SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | 2.138090 | -8.092027 | Cluster4 | SU007_Total_Post | 12455 | 0.715857 | GCAACCGAGAAGAGTG-1 | SU007_Total_Post_1 | SU007_Total_Post#GCAACCGAGAAGAGTG-1 | Treg |
SU007_Total_Post#GTGTCAAGTTAGTAGA-1_SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | 0.018025 | -1.757976 | Cluster1 | SU007_Total_Post | 9101 | 0.621910 | GTGTCAAGTTAGTAGA-1 | SU007_Total_Post_468 | SU007_Total_Post#GTGTCAAGTTAGTAGA-1 | Naive CD4 T |
SU007_Total_Post#CTTGTCGAGCGAGCTA-1_SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | -1.315312 | 1.582670 | Cluster5 | SU007_Total_Post | 22287 | 0.669695 | CTTGTCGAGCGAGCTA-1 | SU007_Total_Post_3 | SU007_Total_Post#CTTGTCGAGCGAGCTA-1 | Naive CD8 T |
SU007_Total_Post#GCAGCTGCATTCGTCC-1_SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | SU007_Total_Post | 0.688227 | -1.430185 | Cluster1 | SU007_Total_Post | 22482 | 0.693866 | GCAGCTGCATTCGTCC-1 | SU007_Total_Post_4 | SU007_Total_Post#GCAGCTGCATTCGTCC-1 | Naive CD4 T |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
SU010_Total_Pre#CCAGATAAGTACGCGA-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | -3.657552 | 2.379094 | Cluster7 | SU010_Total_Pre | 6428 | 0.583852 | CCAGATAAGTACGCGA-1 | SU010_Total_Pre_1057 | SU010_Total_Pre#CCAGATAAGTACGCGA-1 | Memory CD8 T |
SU010_Total_Pre#ACAGGCCTCAAGAGAT-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | -3.117948 | 1.990111 | Cluster5 | SU010_Total_Pre | 3820 | 0.529188 | ACAGGCCTCAAGAGAT-1 | SU010_Total_Pre_1058 | SU010_Total_Pre#ACAGGCCTCAAGAGAT-1 | Naive CD8 T |
SU010_Total_Pre#TGGGTTATCCCAGCAG-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | -1.875685 | 0.338385 | Cluster5 | SU010_Total_Pre | 4741 | 0.469099 | TGGGTTATCCCAGCAG-1 | SU010_Total_Pre_1061 | SU010_Total_Pre#TGGGTTATCCCAGCAG-1 | Naive CD8 T |
SU010_Total_Pre#GGGTGTCTCGGGACAA-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | -4.008999 | 16.443800 | Cluster13 | SU010_Total_Pre | 6619 | 0.599940 | GGGTGTCTCGGGACAA-1 | SU010_Total_Pre_1064 | SU010_Total_Pre#GGGTGTCTCGGGACAA-1 | Plasma B |
SU010_Total_Pre#CCCGTTATCTAGCAGT-1_SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | SU010_Total_Pre | -4.014687 | 1.344591 | Cluster10 | SU010_Total_Pre | 32230 | 0.209680 | CCCGTTATCTAGCAGT-1 | SU010_Total_Pre_1068 | SU010_Total_Pre#CCCGTTATCTAGCAGT-1 | NK1 |
806 rows × 13 columns
We first assign cell types based on KNN and compute KNN-based uncertainty score
[149]:
# Perform KNN assignment
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)
Because we are annotating TIL subtypes, we adopt more loose threshold cutoffs to annotate query cells. Query cells are assigned high weighted distance-based uncertainty score if their distances to the assigned reference cell type centroid are greater than 95 percentile. Query cells with uncertainty score greater than 0.5 are annotated as unknown.
[150]:
# Perform weighted-distance based assignment
distance_threshold = 95
uncertainty_threshold = 0.5
atlas = "HealthyAdult"
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)
We then project query cells into the same UMAP coordinates as original publication for better comparisons, and perform cluster-level assignment.
[151]:
query_distance.obsm["X_umap"] = np.array(query_distance.obs[["UMAP_1", "UMAP_2"]])
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, use_rep=use_rep, cluster_col = "Clusters", UMAP=False)
query_annotated
[151]:
AnnData object with n_obs × n_vars = 806 × 344492
obs: 'celltypes', 'tissue', 'dataset', 'UMAP_1', 'UMAP_2', 'Clusters', 'Group', 'depth', 'FRIP', 'Barcodes', 'Internal_Name', 'Group_Barcode', 'true_label', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'cluster_annotation'
uns: 'dataset_colors'
obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
[152]:
sc.pl.umap(query_annotated, color = ['true_label', "cluster_annotation", 'Uncertainty_Score'], title = ["Annotation Published in Satpathy et al.","scATAnno Annotation"], legend_loc = "on data")
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

Finally, compare the annotation results between scATAnno and original publication.
[175]:
outcomes2 = []
for i in query_annotated.obs["cluster_annotation"].index:
if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"true_label"]:
outcomes2.append("Consistent")
elif query_annotated.obs.loc[i,"cluster_annotation"] == "unknown":
outcomes2.append("Unknown")
else:
outcomes2.append("Inconsistent")
query_annotated.obs["outcomes2"] = outcomes2
sc.pl.umap(query_annotated, color = "outcomes2", palette = ['#ff7f0e','#279e68'], title = "Comparison")
/opt/anaconda3/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

[180]:
adata_query = query_annotated.copy()
variable_col = "true_label"
outcome_col = "outcomes2"
listdict = []
for ct in adata_query.obs[variable_col].unique():
tmp_dict = {}
for pt in np.unique(adata_query.obs[outcome_col]):
tmp_dict[pt] = (
np.sum(
adata_query.obs[adata_query.obs[variable_col] == ct][outcome_col] == pt
)
)
l = len(adata_query.obs[adata_query.obs[variable_col] == ct])
tmp_dict['ct'] = f'{ct} ({l})'
tmp_dict['n_cells'] = len(adata_query.obs[adata_query.obs[variable_col] == ct])
listdict.append(tmp_dict)
df = pd.DataFrame(listdict).set_index('ct').sort_values(by='ct')
df = df.sort_values("n_cells", ascending=True)
fig, ax = plt.subplots(1, 1, figsize=(2, 6))
df[df.columns[0:len(df.columns)-1]].plot(kind='barh', stacked=True, ax=ax, width=1.0, color = ['#ff7f0e','#279e68'])
ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
ax.set_xlabel('cell numbers')
ax.set_ylabel('')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., frameon=False)
ax.grid(False)
sns.despine()
plt.show()
/var/folders/79/yz2sgggd11n9c7kj2nws4w700000gp/T/ipykernel_5654/2520740504.py:24: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)

Compute accuracy and F1 scores
[183]:
query = query_annotated.copy()
paper_anno_col="true_label"
cluster_anno_col = "cluster_annotation"
from sklearn.metrics import f1_score, confusion_matrix, classification_report
report = classification_report(y_true=query.obs[paper_anno_col],
y_pred=query.obs[cluster_anno_col],
output_dict=True)
#Also generate a confusion matrix
cm = confusion_matrix(y_true=query.obs[paper_anno_col],
y_pred=query.obs[cluster_anno_col],)
FP = cm.sum(axis=0)-np.diag(cm)
FN = cm.sum(axis=1) - np.diag(cm)
TP = np.diag(cm)
TN = cm.sum() - (FP + FN + TP)
class_spec = TN/(TN+FP)
perfdf = pd.DataFrame(columns = ['class',
'precision',
'recall',
'f1_score',
'support',
'specificity'])
i=0
for key, val in report.items():
#we don't use accuracy so skip it
if key =='accuracy':
continue
perflist = []
perflist.append(key)
for key2, val2 in val.items():
perflist.append(val2)
if i>=len(class_spec):
perflist.append(0)
else:
perflist.append(class_spec[i])
perfdf=perfdf.append(pd.Series(perflist, index=perfdf.columns), ignore_index=True)
i+=1
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
[184]:
perfdf
[184]:
class | precision | recall | f1_score | support | specificity | |
---|---|---|---|---|---|---|
0 | B | 1.000000 | 1.000000 | 1.000000 | 25 | 1.000000 |
1 | CD8 TEx | 1.000000 | 1.000000 | 1.000000 | 23 | 1.000000 |
2 | Effector CD8 T | 1.000000 | 1.000000 | 1.000000 | 48 | 1.000000 |
3 | Memory CD8 T | 1.000000 | 1.000000 | 1.000000 | 147 | 1.000000 |
4 | Myeloid | 1.000000 | 1.000000 | 1.000000 | 21 | 1.000000 |
5 | NK1 | 1.000000 | 1.000000 | 1.000000 | 29 | 1.000000 |
6 | NK2 | 1.000000 | 1.000000 | 1.000000 | 16 | 1.000000 |
7 | Naive CD4 T | 1.000000 | 1.000000 | 1.000000 | 140 | 1.000000 |
8 | Naive CD8 T | 1.000000 | 1.000000 | 1.000000 | 51 | 1.000000 |
9 | Plasma B | 1.000000 | 1.000000 | 1.000000 | 118 | 1.000000 |
10 | Tfh | 1.000000 | 1.000000 | 1.000000 | 44 | 1.000000 |
11 | Th1 | 0.000000 | 0.000000 | 0.000000 | 23 | 1.000000 |
12 | Th17 | 1.000000 | 1.000000 | 1.000000 | 19 | 1.000000 |
13 | Treg | 1.000000 | 1.000000 | 1.000000 | 102 | 1.000000 |
14 | unknown | 0.000000 | 0.000000 | 0.000000 | 0 | 0.971464 |
15 | macro avg | 0.866667 | 0.866667 | 0.866667 | 806 | 0.000000 |
16 | weighted avg | 0.971464 | 0.971464 | 0.971464 | 806 | 0.000000 |
[186]:
# Accuracy
from scATAnno import scATAnno_evaluation
df = query.obs.copy()
cell_accr = {}
for celltype_of_interest in np.unique(df[paper_anno_col]):
true_cells = df[df[paper_anno_col]==celltype_of_interest].index
accr = scATAnno_evaluation.compute_accuracy(df.loc[true_cells,paper_anno_col], df.loc[true_cells,cluster_anno_col])
cell_accr[celltype_of_interest] = accr
cell_level_df = pd.DataFrame([cell_accr], columns=cell_accr.keys())
cell_level_df = cell_level_df.T
cell_level_df.columns = ["accuracy"]
cell_level_df
[186]:
accuracy | |
---|---|
B | 1.0 |
CD8 TEx | 1.0 |
Effector CD8 T | 1.0 |
Memory CD8 T | 1.0 |
Myeloid | 1.0 |
NK1 | 1.0 |
NK2 | 1.0 |
Naive CD4 T | 1.0 |
Naive CD8 T | 1.0 |
Plasma B | 1.0 |
Tfh | 1.0 |
Th1 | 0.0 |
Th17 | 1.0 |
Treg | 1.0 |
[187]:
data = np.array([np.float(cell_level_df.mean()),
np.float(perfdf[perfdf['class'] == "weighted avg"]["f1_score"])])
overallscore = pd.DataFrame(data.T, index = ['Accuracy',
'Weighted F1',], columns=["value"]
)
overallscore
[187]:
value | |
---|---|
Accuracy | 0.928571 |
Weighted F1 | 0.971464 |
[189]:
ax = sns.barplot(x = overallscore.index, y= overallscore["value"], capsize=.2, dodge=True, palette="Blues",
)
ax.set_title("Classification Performance")
ax.set_ylim(0,1)
ax.set_ylabel("")
fig = ax.get_figure()
fig.tight_layout()
plt.show()
