PBMC10K Multiome

[1]:
import numpy as np
import polars as pl
import snapatac2 as snap
import pandas as pd
import scanpy as sc
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy.io
import os
import anndata as ad # Anndata version must > 0.8
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')

import scATAnno
from scATAnno.SnapATAC2_spectral import *
from scATAnno.SnapATAC2_tools import *
from scATAnno.SnapATAC2_utils import *
from scATAnno import scATAnno_preprocess
from scATAnno import scATAnno_assignment
from scATAnno import scATAnno_integration
from scATAnno import scATAnno_plotting

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

default_28 = scATAnno_plotting.get_palettes("default28")
default_102 = scATAnno_plotting.get_palettes("default102")

In this tutorial, we apply PBMC reference atlas to PBMC 10K Multiome dataset. The input files are 1) reference atlas in h5ad format; 2) query data in mtx and tsv formats, whose features are aligned with PBMC reference peak.

Download reference data

[2]:
# !wget -O PBMC_reference_atlas_final.h5ad "https://www.dropbox.com/s/y9wc6h5mmydj7gf/PBMC_reference_atlas_final.h5ad?dl=1"

Download query data, including 1) scATAC (matrix.mtx, features.tsv and barcodes.tsv), features are the reference peak of PBMC reference atlas 2) scRNA (filtered_feature_bc_matrix.h5)

[3]:
# !wget -O PBMC10k_multiome_vPBMCatlas.zip "https://www.dropbox.com/s/3g63m832mbeec4s/PBMC10k_multiome_vPBMCatlas.zip?dl=1"

Donwload palette file

[4]:
# !wget -O reference_palette.pickle "https://www.dropbox.com/scl/fi/h9gryg9gwdegs8jaob8ga/reference_palette.pickle?rlkey=41zarytshwcowko0r7ndxkiih&st=bxgkyno9&dl=0"

We load reference atlas data and query data. Reference atlas is stored as h5ad AnnData in the data directory; query data is imported from QuickATAC output.

[5]:
os.chdir("scATAnno-main")

output_name = "PBMC_10K_Multiome"
out_dir = os.path.join("scATAnno-main/case_study", output_name)
os.makedirs(out_dir, exist_ok=True)

reference_data_path = "PBMC_reference_atlas_final.h5ad"
reference_data = scATAnno_preprocess.load_reference_data(reference_data_path)
query_data = scATAnno_preprocess.import_query_data(path = 'PBMC10k_multiome_vPBMCatlas/scATAC/',
                                    mtx_file = 'matrix.mtx.gz',
                                    cells_file = 'barcodes.tsv.gz',
                                    features_file = 'features.tsv.gz',
                                    variable_prefix = "pbmc10k_query",
                                    celltype_col = "celltypes",
                                    add_metrics=False)

Before performing integration, we check feature dimensions of the reference and query.

[6]:
print(reference_data)
print(query_data)
assert reference_data.var.shape[0] == query_data.var.shape[0]
AnnData object with n_obs × n_vars = 39441 × 196618
    obs: 'celltypes', 'tissue', 'dataset'
View of AnnData object with n_obs × n_vars = 11909 × 196618
    obs: 'celltypes', 'tissue', 'dataset'

We integrate the reference atlas and query data using SnapATAC2 by 1) concatenating two datasets and selecting highly informative “landmark” reference cells to compute similarity matrix, normalize the matrix and eigen decompositionand and 2) performing Nystrom extension to project the remaining reference and query cells onto the same spectral embedding space

[7]:
# Integrate reference and query data
integrated_adata = scATAnno_assignment.scATAnno_integrate(reference_data, query_data, variable_prefix = "pbmc10k_query", sample_size = 25000)
/Users/francis/Desktop/scATAnno-main/scATAnno/SnapATAC2_spectral.py:162: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var["selected"] = selected_features
Compute similarity matrix
Normalization
Perform decomposition
Perform Nystrom extension
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [13:23<00:00, 267.99s/it]

The following method provides an alternative way to obtain “integrated_adata” (the box above). This method uses all reference samples. The time to calculate “integrated_adata” drops significantly (as only the “Perform Nystrom extension” step is performed) while the accuracy only have a minimal decrease.

[8]:
# First we need to specify the paths for features and models:
# feature_file = "reference_embedding_features.csv"
# model_file ="reference_embedding_model.model"

# You can build the model from prep_data/scATAnno-reference-building-example, or download these files from:
# !wget -O reference_embedding_model.model "https://www.dropbox.com/scl/fi/fgh2vw5jk7g4q68n9k9m7/reference_embedding_model.model?rlkey=6q6n5p5hntqaje7oeqtwezdn4&st=pewaytgz&dl=0"
# !wget -O reference_embedding_features.csv "https://www.dropbox.com/scl/fi/xa6cdy95oqktpy95bw6h8/reference_embedding_features.csv?rlkey=u1of7llfxvdz736bj3txzn0lx&st=e8la05zx&dl=0"

#with both feature_file and model_file for our reference, we can project the data from query_data to the reference data in the spectral space:
#integrated_adata = scATAnno_assignment.scATAnno_query_projection(reference_data, query_data, feature_file, model_file)

Next, apply harmony to spectral embeddings of integrated data to regress out batch effects indicated in batch_col.

[9]:
# Apply harmony to remove batch effects
integrated_adata_harmony = scATAnno_assignment.scATAnno_harmony(integrated_adata, batch_col = "dataset")
2025-01-10 01:01:07,599 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
2025-01-10 01:01:07 - INFO - Computing initial centroids with sklearn.KMeans...
2025-01-10 01:01:14,439 - harmonypy - INFO - sklearn.KMeans initialization complete.
2025-01-10 01:01:14 - INFO - sklearn.KMeans initialization complete.
2025-01-10 01:01:14,566 - harmonypy - INFO - Iteration 1 of 20
2025-01-10 01:01:14 - INFO - Iteration 1 of 20
2025-01-10 01:01:25,173 - harmonypy - INFO - Iteration 2 of 20
2025-01-10 01:01:25 - INFO - Iteration 2 of 20
2025-01-10 01:01:35,250 - harmonypy - INFO - Iteration 3 of 20
2025-01-10 01:01:35 - INFO - Iteration 3 of 20
2025-01-10 01:01:45,520 - harmonypy - INFO - Iteration 4 of 20
2025-01-10 01:01:45 - INFO - Iteration 4 of 20
2025-01-10 01:01:55,620 - harmonypy - INFO - Iteration 5 of 20
2025-01-10 01:01:55 - INFO - Iteration 5 of 20
2025-01-10 01:02:07,263 - harmonypy - INFO - Converged after 5 iterations
2025-01-10 01:02:07 - INFO - Converged after 5 iterations
[10]:
integrated_adata_harmony
[10]:
AnnData object with n_obs × n_vars = 51350 × 196618
    obs: 'celltypes', 'tissue', 'dataset'
    obsm: 'X_spectral', 'X_spectral_harmony'

we then plot UMAP of the integrated data using harmonized spectral embeddings. Optionally, we can save the AnnData object in an output directory.

[11]:
# Plot UMAP using spectral embeddings
integrated_adata = scATAnno_assignment.scATAnno_umap(integrated_adata_harmony, out_dir, use_rep = "X_spectral_harmony", save = True)
integrated_adata
/opt/anaconda3/envs/scATAnno/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
/opt/anaconda3/envs/scATAnno/lib/python3.11/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
... storing 'celltypes' as categorical
... storing 'tissue' as categorical
[11]:
AnnData object with n_obs × n_vars = 51350 × 196618
    obs: 'celltypes', 'tissue', 'dataset'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap'
[12]:
scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(integrated_adata, color="dataset", palette = ['#228833', '#cccccc'], show=True, title = "Integration")
... storing 'celltypes' as categorical
... storing 'tissue' as categorical
../_images/tutorials_demo_PBMC10K_Multiome_22_1.png

We investigated cell types in the PBMC reference atlas and re-annotated gamma delta T cell cluster as MAIT cell cluster by showing chromatin accessibility of gene signatures in this cluster.

After integration, we want to assign cell types to the query data by taking three steps. First, query cells are assigned labels using K-Nearest neighbors along with KNN-based uncertainty score. Second, query cell labels are corrected using weighted-distance-based uncertainty score. Third, query cells are clustered and annotated at the cluster level.

[13]:
uncertainty_threshold = 0.2
distance_threshold = 90
reference_label_col = "celltypes"
use_rep = "X_spectral_harmony"
atlas = "PBMC"

integrated_adata = sc.read_h5ad(os.path.join(out_dir, "1.Merged_query_reference.h5ad"))
integrated_adata = scATAnno_preprocess.add_variable(os.path.join(out_dir, "X_spectral_harmony.csv"), integrated_adata)
reference = integrated_adata[integrated_adata.obs['dataset'] == "Atlas", :].copy()
query = integrated_adata[integrated_adata.obs['dataset'] != "Atlas",:].copy()
# Perform KNN assignment
query_KNN = scATAnno_assignment.scATAnno_KNN_assign(reference, query, reference_label_col=reference_label_col, low_dim_col=use_rep)
[14]:
import pickle
with open(('reference_palette.pickle'), 'rb') as handle:
    celltype_palette = pickle.load(handle)

scATAnno_plotting.defaultPlotting_umap()
sc.pl.umap(reference, color="celltypes", palette = celltype_palette, title = "PBMC Reference Atlas")
../_images/tutorials_demo_PBMC10K_Multiome_26_0.png
[15]:
# Perform weighted-distance based assignment
query_distance = scATAnno_assignment.scATAnno_distance_assign(reference, query_KNN, reference_label_col=reference_label_col, distance_threshold=distance_threshold, atlas=atlas, uncertainty_threshold=uncertainty_threshold, use_rep = use_rep)
Embedding shape: (1477, 30)
Centroid shape: (30,)
[16]:
# Perform cluster-level assignment
query_annotated = scATAnno_assignment.scATAnno_cluster_assign(query_distance, cluster_col = None, use_rep=use_rep)
query_annotated
[16]:
AnnData object with n_obs × n_vars = 11909 × 196618
    obs: 'celltypes', 'tissue', 'dataset', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'leiden', 'cluster_annotation'
    uns: 'neighbors', 'umap', 'leiden'
    obsm: 'X_umap', 'X_spectral_harmony', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
    obsp: 'distances', 'connectivities'
[17]:
sc.pl.umap(query_annotated, color = "cluster_annotation",palette=celltype_palette, title = "scATAnno Annotation")
... storing 'pred_y' as categorical
... storing '1.knn-based_celltype' as categorical
... storing '2.corrected_celltype' as categorical
... storing 'cluster_annotation' as categorical
../_images/tutorials_demo_PBMC10K_Multiome_29_1.png

We benchmark scATAnno results with seurat annotation. Seurat annotation is stored in “Seurat_RNA_annotation.csv”

[18]:
seurat_df = pd.read_csv(os.path.join("PBMC10k_multiome_vPBMCatlas/","Seurat_RNA_annotation.csv"), index_col=0)
query_annotated = query_annotated[query_annotated.obs.index.isin(seurat_df.index)]
query_annotated.obs = pd.merge(query_annotated.obs, seurat_df, left_index=True, right_index=True)
query_annotated
[18]:
AnnData object with n_obs × n_vars = 10412 × 196618
    obs: 'celltypes', 'tissue', 'dataset', 'uncertainty_score_step1', 'pred_y', 'uncertainty_score_step2', 'Uncertainty_Score', '1.knn-based_celltype', '2.corrected_celltype', 'leiden', 'cluster_annotation', 'seurat_rna_annotations', 'seurat_new_annotation'
    uns: 'neighbors', 'umap', 'leiden', 'cluster_annotation_colors'
    obsm: 'X_umap', 'X_spectral_harmony', 'kernel_distance', 'distance', 'indices', 'neighbors_labels'
    obsp: 'distances', 'connectivities'
[19]:
sc.pl.umap(query_annotated, color = "seurat_new_annotation", title = "Seurat Annotation", palette = default_28)
... storing 'seurat_rna_annotations' as categorical
... storing 'seurat_new_annotation' as categorical
../_images/tutorials_demo_PBMC10K_Multiome_32_1.png

Lastly, we compare the annotation results between scATAnno and Seurat.

[20]:
outcomes2 = []
for i in query_annotated.obs["cluster_annotation"].index:
    if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"seurat_new_annotation"]:
        outcomes2.append("Consistent")
    elif query_annotated.obs.loc[i,"cluster_annotation"] == "unknown":
        outcomes2.append("Unknown")
    else: outcomes2.append("Inconsistent")
query_annotated.obs["outcomes2"] = outcomes2
sc.pl.umap(query_annotated, color = "outcomes2", palette= ['#1f77b4','#ff7f0e','#279e68'], title = "Comparison")
... storing 'outcomes2' as categorical
../_images/tutorials_demo_PBMC10K_Multiome_34_1.png
[21]:
outcomes = []
for i in query_annotated.obs["cluster_annotation"].index:
    if query_annotated.obs.loc[i,"cluster_annotation"] == query_annotated.obs.loc[i,"seurat_new_annotation"]:
        outcomes.append("Consistent")
    else: outcomes.append("Inconsistent")
query_annotated.obs["outcomes"] = outcomes

adata_query = query_annotated.copy()
variable_col = "cluster_annotation"
outcome_col = "outcomes"
listdict = []
for ct in adata_query.obs[variable_col].unique():
    # print(len(adata_query.obs[adata_query.obs['tissue'] == ct]))
    tmp_dict = {}
    for pt in np.unique(adata_query.obs[outcome_col]):
        tmp_dict[pt] = (
            np.sum(
                adata_query.obs[adata_query.obs[variable_col] == ct][outcome_col] == pt
            )
            # / len(adata_query.obs[adata_query.obs[variable_col] == ct])
        )
    l = len(adata_query.obs[adata_query.obs[variable_col] == ct])
    tmp_dict['ct'] = f'{ct} ({l})'
    tmp_dict['n_cells'] = len(adata_query.obs[adata_query.obs[variable_col] == ct])
    listdict.append(tmp_dict)


df = pd.DataFrame(listdict).set_index('ct').sort_values(by='ct')
df = df.iloc[0:df.shape[0]-1].sort_values("n_cells", ascending=True)
print(df)
                             Consistent  Inconsistent  n_cells
ct
pDC (89)                             89             0       89
MAIT (123)                          111            12      123
Treg (146)                          114            32      146
cDC (213)                           185            28      213
Naive B (351)                         3           348      351
NK (389)                            383             6      389
Memory B (492)                        7           485      492
Effector memory CD8 T (732)         581           151      732
Memory CD4 T (1185)                1118            67     1185
Naive CD4 T (1315)                 1203           112     1315
Naive CD8 T (1386)                 1320            66     1386
Monocyte (2788)                    2746            42     2788
[22]:
fig, ax = plt.subplots(1, 1, figsize=(2, 6))
df[df.columns[0:len(df.columns)-1]].plot(kind='barh', stacked=True, ax=ax, width=1.0)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
ax.set_xlabel('Cell Numbers')
ax.set_ylabel('')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., frameon=False)
ax.grid(False)
sns.despine()
plt.show()
/var/folders/dq/79h3g1bj0rjb4wvj5r5c3w1h0000gn/T/ipykernel_60690/2726220942.py:3: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
../_images/tutorials_demo_PBMC10K_Multiome_36_1.png
[23]:
query_annotated.obs
[23]:
celltypes tissue dataset uncertainty_score_step1 pred_y uncertainty_score_step2 Uncertainty_Score 1.knn-based_celltype 2.corrected_celltype leiden cluster_annotation seurat_rna_annotations seurat_new_annotation outcomes2 outcomes
TCAAGAACAGTAATAG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 3.666667e-01 Treg 1.0 1.000000e+00 unknown unknown 32 Treg CD4 Naive Naive CD4 T Inconsistent Inconsistent
AACCCGCAGGTAGCTT-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 3.666667e-01 Central memory CD8 T 0.0 3.666667e-01 unknown unknown 10 Effector memory CD8 T gdT Gamma delta T Inconsistent Inconsistent
CGCATATAGGTTACGT-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 1.110223e-16 Memory CD4 T 0.0 1.110223e-16 Memory CD4 T Memory CD4 T 3 Memory CD4 T CD4 TCM Memory CD4 T Consistent Consistent
TATTCGTTCCGCCTCA-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 1.110223e-16 Monocyte 0.0 1.110223e-16 Monocyte Monocyte 0 Monocyte CD14 Mono Monocyte Consistent Consistent
CGGTTTGAGTTTGGTA-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 0.000000e+00 Effector memory CD8 T 0.0 0.000000e+00 Effector memory CD8 T Effector memory CD8 T 8 Effector memory CD8 T gdT Gamma delta T Inconsistent Inconsistent
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
GGTCAAGCACCTACTT-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 0.000000e+00 Monocyte 0.0 0.000000e+00 Monocyte Monocyte 0 Monocyte CD14 Mono Monocyte Consistent Consistent
CTTGCTCAGACAAACG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 1.110223e-16 Monocyte 1.0 1.000000e+00 Monocyte unknown 15 unknown CD16 Mono Monocyte Unknown Inconsistent
GAAAGCCAGCAGGTGG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query -2.220446e-16 Monocyte 0.0 0.000000e+00 Monocyte Monocyte 25 Monocyte CD14 Mono Monocyte Consistent Consistent
ACGGGAAGTGTTAGCA-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query 0.000000e+00 Naive CD8 T 0.0 0.000000e+00 Naive CD8 T Naive CD8 T 6 Naive CD8 T CD8 Naive Naive CD8 T Consistent Consistent
CTATGATCACCATATG-1_pbmc10k_query pbmc10k_query pbmc10k_query pbmc10k_query -2.220446e-16 Monocyte 0.0 0.000000e+00 Monocyte Monocyte 11 Monocyte CD14 Mono Monocyte Consistent Consistent

10412 rows × 15 columns

[24]:
query_annotated.write(os.path.join(out_dir, "query_annotated.h5ad"))
... storing 'outcomes' as categorical

To examine B cell subtypes, we leverage scRNA-profile to show signature genes

[25]:
# Read in scRNA profile
RNA_anndata = sc.read_10x_h5("PBMC10k_multiome_vPBMCatlas/scRNA/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
RNA_anndata.obs.index = [ i+ "_pbmc10k_query" for i in RNA_anndata.obs.index]
RNA_anndata = RNA_anndata[RNA_anndata.obs.index.isin(query_annotated.obs.index)]
/opt/anaconda3/envs/scATAnno/lib/python3.11/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/opt/anaconda3/envs/scATAnno/lib/python3.11/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[26]:
RNA_anndata
[26]:
View of AnnData object with n_obs × n_vars = 10412 × 36601
    var: 'gene_ids', 'feature_types', 'genome', 'interval'
[27]:
RNA_anndata.obs = pd.merge(RNA_anndata.obs, query_annotated.obs[["cluster_annotation", "seurat_new_annotation"]], left_index=True, right_index=True)
UMAP_df = pd.DataFrame(query_annotated.obsm["X_umap"])
UMAP_df.index = query_annotated.obs.index
UMAP_df.columns = ["UMAP_1", "UMAP_2"]
UMAP_df = UMAP_df.loc[RNA_anndata.obs.index]
RNA_anndata.obsm["X_umap"] = np.array(UMAP_df)
new_categories = np.unique(query_annotated.obs["cluster_annotation"])
RNA_anndata.obs["cluster_annotation"] = RNA_anndata.obs["cluster_annotation"].cat.set_categories(new_categories)
/opt/anaconda3/envs/scATAnno/lib/python3.11/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/opt/anaconda3/envs/scATAnno/lib/python3.11/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[28]:
RNA_anndata.obs
[28]:
cluster_annotation seurat_new_annotation
AAACAGCCAAGGAATC-1_pbmc10k_query unknown Naive CD4 T
AAACAGCCAATCCCTT-1_pbmc10k_query Memory CD4 T Memory CD4 T
AAACAGCCAATGCGCT-1_pbmc10k_query Naive CD4 T Naive CD4 T
AAACAGCCACCAACCG-1_pbmc10k_query Naive CD8 T Naive CD8 T
AAACAGCCAGGATAAC-1_pbmc10k_query Naive CD4 T Naive CD4 T
... ... ...
TTTGTTGGTGACATGC-1_pbmc10k_query Naive CD8 T Naive CD8 T
TTTGTTGGTGTTAAAC-1_pbmc10k_query Naive CD8 T Naive CD8 T
TTTGTTGGTTAGGATT-1_pbmc10k_query NK NK
TTTGTTGGTTGGTTAG-1_pbmc10k_query Memory CD4 T Memory CD4 T
TTTGTTGGTTTGCAGA-1_pbmc10k_query Effector memory CD8 T Effector memory CD8 T

10412 rows × 2 columns

[29]:
sc.pp.filter_genes(RNA_anndata, min_cells=3)
sc.pp.normalize_total(RNA_anndata, target_sum=1e4)
sc.pp.log1p(RNA_anndata)
sc.pp.highly_variable_genes(RNA_anndata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.scale(RNA_anndata, max_value=10)
RNA_anndata.var_names_make_unique()
/opt/anaconda3/envs/scATAnno/lib/python3.11/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[30]:
N_col = 2
N_row = 2

markergenes = ["CD27","TNFRSF13B", # memoryB
              "IGHM", "TCL1A", # naive B
              ]
fig, axes=plt.subplots(nrows= N_row , ncols= N_col , figsize=(N_col*2,N_row*2),
                       sharex=True,sharey=True, )
for i in range(len(markergenes) ):
    i_row = int(i/N_col)
    i_col = i%N_col
    ax = axes[i_row, i_col]
    sc.pl.umap(RNA_anndata, color= markergenes[i],
                    title =  markergenes[i],
               # title = "",
               show=False, ax = ax, legend_loc=None,color_map='plasma', colorbar_loc = None, )
    ax.set_xlabel('')
    ax.set_ylabel('')
fig.tight_layout()
plt.show()
... storing 'feature_types' as categorical
... storing 'genome' as categorical
... storing 'interval' as categorical
../_images/tutorials_demo_PBMC10K_Multiome_45_1.png

We have integrated a new module into scATAnno specifically designed to identify cell type-specific peaks. This module utilizes the marker_regions function from SnapATAC2, with an average runtime of just 5 seconds for 5,000 cells.

[31]:
marker_peaks = snap.tl.marker_regions(query_annotated, groupby='cluster_annotation', pvalue=0.001)

Plot cell-type specific peaks identified

[32]:
snap.pl.regions(query_annotated, groupby='cluster_annotation', peaks=marker_peaks, interactive=False)
[32]:
../_images/tutorials_demo_PBMC10K_Multiome_49_0.png

The following outputs cell type specific peaks into separate documents.

[33]:
for i in range(len(marker_peaks)):
    key = list(marker_peaks.keys())[i]
    data = marker_peaks[key]
    print(f"Key: {key}, Data size: {len(data)}")
    df = pd.DataFrame(data, columns=['peak'])
    df.to_csv(f'{key}_peak.csv', index=False)
Key: Effector memory CD8 T, Data size: 244
Key: MAIT, Data size: 241
Key: Memory B, Data size: 412
Key: Memory CD4 T, Data size: 260
Key: Monocyte, Data size: 1401
Key: NK, Data size: 366
Key: Naive B, Data size: 237
Key: Naive CD4 T, Data size: 261
Key: Naive CD8 T, Data size: 316
Key: Treg, Data size: 250
Key: cDC, Data size: 608
Key: pDC, Data size: 2050
Key: unknown, Data size: 921