Multiome Workflow¶
Below is the workflow for SCRIPro’s multi-omics data input, which requires paired/unpaired scRNA-seq and scATAC-seq data as input.
To demonstrate SCRIP’s ability to be applied to Multi-omic data, we applied SCRIP to 10X Multiome-seq lymphoma sequencing data.Data are available on https://www.10xgenomics.com/datasets/fresh-frozen-lymph-node-with-b-cell-lymphoma-14-k-sorted-nuclei-1-standard-2-0 .
Using Shell:¶
Transcription factor enrichment scores can be obtained by SCRIPro using the following shell statement:
scripro enrich_multiome -i ./data/rna/rna.h5ad -n 50 -s hg38 -a matrix -b 0 -f ./data/atac/atac.h5ad -g./gencode.v43.chr_patch_hapl_scaff.annotation.gtf.gz -p multiome -t 12
Using Python for custom analysis:¶
import anndata as ad
import networkx as nx
import scanpy as sc
import scglue
from matplotlib import rcParams
import numpy as np
import scripro
import pandas as pd
import warnings
import seaborn as sns
from itertools import chain
from scipy.spatial.distance import cdist
warnings.filterwarnings("ignore")
Load data¶
Respectively loaded ATAC-seq and RNA-seq data.
scglue.plot.set_publication_params()
rcParams["figure.figsize"] = (4, 4)
rna = sc.read_h5ad("/fs/home/xuyunfan/data/10x/lymph2/rna/rna.h5ad")
rna.var_names_make_unique()
rna.raw = rna
atac = sc.read_h5ad('/fs/home/xuyunfan/data/10x/lymph2/atac/atac.h5ad')
Use GLUE to compute scATAC-matched scRNA barcodes¶
Here, we use GLUE to align the two omics data,here is the normal GLUE workflow:
atac.var_names_make_unique()
rna.layers["counts"] = rna.X.copy()
sc.pp.normalize_total(rna, target_sum=1e4)
sc.pp.log1p(rna)
sc.pp.highly_variable_genes(rna, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.scale(rna)
sc.tl.pca(rna, n_comps=100, svd_solver="auto")
sc.pp.neighbors(rna, metric="cosine")
sc.tl.umap(rna)
sc.pl.umap(rna)
scglue.data.lsi(atac, n_components=5, n_iter=15)
sc.pp.neighbors(atac, use_rep="X_lsi", metric="cosine")
sc.tl.umap(atac)
sc.pl.umap(atac)
scglue.data.get_gene_annotation(
rna, gtf="../data/gencode.v43.chr_patch_hapl_scaff.annotation.gtf.gz",
gtf_by="gene_name"
)
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]]
| chrom | chromStart | chromEnd | |
|---|---|---|---|
| MIR1302-2HG | chr1 | 29553.0 | 31109.0 |
| FAM138A | chr1 | 34553.0 | 36081.0 |
| OR4F5 | chr1 | 65418.0 | 71585.0 |
| AL627309.1 | NaN | NaN | NaN |
| AL627309.3 | NaN | NaN | NaN |
| ... | ... | ... | ... |
| AC141272.1 | NaN | NaN | NaN |
| AC023491.2 | NaN | NaN | NaN |
| AC007325.1 | NaN | NaN | NaN |
| AC007325.4 | NaN | NaN | NaN |
| AC007325.2 | NaN | NaN | NaN |
36621 rows × 3 columns
genes_to_remove = rna.var[~(rna.var.loc[:,"chromStart"]>0)].index
rna = rna[:, ~rna.var.index.isin(genes_to_remove)]
atac.var_names[:5]
Index(['chr1_9795_10696', 'chr1_17061_17939', 'chr1_180997_181703',
'chr1_183968_184757', 'chr1_186502_187406'],
dtype='object')
split = atac.var_names.str.split(r"[_]")
atac.var["chrom"] = split.map(lambda x: x[0])
atac.var["chromStart"] = split.map(lambda x: x[1]).astype(int)
atac.var["chromEnd"] = split.map(lambda x: x[2]).astype(int)
atac.var.head()
| chrom | chromStart | chromEnd | |
|---|---|---|---|
| chr1_9795_10696 | chr1 | 9795 | 10696 |
| chr1_17061_17939 | chr1 | 17061 | 17939 |
| chr1_180997_181703 | chr1 | 180997 | 181703 |
| chr1_183968_184757 | chr1 | 183968 | 184757 |
| chr1_186502_187406 | chr1 | 186502 | 187406 |
rna.var
| highly_variable | means | dispersions | dispersions_norm | mean | std | chrom | chromStart | chromEnd | name | ... | itemRgb | blockCount | blockSizes | blockStarts | gene_id | gene_type | tag | hgnc_id | havana_gene | artif_dupl | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MIR1302-2HG | False | 1.000000e-12 | NaN | 0.000000 | 0.000000 | 1.000000 | chr1 | 29553.0 | 31109.0 | MIR1302-2HG | ... | . | . | . | . | ENSG00000243485.5 | lncRNA | ncRNA_host | HGNC:52482 | OTTHUMG00000000959.2 | NaN |
| FAM138A | False | 1.000000e-12 | NaN | 0.000000 | 0.000000 | 1.000000 | chr1 | 34553.0 | 36081.0 | FAM138A | ... | . | . | . | . | ENSG00000237613.2 | lncRNA | NaN | HGNC:32334 | OTTHUMG00000000960.1 | NaN |
| OR4F5 | False | 5.497313e-03 | 1.040101 | -1.097506 | 0.002889 | 0.056353 | chr1 | 65418.0 | 71585.0 | OR4F5 | ... | . | . | . | . | ENSG00000186092.7 | protein_coding | NaN | HGNC:14825 | OTTHUMG00000001094.4 | NaN |
| OR4F29 | False | 1.000000e-12 | NaN | 0.000000 | 0.000000 | 1.000000 | chr1 | 450739.0 | 451678.0 | OR4F29 | ... | . | . | . | . | ENSG00000284733.2 | protein_coding | NaN | HGNC:31275 | OTTHUMG00000002860.3 | NaN |
| OR4F16 | False | 1.000000e-12 | NaN | 0.000000 | 0.000000 | 1.000000 | chr1 | 685715.0 | 686654.0 | OR4F16 | ... | . | . | . | . | ENSG00000284662.2 | protein_coding | NaN | HGNC:15079 | OTTHUMG00000002581.3 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| MT-ND4 | True | 2.037123e+00 | 4.331923 | 3.915107 | 0.950185 | 1.276617 | chrM | 10759.0 | 12137.0 | MT-ND4 | ... | . | . | . | . | ENSG00000198886.2 | protein_coding | NaN | HGNC:7459 | NaN | NaN |
| MT-ND5 | True | 6.776105e-01 | 3.704260 | 5.513758 | 0.219016 | 0.636512 | chrM | 12336.0 | 14148.0 | MT-ND5 | ... | . | . | . | . | ENSG00000198786.2 | protein_coding | NaN | HGNC:7461 | NaN | NaN |
| MT-ND6 | True | 2.098734e-01 | 3.157219 | 1.393144 | 0.062790 | 0.337936 | chrM | 14148.0 | 14673.0 | MT-ND6 | ... | . | . | . | . | ENSG00000198695.2 | protein_coding | NaN | HGNC:7462 | NaN | NaN |
| MT-CYB | True | 1.438881e+00 | 4.125400 | 4.742718 | 0.554779 | 1.015753 | chrM | 14746.0 | 15887.0 | MT-CYB | ... | . | . | . | . | ENSG00000198727.2 | protein_coding | NaN | HGNC:7427 | NaN | NaN |
| MAFIP | False | 2.067433e-02 | 1.815702 | -0.185062 | 0.008163 | 0.110646 | GL000194.1 | 53593.0 | 115055.0 | MAFIP | ... | . | . | . | . | ENSG00000274847.1 | protein_coding | NaN | HGNC:31102 | NaN | NaN |
23469 rows × 24 columns
guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac)
guidance
scglue.graph.check_graph(guidance, [rna, atac])
scglue.models.configure_dataset(
rna, "NB", use_highly_variable=True,
use_layer="counts", use_rep="X_pca"
)
scglue.models.configure_dataset(
atac, "NB", use_highly_variable=True,
use_rep="X_lsi"
)
guidance_hvf = guidance.subgraph(chain(
rna.var.query("highly_variable").index,
atac.var.query("highly_variable").index
)).copy()
glue = scglue.models.fit_SCGLUE(
{"rna": rna, "atac": atac}, guidance_hvf,
fit_kws={"directory": "glue"}
)
dx = scglue.models.integration_consistency(
glue, {"rna": rna, "atac": atac}, guidance_hvf
)
dx
| n_meta | consistency | |
|---|---|---|
| 0 | 10 | 0.320081 |
| 1 | 20 | 0.281343 |
| 2 | 50 | 0.216881 |
| 3 | 100 | 0.169161 |
| 4 | 200 | 0.136142 |
_ = sns.lineplot(x="n_meta", y="consistency", data=dx).axhline(y=0.05, c="darkred", ls="--")
rna.obsm["X_glue"] = glue.encode_data("rna", rna)
atac.obsm["X_glue"] = glue.encode_data("atac", atac)
rna
AnnData object with n_obs × n_vars = 14566 × 23469
obs: 'balancing_weight'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std', 'chrom', 'chromStart', 'chromEnd', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts', 'gene_id', 'gene_type', 'tag', 'hgnc_id', 'havana_gene', 'artif_dupl'
uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', '__scglue__'
obsm: 'X_pca', 'X_umap', 'X_glue'
varm: 'PCs'
layers: 'counts'
obsp: 'distances', 'connectivities'
atac
AnnData object with n_obs × n_vars = 14566 × 109789
obs: 'balancing_weight'
var: 'chrom', 'chromStart', 'chromEnd', 'highly_variable'
uns: 'neighbors', 'umap', '__scglue__'
obsm: 'X_lsi', 'X_umap', 'X_glue'
obsp: 'distances', 'connectivities'
rna.obs['feature']='rna'
atac.obs['feature']='atac'
combined = ad.concat([rna, atac])
combined
AnnData object with n_obs × n_vars = 29132 × 0
obs: 'balancing_weight', 'feature'
obsm: 'X_umap', 'X_glue'
sc.pp.neighbors(combined, use_rep="X_glue", metric="cosine")
sc.tl.umap(combined)
sc.pl.umap(combined)
sc.tl.leiden(combined,resolution=0.8)
sc.pl.umap(combined,color='leiden')
combined
AnnData object with n_obs × n_vars = 29132 × 0
obs: 'balancing_weight', 'feature', 'leiden'
uns: 'neighbors', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_umap', 'X_glue'
obsp: 'distances', 'connectivities'
combined_rna = combined[combined.obs.feature == 'rna']
combined_rna
View of AnnData object with n_obs × n_vars = 14566 × 0
obs: 'balancing_weight', 'feature', 'leiden'
uns: 'neighbors', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_umap', 'X_glue'
obsp: 'distances', 'connectivities'
combined_rna.obs
| balancing_weight | feature | leiden | |
|---|---|---|---|
| AAACAGCCAGAACCGA-1 | 0.766128 | rna | 0 |
| AAACAGCCAGCCTAAC-1 | 3.160764 | rna | 1 |
| AAACAGCCATATTGAC-1 | 1.073027 | rna | 4 |
| AAACATGCAAATTCGT-1 | 1.073027 | rna | 4 |
| AAACATGCAACCTAAT-1 | 0.660823 | rna | 9 |
| ... | ... | ... | ... |
| TTTGTTGGTATGGTGC-1 | 0.432174 | rna | 12 |
| TTTGTTGGTCAATACG-1 | 1.097787 | rna | 3 |
| TTTGTTGGTCAGGAAG-1 | 0.687277 | rna | 1 |
| TTTGTTGGTTCAAGAT-1 | 1.884496 | rna | 6 |
| TTTGTTGGTTTACTTG-1 | 0.687277 | rna | 1 |
14566 rows × 3 columns
combined_rna.obs.loc[:,'new_leiden'] = np.nan
scripro.glue_metacell(combined_rna,50)
rna_leiden_clusters = combined_rna.obs['new_leiden']
rna_leiden_clusters
AAACAGCCAGAACCGA-1 0_1
AAACAGCCAGCCTAAC-1 1_4
AAACAGCCATATTGAC-1 4_0
AAACATGCAAATTCGT-1 4_0
AAACATGCAACCTAAT-1 9_0
...
TTTGTTGGTATGGTGC-1 12_0
TTTGTTGGTCAATACG-1 3_6
TTTGTTGGTCAGGAAG-1 1_25
TTTGTTGGTTCAAGAT-1 6_0
TTTGTTGGTTTACTTG-1 1_29
Name: new_leiden, Length: 14566, dtype: object
The RNA-Seq and ATAC-seq omics data are combined to generate a new dataset Combined, then divide metacell using the RNA-seq data region, and assign the corresponding metacell to the corresponding ATAC-seq data.
combined_atac = combined[combined.obs.feature == 'atac']
distance_matrix = cdist(combined_atac.obsm['X_umap'], combined_rna.obsm['X_umap'], metric='euclidean')
nearest_rna = np.argmin(distance_matrix, axis=1)
nearest_rna
array([ 1836, 9072, 1954, ..., 13302, 8738, 12567])
atac_leiden_clusters = rna_leiden_clusters[nearest_rna]
atac_leiden_clusters.index = combined_atac.obs.index
rna.obs = combined_rna.obs
cellgroup = pd.DataFrame(atac_leiden_clusters)
cellgroup
| new_leiden | |
|---|---|
| AAACAGCCAGAACCGA-1 | 0_4 |
| AAACAGCCAGCCTAAC-1 | 8_0 |
| AAACAGCCATATTGAC-1 | 15_0 |
| AAACATGCAAATTCGT-1 | 4_1 |
| AAACATGCAACCTAAT-1 | 9_0 |
| ... | ... |
| TTTGTTGGTATGGTGC-1 | 12_0 |
| TTTGTTGGTCAATACG-1 | 1_17 |
| TTTGTTGGTCAGGAAG-1 | 18_0 |
| TTTGTTGGTTCAAGAT-1 | 6_0 |
| TTTGTTGGTTTACTTG-1 | 1_19 |
14566 rows × 1 columns
Calculate metacell and markergene¶
test_data = scripro.Ori_Data(rna,Cell_num=50,use_glue = True)
test_data.get_glue_cluster(rna_leiden_clusters)
test_data.get_positive_marker_gene_parallel()
The data from ATAC-seq is used to generate the corresponding chromatin landscape, that is the bigwig file corresponding to metacell of the same name, which is stored in the folder ‘./bigwig’.
scripro.dataframe_to_sparse_tsv(atac.to_df(), 'test.tsv')
scripro.get_metacell_fragment(cellgroup,'.','./test.tsv',chunksize = 10000000)
scripro.process_tsv('./metacell_fragment/', 'hg38')
share_seq_data = scripro.SCRIPro_Multiome(8,'hg38',test_data)
Calculate the TF activity score¶
share_seq_data.cal_ISD_parallel('./bigwig/')
share_seq_data.get_tf_score()
sns.clustermap(share_seq_data.tf_score)