Persad et al & Littman et al Comparison

To test SCRIPro’s ability to explore the intensity of transcription factor regulation, we applied it to a CRISPRa activated T cell single-cell sequencing data set to calculate AUROC and AUPRC using the activated Label per cell as the gold standard,and this data can be downloaded from http://www.perturbase.cn/download (PRJNA787633).

In this notebook, we use the cell clustering results obtained by Persad et al to replace the Littman et al clustering results and test Persad et al ability to extract heterogeneous cells.

Using Shell:

Transcription factor enrichment scores can be obtained by SCRIPro using the following shell statement:

SCRIPro enrich -i ./data/raw.h5ad -n 50 -s hs -p rna_workflow -t 32

Using Python for custom analysis:

import numpy as np
import pandas as pd
import scanpy as sc
import h5py
import scripro
import anndata
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import scanpy as sc
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import roc_curve, auc

Load and process data

Load data and raw data. Since the processed data downloaded does not contain raw data, we download raw data for re-processing

rna = sc.read_h5ad('./data/perturb_data.h5ad')
rna_raw = sc.read_h5ad('./data/raw.h5ad')
rna_raw=rna_raw[rna.obs.index]
sc.pp.normalize_total(rna_raw, target_sum=1e4)
sc.pp.log1p(rna_raw)
rna.raw = rna_raw
test_data = scripro.Ori_Data(rna,Cell_num=50)
rna.obs
gene n_genes n_genes_by_counts total_counts total_counts_mt pct_counts_mt leiden mixscape_class_p_ko mixscape_class mixscape_class_global pertclass hdbscan
Cell_barcodes
TAACCAGAGTAGAATC-8 TRIM21 3467 3467 10422.0 755.0 7.244291 26 1.0 TRIM21 KO KO strong 9
CATAGACCAACACGAG-8 CBY1 2003 2003 4621.0 392.0 8.483012 24 1.0 CBY1 KO KO strong 10
CTGTGAATCCGGTAAT-2 LAT2 4344 4344 16784.0 1412.0 8.412774 9 1.0 LAT2 KO KO strong 1
GAGCTGCAGGTAGATT-8 RELA 2361 2360 6086.0 380.0 6.243838 16 1.0 RELA KO KO strong 4
AAGTACCCAACTTCTT-3 WT1 2198 2198 5469.0 545.0 9.965259 12 1.0 WT1 KO KO strong 0
... ... ... ... ... ... ... ... ... ... ... ... ...
GGCTGTGAGGGCTAAC-5 APOL2 2503 2503 6126.0 553.0 9.027098 30 1.0 APOL2 KO KO strong 20
ATATCCTCATCATTTC-8 TNFRSF1B 4380 4379 14271.0 1052.0 7.371593 11 1.0 TNFRSF1B KO KO strong 7
CTAGGTAGTTGAGGAC-1 CD27 2385 2385 6731.0 336.0 4.991829 19 1.0 CD27 KO KO strong 17
TGGGAAGGTGAGTTTC-6 CTRL 2988 2988 7500.0 541.0 7.213333 3 0.0 CTRL CTRL CTRL 30
ATGATCGGTATCGTTG-7 RELA 2739 2739 7606.0 311.0 4.088877 16 1.0 RELA KO KO strong 4

16707 rows × 12 columns

replace Littman et al data with Persad et al data

Load the Persad et al data calculated by Persad et al, then replace Littman et al data with Persad et al data (new_leiden column)

rawmeta = pd.read_csv('./rawmeta.csv')
rawmeta
Cell_barcodes gene excluded_umis metacell dissolved metacell_level cells_rare_gene_module rare_cell metacell_name
0 TAACCAGAGTAGAATC-8 TRIM21 849.0 110 False 1 -1 False M110.11
1 CATAGACCAACACGAG-8 CBY1 459.0 91 False 1 -1 False M91.30
2 CTGTGAATCCGGTAAT-2 LAT2 1571.0 484 False 2 -1 False M484.07
3 GAGCTGCAGGTAGATT-8 RELA 408.0 131 False 1 -1 False M131.58
4 AAGTACCCAACTTCTT-3 WT1 779.0 412 False 1 -1 False M412.96
... ... ... ... ... ... ... ... ... ...
16702 GGCTGTGAGGGCTAAC-5 APOL2 712.0 219 False 1 -1 False M219.04
16703 ATATCCTCATCATTTC-8 TNFRSF1B 1107.0 5 False 1 -1 False M5.31
16704 CTAGGTAGTTGAGGAC-1 CD27 445.0 235 False 1 -1 False M235.05
16705 TGGGAAGGTGAGTTTC-6 CTRL 650.0 100 False 1 -1 False M100.73
16706 ATGATCGGTATCGTTG-7 RELA 332.0 470 False 2 -1 False M470.82

16707 rows × 9 columns

test_data.adata.obs.new_leiden=list(rawmeta.metacell)
test_data.adata.obs
gene n_genes n_genes_by_counts total_counts total_counts_mt pct_counts_mt leiden mixscape_class_p_ko mixscape_class mixscape_class_global pertclass hdbscan new_leiden
Cell_barcodes
TAACCAGAGTAGAATC-8 TRIM21 3467 3467 10422.0 755.0 7.244291 26 1.0 TRIM21 KO KO strong 9 110
CATAGACCAACACGAG-8 CBY1 2003 2003 4621.0 392.0 8.483012 24 1.0 CBY1 KO KO strong 10 91
CTGTGAATCCGGTAAT-2 LAT2 4344 4344 16784.0 1412.0 8.412774 9 1.0 LAT2 KO KO strong 1 484
GAGCTGCAGGTAGATT-8 RELA 2361 2360 6086.0 380.0 6.243838 16 1.0 RELA KO KO strong 4 131
AAGTACCCAACTTCTT-3 WT1 2198 2198 5469.0 545.0 9.965259 12 1.0 WT1 KO KO strong 0 412
... ... ... ... ... ... ... ... ... ... ... ... ... ...
GGCTGTGAGGGCTAAC-5 APOL2 2503 2503 6126.0 553.0 9.027098 30 1.0 APOL2 KO KO strong 20 219
ATATCCTCATCATTTC-8 TNFRSF1B 4380 4379 14271.0 1052.0 7.371593 11 1.0 TNFRSF1B KO KO strong 7 5
CTAGGTAGTTGAGGAC-1 CD27 2385 2385 6731.0 336.0 4.991829 19 1.0 CD27 KO KO strong 17 235
TGGGAAGGTGAGTTTC-6 CTRL 2988 2988 7500.0 541.0 7.213333 3 0.0 CTRL CTRL CTRL 30 100
ATGATCGGTATCGTTG-7 RELA 2739 2739 7606.0 311.0 4.088877 16 1.0 RELA KO KO strong 4 470

16707 rows × 13 columns

test_data.adata.obs['new_leiden'] = test_data.adata.obs['new_leiden'].astype(str)
test_data.get_positive_marker_gene_parallel()
rna_seq_data = scripro.SCRIPro_RNA(12,'hg38',test_data,assays=['Direct','DNase','H3K27ac'])

Calculating ISD

rna_seq_data.cal_ISD_cistrome()
rna_seq_data.P_value_matrix
factor NELFA SUPT5H POLR2A TAF1 E2F1 MYC JMJD6 TFDP1 PHF8 BRD4 ... ESCO2 SOX8 WWTR1 ELF5 ZIC3 SOX6 HOXA1 TOP1 FOXE3 ETV2
54 1.000000 0.879108 0.790975 0.787378 0.787086 0.759230 0.759144 0.744869 0.735694 0.734767 ... 3.128389e-11 2.321808e-11 1.636090e-11 9.684857e-12 6.359623e-12 4.229628e-12 3.758939e-12 9.764622e-13 7.729002e-13 0.000000e+00
568 0.596106 0.658273 0.878588 0.637187 0.542015 0.772881 0.482539 0.328786 0.485381 0.840747 ... 5.529090e-02 2.201403e-02 6.034951e-02 1.250646e-01 9.094418e-02 2.176096e-05 1.033788e-01 1.759315e-02 8.293918e-02 1.189246e-01
171 0.956295 0.948642 0.807200 0.779516 0.645101 1.000000 0.747362 0.417036 0.685183 0.772588 ... 0.000000e+00 9.707026e-07 4.524251e-08 1.419998e-10 2.228434e-07 2.616552e-10 5.366166e-07 0.000000e+00 1.553591e-07 1.790557e-10
106 1.000000 0.938330 0.849668 0.820421 0.634343 0.915763 0.775252 0.401171 0.737218 0.841678 ... 0.000000e+00 1.369131e-06 2.459007e-08 6.081526e-07 2.266960e-05 6.717899e-10 2.948945e-04 1.449065e-13 1.059525e-07 4.508395e-09
79 0.037224 0.413154 1.000000 0.553018 0.133872 0.551101 0.076116 0.042632 0.125038 0.670272 ... 6.291131e-04 1.136219e-01 7.055724e-02 1.315009e-01 5.778941e-02 7.625599e-02 1.515579e-01 1.035683e-02 1.189053e-01 2.294130e-01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
337 0.628819 0.884519 0.881672 1.000000 1.000000 0.738973 0.544195 0.935860 0.466437 0.784920 ... 2.697677e-06 5.354145e-03 1.452867e-02 1.063817e-03 2.141516e-05 5.709512e-02 2.659353e-02 3.680871e-05 5.829138e-02 4.284606e-02
222 0.912849 0.800940 0.908895 0.974336 0.590447 0.717522 0.606849 0.280193 0.566149 0.913928 ... 2.674771e-03 1.304035e-02 1.189110e-01 6.780617e-02 3.949164e-03 5.605106e-02 3.067985e-02 5.394981e-02 1.673135e-01 3.413787e-02
304 0.839100 0.804796 0.861108 0.937081 1.000000 0.718710 0.672186 0.722735 0.624386 0.913212 ... 2.928011e-02 1.334770e-03 2.150550e-02 1.245458e-02 1.675397e-04 7.373146e-03 6.442231e-03 1.534435e-02 5.141218e-02 1.081799e-02
375 0.145227 0.599807 1.000000 0.570559 0.201262 0.653092 0.182292 0.047675 0.259976 0.752675 ... 5.605888e-04 3.816154e-02 3.498265e-02 1.408408e-01 2.847721e-02 7.229822e-02 9.771098e-02 4.832158e-03 5.258527e-02 2.155378e-01
220 0.991618 0.938332 0.838142 0.710450 0.641620 1.000000 0.719913 0.371703 0.608496 0.759212 ... 1.202406e-14 3.684495e-06 4.245327e-07 5.762894e-07 4.323174e-08 3.048350e-11 7.881787e-09 6.787328e-10 6.295932e-06 1.545003e-10

592 rows × 1252 columns

rna_seq_data.get_tf_score()
tem_exp = rna_raw.to_df().merge(test_data.adata.obs.loc[:,'new_leiden'],left_index=True,right_index=True)
grouped = tem_exp.groupby('new_leiden').mean()
grouped
MIR1302-2HG FAM138A OR4F5 AL627309.1 AL627309.3 AL627309.2 AL627309.5 AL627309.4 AP006222.2 AL732372.1 ... TNFRSF9-1 TNFRSF9-2 TRAF3IP2-1 TRAF3IP2-2 TRIM21-1 TRIM21-2 VAV1-1 VAV1-2 WT1-1 WT1-2
new_leiden
0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 ... 0.022986 0.020883 0.141023 0.019764 0.070329 0.045294 0.000000 0.000000 5.971172 0.025028
1 0.0 0.0 0.0 0.000000 0.000000 0.0 0.028059 0.000000 0.0 0.0 ... 0.307047 0.000000 0.175104 0.078949 0.000000 0.000000 0.000000 0.031357 3.475705 0.000000
2 0.0 0.0 0.0 0.029139 0.000000 0.0 0.061611 0.029139 0.0 0.0 ... 0.000000 0.000000 0.145290 0.000000 0.025556 0.000000 0.000000 0.000000 0.082464 0.000000
3 0.0 0.0 0.0 0.000000 0.000000 0.0 0.030975 0.000000 0.0 0.0 ... 0.025993 0.000000 0.171809 0.099170 0.072312 0.000000 0.039627 0.000000 0.352380 0.000000
4 0.0 0.0 0.0 0.000000 0.000000 0.0 0.035995 0.000000 0.0 0.0 ... 0.032857 0.000000 0.288488 0.000000 0.043905 0.044607 0.000000 0.035995 0.072219 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
587 0.0 0.0 0.0 0.000000 0.037087 0.0 0.000000 0.000000 0.0 0.0 ... 0.052179 0.000000 0.828584 0.035307 0.000000 0.000000 0.000000 0.000000 0.094569 0.000000
588 0.0 0.0 0.0 0.035962 0.000000 0.0 0.000000 0.000000 0.0 0.0 ... 0.039119 0.000000 1.642812 0.000000 0.000000 0.074946 0.000000 0.000000 0.000000 0.000000
589 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 ... 0.141428 0.029187 0.287361 0.202327 0.059597 0.000000 0.000000 0.024091 0.094148 0.000000
590 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 ... 0.031027 0.000000 0.155326 0.051923 0.000000 0.105978 0.000000 0.000000 0.032125 0.000000
-1 0.0 0.0 0.0 0.003845 0.000000 0.0 0.009864 0.000000 0.0 0.0 ... 0.070795 0.030611 0.236199 0.057267 0.053324 0.301895 0.023147 0.022444 0.200147 0.008683

592 rows × 36755 columns

rna_seq_data.Ori_Data.ad_all = grouped
rna_seq_data.Ori_Data.super_gene_exp = grouped
super_gene_exp = rna_seq_data.Ori_Data.super_gene_exp
super_gene_mean = rna_seq_data.Ori_Data.super_gene_mean
super_gene_std = rna_seq_data.Ori_Data.super_gene_std
rna_seq_data.Ori_Data.super_gene_mean = rna_seq_data.Ori_Data.super_gene_exp.mean()
rna_seq_data.Ori_Data.super_gene_std = rna_seq_data.Ori_Data.super_gene_exp.std()
rna_seq_data.P_value_matrix
ADNP AFF1 AFF4 AGO1 AHR AIRE ALX1 ALX3 ALX4 ANHX ... ZSCAN22 ZSCAN23 ZSCAN29 ZSCAN30 ZSCAN31 ZSCAN4 ZSCAN5A ZSCAN5C ZXDB ZXDC
row
-1 8.563566e-04 0.360750 0.559371 0.182508 0.018380 0.000006 0.000029 1.898788e-05 0.000016 0.000330 ... 0.305053 0.001561 0.114077 0.000631 1.361497e-03 3.585991e-06 0.006103 2.655798e-05 0.261396 1.215443e-01
0 3.446237e-12 0.497215 0.500529 0.163227 0.000100 0.000003 0.000129 1.646497e-07 0.000002 0.000044 ... 0.349755 0.002214 0.028432 0.000275 1.026750e-07 3.986286e-06 0.034384 1.379630e-05 0.219834 1.525143e-12
1 7.096883e-02 0.480607 0.591523 0.193952 0.028921 0.051201 0.000002 4.352112e-04 0.000222 0.000107 ... 0.309516 0.000656 0.068821 0.001493 5.858211e-06 8.907871e-07 0.008626 1.288200e-08 0.212142 2.165690e-01
10 5.333758e-02 0.544322 0.600351 0.262902 0.029763 0.057495 0.056367 7.002899e-04 0.000222 0.029509 ... 0.309205 0.004910 0.059540 0.001900 4.214787e-02 3.509326e-07 0.021020 1.045370e-04 0.176866 2.342755e-01
100 7.107864e-02 0.399857 0.764365 0.176755 0.121364 0.063864 0.032153 3.981877e-02 0.026263 0.010673 ... 0.298497 0.011864 0.116632 0.019865 7.758212e-02 9.608021e-03 0.008497 1.111107e-04 0.092073 1.953459e-01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 1.291925e-01 0.487603 0.791613 0.203145 0.080655 0.063231 0.036989 3.843538e-02 0.045159 0.022616 ... 0.353437 0.004247 0.114041 0.020601 2.575347e-02 5.994760e-05 0.023438 1.904101e-02 0.140363 9.592408e-02
96 1.031648e-01 0.318737 0.647392 0.119693 0.108568 0.019088 0.034342 2.822314e-02 0.049918 0.056430 ... 0.194852 0.011646 0.117273 0.023761 2.218386e-02 3.685638e-03 0.013456 6.609630e-03 0.204090 1.425852e-01
97 1.062549e-01 0.340830 0.789539 0.114368 0.136252 0.097897 0.059248 8.784248e-02 0.156568 0.042775 ... 0.191674 0.002716 0.095484 0.016832 5.122700e-02 8.143724e-02 0.000837 2.209640e-02 0.089183 1.023146e-01
98 6.693196e-02 0.558301 0.756792 0.281248 0.047730 0.104862 0.089017 1.252955e-01 0.083650 0.035471 ... 0.302286 0.002517 0.147804 0.007596 2.793163e-02 4.735611e-03 0.019666 2.356419e-02 0.289258 1.428439e-01
99 6.862545e-02 0.496669 0.600869 0.187158 0.096891 0.030484 0.006576 2.917082e-03 0.001573 0.015501 ... 0.390762 0.018814 0.083292 0.019330 3.054598e-02 4.523016e-06 0.027676 5.116583e-04 0.212204 2.528735e-01

592 rows × 1226 columns

rna_seq_data.tf_score
ADNP AFF1 AFF4 AGO1 AHR AIRE ALX1 ALX3 ALX4 ANHX ... ZSCAN22 ZSCAN23 ZSCAN29 ZSCAN30 ZSCAN31 ZSCAN4 ZSCAN5A ZSCAN5C ZXDB ZXDC
row
-1 3.672389e-04 0.178091 0.239622 0.070631 0.007116 1.357711e-06 0.0 2.845156e-06 2.377675e-06 0.0 ... 0.063907 0.000153 0.028755 0.000139 3.560237e-04 0.0 0.001316 0.0 0.052378 4.111837e-02
0 2.414882e-12 0.261679 0.205148 0.078429 0.000027 5.387451e-07 0.0 2.461573e-08 2.656785e-07 0.0 ... 0.126825 0.000271 0.014292 0.000077 1.620604e-08 0.0 0.003788 0.0 0.063796 1.009363e-12
1 5.021123e-02 0.086393 0.228981 0.160712 0.007152 8.023564e-03 0.0 7.339054e-05 3.607554e-05 0.0 ... 0.040119 0.000085 0.005075 0.000094 9.613452e-07 0.0 0.000946 0.0 0.085753 6.809414e-02
10 3.808361e-02 0.175952 0.306938 0.168117 0.009945 1.062668e-02 0.0 1.275641e-04 4.347075e-05 0.0 ... 0.037240 0.004910 0.033682 0.000281 6.428773e-03 0.0 0.002387 0.0 0.081476 8.746815e-02
100 3.513059e-02 0.293045 0.449179 0.129605 0.061245 8.614222e-03 0.0 5.562580e-03 3.426428e-03 0.0 ... 0.019991 0.001214 0.053916 0.005434 9.358260e-03 0.0 0.000642 0.0 0.008968 1.053210e-01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 1.177092e-01 0.209708 0.538176 0.134788 0.020214 1.010271e-02 0.0 6.525869e-03 7.493207e-03 0.0 ... 0.028318 0.000345 0.100466 0.007886 3.624425e-03 0.0 0.001972 0.0 0.081238 4.317646e-02
96 2.874359e-02 0.148875 0.243910 0.046832 0.029400 2.040194e-03 0.0 2.741001e-03 4.072542e-03 0.0 ... 0.058285 0.000712 0.037024 0.010316 2.618379e-03 0.0 0.000991 0.0 0.020238 5.766247e-02
97 1.906883e-02 0.174282 0.399717 0.026488 0.035957 1.174490e-02 0.0 1.196944e-02 1.987612e-02 0.0 ... 0.010206 0.000062 0.016045 0.006168 6.485161e-03 0.0 0.000341 0.0 0.004761 2.665132e-03
98 1.754907e-02 0.471482 0.416241 0.115289 0.022126 1.411043e-02 0.0 1.252955e-01 1.405428e-02 0.0 ... 0.021562 0.000198 0.069066 0.005385 4.197360e-03 0.0 0.010347 0.0 0.024765 8.091507e-02
99 2.922981e-02 0.146265 0.232052 0.074721 0.056472 3.580166e-03 0.0 4.096966e-04 2.107131e-04 0.0 ... 0.029510 0.001427 0.026408 0.005165 3.396306e-03 0.0 0.001820 0.0 0.108117 1.003711e-01

592 rows × 1226 columns

Calculate the AUPRC and AUROC

scripro_score = test_data.adata.obs.merge(rna_seq_data.tf_score,left_on='new_leiden',right_index=True).iloc[:,13:]
commontf = set(test_data.adata.obs['gene']).intersection(set(scripro_score.columns))
scripro_auroc_dic = {}
for k in commontf:
    y_true = []
    for i in scripro_score.index:
        if test_data.adata.obs.loc[i,'gene'] == k:
            y_true.append(1)
        else:
            y_true.append(0)
    y_scores = list(scripro_score.loc[:,k])
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)
    scripro_auroc_dic[k]=roc_auc

scripro_auroc_score = pd.DataFrame([scripro_auroc_dic]).T.sort_values(ascending = False,by = 0)
scripro_auroc_score.columns = ['auroc']
scripro_auroc_score
auroc
EOMES 0.993938
GATA3 0.951226
RELA 0.941175
FOXD2 0.916245
PRDM1 0.915120
TBX21 0.863753
LHX4 0.818389
FOXQ1 0.743893
LHX6 0.735591
WT1 0.729839
JMJD1C 0.691111
ALX4 0.662308
NOTCH1 0.583613
IKZF3 0.564741
FOSB 0.413174
import pandas as pd
from sklearn.metrics import precision_recall_curve, auc

scripro_auprc_dic = {}
for k in commontf:
    y_true = []
    for i in scripro_score.index:
        if test_data.adata.obs.loc[i, 'gene'] == k:
            y_true.append(1)
        else:
            y_true.append(0)
    y_scores = list(scripro_score.loc[:,k])
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
    auprc = auc(recall, precision)
    scripro_auprc_dic[k] = auprc
scripro_auprc_score = pd.DataFrame([scripro_auprc_dic]).T.sort_values(ascending=False, by=0)
scripro_auprc_score.columns = ['auprc']
scripro_auprc_score
auprc
EOMES 0.871598
RELA 0.760414
GATA3 0.731483
WT1 0.598684
LHX6 0.495087
FOXD2 0.280925
TBX21 0.097302
ALX4 0.061300
FOXQ1 0.003539
JMJD1C 0.000962
PRDM1 0.000350
FOSB 0.000324
NOTCH1 0.000202
LHX4 0.000163
IKZF3 0.000128