{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "We need two things for generating coexpression proxies for a new species, single cell data and an orthology relationship. This template helps generate a coexpression network, but the orthology relationship will need to come from another existing tool. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import scanpy as sc\n", "import h5py\n", "import CococoNet_reader\n", "import numpy as np\n", "import anndata\n", "\n", "\n", "import Go_annotations\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "sc.settings.verbosity = 3 \n", "sc.set_figure_params(facecolor = 'white', figsize = (10,8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, read in your single cell data below, we will use it to build a coexpression network. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "single_cell_arabidopsis_root = sc.read_h5ad('your_file_name_here')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we do basic filtering. This dataset is older, so higher thresholds are likely more appropriate for your data. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "filtered out 428 genes that are detected in less than 3 cells\n" ] } ], "source": [ "sc.pp.filter_cells(single_cell_arabidopsis_root, min_genes=200)\n", "sc.pp.filter_genes(single_cell_arabidopsis_root, min_cells=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we identify and visualize our data to pick highly variable genes. Refer to the Scanpy tutorial for more info on this. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "extracting highly variable genes\n", " finished (0:00:03)\n", "--> added\n", " 'highly_variable', boolean vector (adata.var)\n", " 'means', float vector (adata.var)\n", " 'dispersions', float vector (adata.var)\n", " 'dispersions_norm', float vector (adata.var)\n" ] } ], "source": [ "sc.pp.highly_variable_genes(single_cell_arabidopsis_root, min_mean=0.125, max_mean=4, min_disp=0.5)\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "sc.pl.highly_variable_genes(single_cell_arabidopsis_root)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we do standard preprocessing for clustering." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "computing PCA\n", " on highly variable genes\n", " with n_comps=50\n", " finished (0:00:03)\n" ] } ], "source": [ "sc.tl.pca(single_cell_arabidopsis_root, svd_solver='arpack', random_state=303)\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 570, "width": 693 } }, "output_type": "display_data" } ], "source": [ "sc.pl.pca_variance_ratio(single_cell_arabidopsis_root, log=True)\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "computing neighbors\n", " using 'X_pca' with n_pcs = 50\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-08-29 03:00:09.270126: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", "2023-08-29 03:00:09.270198: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " finished: added to `.uns['neighbors']`\n", " `.obsp['distances']`, distances for each pair of neighbors\n", " `.obsp['connectivities']`, weighted adjacency matrix (0:00:14)\n" ] } ], "source": [ "sc.pp.neighbors(single_cell_arabidopsis_root, n_neighbors=12, n_pcs=50)\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "computing UMAP\n" ] } ], "source": [ "sc.tl.umap(single_cell_arabidopsis_root, random_state = 233)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.pl.umap(single_cell_arabidopsis_root,color = 'Meta Cluster String', s = 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we majorly diverge from a standard clustering pipeline. We want to pick a resolution such that we have several hundred clusters, and that most clusters have low double digit numbers of cells. This resolution will probably be ridiculously high, in the 50-200 range. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.tl.leiden(single_cell_arabidopsis_root,resolution = 50, random_state = 203)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check how many cells are in biggest and smallest clusters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "single_cell_arabidopsis_root.obs['leiden'].value_counts().head(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "single_cell_arabidopsis_root.obs['leiden'].value_counts().tail(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "single_cell_arabidopsis_root" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we are going to psuedobulk our samples, averaging expression within each tiny cluster" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "psuedobulk_df = pd.DataFrame(index = single_cell_arabidopsis_root.var_names)## Make a base dataframe index we will add stuff on to later\n", "all_samples = list(single_cell_arabidopsis_root.obs.leiden.unique()) ## get list of clusters to loop through\n", "psuedobulk_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(all_samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we actually do the psuedobulk." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for batch_type in all_samples:\n", "\n", " ## Read in the Names so our code is easy to understand\n", " current_cluster = batch_type\n", "\n", " ## Calculate the Psuedobulked mean\n", " cells_matching_batch_and_cluster = single_cell_arabidopsis_root[single_cell_arabidopsis_root.obs['leiden'] == current_cluster ]\n", " mean_of_genes = cells_matching_batch_and_cluster.X.mean(axis = 0).tolist()\n", "\n", "\n", " name_of_combo = current_cluster\n", " psuedobulk_df[name_of_combo] = mean_of_genes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "psuedobulk_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "exp_data = psuedobulk_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we efficiently generate the spearman coexpression matrix, should be faster than .corr" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipy.stats as sci\n", "\n", "rank_test_py_exp = sci.rankdata(exp_data, method = 'average', axis = 1) #Row ranks\n", "rank_test_py_exp = rank_test_py_exp - rank_test_py_exp.mean(axis = 1)[1] #Center each gene, subtract mean rank\n", "rank_test_py_exp_2 = np.square(rank_test_py_exp) #Square\n", "rank_test_py_exp = rank_test_py_exp /np.sqrt(rank_test_py_exp_2.sum(axis = 1))[:,None] #divide by sqrt(rowSums)\n", "cr_python = np.dot(rank_test_py_exp, rank_test_py_exp.T) # Get correlations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cr_python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put the results in a labeled dataframe, and this is your coexpression network!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_results = pd.DataFrame(columns = psuedobulk_df.index, index = psuedobulk_df.index, data = cr_python)\n", "corr_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need an orthology mapping of our data. It should be all many to many gene pairs, formatted into 5 columns. \n", "\n", "\n", "Species_1-OrthoDB Gene ID , Species_2-OrthoDB Gene ID, Orthogroup, Species_1 Gene ID Used in your single cell data,Species_2 Gene ID Used in your single cell data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, use the two following functions to generate the list, dropping in your data and the generated coexpression network/networks in the first function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def Calculate_Score_list_for_thresholding(orthology_map,species_1_coexpression_network,species_2_coexpression_network):\n", " import pandas as pd\n", " #Get Species Names in Common form \n", "\n", "\n", " cross_species_n_m_genes = pd.read_csv(orthology_map)\n", " orig_column_common_name_1 = common_name_1 + ' Symbol'\n", " orig_column_common_name_2 = common_name_2 + ' Symbol'\n", " cross_species_n_m_genes = cross_species_n_m_genes.rename(columns = {orig_column_common_name_1:common_name_1,orig_column_common_name_2:common_name_2})\n", " ### Get one to ones\n", " cross_species_map_one_to_one = cross_species_n_m_genes.drop_duplicates(subset=common_name_1, keep= False,)\n", " cross_species_map_one_to_one = cross_species_map_one_to_one.drop_duplicates(subset= common_name_2, keep= False)\n", "\n", " ## Convert to Dictionary\n", " dictionary_mapper_one_to_two = cross_species_map_one_to_one.set_index(common_name_1).to_dict()[common_name_2]\n", " dictionary_mapper_dos_to_uno = cross_species_map_one_to_one.set_index(common_name_2).to_dict()[common_name_1]\n", "\n", " ## Read In Cococonets \n", " coconet_species_one = species_1_coexpression_network\n", " coconet_species_two = species_2_coexpression_network\n", "\n", " cross_species_n_m_genes['Group ID'] = 'Unassigned'\n", "\n", "\n", " ## Assign Genes to Groups\n", " id_indexer = 0\n", " for gene_pair in cross_species_n_m_genes.iterrows():\n", " \n", " if gene_pair[1]['Group ID'] == 'Unassigned':\n", " current_species_1_gene = gene_pair[1][common_name_1]\n", " current_species_2_gene = gene_pair[1][common_name_2]\n", " cross_species_n_m_genes['Group ID'].loc[(cross_species_n_m_genes[common_name_1] == current_species_1_gene) & (cross_species_n_m_genes['Group ID'] == 'Unassigned')] = id_indexer\n", " cross_species_n_m_genes['Group ID'].loc[(cross_species_n_m_genes[common_name_2] == current_species_2_gene) & (cross_species_n_m_genes['Group ID'] == 'Unassigned')] = id_indexer\n", "\n", " all_labeled_groups = cross_species_n_m_genes.loc[cross_species_n_m_genes['Group ID'] == id_indexer]\n", "\n", " all_labeled_groups_species_1_genes = all_labeled_groups[common_name_1].to_list()\n", " all_labeled_groups_species_2_genes = all_labeled_groups[common_name_2].to_list()\n", "\n", " cross_species_n_m_genes['Group ID'].loc[cross_species_n_m_genes[common_name_1].isin(all_labeled_groups_species_1_genes)] = id_indexer\n", " cross_species_n_m_genes['Group ID'].loc[cross_species_n_m_genes[common_name_2].isin(all_labeled_groups_species_2_genes)] = id_indexer\n", "\n", " id_indexer += 1\n", "\n", "\n", "\n", " #Identify Pairs for evaluation\n", " all_pairs_to_evaluate_for_functional_conservation = pd.DataFrame(columns = [common_name_1,common_name_2,'Group Number'])\n", " for group_number in list(set(cross_species_n_m_genes['Group ID'].to_list())):\n", " current_gene_map = cross_species_n_m_genes.loc[cross_species_n_m_genes['Group ID'] == group_number]\n", " list_of_species_1_genes_in_group = list(set(current_gene_map[common_name_1].to_list()))\n", " list_of_species_2_genes_in_group = list(set(current_gene_map[common_name_2].to_list()))\n", " all_combo_list_current_genes = itertools.product(list_of_species_1_genes_in_group,list_of_species_2_genes_in_group)\n", " all_combo_list_current_genes = list(map(list,all_combo_list_current_genes))\n", " current_list_of_pairs = pd.DataFrame(all_combo_list_current_genes,columns = [common_name_1,common_name_2])\n", " current_list_of_pairs['Group Number'] = group_number\n", " all_pairs_to_evaluate_for_functional_conservation = all_pairs_to_evaluate_for_functional_conservation.append(current_list_of_pairs)\n", "\n", "\n", "\n", " all_pairs_to_evaluate_for_functional_conservation['Species 1 Score'] = np.nan\n", " all_pairs_to_evaluate_for_functional_conservation['Species 2 Score'] = np.nan\n", "\n", "\n", " ## Trim cococonets to match\n", "\n", "\n", " trimmed_species_1_cococonet = coconet_species_one[coconet_species_one.columns.intersection(cross_species_n_m_genes[common_name_1].to_list())]\n", " trimmed_species_1_cococonet = trimmed_species_1_cococonet[trimmed_species_1_cococonet.index.isin(cross_species_n_m_genes[common_name_1].to_list())]\n", " double_species_1_trimmed_cococonet = trimmed_species_1_cococonet[trimmed_species_1_cococonet.columns.intersection(cross_species_map_one_to_one[common_name_1].to_list())]\n", " double_species_1_trimmed_cococonet = double_species_1_trimmed_cococonet.replace(1,0)\n", "\n", " trimmed_species_2_cococonet = coconet_species_two[coconet_species_two.columns.intersection(cross_species_n_m_genes[common_name_2].to_list())]\n", " trimmed_species_2_cococonet = trimmed_species_2_cococonet[trimmed_species_2_cococonet.index.isin(cross_species_n_m_genes[common_name_2].to_list())]\n", " double_species_2_trimmed_cococonet = trimmed_species_2_cococonet[trimmed_species_2_cococonet.columns.intersection(cross_species_map_one_to_one[common_name_2].to_list())]\n", " double_species_2_trimmed_cococonet = double_species_2_trimmed_cococonet.replace(1,0)\n", "\n", "\n", " ## Rank\n", " species_1_cococonet_ranked = trimmed_species_1_cococonet.rank()\n", " species_2_cococonet_ranked = trimmed_species_2_cococonet.rank()\n", "\n", " #Do top 10 Genes\n", " top_10_species_1_genes = np.array(\n", " [double_species_1_trimmed_cococonet.T[c].nlargest(10).index.values for c in double_species_1_trimmed_cococonet.T]\n", " ) # using pair list above, cut down top 10 list to relevant genes, probably by adding list as a column in panda and then filtering panda to index of pair list\n", " top_10_species_1_genes_dataframe = pd.DataFrame(\n", " data=top_10_species_1_genes,\n", " index=double_species_1_trimmed_cococonet.index,\n", " columns=[\n", " \"One\",\n", " \"Two\",\n", " \"Three\",\n", " \"Four\",\n", " \"Five\",\n", " \"Six\",\n", " \"Seven\",\n", " \"Eight\",\n", " \"Nine\",\n", " \"Ten\",\n", " ],\n", " )\n", "\n", " #Convert \n", " top_10_species_1_genes_as_species_2 = top_10_species_1_genes_dataframe.replace(to_replace=dictionary_mapper_one_to_two)\n", "\n", " # Get genes for checking \n", " have_species_1_pairs = all_pairs_to_evaluate_for_functional_conservation.loc[all_pairs_to_evaluate_for_functional_conservation[common_name_1].isin(top_10_species_1_genes_as_species_2.index)]\n", " trimmed_all_gene_pairs_for_fc = have_species_1_pairs.loc[have_species_1_pairs[common_name_2].isin(trimmed_species_2_cococonet.index)]\n", " trimmed_all_gene_pairs_for_fc = trimmed_all_gene_pairs_for_fc.reset_index(drop = True)\n", "\n", " # Get values in species 2 \n", " for two_genes in trimmed_all_gene_pairs_for_fc.iterrows():\n", " current_species_1_gene = two_genes[1][common_name_1]\n", " current_species_2_gene = two_genes[1][common_name_2]\n", " finger_print_genes = top_10_species_1_genes_as_species_2.loc[current_species_1_gene].to_list()\n", " gene_ranks_in_species_2 = species_2_cococonet_ranked.loc[species_2_cococonet_ranked.index.isin(finger_print_genes), current_species_2_gene]\n", " avg_rank_in_species_2 = gene_ranks_in_species_2.mean()\n", " index_from_pairs = two_genes[0]\n", " trimmed_all_gene_pairs_for_fc.at[index_from_pairs, 'Species 1 Score'] = avg_rank_in_species_2\n", "\n", " #Repeat for Species 2 \n", "\n", " top_10_species_2_genes = np.array(\n", " [double_species_2_trimmed_cococonet.T[c].nlargest(10).index.values for c in double_species_2_trimmed_cococonet.T]\n", " ) # using pair list above, cut down top 10 list to relevant genes, probably by adding list as a column in panda and then filtering panda to index of pair list\n", " top_10_species_2_genes_dataframe = pd.DataFrame(\n", " data=top_10_species_2_genes,\n", " index=double_species_2_trimmed_cococonet.index,\n", " columns=[\n", " \"One\",\n", " \"Two\",\n", " \"Three\",\n", " \"Four\",\n", " \"Five\",\n", " \"Six\",\n", " \"Seven\",\n", " \"Eight\",\n", " \"Nine\",\n", " \"Ten\",\n", " ],\n", " )\n", "\n", "\n", " #convert \n", " top_10_species_2_genes_as_species_1 = top_10_species_2_genes_dataframe.replace(to_replace=dictionary_mapper_dos_to_uno)\n", "\n", "\n", " # Get values in species 1 \n", " for two_genes in trimmed_all_gene_pairs_for_fc.iterrows():\n", " current_species_1_gene = two_genes[1][common_name_1]\n", " current_species_2_gene = two_genes[1][common_name_2]\n", " finger_print_genes = top_10_species_2_genes_as_species_1.loc[current_species_2_gene].to_list()\n", " gene_ranks_in_species_1 = species_1_cococonet_ranked.loc[species_1_cococonet_ranked.index.isin(finger_print_genes), current_species_1_gene]\n", " avg_rank_in_species_1 = gene_ranks_in_species_1.mean()\n", " index_from_pairs = two_genes[0]\n", " trimmed_all_gene_pairs_for_fc.loc[index_from_pairs, 'Species 2 Score'] = avg_rank_in_species_1\n", "\n", " #Caluclate Divisors \n", " Number_of_species_1_genes = len(top_10_species_1_genes_as_species_2)\n", " Number_of_species_2_genes = len(top_10_species_2_genes_as_species_1)\n", "\n", "\n", " species_1_score_divisor = Number_of_species_2_genes - 4.5\n", " species_2_score_divisor = Number_of_species_1_genes-4.5\n", "\n", " #Divide and Average \n", " trimmed_all_gene_pairs_for_fc['Species 1 Score'] = trimmed_all_gene_pairs_for_fc['Species 1 Score']/species_1_score_divisor\n", " trimmed_all_gene_pairs_for_fc['Species 2 Score'] = trimmed_all_gene_pairs_for_fc['Species 2 Score']/species_2_score_divisor\n", " trimmed_all_gene_pairs_for_fc['Total Score'] = trimmed_all_gene_pairs_for_fc[['Species 1 Score','Species 2 Score']].mean(axis = 1)\n", "\n", " return trimmed_all_gene_pairs_for_fc\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the second function, drop in the results of your first function as well as whatever thresholds you'd like to use. Below are the ones we recommmend. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lenient_threshold = ('lenient',0.7,0.8,0.02)\n", "moderate_threshold = ('moderate',0.8,0.85,0.03)\n", "stringent_threshold = ('stringent',0.85,0.9,0.035)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This funciton will return the coexpression proxies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def Threshold_and_generate_coexpressalog_list(trimmed_all_gene_pairs_for_fc, single_pair_junk_threshold = .8, many_to_many_junk_threshold = .85,difference_between_many_to_drop = .03):\n", " \n", " \n", " list_of_true_pairs = []\n", " list_of_genes_to_average_and_set_to_be_equal = []\n", "\n", "\n", " common_name_1 = trimmed_all_gene_pairs_for_fc.columns[0]\n", " common_name_2 = trimmed_all_gene_pairs_for_fc.columns[1]\n", "\n", " for current_group in list(set(trimmed_all_gene_pairs_for_fc['Group Number'].to_list())):\n", " dataframe_of_group = trimmed_all_gene_pairs_for_fc.loc[trimmed_all_gene_pairs_for_fc['Group Number'] == current_group]\n", " dataframe_of_group = dataframe_of_group[[common_name_1,common_name_2,'Total Score']]\n", " wide_format = dataframe_of_group.pivot(index = common_name_1,columns= common_name_2,values= 'Total Score')\n", " if (wide_format1 and len(wide_format.columns) ==1:\n", " one_true_pair = [wide_format.idxmax(axis =0).item(),wide_format.columns.item()]\n", " list_of_true_pairs.append(one_true_pair)\n", " elif len(wide_format) == 1 and len(wide_format.columns)>1:\n", " one_true_pair = [wide_format.index.item(), wide_format.idxmax(axis = 1).item()]\n", " list_of_true_pairs.append(one_true_pair)\n", " else:\n", "\n", " #Drop Low Quality Columns and Rows\n", " cols_to_drop = wide_format.columns[wide_format.max()1 and len(wide_format.columns) ==1:\n", " one_true_pair = [wide_format.idxmax(axis =0).item(),wide_format.columns.item()]\n", " list_of_true_pairs.append(one_true_pair)\n", " elif len(wide_format) == 1 and len(wide_format.columns)>1:\n", " one_true_pair = [wide_format.index.item(), wide_format.idxmax(axis = 1).item()]\n", " list_of_true_pairs.append(one_true_pair)\n", " elif (wide_format>.9).all(axis = None):\n", " ### Put in retention code here\n", " both_gene_lists_to_average = [wide_format.index.to_list(), wide_format.columns.to_list()]\n", " list_of_genes_to_average_and_set_to_be_equal.append(both_gene_lists_to_average)\n", " \n", " else:\n", " for cur_row in wide_format.iterrows():\n", " cur_row_max = cur_row[1].max()\n", " cur_row[1][cur_row[1]< cur_row_max - difference_between_many_to_drop] = np.nan\n", " wide_format.loc[cur_row[0]] = cur_row[1]\n", " wide_format = wide_format.dropna(axis = 1, how = 'all')\n", " for cur_col in wide_format.columns:\n", " cur_col_max = wide_format[cur_col].max()\n", " wide_format[cur_col].loc[wide_format[cur_col]< cur_col_max-difference_between_many_to_drop] = np.nan\n", " wide_format = wide_format.dropna(axis = 0, how = 'all')\n", " col_count = wide_format.count() == 1\n", " wide_format = wide_format.loc[:,col_count]\n", " row_count = wide_format.count(axis = 1) ==1 \n", " wide_format = wide_format.loc[row_count,:]\n", " wide_format = wide_format.dropna(axis = 1, how = 'all')\n", " wide_format = wide_format.dropna(axis = 0, how = 'all')\n", " for label,content in wide_format.items():\n", " cur_species_2_label = label\n", " cur_species_1_label = content.idxmax()\n", " if type(cur_species_1_label) == str:\n", " one_true_pair = [cur_species_1_label,cur_species_2_label]\n", " list_of_true_pairs.append(one_true_pair)\n", "\n", "\n", " true_pair_dataframe = pd.DataFrame(columns= [f'{common_name_1} gene',f'{common_name_2} gene'], data = list_of_true_pairs)\n", " true_pair_dataframe = true_pair_dataframe.drop_duplicates(subset = f'{common_name_1} gene')\n", " true_pair_dataframe = true_pair_dataframe.drop_duplicates(subset = f'{common_name_2} gene')\n", " \n", " return true_pair_dataframe\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, is an example of what that workflow would look like " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score_list_to_threshold = Calculate_Score_list_for_thresholding(orthology_map,species_1_coexpression_network,species_2_coexpression_network)\n", "final_coexpression_proxies = Threshold_and_generate_coexpressalog_list(score_list_to_threshold)\n", "print(final_coexpression_proxies)\n", "final_coexpression_proxies.to_csv('save_where_you_want')" ] } ], "metadata": { "kernelspec": { "display_name": "Single_cell_data_fix", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }