Posted in .loc function, python

My personal experience with the .loc function for gene queries

Gene query can be made easier with programming tools such as Python and SQL

With more scientific journals mandating the publication of datasets into the public data repository, data scientists can now consider analysing these deposited datasets to derive new biological insights, or leverage on these datasets to cross-check their own research findings. However, searching a long list of genes across multiple datasets can be challenging, especially if you are querying hundreds to thousands of genes.

Imagine another situation where 100 genes are involved in a biological pathway. You are interested to find out whether your gene expression changes are just influenced by a small subset of genes involved in the pathway or if the majority of genes in the pathway are affected. At this point, you may think that “Control-F” is your solution, but you will have to click 100 times to find the expression levels of all the 100 genes involved. While achievable with pure resilience, such manual methods are often slow and error-prone.

Here, I recommend using python to streamline these query processes, allowing you to quickly query gene expression levels by using just a few lines of codes. To provide a relevant example, we will use the dataset published by Zak DE et al., PNAS, where I have previously described the research design and research outcomes. In short, the authors investigated the gene expression profile of seropositive and seronegative subjects receiving the MRKAd5/HIV vaccine.

I will describe my personal encounter that eventually derived the final code for my own applications. First, I assigned the variable Ad5_seroneg as the data containing the fold-change (FC), p-value (pval) and adjusted p-value for the seronegative subjects using the below command:

import csv
import numpy as np
import pandas as pd
Ad5_seroneg = pd.read_csv('/Users/kuanrongchan/Desktop/Ad5_days_seroneg_Daniel_2012.csv')
Ad5_seroneg.head(2)

Output file is as follows:

Gene_SymbolFC_6h_vs_0pval_6h_vs_0adj_pval_6h_vs_0FC_1d_vs_0pval_1d_vs_0adj_pval_1d_vs_0FC_3d_vs_0pval_3d_vs_0adj_pval_3d_vs_0FC_7d_vs_0pval_7d_vs_0adj_pval_7d_vs_0
0A1BG-1.134080.1577810.50037-1.037420.6099120.804213-1.139440.1430990.678669-1.048680.5899130.997579
1A1CF-1.041990.3837500.638881.074750.0655790.1701181.019200.6860740.929992-1.013690.7725000.997579

Now that we are sure that the data is properly loaded, we will use the .loc command to find the expression of interferon-related genes, by setting the Gene_symbol column as index keys using the “set_index” command. In this case, I used IFNA14, IFNB1, IFNG as a proof of concept and saved these genes under IFN_keys. The IFN_keys are then used to query against the dataset previously saved under Ad5_seroneg.

IFN_keys = ["IFNA14", "IFNB1", "IFNG"]
Ad5_seroneg.set_index('Gene_Symbol').loc[IFN_keys]

Output file is as follows:

Gene_SymbolFC_6h_vs_0pval_6h_vs_0adj_pval_6h_vs_0FC_1d_vs_0pval_1d_vs_0adj_pval_1d_vs_0FC_3d_vs_0pval_3d_vs_0adj_pval_3d_vs_0FC_7d_vs_0pval_7d_vs_0adj_pval_7d_vs_0
IFNA141.139020.2446500.542742-1.037400.6857290.8522021.057650.6141470.909833-1.035690.7522600.997579
IFNB1-1.041200.6601080.8216551.040450.5970050.7948621.017440.8504860.970073-1.133910.1750990.997579
IFNG1.018090.9273400.968282-1.185670.2916960.5181501.001240.9949820.9984321.166440.4350680.997579

Fantastic! The .loc function solved the problem! You could theoretically replace the gene names saved under IFN_keys to any query gene list you desire.

At this point, you may think that everything is solved and we can proceed with using this command freely. However, I was curious to find out what would happen if you replace the query key with a gene that is not found in the microarray? Would this command still work? To illustrate this point, I replaced IFNA14 with zzz, since we know there is no gene called zzz.

IFN_keys2 = ["zzz", "IFNB1", "IFNG"]
Ad5_seroneg.set_index('Gene_Symbol').loc[IFN_keys]

Output file is as follows:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-54-01a2a359e521> in <module>
      1 IFN_keys = ["zzz", "IFNB1", "IFNG"]
----> 2 Ad5_seroneg.set_index('Gene_Symbol').loc[IFN_keys]

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    893 
    894             maybe_callable = com.apply_if_callable(key, self.obj)
--> 895             return self._getitem_axis(maybe_callable, axis=axis)
    896 
    897     def _is_scalar_access(self, key: Tuple):

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1111                     raise ValueError("Cannot index with multidimensional key")
   1112 
-> 1113                 return self._getitem_iterable(key, axis=axis)
   1114 
   1115             # nested tuple slicing

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1051 
   1052         # A collection of keys
-> 1053         keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1054         return self.obj._reindex_with_indexers(
   1055             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1265 
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer
   1268 

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1319 
   1320             with option_context("display.max_seq_items", 10, "display.width", 80):
-> 1321                 raise KeyError(
   1322                     "Passing list-likes to .loc or [] with any missing labels "
   1323                     "is no longer supported. "

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['zzz'], dtype='object', name='Gene_Symbol'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

To me, this is a disaster! By entering a gene that was not found in the dataset means that I will not be able to have any useful output from python. This will also mean that querying genes across various datasets that are using different platforms will be potentially challenging. If I wanted to make this code work for me, I knew I needed to find a solution.

I took some time to find the solution, only to realise that the reindex function in python can be used to circumvent this issue. When I tried this code instead:

IFN_keys2 = ["zzz", "IFNB1", "IFNG"]
Ad5_seroneg.set_index('Gene_Symbol').reindex(IFN_keys2)

Output is as follows:

FC_6h_vs_0pval_6h_vs_0adj_pval_6h_vs_0FC_1d_vs_0pval_1d_vs_0adj_pval_1d_vs_0FC_3d_vs_0pval_3d_vs_0adj_pval_3d_vs_0FC_7d_vs_0pval_7d_vs_0adj_pval_7d_vs_0
Gene_Symbol
zzzNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
IFNB1-1.041200.6601080.8216551.040450.5970050.7948621.017440.8504860.970073-1.133910.1750990.997579
IFNG1.018090.9273400.968282-1.185670.2916960.5181501.001240.9949820.9984321.166440.4350680.997579

Amazing outcome! Now, genes that are not tested in the gene expression dataset will be considered as NaN, which can allow me to execute ‘dropna’ function later to drop these columns. Most importantly, I am able to get the output for the remaining genes that I am interested to query. With this command, querying genes or gene sets across different datasets is much simpler, faster and reproducible.