Posted in .loc function, python

My personal experience with the .loc function for gene queries

Gene query can be made easier with programming tools such as Python and SQL

With more scientific journals mandating the publication of datasets into the public data repository, data scientists can now consider analysing these deposited datasets to derive new biological insights, or leverage on these datasets to cross-check their own research findings. However, searching a long list of genes across multiple datasets can be challenging, especially if you are querying hundreds to thousands of genes.

Imagine another situation where 100 genes are involved in a biological pathway. You are interested to find out whether your gene expression changes are just influenced by a small subset of genes involved in the pathway or if the majority of genes in the pathway are affected. At this point, you may think that “Control-F” is your solution, but you will have to click 100 times to find the expression levels of all the 100 genes involved. While achievable with pure resilience, such manual methods are often slow and error-prone.

Here, I recommend using python to streamline these query processes, allowing you to quickly query gene expression levels by using just a few lines of codes. To provide a relevant example, we will use the dataset published by Zak DE et al., PNAS, where I have previously described the research design and research outcomes. In short, the authors investigated the gene expression profile of seropositive and seronegative subjects receiving the MRKAd5/HIV vaccine.

I will describe my personal encounter that eventually derived the final code for my own applications. First, I assigned the variable Ad5_seroneg as the data containing the fold-change (FC), p-value (pval) and adjusted p-value for the seronegative subjects using the below command:

import csv
import numpy as np
import pandas as pd
Ad5_seroneg = pd.read_csv('/Users/kuanrongchan/Desktop/Ad5_days_seroneg_Daniel_2012.csv')

Output file is as follows:


Now that we are sure that the data is properly loaded, we will use the .loc command to find the expression of interferon-related genes, by setting the Gene_symbol column as index keys using the “set_index” command. In this case, I used IFNA14, IFNB1, IFNG as a proof of concept and saved these genes under IFN_keys. The IFN_keys are then used to query against the dataset previously saved under Ad5_seroneg.

IFN_keys = ["IFNA14", "IFNB1", "IFNG"]

Output file is as follows:


Fantastic! The .loc function solved the problem! You could theoretically replace the gene names saved under IFN_keys to any query gene list you desire.

At this point, you may think that everything is solved and we can proceed with using this command freely. However, I was curious to find out what would happen if you replace the query key with a gene that is not found in the microarray? Would this command still work? To illustrate this point, I replaced IFNA14 with zzz, since we know there is no gene called zzz.

IFN_keys2 = ["zzz", "IFNB1", "IFNG"]

Output file is as follows:

KeyError                                  Traceback (most recent call last)
<ipython-input-54-01a2a359e521> in <module>
      1 IFN_keys = ["zzz", "IFNB1", "IFNG"]
----> 2 Ad5_seroneg.set_index('Gene_Symbol').loc[IFN_keys]

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ in __getitem__(self, key)
    894             maybe_callable = com.apply_if_callable(key, self.obj)
--> 895             return self._getitem_axis(maybe_callable, axis=axis)
    897     def _is_scalar_access(self, key: Tuple):

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ in _getitem_axis(self, key, axis)
   1111                     raise ValueError("Cannot index with multidimensional key")
-> 1113                 return self._getitem_iterable(key, axis=axis)
   1115             # nested tuple slicing

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ in _getitem_iterable(self, key, axis)
   1052         # A collection of keys
-> 1053         keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1054         return self.obj._reindex_with_indexers(
   1055             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer

/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1320             with option_context("display.max_seq_items", 10, "display.width", 80):
-> 1321                 raise KeyError(
   1322                     "Passing list-likes to .loc or [] with any missing labels "
   1323                     "is no longer supported. "

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['zzz'], dtype='object', name='Gene_Symbol'). See"

To me, this is a disaster! By entering a gene that was not found in the dataset means that I will not be able to have any useful output from python. This will also mean that querying genes across various datasets that are using different platforms will be potentially challenging. If I wanted to make this code work for me, I knew I needed to find a solution.

I took some time to find the solution, only to realise that the reindex function in python can be used to circumvent this issue. When I tried this code instead:

IFN_keys2 = ["zzz", "IFNB1", "IFNG"]

Output is as follows:


Amazing outcome! Now, genes that are not tested in the gene expression dataset will be considered as NaN, which can allow me to execute ‘dropna’ function later to drop these columns. Most importantly, I am able to get the output for the remaining genes that I am interested to query. With this command, querying genes or gene sets across different datasets is much simpler, faster and reproducible.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s