Posted in python

Executing Excel functions in Jupyter Notebooks using Mito

Mito

Data scientists often need to stratify their data using “IF” function. For instance, what is the difference of immune responses in the young and elderly after vaccination? In this case, you would filter the young by a pre-defined age cut-off (less than 65 years old). Thereafter, the data can be sorted in descending order using fold change, so as to determine the genes that are most differentially regulated. Another filtering can be used to allow identification of differentially expressed genes (DEGs), based on a pre-determined fold change and p-value cut-off. Then, another possibility may arise, where you may wonder if the dataset is influenced by gender differences, and how gender interacts with age to influence vaccine responses. This will require even more filtering of the dataset to probe into these interesting biological questions…

As you can imagine, there are many instances of filtering and comparisons to perform, especially when analysing a human dataset, due to large variations between individuals. Microsoft Excel is a useful tool to do many of these filtering and aggregation functions. However, a major disadvantage is that this consequently generates many Excel files and Excel sheets which can become difficult to manage. Python can be used to perform these filtering functions easily, but this will usually require typing of long lists of codes which can be potentially time-consuming. Hence, I recommend a python package named Mito that can accelerate execution of these filtering functions.

To install the package, execute the following command in the terminal to download the installer:

python -m pip install mitoinstaller

Ensure that you have Python 3.6 or above to utilise this package. Next, run the installer in the terminal:

python -m mitoinstaller install

Finally, launch JupyterLab and restart your Notebook Kernel:

python -m jupyter lab

To run the package in Jupyter Lab:

import mitosheet
mitosheet.sheet()

Output is as such:

The features of a mitosheet includes:

1. Import dataset
2. Export dataset
3. Add columns (you can insert formulas here)
4. Delete columns that are unnecessary
5. Pivot tables allows you to transpose and query the dataframe differently
6. Merge datasets. Possible if you have 2 excel files with a shared column.
7. Basic graph features, using Plotly
8. Filter data
9. Different excel sheets and for inserting new sheets

Most excitingly, Mito will also generate the equivalent annotated Python codes for each of these edits, meaning that you can easily use the commands to do further analysis in python. Overall, Mito is a user-friendly tool that allows you to perform excel functions within Python. This helps the workflow to be much more efficient as we do not have to import multiple excel files into Jupyter notebooks. Critically, typing of long codes is not required as the Python codes are automatically generated as you plow through the data.

Posted in Introduction, python

Five reasons why python is better than excel in processing big omics datasets

Big data analysis with python from pandas package can be easier than excel. Picture source from https://images.app.goo.gl/wq34AegEWn73BcCQ8

My self-learning with dataquest has allowed me to use python to make panda dataframes that allows basic data query, creating columns based on pre-defined formulas and graph plotting. Despite my basic knowledge of python, I have begun to appreciate that python could allow more efficient data processing compared to excel. Some of the main advantages of using python are:

1. Processed DataFrames are assigned in codes, mitigating the need for storing multiple files or spreadsheets.

For instance, a data scientist will usually want to convert their values to Z-score before doing principal component analysis, hierarchical clustering or plotting heatmaps. The conversion in excel is easy if you are analysing a small dataset. However, if analysing a dataset with dimensions of 20 variables with 20,000 genes (with total of 400,000 numerical values), the user will usually have to save the file in a different excel file to prevent lagging issues. In python, the conversion can be easily stored using the following code as an example:

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

In this case, the Z-score transformed data is assigned under ‘scaled data’, and you can subsequently do a PCA or heatmap with this ‘scaled_data’.

2. Python is less prone to errors compared to excel

Using excel files to analyse small datasets is convenient and my preferred choice. However, with large datasets, the drag and drop function is not perfect. For instance, a lot of time and effort is needed to drag the cells if you have a large number of variables. The drop function is also not useful if you have empty cells in your dataset. Finally, combining “filter” or “sort” functions together with formulas is cumbersome in excel, especially for large datasets. In python, the formulas can be easily stored in a new column. For instance, to add new columns of log2FC and -logP values into a dataset (assigned as dataset in this case), we just need to use the below command:

# Command generates log2FC and -logP column
dataset['log2FC'] = np.log2(dataset['FC'])
dataset['negative_log_pvalue'] = np.log10(dataset['p_value'])*-1

We can then use these new columns to plot graphs, such as volcano plots and p-value histograms.

3. Python is more time and space efficient in processing of large datasets

The maximum limit of excel is 1,048,576 rows by 16,384 columns, which means that you will have to store all your genes from microarray or RNAseq in rows. Even if you stored all your genes in rows, performing excel functions in large datasets can take a long time (can take several minutes). In some scenarios, due to the long processing time, the excel program may shutdown, causing you to lose your updated file. This means that you will have to save your files regularly and each time you press “save”, it takes a few minutes, which means a lot of time wasted on waiting. In python, particularly with Jupyterlab, the codes are automatically saved and the DataFrame dimensions are limitless. This also means you can manage and transpose the files whenever you like, within seconds.

Another point to note is that all the codes can be saved and shared, which means that the same codes can be quickly applied to any other large datasets. This is unlike excel which will need you to drag and drop for every dataset you are analysing.

4. Searching for variables or genes across different datasets are quicker in python than excel

Within one dataset, excel is able to do multiple gene queries with the INDEX function. However, for multiple large datasets, this is not efficient. First, you will have to open multiple spreadsheets, and then use the INDEX function to query each and every spreadsheet. Overall, this takes a really long time if you have many spreadsheets. In python, when you assign your gene names to be the index column, you can use the codes to perform the specific query for every dataset. All the files can be nicely stored in Jupyterlab or Jupyternotebook, which means that the files can be easily accessible by python codes.

5. Python can draw more sophisticated graphs than excel

With more people sharing their codes and developing packages, Python can draw more sophisticated graphs which is important for data visualisation of large datasets. For instance, python can do hierarchical clustering and heatmaps easier than excels. Complex functions required for principal component analyses is easier to perform in python compared to excel.

Overall, based on these advantages, I believe that python is an essential programming language for data scientists. I would hence strongly recommend learning python if you are considering a career in data science.

Posted in Research tools

Beware of excel autocorrect feature when analysing gene lists

Excel is a great way of handling big datasets, but beware of this problem: This software autocorrects some of the gene symbols into dates. Without having a gene description column on your datasheet, these conversions are irreversible as there is no way to recover the gene name based on dates.

One way to circumvent this issue is to always open a blank excel sheet and open the .txt or .csv file from “File -> Open.” At the last step (step 3; see figure at top), you have to assign the appropriate columns as text (see figure on top). If your gene name is not at the first column, you will have to manually click on the column with your gene names.

If you have accidentally forgotten to open your excel file in this manner, the other option is to assign the gene symbol based on the gene description. To do this, sort your gene symbols in ascending, which will quickly identify the gene names that are converted to dates. Then, format your entire column as text. Finally, based on the table below, assign the gene symbol back to the correct one.