Big data analysis with python from pandas package can be easier than excel. Picture source from https://images.app.goo.gl/wq34AegEWn73BcCQ8
My self-learning with dataquest has allowed me to use python to make panda dataframes that allows basic data query, creating columns based on pre-defined formulas and graph plotting. Despite my basic knowledge of python, I have begun to appreciate that python could allow more efficient data processing compared to excel. Some of the main advantages of using python are:
1. Processed DataFrames are assigned in codes, mitigating the need for storing multiple files or spreadsheets.
For instance, a data scientist will usually want to convert their values to Z-score before doing principal component analysis, hierarchical clustering or plotting heatmaps. The conversion in excel is easy if you are analysing a small dataset. However, if analysing a dataset with dimensions of 20 variables with 20,000 genes (with total of 400,000 numerical values), the user will usually have to save the file in a different excel file to prevent lagging issues. In python, the conversion can be easily stored using the following code as an example:
from sklearn.preprocessing import StandardScaler import numpy as np data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]]) scaler = StandardScaler() scaled_data = scaler.fit_transform(data)
In this case, the Z-score transformed data is assigned under ‘scaled data’, and you can subsequently do a PCA or heatmap with this ‘scaled_data’.
2. Python is less prone to errors compared to excel
Using excel files to analyse small datasets is convenient and my preferred choice. However, with large datasets, the drag and drop function is not perfect. For instance, a lot of time and effort is needed to drag the cells if you have a large number of variables. The drop function is also not useful if you have empty cells in your dataset. Finally, combining “filter” or “sort” functions together with formulas is cumbersome in excel, especially for large datasets. In python, the formulas can be easily stored in a new column. For instance, to add new columns of log2FC and -logP values into a dataset (assigned as dataset in this case), we just need to use the below command:
# Command generates log2FC and -logP column dataset['log2FC'] = np.log2(dataset['FC']) dataset['negative_log_pvalue'] = np.log10(dataset['p_value'])*-1
We can then use these new columns to plot graphs, such as volcano plots and p-value histograms.
3. Python is more time and space efficient in processing of large datasets
The maximum limit of excel is 1,048,576 rows by 16,384 columns, which means that you will have to store all your genes from microarray or RNAseq in rows. Even if you stored all your genes in rows, performing excel functions in large datasets can take a long time (can take several minutes). In some scenarios, due to the long processing time, the excel program may shutdown, causing you to lose your updated file. This means that you will have to save your files regularly and each time you press “save”, it takes a few minutes, which means a lot of time wasted on waiting. In python, particularly with Jupyterlab, the codes are automatically saved and the DataFrame dimensions are limitless. This also means you can manage and transpose the files whenever you like, within seconds.
Another point to note is that all the codes can be saved and shared, which means that the same codes can be quickly applied to any other large datasets. This is unlike excel which will need you to drag and drop for every dataset you are analysing.
4. Searching for variables or genes across different datasets are quicker in python than excel
Within one dataset, excel is able to do multiple gene queries with the INDEX function. However, for multiple large datasets, this is not efficient. First, you will have to open multiple spreadsheets, and then use the INDEX function to query each and every spreadsheet. Overall, this takes a really long time if you have many spreadsheets. In python, when you assign your gene names to be the index column, you can use the codes to perform the specific query for every dataset. All the files can be nicely stored in Jupyterlab or Jupyternotebook, which means that the files can be easily accessible by python codes.
5. Python can draw more sophisticated graphs than excel
With more people sharing their codes and developing packages, Python can draw more sophisticated graphs which is important for data visualisation of large datasets. For instance, python can do hierarchical clustering and heatmaps easier than excels. Complex functions required for principal component analyses is easier to perform in python compared to excel.
Overall, based on these advantages, I believe that python is an essential programming language for data scientists. I would hence strongly recommend learning python if you are considering a career in data science.