Posted in python, Research tools

Introducing BENEATH as a data warehouse platform

Data warehousing improves data accessibility to multiple datasets, increasing the speed and efficiency of data analysis. By storing and categorising multiple processed datasets in one specified location, this allows data scientists to quickly query and analyse multiple datasets. It also facilitates advanced analysis like predictive modelling and machine learning.

Here, I introduce BENEATH as a data warehouse platform to store multiple datasets. There are several features which I find this platform cool:

  1. User-friendly interface allows ease of depositing dataframes
  2. Ability to use the SQL programming language to do various kinds of data query
  3. Python commands are adequate to import and export files, meaning that no background in other programming languages is required
  4. Easy to set permissions. One command allows other users to create or view dataframes. You can also have the option to make your database public.
  5. Free of charge (unless your datasets are huge)
  6. Responsive community to accelerate troubleshooting

I will break down the individual steps involved in depositing a dataframe in BENEATH:

First, install BENEATH by executing the below command in the Jupyter Notebook (you only need to do this once):

pip install --upgrade beneath

Next, set up an account in https://beneath.dev/ Thereafter, under the command line (Terminal), execute the command below and follow the subsequent instructions to authenticate your account.

beneath auth

Thus far, we have installed the important packages and executed the authentication procedures. We are now ready to deposit any dataframe into BENEATH. In this example, we will import a .csv file (named test) into Python using the below command in a Jupyter Notebook. The dataframe is saved under the variable “df”

import csv
import numpy as np
import pandas as pd
df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv')
df.head(2)

Output file is as follows:

subjectab30dab60dab90dtcell7d
010.2847200.0806110.9971330.151514
120.1638440.8187040.4599750.143680

Depending on your username (in my case is kuanrongchan), we can create a folder to store this dataframe. In this example, we will create a folder under the name: vaccine_repository, to store this dataframe. You will have to execute the code under the command line (Terminal):

beneath project create kuanrongchan/vaccine_repository

Finally, to deposit your dataset, execute the below command. The table_path will direct the dataframe to the assigned folder . You can then set the index under “key” command. In this case, I have assigned subject as the index key, which will allow you to query subject IDs quickly in future. You can also add a description to your file to include any other details of this dataset.

import beneath
await beneath.write_full(
table_path="kuanrongchan/vaccine_repository/test",
    records=df,
    key=["subject"],
    description="my first test file"
)

Now, for the cool part: You can execute a simple command to quickly import this dataframe into Jupyter Notebook for data analysis:

df = await beneath.load_full("kuanrongchan/vaccine-repository/test")

To share this folder with a friend, you can execute the below command. In this case, assume my friend BENEATH username is abc.

beneath project update-permissions kuanrongchan/vaccine_candidates abc --view --create

Multiple datasets can be easily deposited in the platform by using the above described codes. However, the current limitation of BENEATH is the lack of data visualisation tools, which means that the graphs will have to be processed in Jupyter Notebooks. The developers are currently working on this particular aspect, which should make BENEATH a great data warehouse platform for data scientists.

Posted in python, Research tools

Introducing Jupyterlab for efficient bioinformatic workflows

In my future blog entries, I will be using and explaining how the Python Programming Language can be used for data visualisation. To run some of the commands and codes, I recommend downloading JupyterLab, which is a web-based interactive development environment for storing Jupyter notebooks, code, and data. The advantage of using JupyterLab is that it is free to download, open-source, flexible and supports multiple computer languages, including Julia, Python, and R. If needed, you can even download the SQL plug-in to execute SQL commands in Jupyterlab.

The process of downloading is simple:

  1. Visit the Anaconda website, and click on the download icon. Download the 64-bit Graphical Installer based on your computer OS.
  2. Open the package after download, and follow the instructions to download Anaconda into your computer.
  3. Launch the Anaconda-Navigator by clicking the icon. For Mac users, the icon should appear under the “Applications” tab
  4. Launch JupyterLab, choose Python3 notebook, which will eventually direct you to the notebook server’s URL.

You can import your .csv or..txt datafiles directly into JupyterLab to start analysing your dataset in Python. You can also export your notebook as a Jupyter Interactive Notebook (.ipynb file format) if you’d like to share the codes with another person. I believe that JupyterLab will enable more efficient workflows, regardless of tool or language.

Posted in Research tools

Beware of excel autocorrect feature when analysing gene lists

Excel is a great way of handling big datasets, but beware of this problem: This software autocorrects some of the gene symbols into dates. Without having a gene description column on your datasheet, these conversions are irreversible as there is no way to recover the gene name based on dates.

One way to circumvent this issue is to always open a blank excel sheet and open the .txt or .csv file from “File -> Open.” At the last step (step 3; see figure at top), you have to assign the appropriate columns as text (see figure on top). If your gene name is not at the first column, you will have to manually click on the column with your gene names.

If you have accidentally forgotten to open your excel file in this manner, the other option is to assign the gene symbol based on the gene description. To do this, sort your gene symbols in ascending, which will quickly identify the gene names that are converted to dates. Then, format your entire column as text. Finally, based on the table below, assign the gene symbol back to the correct one.

Posted in Research tools

JNRLclub: A quick way to keep up with the scientific literature

Introducing JNRLclub, one of the best platforms that summarises scientific publications into ~10 minute science talk videos. In 2019, I was very privileged to be able to present my findings on my research paper on: Metabolic perturbations and cellular stress underpin susceptibility to symptomatic live-attenuated yellow fever infection, published in Nature Medicine, 2019 (See video on top). Since then, I have been on a constant look out for science talks published by JNRLclub, particularly related to infectious diseases and omics research. I would hence recommend JNRLclub as one of the quickest way to keep up with the scientific literature.