I have spent my last week building my lab webpage, so that people can better understand what our lab is doing. The webpage is built with Zyro, and hosted at kuanrongchan.com. If interested to work with us, please do not hesitate to contact us! đź™‚

# Tag: datascience

## Building interactive dashboards with Streamlit (II) – Plotting pairplots and scatterplots for numeric variables

In my previous blog entry, I covered the basics of using Streamlit to inspect basic attributes of the dataframe, including numeric and categorical variables.

In this blog entry, we will cover how we can use data visualisation tools in Streamlit for data dashboarding. The advantage of using Streamlit is that we can use Python graph packages, such as Plotly and Altair to plot interactive charts that allow users to hover over the data points to query specific data point attributes. Moreover, users can also utilise widgets or radio buttons in Streamlit plot interactive scatterplots for data dashboarding. Specifically, we will focus on how we can plot pair plots and scatterplots in Streamlit. The codes are publicly available in GitHub (saved under iris.py) and the website instance is hosted at: https://share.streamlit.io/kuanrongchan/iris_dataset/main/iris.py.

To plot pairplots, we will import seaborn. The rationale has been described previously, and we can use the codes we optimised from Jupyter Notebooks into Streamlit. We first need to load the additional packages:

```
import seaborn as sns
```

Then we plot the pairplot using the commands (with the header and subheader descriptions) below:

```
st.header('Visualising relationship between numeric variables')
st.subheader('Pairplot analysis')
g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)
st.pyplot(g)
```

Output file is as follows:

After plotting pair plots, we may want to have a close-up analysis using scatterplots. The advantage of using Streamlit is that we can use widgets to allow users plot a scatterplot of any 2 selected variables.

```
st.subheader('Scatterplot analysis')
selected_x_var = st.selectbox('What do you want the x variable to be?', df.columns)
selected_y_var = st.selectbox('What about the y?', df.columns)
fig = px.scatter(df, x = df[selected_x_var], y = df[selected_y_var], color="species")
st.plotly_chart(fig)
```

Output is as shown below. I have annotated the widgets, which allows users to select and click on any variables they desire for scatterplot analysis. You may hover over the data points to zoom into the data point attributes, as this graph was plotted using Plotly.

We can also show the correlation coefficients and the associated p-values for the scatterplots to indicate if their relationship is linearly correlated. We first add and load the required packages:

```
from scipy.stats import pearsonr
from sklearn import linear_model, metrics
from sklearn.metrics import r2_score
from scipy import stats
```

We can then calculate the Pearson and Spearman coefficients, together with the associated p-values using the following commands:

```
#Correlation calculations (Pearson)
st.subheader("Pearson Correlation")
def calc_corr(selected_x_var, selected_y_var):
corr, p_val = stats.pearsonr(selected_x_var, selected_y_var)
return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Pearson correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)
#Correlation calculations (Spearman)
st.subheader("Spearman Correlation")
def calc_corr(selected_x_var, selected_y_var):
corr, p_val = stats.spearmanr(selected_x_var, selected_y_var)
return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Spearman correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)
```

Output is as follows:

### Pearson Correlation

Pearson correlation coefficient: -0.109

p value: 0.183

### Spearman Correlation

Spearman correlation coefficient: -0.159

p value: 0.051

Hence, sepal length and sepal width are not linearly correlated.

As you may appreciate, the Streamlit commands allow you to organise the data by adding headers and subheaders. In this addition, they provide the tools that allow users to interactively explore their dataset, preventing the need to plot every pairwise permutations for scatterplots. The outcome is a neat and professional dashboard that can be readily deployed and shared with others.

## My experience with VoilĂ and Streamlit for building data dashboards

The role of data scientist is clear: To analyse the data, plot visualisation graphs and consolidate the findings into a report. However, with greater interest in deeper understanding of big data and urgent need for more novel tools to gain insights from biological datasets, there is a growing interest in employing data engineers. Their roles and responsibilities can include app development, constructing pipelines, data testing and maintaining architectures such as databases and large-scale processing systems. This is unlike data scientists, who are mostly involved in data cleaning, data visualisation and big data organisation.

One aspect is to implement strategies to improve reliability, efficiency and quality. To ensure consistency, implementing data dashboards is important. This will require moving out of the comfort zone of just reporting data within Jupyter Notebooks. To build data dashboards, Javascript is often used. However, recently, there are packages that can be implemented in Python (which means that you don’t have to learn another language). These packages include VoilĂ , Panel, Dash and Streamlit. On my end, I have tried VoilĂ and Streamlit as they are both easier to implement as compared to Panel and Dash. This blog will hence compare my experience with VoilĂ and Streamlit.

The installation of VoilĂ , and the associated templates is relatively straight-forward. You just need to execute these codes to download the packages:

```
pip install voila
pip install voila-gridstack
pip install voila-vuetify
```

Once the packages are installed in your environment, you should be able to see the extensions in the Jupyter notebook (indicated in arrow). Clicking on them will execute the output files from python codes.

With the gridstack or the vuetify templates, you can further manipulate and reorder your output files to display your graphs in your dashboard. The dashboard can then be deployed using Heroku or deposited in GitHub for deployment in mybinder.org.

As you can imagine, if you enjoy working within Jupyter Notebooks, VoilĂ can be a simple and convenient tool to make data dashboards. You can also make the dashboard interactive by using iPywidgets, Plotly, Altair or Bokeh. However, a severe limitation is that it is difficult to do multi-pages. This can be an issue if you are developing multiple dashboards, or multiple graphs from different studies.

My delay in this blog post is because I have spent much of my time in finding alternatives for building multi-pages. This is where I learnt about Streamlit. I was very impressed at how we can use simple python codes to develop beautiful dashboards, and I was able to build a simple webpage/dashboard with a few hours of reading online tutorials. With more readings, I was even able to make some simple apps! Using Streamlit is as simple as:

- Open terminal window
- Install Streamlit
- Create a .py file using text editors such as sublime text (my preferred choice), atom or visual code
- And then execute file by typing the code in terminal: streamlit run filename.py

You can install streamlit by using:

```
pip install streamlit
```

In addition to these cool features, Streamlit is able to do multi-pages, which means you can create multiple data dashboards or multiple apps within a single website. Finally, the deployment is also relatively simple with Streamlit teams, which is attractive. However, if you prefer to work within Jupyter Notebooks, this may not be a great option for you as the commands are mostly executed via terminal or in .py files. The other limitation which I havenâ€™t found a solution is related to security, where I do not know how to configure in such a way that only allows registered users to use the website.

Overall, deciding on which platform to use will depend on your personal preferences and applications. I prefer Streamlit as it is more versatile and flexible, which may explain why itâ€™s increasing popularity in these recent years!

## Five reasons why python is better than excel in processing big omics datasets

Big data analysis with python from pandas package can be easier than excel. Picture source from https://images.app.goo.gl/wq34AegEWn73BcCQ8

My self-learning with dataquest has allowed me to use python to make panda dataframes that allows basic data query, creating columns based on pre-defined formulas and graph plotting. Despite my basic knowledge of python, I have begun to appreciate that python could allow more efficient data processing compared to excel. Some of the main advantages of using python are:

1. Processed DataFrames are assigned in codes, mitigating the need for storing multiple files or spreadsheets.

For instance, a data scientist will usually want to convert their values to Z-score before doing principal component analysis, hierarchical clustering or plotting heatmaps. The conversion in excel is easy if you are analysing a small dataset. However, if analysing a dataset with dimensions of 20 variables with 20,000 genes (with total of 400,000 numerical values), the user will usually have to save the file in a different excel file to prevent lagging issues. In python, the conversion can be easily stored using the following code as an example:

```
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
```

In this case, the Z-score transformed data is assigned under ‘scaled data’, and you can subsequently do a PCA or heatmap with this ‘scaled_data’.

2. Python is less prone to errors compared to excel

Using excel files to analyse small datasets is convenient and my preferred choice. However, with large datasets, the drag and drop function is not perfect. For instance, a lot of time and effort is needed to drag the cells if you have a large number of variables. The drop function is also not useful if you have empty cells in your dataset. Finally, combining â€śfilterâ€ť or â€śsortâ€ť functions together with formulas is cumbersome in excel, especially for large datasets. In python, the formulas can be easily stored in a new column. For instance, to add new columns of log2FC and -logP values into a dataset (assigned as dataset in this case), we just need to use the below command:

```
# Command generates log2FC and -logP column
dataset['log2FC'] = np.log2(dataset['FC'])
dataset['negative_log_pvalue'] = np.log10(dataset['p_value'])*-1
```

We can then use these new columns to plot graphs, such as volcano plots and p-value histograms.

3. Python is more time and space efficient in processing of large datasets

The maximum limit of excel is 1,048,576 rows by 16,384 columns, which means that you will have to store all your genes from microarray or RNAseq in rows. Even if you stored all your genes in rows, performing excel functions in large datasets can take a long time (can take several minutes). In some scenarios, due to the long processing time, the excel program may shutdown, causing you to lose your updated file. This means that you will have to save your files regularly and each time you press â€śsaveâ€ť, it takes a few minutes, which means a lot of time wasted on waiting. In python, particularly with Jupyterlab, the codes are automatically saved and the DataFrame dimensions are limitless. This also means you can manage and transpose the files whenever you like, within seconds.

Another point to note is that all the codes can be saved and shared, which means that the same codes can be quickly applied to any other large datasets. This is unlike excel which will need you to drag and drop for every dataset you are analysing.

4. Searching for variables or genes across different datasets are quicker in python than excel

Within one dataset, excel is able to do multiple gene queries with the INDEX function. However, for multiple large datasets, this is not efficient. First, you will have to open multiple spreadsheets, and then use the INDEX function to query each and every spreadsheet. Overall, this takes a really long time if you have many spreadsheets. In python, when you assign your gene names to be the index column, you can use the codes to perform the specific query for every dataset. All the files can be nicely stored in Jupyterlab or Jupyternotebook, which means that the files can be easily accessible by python codes.

5. Python can draw more sophisticated graphs than excel

With more people sharing their codes and developing packages, Python can draw more sophisticated graphs which is important for data visualisation of large datasets. For instance, python can do hierarchical clustering and heatmaps easier than excels. Complex functions required for principal component analyses is easier to perform in python compared to excel.

Overall, based on these advantages, I believe that python is an essential programming language for data scientists. I would hence strongly recommend learning python if you are considering a career in data science.

## Introducing Jupyterlab for efficient bioinformatic workflows

In my future blog entries, I will be using and explaining how the *Python Programming* Language can be used for data visualisation. To run some of the commands and codes, I recommend downloading JupyterLab, which is a web-based interactive development environment for storing Jupyter notebooks, code, and data. The advantage of using JupyterLab is that it is free to download, open-source, flexible and supports multiple computer languages, including Julia, Python, and R. If needed, you can even download the SQL plug-in to execute SQL commands in Jupyterlab.

The process of downloading is simple:

- Visit the Anaconda website, and click on the download icon. Download the 64-bit Graphical Installer based on your computer OS.
- Open the package after download, and follow the instructions to download Anaconda into your computer.
- Launch the Anaconda-Navigator by clicking the icon. For Mac users, the icon should appear under the “Applications” tab
- Launch JupyterLab, choose Python3 notebook, which will eventually direct you to the notebook serverâ€™s URL.

You can import your .csv or..txt datafiles directly into JupyterLab to start analysing your dataset in Python. You can also export your notebook as a Jupyter Interactive Notebook (.ipynb file format) if you’d like to share the codes with another person. I believe that JupyterLab will enable more efficient workflows, regardless of tool or language.