Posted in Data visualisation, python

Building interactive dashboards with Streamlit (II) – Plotting pairplots and scatterplots for numeric variables

In my previous blog entry, I covered the basics of using Streamlit to inspect basic attributes of the dataframe, including numeric and categorical variables.

In this blog entry, we will cover how we can use data visualisation tools in Streamlit for data dashboarding. The advantage of using Streamlit is that we can use Python graph packages, such as Plotly and Altair to plot interactive charts that allow users to hover over the data points to query specific data point attributes. Moreover, users can also utilise widgets or radio buttons in Streamlit plot interactive scatterplots for data dashboarding. Specifically, we will focus on how we can plot pair plots and scatterplots in Streamlit. The codes are publicly available in GitHub (saved under iris.py) and the website instance is hosted at: https://share.streamlit.io/kuanrongchan/iris_dataset/main/iris.py.

To plot pairplots, we will import seaborn. The rationale has been described previously, and we can use the codes we optimised from Jupyter Notebooks into Streamlit. We first need to load the additional packages:

import seaborn as sns

Then we plot the pairplot using the commands (with the header and subheader descriptions) below:

st.header('Visualising relationship between numeric variables')
st.subheader('Pairplot analysis')
g = sns.pairplot(df, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"], dropna = True, hue = 'species', diag_kind="kde")
g.map_lower(sns.regplot)
st.pyplot(g)

Output file is as follows:

After plotting pair plots, we may want to have a close-up analysis using scatterplots. The advantage of using Streamlit is that we can use widgets to allow users plot a scatterplot of any 2 selected variables.

st.subheader('Scatterplot analysis')
selected_x_var = st.selectbox('What do you want the x variable to be?', df.columns)
selected_y_var = st.selectbox('What about the y?', df.columns)
fig = px.scatter(df, x = df[selected_x_var], y = df[selected_y_var], color="species")
st.plotly_chart(fig)

Output is as shown below. I have annotated the widgets, which allows users to select and click on any variables they desire for scatterplot analysis. You may hover over the data points to zoom into the data point attributes, as this graph was plotted using Plotly.

We can also show the correlation coefficients and the associated p-values for the scatterplots to indicate if their relationship is linearly correlated. We first add and load the required packages:

from scipy.stats import pearsonr
from sklearn import linear_model, metrics
from sklearn.metrics import r2_score
from scipy import stats

We can then calculate the Pearson and Spearman coefficients, together with the associated p-values using the following commands:

#Correlation calculations (Pearson)
st.subheader("Pearson Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.pearsonr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Pearson correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)
#Correlation calculations (Spearman)
st.subheader("Spearman Correlation")
def calc_corr(selected_x_var, selected_y_var):
    corr, p_val = stats.spearmanr(selected_x_var, selected_y_var)
    return corr, p_val
x = df[selected_x_var].to_numpy()
y = df[selected_y_var].to_numpy()
correlation, corr_p_val = calc_corr(x, y)
st.write('Spearman correlation coefficient: %.3f' % correlation)
st.write('p value: %.3f' % corr_p_val)

Output is as follows:

Pearson Correlation

Pearson correlation coefficient: -0.109

p value: 0.183

Spearman Correlation

Spearman correlation coefficient: -0.159

p value: 0.051

Hence, sepal length and sepal width are not linearly correlated.

As you may appreciate, the Streamlit commands allow you to organise the data by adding headers and subheaders. In this addition, they provide the tools that allow users to interactively explore their dataset, preventing the need to plot every pairwise permutations for scatterplots. The outcome is a neat and professional dashboard that can be readily deployed and shared with others.

Posted in python

Building interactive dashboards with Streamlit (Part I)

Growing Irises – Planting & Caring for Iris Flowers | Garden Design
Iris flower dataset used to illustrate how we can use Streamlit to build data dashboards

Python is often used in back-end programming, which builds functionality in web applications. For instance, Python can be used to connect the web to a database, so that users can query information. By adding the backend components, Python turns a static webpage into a dynamic web application, where users can interact with the webpage based on an updated and connected database. If data provided is big, developers can even use the machine learning tools in Python to make data predictions. However, front-end development, which is the part of making website beautiful and interactive webpages, is often done with other programming languages such as HTML, CSS and JavaScript. The question remains, do I need to be a full-stack developer to master both front-end and back-end web development? This can be time-consuming as this means you have to learn multiple languages.

This is where Streamlit, Django and Flask comes into the rescue! These are built on the Python programming language, and allows building of websites with minimal knowledge on CSS and JavaScript. Among these frameworks, I personally find Flask and Streamlit easier to learn and implement. However, with more experience, I decided to focus on Streamlit as the syntax are easier to understand and the deployment is more rapid.

Rather than going through Streamlit commands one-by-one, I will instead illustrate the functionality of the different commands using my favourite iris dataset which is readily available in PlotlyExpress. The GitHub repository is publicly available, and the output of the codes will be hosted here.

First, we will create a file called iris.py using Sublime text or VisualStudio. In this blog entry, we will focus on acquiring the basic statistics of the individual columns, and present this information in a data dashboard for project showcasing. As with all Python commands, we need to import the required packages:

import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
from wordcloud import WordCloud
from typing import Any, List, Tuple

After that, we type in the commands needed to read the iris dataframe. The data is available in PlotlyExpress and can be loaded into Streamlit with the following commands. The title of the dataframe is also included for data dashboarding:

st.title('Data analysis of iris dataset from PlotlyExpress')
df = px.data.iris()

Next, the basic stats such as counts, mean, standard deviation, min, max and quantiles can be displayed using the command df.describe(). However, the advantage of Streamlit is the ability to add widgets, thus preventing the dashboard from looking too cluttered. In this example, we will look into creating widgets for the variables in the header columns for users to determine the basic stats in the specific columns:

col1, col2, col3, col4, col5 = st.columns(5)
col1.metric("Mean", column.mean())
col2.metric("Max", column.max())
col3.metric("Min", column.min())
col4.metric("Std", column.std())
col5.metric("Count", int(column.count()))

The output will generate 5 columns, showing the mean, max, min, standard deviation and counts in each respective columns. The widget can be visualised here.

These statistics are appropriate for numeric or continuous variables. However, to visualise the categorical variables (in this case, the flower species), we can use a word cloud or a table to know the exact counts for each species using the following commands:

# Build wordcloud
non_num_cols = df.select_dtypes(include=object).columns
column = st.selectbox("Select column to generate wordcloud", non_num_cols)
column = df[column]
wc = WordCloud(max_font_size=25, background_color="white", repeat=True, height=500, width=800).generate(' '.join(column.unique()))
st.image(wc.to_image())

# Build table
st.markdown('***Counts in selected column for generating WordCloud***')
unique_values = column.value_counts()
st.write(unique_values)

The aim for part I is to provide various solutions to inspect the distribution of your categorical and continuous variables within your dataset. We will slowly cover the other topics including data visualisation, machine learning and interactive graphical plots in my subsequent posts!

Posted in python

Streamlit: My recommended tool for data science app development

Streamlit is a web application framework that helps in the building and development of Python-based web applications to share data, build data dashboards, app development and even build machine learning models. In addition, developing and deploying Streamlit apps is amazingly quick and versatile, allowing the development of simple apps in a few hours. In fact, I was able to make a simple app within a day after reading this book:

Getting Started with Streamlit for Data Science | Packt
My favourite book for Streamlit

This book summarises the basics of Streamlit, and provides realistic solutions in executing python codes and simple app development. Streamlit is quite newly developed so there are very few books available to learn from. Nonetheless, from the examples provided in the book, I was able to make a simple scatterplot app. We can first load the dataset to make sure we are analysing the correct dataset:

The commands that make the above section are as follows:

st.title() # Gives the header title
st.markdown() # Can provide descriptions of the title in smaller fonts
st.file_uploader() # Creates a file uploader for users to drag and drop
st.write() # Loads the details of the dataset 

Next, we want to create widgets that allow users to select their x and y-axis variables:

To make the above section, the relevant commands are as follows:

st.selectbox() #To create widgets for users to select
px.scatter() # For plotting scatterplot with plotly express
plotly_chart() # To plot the figure out

Why use Plotly? This is because Plotly allows you to interact with the graph, including zoom in and zoom out functions, mousing over data points to determine data point attributes and selection functions to crop the range of x- and y-axis.

Deploying in Streamlit is fast, but the initial steps of setting up can be time-consuming, especially if this is the first time you are trying out. The essential steps are as follows:
1. Create a GitHub account
2. Contact the Streamlit Team to allow the developers to connect Streamlit to your GitHub account
3. Create the GitHub repository. You can choose to make a public or private repository. To make the requirements.txt file, make sure you download by typing the following command: pip install pipreqs.
4. Create apps within the Streamlit account by adding the Github repository address and specifying the Python file to execute.

And that’s it! Your web address will start with: https://share.streamlit.io and the deployed website can be shared publicly to anyone. Furthermore, any changes you make within the GitHub repository can be immediately updated into the deployed website. I appreciate the responsiveness and the speed of deploying the website once everything is set up. Finally, for an icing in the cake, you can even convert the weblink into a desktop app with these specific instructions! If you are into app development and you want to stick to the Python language, I would strongly recommend Streamlit. It’s simplicity from coding to execution and deployment is just so attractive!

Posted in python

My experience with Voilà and Streamlit for building data dashboards

Data engineer vs. Data scientist- What does your company need?
Differences between data engineer vs scientist. Source

The role of data scientist is clear: To analyse the data, plot visualisation graphs and consolidate the findings into a report. However, with greater interest in deeper understanding of big data and urgent need for more novel tools to gain insights from biological datasets, there is a growing interest in employing data engineers. Their roles and responsibilities can include app development, constructing pipelines, data testing and maintaining architectures such as databases and large-scale processing systems. This is unlike data scientists, who are mostly involved in data cleaning, data visualisation and big data organisation.

One aspect is to implement strategies to improve reliability, efficiency and quality. To ensure consistency, implementing data dashboards is important. This will require moving out of the comfort zone of just reporting data within Jupyter Notebooks. To build data dashboards, Javascript is often used. However, recently, there are packages that can be implemented in Python (which means that you don’t have to learn another language). These packages include Voilà, Panel, Dash and Streamlit. On my end, I have tried Voilà and Streamlit as they are both easier to implement as compared to Panel and Dash. This blog will hence compare my experience with Voilà and Streamlit.

The installation of Voilà, and the associated templates is relatively straight-forward. You just need to execute these codes to download the packages:

pip install voila
pip install voila-gridstack
pip install voila-vuetify

Once the packages are installed in your environment, you should be able to see the extensions in the Jupyter notebook (indicated in arrow). Clicking on them will execute the output files from python codes.

With the gridstack or the vuetify templates, you can further manipulate and reorder your output files to display your graphs in your dashboard. The dashboard can then be deployed using Heroku or deposited in GitHub for deployment in mybinder.org.

As you can imagine, if you enjoy working within Jupyter Notebooks, Voilà can be a simple and convenient tool to make data dashboards. You can also make the dashboard interactive by using iPywidgets, Plotly, Altair or Bokeh. However, a severe limitation is that it is difficult to do multi-pages. This can be an issue if you are developing multiple dashboards, or multiple graphs from different studies.

My delay in this blog post is because I have spent much of my time in finding alternatives for building multi-pages. This is where I learnt about Streamlit. I was very impressed at how we can use simple python codes to develop beautiful dashboards, and I was able to build a simple webpage/dashboard with a few hours of reading online tutorials. With more readings, I was even able to make some simple apps! Using Streamlit is as simple as:

  1. Open terminal window
  2. Install Streamlit
  3. Create a .py file using text editors such as sublime text (my preferred choice), atom or visual code
  4. And then execute file by typing the code in terminal: streamlit run filename.py

You can install streamlit by using:

pip install streamlit

In addition to these cool features, Streamlit is able to do multi-pages, which means you can create multiple data dashboards or multiple apps within a single website. Finally, the deployment is also relatively simple with Streamlit teams, which is attractive. However, if you prefer to work within Jupyter Notebooks, this may not be a great option for you as the commands are mostly executed via terminal or in .py files. The other limitation which I haven’t found a solution is related to security, where I do not know how to configure in such a way that only allows registered users to use the website.

Overall, deciding on which platform to use will depend on your personal preferences and applications. I prefer Streamlit as it is more versatile and flexible, which may explain why it’s increasing popularity in these recent years!

Streamlit vs Dash vs Voilà vs Panel — Battle of The Python Dashboarding  Giants | by Stephen Kilcommins | DataDrivenInvestor