Rather than going through Streamlit commands one-by-one, I will instead illustrate the functionality of the different commands using my favourite iris dataset which is readily available in PlotlyExpress. The GitHub repository is publicly available, and the output of the codes will be hosted here.
First, we will create a file called iris.py using Sublime text or VisualStudio. In this blog entry, we will focus on acquiring the basic statistics of the individual columns, and present this information in a data dashboard for project showcasing. As with all Python commands, we need to import the required packages:
import streamlit as st import numpy as np import pandas as pd import plotly.express as px from wordcloud import WordCloud from typing import Any, List, Tuple
After that, we type in the commands needed to read the iris dataframe. The data is available in PlotlyExpress and can be loaded into Streamlit with the following commands. The title of the dataframe is also included for data dashboarding:
st.title('Data analysis of iris dataset from PlotlyExpress') df = px.data.iris()
Next, the basic stats such as counts, mean, standard deviation, min, max and quantiles can be displayed using the command df.describe(). However, the advantage of Streamlit is the ability to add widgets, thus preventing the dashboard from looking too cluttered. In this example, we will look into creating widgets for the variables in the header columns for users to determine the basic stats in the specific columns:
col1, col2, col3, col4, col5 = st.columns(5) col1.metric("Mean", column.mean()) col2.metric("Max", column.max()) col3.metric("Min", column.min()) col4.metric("Std", column.std()) col5.metric("Count", int(column.count()))
The output will generate 5 columns, showing the mean, max, min, standard deviation and counts in each respective columns. The widget can be visualised here.
These statistics are appropriate for numeric or continuous variables. However, to visualise the categorical variables (in this case, the flower species), we can use a word cloud or a table to know the exact counts for each species using the following commands:
# Build wordcloud non_num_cols = df.select_dtypes(include=object).columns column = st.selectbox("Select column to generate wordcloud", non_num_cols) column = df[column] wc = WordCloud(max_font_size=25, background_color="white", repeat=True, height=500, width=800).generate(' '.join(column.unique())) st.image(wc.to_image()) # Build table st.markdown('***Counts in selected column for generating WordCloud***') unique_values = column.value_counts() st.write(unique_values)
The aim for part I is to provide various solutions to inspect the distribution of your categorical and continuous variables within your dataset. We will slowly cover the other topics including data visualisation, machine learning and interactive graphical plots in my subsequent posts!