Data warehousing improves data accessibility to multiple datasets, increasing the speed and efficiency of data analysis. By storing and categorising multiple processed datasets in one specified location, this allows data scientists to quickly query and analyse multiple datasets. It also facilitates advanced analysis like predictive modelling and machine learning.
Here, I introduce BENEATH as a data warehouse platform to store multiple datasets. There are several features which I find this platform cool:
- User-friendly interface allows ease of depositing dataframes
- Ability to use the SQL programming language to do various kinds of data query
- Python commands are adequate to import and export files, meaning that no background in other programming languages is required
- Easy to set permissions. One command allows other users to create or view dataframes. You can also have the option to make your database public.
- Free of charge (unless your datasets are huge)
- Responsive community to accelerate troubleshooting
I will break down the individual steps involved in depositing a dataframe in BENEATH:
First, install BENEATH by executing the below command in the Jupyter Notebook (you only need to do this once):
pip install --upgrade beneath
Next, set up an account in https://beneath.dev/ Thereafter, under the command line (Terminal), execute the command below and follow the subsequent instructions to authenticate your account.
Thus far, we have installed the important packages and executed the authentication procedures. We are now ready to deposit any dataframe into BENEATH. In this example, we will import a .csv file (named test) into Python using the below command in a Jupyter Notebook. The dataframe is saved under the variable “df”
import csv import numpy as np import pandas as pd df = pd.read_csv('/Users/kuanrongchan/Desktop/test.csv') df.head(2)
Output file is as follows:
Depending on your username (in my case is kuanrongchan), we can create a folder to store this dataframe. In this example, we will create a folder under the name: vaccine_repository, to store this dataframe. You will have to execute the code under the command line (Terminal):
beneath project create kuanrongchan/vaccine_repository
Finally, to deposit your dataset, execute the below command. The table_path will direct the dataframe to the assigned folder . You can then set the index under “key” command. In this case, I have assigned subject as the index key, which will allow you to query subject IDs quickly in future. You can also add a description to your file to include any other details of this dataset.
import beneath await beneath.write_full( table_path="kuanrongchan/vaccine_repository/test", records=df, key=["subject"], description="my first test file" )
Now, for the cool part: You can execute a simple command to quickly import this dataframe into Jupyter Notebook for data analysis:
df = await beneath.load_full("kuanrongchan/vaccine-repository/test")
To share this folder with a friend, you can execute the below command. In this case, assume my friend BENEATH username is abc.
beneath project update-permissions kuanrongchan/vaccine_candidates abc --view --create
Multiple datasets can be easily deposited in the platform by using the above described codes. However, the current limitation of BENEATH is the lack of data visualisation tools, which means that the graphs will have to be processed in Jupyter Notebooks. The developers are currently working on this particular aspect, which should make BENEATH a great data warehouse platform for data scientists.