It’s a great sense of achievement to know that you have finished analysing your dataset. However, it is almost always a good habit to check through your data analysis and at times, even re-analyse the data in a different way to understand the data better and make the data analysis more outstanding. In this blog entry, I will highlight some key considerations that could be taken into account when checking through your data analysis.
1) Could your variables be categorised differently, or expressed as continuous variables for data analysis?
In some instances, the explanatory variable can be expressed as a categorical variable or a continuous variable. If that is indeed the case, I would recommend analysing the data both ways. For example, consider a research question that studies the effect of age on vaccine antibody response. Based on literature, you may have classified subjects into two groups: (I) elderly subjects as 65 years of age and above; (II) young subjects as lower than 65 years of age. The antibody responses are then compared in these 2 groups. However, the analysis can also be done in a different way. You can instead plot a correlation plot of antibody response against age, and evaluate if the correlation is significant. While both analyses methods ask the research question in a slightly different way (see figure above for better clarity), where the former investigates differences between young and elderly, and the latter investigates whether antibody responses are correlated with age, it is usually worth doing both analyses to have a deeper understanding of your data.
2) Are the cutoff values in your data analysis justified?
We often choose default values as cutoff values for assigning categories, or filtering of data. However, the default values may not be applicable for every dataset. This is where a deeper knowledge of the analysis tools will be helpful, as you will be able to better appreciate the assumptions applied in the analyses. For instance, let’s consider the scenario where you are interested to find out the molecular pathways regulated by virus infection. You decide to choose a 1.5 fold-change and 0.05 p-value cutoff to identify the differentially expressed genes for pathway analysis. However, have you wondered why 1.5 fold-change was chosen? What happens if you picked a cutoff of 2 fold-change? To better understand the cutoff values and how they may affect data analysis, it is usually wise to analyse the results using different cutoff values and eventually, justify why the final chosen cutoff is most suitable for your analysis.
3) How did you manage the missing values and outliers? Can we do better?
Different individuals may manage missing values and outliers differently, which may impact interpretation of data. Hence, it may be a good practice to use different ways to handle these missing values and outliers to see how these different methods impact data analysis. Consider the situation where there is at least one missing value for different explanatory variables in 50% of the samples. If the solution to handle missing data is to remove any sample with missing data, then 50% of the data is removed. Compared to another method which imputes a mean value, the data interpretation could be very different. If unsure, it is usually a good habit to try these different strategies and then evaluate the best strategy to best represent the data.
4) Besides the primary research goal, what more can we learn more from the data?
As data scientists, we are often too engrossed with our research interest. Our interpretation of the data could thus be biased towards what we are interested to “see.” While this is may be a reasonable approach to directly answer research objectives, it may be also important to look out for other trends within the data. The reason is because some of these trends could allow us to gain new biological insights, or allow us to understand any potential confounding variables that could be involved. For instance, in a dataset, you may find that BMI is associated with increase severity of disease. However, with more exploratory analyses, you may also find that the high BMI subjects are also the elderly individuals. From this additional finding, the interpretation of the data may not be solely that BMI has an influence on disease severity, but the disease severity could be influenced by age as well. To separate effects of both BMI and age, further statistical analysis that strives to reduce the impact of confounding variables is required.
5) Look beyond statistical significance for biological meaning
As scientists, we have been trained to look at p-values. Essentially, p-value < 0.05 are deemed as significant and we can reject the null hypothesis. However, in biology, sometimes it is pertinent to look beyond statistical significance. Magnitude changes are also important. Consider a study where a researcher showed that eating drug X can significantly reduce weight loss by 1kg. Consider another drug Y which causes a mean 10kg loss, but with weak significance (p>0.05). Of the two drugs, drug Y could be more biologically meaningful in reducing weight as compared to drug X, despite not reaching significance.
In summary, inspecting and cleaning data should take up the most amount of time in data analysis, but the effort is worth it because it ensures that the downstream interpretation of the data is more complete. This is particularly true for machine learning, where the data input will consequently impact the machine learning models. From the above 5 points, I hope that I have convinced you the importance of checking and revisiting your data, before saying your analysis is “done.” Give it a shot, I am confident that you will find your time spent worthwhile.