Solve blog

Introducing Bowser – an automated, intelligent data imputation tool.

Mark Grujic

The following will be the first in a series of recent applications that we’ve worked on since userR! 2018, so you’ll see a number of links to packages that we learnt about from that conference. 

The first example uses naniar (data structure and visualisation of missing data) and visdat (visualisation of entire dataframes), two of our favourite packages we came across at the conference. This is the first entry on the blog from the newest Solve team member, Mark Grujic.

Bowser (think fuel, not Mario – you’ll see why later) is a web service that lets you work with and impute missing values in your data frames, a particularly common problem in geochemistry.

Sometimes, you have an incomplete or nearly-complete dataset that you want to use in some advanced modelling application, but the application might do one of several things to your data:

  • Delete all rows with missing data.
  • Replace all missing values with the mean/median/something else of each column.
  • Nothing because the code just broke and crashed the program.

Bowser aims to make it easier for you to work with data and convert it into a format that makes it easier to use in other modelling applications!

Why do this?

There are several reasons why you would want to use an artificially complete dataset, even though you will be using data that wasn’t directly observed. One of the main reasons is that a bunch of algorithms like Self Organising Maps will go and impute the missing values with the mean of the data, without necessarily informing you. This can be dangerous, as you can see in this following simple example:

Different imputation methods compared, showing that just using the mean value for imputation for y that is present in some algorithm implementations is going to give a spurious result.

Here, some known data (black) has a clear trend between the x and y variables and “missing’” data has been simulated at regular intervals across the range of x. The value of has been imputed using several methods. You can see that using the mean is a poor approximation that destroys any relationship between variables. The linear interpolation method gives a much better result. However, with larger datasets that have more variables, there are much more complex algorithms that can give more robust imputation models.

Imputation in Bowser

Bowser uses a multiple-imputation, non-parametric technique using the Random Forest algorithm to fill-in your missing data values. This method is robust for significant proportions of missing data and works with mixed data; it will impute numeric and categorical missing values. To do this, the fantastic missForest R package is utilised.

Additionally, visdat and naniar are used to help visualise the missing data before doing the imputation. The following shows an overview of where the missing data is in relation to the row order:

An example of the vis_miss() visualisation in visdat with the data ordered by row number, allowing for an easier understanding of what data is missing and where that’s occurring.

Once you infill the missing values, you are presented with information including how the imputed values fit within the distribution of the known data, for each variable:

A custom visualisation created by Mark to compare the imputed data (light blue) and the original data (pink), allowing for an easy overview of the different distributions of both types of data.

Here, we can see that there are no imputed values that seem unrealistic.

Label renamer

Often, labelled data can be ambiguous in how an algorithm will interpret it. Most algorithms don’t recognise that “granite” and “GRANITE” are really the same thing.

Be it from a spelling mistake, change in convention, or just simply the letter casing, not accounting for similar labels has a flow-on effect for the rest of your data process; it views the character string “granite” and “Granite” just as similar as “granite” and “Sedimentary cover”.

Bowser lets you quickly find similar labels and replace all instances with something new. You can also just rename individual labels, including missing data.

An example of Bowser in action. The renaming tool gives you detailed control over how you change your labels.

Bowser is the first in the upcoming set of tools that Solve will be releasing to help with all parts of the predictive modelling workflow with respect to mining and exploration. Bowser sits are the very start of this workflow with the ability to complete and begin cleaning your dataset before any advanced processing is done.

If you’re interested in a trial – or you’d like more information about how the imputation is done, please contact information@solvegeosolutions.com or mark.grujic@solvegeosolutions.com