Introducing Bowser – an automated, intelligent data imputation tool.
July 30, 2018
The following will be the first in a series of recent applications that we’ve worked on since userR! 2018, so you’ll see a number of links to packages that we learnt about from that conference.
The first example uses naniar (data structure and visualisation of missing data) and visdat (visualisation of entire dataframes), two of our favourite packages we came across at the conference. This is the first entry on the blog from the newest Solve team member, Mark Grujic.
Bowser (think fuel, not Mario – you’ll see why later) is a web service that lets you work with and impute missing values in your data frames, a particularly common problem in geochemistry.
Sometimes, you have an incomplete or nearly-complete dataset that you want to use in some advanced modelling application, but the application might do one of several things to your data:
- Delete all rows with missing data.
- Replace all missing values with the mean/median/something else of each column.
- Nothing because the code just broke and crashed the program.
Bowser aims to make it easier for you to work with data and convert it into a format that makes it easier to use in other modelling applications!
Why do this?
There are several reasons why you would want to use an artificially complete dataset, even though you will be using data that wasn’t directly observed. One of the main reasons is that a bunch of algorithms like Self Organising Maps will go and impute the missing values with the mean of the data, without necessarily informing you. This can be dangerous, as you can see in this following simple example:
Here, some known data (black) has a clear trend between the x and y variables and “missing’” data has been simulated at regular intervals across the range of x. The value of y has been imputed using several methods. You can see that using the mean is a poor approximation that destroys any relationship between variables. The linear interpolation method gives a much better result. However, with larger datasets that have more variables, there are much more complex algorithms that can give more robust imputation models.
Imputation in Bowser
Bowser uses a multiple-imputation, non-parametric technique using the Random Forest algorithm to fill-in your missing data values. This method is robust for significant proportions of missing data and works with mixed data; it will impute numeric and categorical missing values. To do this, the fantastic missForest R package is utilised.
Additionally, visdat and naniar are used to help visualise the missing data before doing the imputation. The following shows an overview of where the missing data is in relation to the row order:
Once you infill the missing values, you are presented with information including how the imputed values fit within the distribution of the known data, for each variable:
Here, we can see that there are no imputed values that seem unrealistic.
Often, labelled data can be ambiguous in how an algorithm will interpret it. Most algorithms don’t recognise that “granite” and “GRANITE” are really the same thing.
Be it from a spelling mistake, change in convention, or just simply the letter casing, not accounting for similar labels has a flow-on effect for the rest of your data process; it views the character string “granite” and “Granite” just as similar as “granite” and “Sedimentary cover”.
Bowser lets you quickly find similar labels and replace all instances with something new. You can also just rename individual labels, including missing data.
Bowser is the first in the upcoming set of tools that Solve will be releasing to help with all parts of the predictive modelling workflow with respect to mining and exploration. Bowser sits are the very start of this workflow with the ability to complete and begin cleaning your dataset before any advanced processing is done.
If you’re interested in a trial – or you’d like more information about how the imputation is done, please contact firstname.lastname@example.org or email@example.com