Solve blog

Ruff – Clustering image textures

Mark Grujic

Solve Geosolutions is releasing another free web app that aims to help geoscientists (and others!) get the most out of their data.

Ruff is an app that takes your images and returns the same image split into groups that have similar textural characteristics. To do this it uses the GLCM algorithm to find a bunch of numbers that describe the texture around each pixel in the image.

For most of the demonstration of the app in this post, we’ll use the following seismic reflection image from here.

Seismic reflection data from the Gabon South Basin

Let’s look at some of the textural components that are calculated using Ruff. Each parameter helps define things like roughness, slope, etc.

Some of the textural parameters that Ruff extracts from images

After we have our textures and applying some scaling, we can then group the pixels based on how similar their textural parameters are. In Ruff, this is done using k-means clustering, with the number of clusters determined by the user. Below, we see the result of using a 5 x 5 GLCM window and 6 clusters on the seismic reflection data.

The seismic reflection data with and without cluster colouring. A 5 x 5 GLCM window was used and 6 clusters were defined

It looks good! Nothing too out of place and you can see that adding colours to the image really does help distinguish areas with different textures.

What if we increased our window size? In the image below, the window size changes from 5 x 5 to 7 x 7, then 11 x 11. You will notice the clusters get coarser, as expected.

The GLCM window was changed to 7 x 7 then 11 x 11

Another example, this time using an airborne magnetics image. The same principle was applied, resulting in another layer of data to interpret.

The mag data (black/white), the clustered texture domains (colour), and the two overlapped

OK, last example, this time with SRTM elevation data from the Pilbara, Western Australia. Here, we’ve applied the same methodology and identified the clusters that appear to correlate well with interesting topographic features. We can isolate these clusters and see if they also occur in previously underappreciated areas.

SRTM elevation data over the Pilbara (greyscale) with selected textural clusters (colour)

In summary, Ruff provides the means to analyse the textural characteristics of images and provide you with an understanding of where similar textures appear in the image. Due to the computational requirements, Ruff will accept images less than 1MB in size. It’s not hard to imagine how this process can be applied to a wide range of other data applications!

We haven’t really discussed the nuts-and-bolts of the GLCM algorithm that Ruff uses, but it is generally considered to be a good first step at textural characterisation. Additionally, k-means clustering is also an OK first step, but results could be improved with more involved clustering routines.

Please try out Ruff by clicking here and let us know if you have feedback by contacting us at information@solvegeosolutions.com or mark.grujic@solvegeosolutions.com

For more advanced analysis, Solve Geosolutions offers a host of more robust, neural-network based services that we would be happy to discuss with you!

Solve blog

MICA – Mineral Identification and Compositional Analysis

Mark Grujic

Minerals are confusing, at least to most of us! What’s worse is that finding out exactly what we are confusing ourselves with is hard to know.

 
Fortunately, the good people at webmineral.com have compiled libraries of the elemental composition of a large amount of known minerals. Using this data, we can create a measure of mineral similarity to see how different and similar minerals are to each other.
 
We have built a web app, MICA (Mineral Identification and Compositional Analysis), that does a lot of the hard work!
 
MICA uses a database of 4722 minerals and 85 elements. The proportion of each element in each mineral is recorded in the database:
Overview of the composition of the minerals in the database
 
We then take that information and reduce the dimensionality of the data so that the relationships between the compositions of the 85 elements can be displayed on a 2-dimensional scatter plot. This means that minerals with similar composition will plot close to each other:
A plot of the minerals in 2D. The minerals are coloured by their Sulfur and sized by Copper proportions
 
 MICA allows you to zoom in and out of the map, colouring the minerals by whichever element you choose along the way. Selecting some minerals using the lasso tool lets you see the selections and their chemical formula:
Manually selected minerals and their chemical formula
Selecting several minerals also lets you identify the important elements that makes them similar in the first place. MICA ranks the elemental importance of the mineral similarity measure. This rank is found by running an unsupervised Random Forest model through the selected minerals and extracting the elemental importance using the mean decrease in the Gini coefficient:
 
Key elements for the selected minerals
 
Let’s get back to answering the main issue with mineral discrimination:
 
What else could this mineral be?
 
MICA lets you pick a mineral and then look at the most similar minerals and compare their elemental compositions:
The composition of Serandite and the closest 5 minerals. Hovering over each section gives details.
 
Here is a quick demo of some key features of MICA:
Demo of MICA being used to identify similar minerals to Yimengite, then select a different group of minerals and determine what makes them unique, compared to the entire mineral database.

Please try out MICA and let us know if you have feedback by contacting us at information@solvegeosolutions.com or mark.grujic@solvegeosolutions.com

Solve blog

Introducing Bowser – an automated, intelligent data imputation tool.

Mark Grujic

The following will be the first in a series of recent applications that we’ve worked on since userR! 2018, so you’ll see a number of links to packages that we learnt about from that conference. 

The first example uses naniar (data structure and visualisation of missing data) and visdat (visualisation of entire dataframes), two of our favourite packages we came across at the conference. This is the first entry on the blog from the newest Solve team member, Mark Grujic.

Bowser (think fuel, not Mario – you’ll see why later) is a web service that lets you work with and impute missing values in your data frames, a particularly common problem in geochemistry.

Sometimes, you have an incomplete or nearly-complete dataset that you want to use in some advanced modelling application, but the application might do one of several things to your data:

  • Delete all rows with missing data.
  • Replace all missing values with the mean/median/something else of each column.
  • Nothing because the code just broke and crashed the program.

Bowser aims to make it easier for you to work with data and convert it into a format that makes it easier to use in other modelling applications!

Why do this?

There are several reasons why you would want to use an artificially complete dataset, even though you will be using data that wasn’t directly observed. One of the main reasons is that a bunch of algorithms like Self Organising Maps will go and impute the missing values with the mean of the data, without necessarily informing you. This can be dangerous, as you can see in this following simple example:

Different imputation methods compared, showing that just using the mean value for imputation for y that is present in some algorithm implementations is going to give a spurious result.

Here, some known data (black) has a clear trend between the x and y variables and “missing’” data has been simulated at regular intervals across the range of x. The value of has been imputed using several methods. You can see that using the mean is a poor approximation that destroys any relationship between variables. The linear interpolation method gives a much better result. However, with larger datasets that have more variables, there are much more complex algorithms that can give more robust imputation models.

Imputation in Bowser

Bowser uses a multiple-imputation, non-parametric technique using the Random Forest algorithm to fill-in your missing data values. This method is robust for significant proportions of missing data and works with mixed data; it will impute numeric and categorical missing values. To do this, the fantastic missForest R package is utilised.

Additionally, visdat and naniar are used to help visualise the missing data before doing the imputation. The following shows an overview of where the missing data is in relation to the row order:

An example of the vis_miss() visualisation in visdat with the data ordered by row number, allowing for an easier understanding of what data is missing and where that’s occurring.

Once you infill the missing values, you are presented with information including how the imputed values fit within the distribution of the known data, for each variable:

A custom visualisation created by Mark to compare the imputed data (light blue) and the original data (pink), allowing for an easy overview of the different distributions of both types of data.

Here, we can see that there are no imputed values that seem unrealistic.

Label renamer

Often, labelled data can be ambiguous in how an algorithm will interpret it. Most algorithms don’t recognise that “granite” and “GRANITE” are really the same thing.

Be it from a spelling mistake, change in convention, or just simply the letter casing, not accounting for similar labels has a flow-on effect for the rest of your data process; it views the character string “granite” and “Granite” just as similar as “granite” and “Sedimentary cover”.

Bowser lets you quickly find similar labels and replace all instances with something new. You can also just rename individual labels, including missing data.

An example of Bowser in action. The renaming tool gives you detailed control over how you change your labels.

Bowser is the first in the upcoming set of tools that Solve will be releasing to help with all parts of the predictive modelling workflow with respect to mining and exploration. Bowser sits are the very start of this workflow with the ability to complete and begin cleaning your dataset before any advanced processing is done.

If you’re interested in a trial – or you’d like more information about how the imputation is done, please contact information@solvegeosolutions.com or mark.grujic@solvegeosolutions.com