Solve blog

Taipan – (Tool for Annotating Images in Preparation for ANalysis)

Tom Carmichael

The following is the second in a series of recent applications that we’ve worked on since userR! 2018, so you’ll see a number of links to packages that we learnt about from that conference. 

This example uses the Taipan (Tool for Annotating Images in Preperation for ANalysis) package which was presented by Stephanie Kobakian from Monash University in a talk called Taipan: Woman faces machine.

We’ve recently started doing a number of jobs that have had a deep learning flavor and require a training data set to be created on numerous images over a large number of samples. As a result, we’ve spent a (large) number of hours annotating images and identifying certain features in them. Unfortunately most off the shelf options to do this are limited and don’t provide enough customisation to make sure the job is done accurately and equally as important, in a timely fashion. For these reasons, Taipan was a perfect base to build an image annotation tool. Taipan allows users to easily define a set of questions and have a user annotate an image and answer questions about that image.

In it’s base form it’s perfect to use on static images with polygons (ideal if you’re annotating where a face is in an image), but for our purpose (identifying the presence or absence of rock in an image) we need to make some slight modifications.

Image of core from the Estonian Core repository

Take this photo for example: if we wanted to build a training set of where core sat in wooden core boxes, we can’t just look for the edge of the coreboxes due to the variable thickness of the rows.

We’ve modified Taipan so that it has four click polygon creation that can be used to quickly annotate the image like so:

In addition to visually annotating the image, it’s often critically important to have where the image is located downhole so that any parameter calculated on the image can then be tied back to an important parameter (assay intervals, hardness measurement etc). A second stage is required to input the Hole ID and the depth from and depth to. This is done on the Image information tab.

When you’re adding information to an image in Taipan, you put Hole ID, Depth From and Depth To into the required fields, you can also check if the image is appropriate for analysis (blurry, doesn’t contain the feature of interest, etc.). It also autofills the next image for you (with the previous images Hole ID, as well as assigning the Depth To of the previous image to the Depth From of this one). A minimum threshold can be set for the size of the box (in this case we set each box to be a minimum of 4 meters, so if the depths you enter are less than that, you get a warning in red, greater than that, and the depth is returned in green).

Subset image – named with Hole ID and depth.

Using our digitized polygon we can now crop the image to include only core. These photos can be further analysed or subset for a feature of interest, used as a cropping tool for the easier storage of core photos, or joined together as a strip log. 

Taipan is flexible enough that you can achieve a number of different tasks (here we’ve just shown the most basic functionality) depending on the training set you’re trying to create and the deep learning problem you’re trying to solve. Importantly, we’ve seen a significant speed up in how long it’s taking us to annotate images compared to an off the shelf option.

If you’re interested in a custom solution for image annotation or how you can extend your core photography to get more information from it, please contact information@solvegeosolutions.com.

Solve blog

Solve at useR! 2018

Tom Carmichael

Going to an open source, coding based conference which had a grand total of one geological themed talked over 4 days might have seemed like a dicey proposition for Solve Geosolutions, but useR! 2018 was worth the trip to Brisbane many times over (least of all as a respite from Melbourne’s weather).

The conference proper was preceded by a day of tutorials, concerning everything from best practices when imputing values (from the author of the excellent reference R for Statistics, Julie Josse) through to the easiest way to extend both xgboost and mxnet (run by Tong He, one of the original authors of the xgboost algorithm). The conference proper was excellent and covered just about every aspect of how R is being used to solve interesting problems – through to some of the strengths of the R community in general, and what the future for R will look like.

In absolutely no order whatsoever – a few of the favourite talks attended by the Solve team across the week were :

Enabling Analysts: Embracing R in a National Statistics Office – Chris Hansen, from Stats NZ spoke of making the business case from moving from Stata and SPSS to R, and the practicalities of making that move (including the idea of having a central R server that everyone uses, instead of local instances), the challenges that they faced when implementing R into existing workflows and how they addressed them.

clustree: a package for producing clustering trees using ggraph – Luke Zappia, presented a novel way to integrate the results from several different clustering solutions and use this to try to identify the optimum number of clusters for a given dataset.

DALEX will help you to understand your predictive model – Przemyslaw Biecek, from the Warsaw University of Technology, has produced a new package for interrogating predictive models called DALEX. This package is algorithm agnostic and aims to remove some of the mystery behind black box algorithms such as Random Forest and XGBoost. DALEX is particularly useful for investigating where a model struggles and where it excels.

Speeding up computations in R with parallel programming in the cloud – David Smith, foreach is an amazing R package that lets anything that you can do with a simple for loop run in parallel. Microsoft demonstrated an R package that makes it (almost) as simple to send off large jobs to an Azure cluster.

Maxcovr: Find the best locations for facilities using the maximal covering location problem – Nicholas Tierney,  Where should you put the next 50 WiFi hotspots to best service the public transport system? What about installing a few more AED resuscitation kits around a sporting complex? These are the questions that Nicholas Tierney’s fantastic maxcovr package can answer. It’s not too much of a stretch to see where these principles can be applied to mining and exploration.

And in an odd coincidence, in the middle of the last day of the conference, this article by the Sydney Morning Herald quotes CEO of Rio Tinto, Jean-Sebastien Jacques as saying:

 

“It is absolutely clear that technology, automation, artificial intelligence and digitalisation will play a more important role across industry, and it’s fair to say, in Australia today, it is difficult to find data scientists”

 

Which may, or may not be true, but having attended a conference where a cognitive psychologist has become a senior data scientist at Booking.com, should this be a model for the mining industry going forward?

If you’re interested in any of the talks that were given during useR! 2018, the talks list is here, and the vast majority have been made available to watch here.

Tom, Mark and Liam at #user2018.

Solve blog

Orange: entry level data mining

Tom Carmichael

Data mining has a strange relationship with geology. When it is described as “the process of finding patterns in large datasets, with the goal of extracting understandable structure, relationships and knowledge“, most geologists agree that it is something that we should, at the very least, be considering.

The problem is, typically, data mining assumes extensive background knowledge from a number of different fields, not limited to statistics, machine learning and artificial intelligence.

We start to see these disciplines creeping into geology in different ways. In the academic sphere Matthew Cracknell’s excellent PhD thesis looked at several different ways machine learning could be used for lithological classification, while UWA, typically a leader in applied research, has shown the potential for large scale prospectivity studies for iron ore exploration.

Within industry, the Integra GoldCorp challenge was won by SGS Geostat who utilised “A prospectivity scoring system that harnessed both geological knowledge and machine learning“. At the Tropicana min in WA, the mill uses a combination of XRF, Hyperspectral and mill processing information to determine what the likely power load will be before material enters the mill.

All of this might seem like a very specialised job – and it certainly is at this level. However, there are entry level, open source products which can be used to process your data in a robust way and allow for key learnings to be identified quickly and easily.

Orange is a data mining tool designed at the University of Ljubljana. What distinguishes it from typical statistical software is that it is a visual programming language, which means it is operated through a series of widgets. A drawback to this technique is that it doesn’t allow you to drill down in your analysis.

An useful product that we’re typically asked to create for clients is feature selection. A feature selection is an indication of what the most important inputs (typically geochemistry, or hyperspectral data) in defining a certain classification (lithology, alteration, stratigraphy or some other classification).

In Orange, it’s simple a matter of dragging the File Icon (to import you’re original .csv, or text file with the both the classification and input merged together), drag that to a Select Columns, to define the target variable and possible determinants and then finally add a Rank widget.

All of these can be found in the data tab of Orange. The example data that you’ll see is from a survey from the Loongana 250K Mapsheet region in southern West Australia, which has three different rock classifications, Foliated Metagranite, Metagranitic Rock and Migmatic Gneiss with lab XRF and basic geochemistry attached . This is provided by the Department of Mines and Petroleum, through their West Australian Resource Information and Map Services (WARIMS) and can be found here.

This is a quick way to identify feature importance for your lithologies. There are several layers of complexity which determine good feature selection from bad feature selection,  a thorough analysis of the distribution of geochemical data, the number of each lithology type and some idea of the overall data quality should be conducted before conclusions are drawn from these data.

How might we test the overall veracity of this information? A good way to compare is to compare an analysis with all of the original input variables (63 Inputs) to those of high importance (in this case, the top 5). If we scale all the data and run a K-means clustering of 3 clusters on both these different data sets, the results are starkly different.

Solve blog

Clustering Geoscience data: Part 1 – Choosing the right tool for the job.

Tom Carmichael

As Geologists, Metallurgists and Engineers in the mining industry, most of us spend a significant amount of time categorising rock into classes. Geological classification schemes range from lithology and alteration logging, through to grade control and processing. These classes are at the centre of how we mine a resource. These rock classifications are typically derived from a model constructed by an expert.

But what if we want to understand the domains or structure of our data in the absence of an expert driven model? This is where clustering comes in. The most common form of unsupervised learning is clustering, where we want to divide our data into its natural or data-driven groups. Data that belong to the same cluster should be similar (with respect to the similarity measure you have chosen (e.g. Euclidean distance, correlation etc), and clusters should be as distinct and compact as possible. Clustering is easy to execute but complex to perform in an optimal and meaningful way. We can quickly cluster data in numerous software packages with the click of a button, but the real complexity in clustering can be divided into the following themes:

  1. Algorithm selection – What is the most appropriate clustering method for your task, and what parameters will be set?
  2. Choosing/preparing appropriate input data: Of all the available data what will be put into the clustering algorithm, what will be kept out and how will it be pre-processed?
  3. How to interpret the resultant clustered solutions: what are clusters made of? Are they optimal and/or meaningful?

In this blog post we will work through problem 1 – what clustering method to use for your particular task. We will address parts 2 and 3 in future posts.

Clustering geoscience/geological data (or other data derived from natural phenomena) has its own unique set of challenges, as these data are often not part of a statistical distribution (e.g normal, exponential etc), which most clustering algorithms demand (e.g k-means).

Clustering Corescan Mineralogy

One of the key downhole data-sets that we routinely cluster is Corescan hyperspectral mineralogy. This is a dense source of mineral abundance and chemistry data that responds well to clustering and produces mineralogical domains that are highly informative to geologists, metallurgists and engineers.

The first and extremely important step (which we will discuss further in future posts) is understanding what input data is appropriate to be included in the clustering. As clustering is an unsupervised method, this is one of the few ways we can impose some domain knowledge on the process. However, the strength of unsupervised learning workflows such as clustering is that we want to explore the natural variation and structure in our data, therefore we don’t want to impose too many of our own rules.

When clustering a high dimensional dataset such as Corescan, we can use various dimensionality reduction techniques (common methods include PCA, t-SNE, high correlation filter etc) to reduce (30+) variables to a lower dimensional space and remove highly correlated variables.

In the example dataset below, we’ve transformed our high dimensional data into two dimensions. (The axes have no real meaning apart from to illustrate relative distance/similarity between samples.)

Each data point in the above plot is a high dimensional hyperspectral abundance vector for a sample of scanned drill core. (In this instance, 25cm interval of 500-micron data from the Corescan HCI-3 instrument).

Each point represents a 25cm interval of Corescan mineral abundance.

The above plot was generated using the t-SNE algorithm and is grouping the points by n-dimensional similarity. Just by looking at the data, there are some obvious (and some not so obvious) clusters that can be pulled out. This is real data, so our clusters have different sizes and densities along with some points which don’t appear to fit into any cluster. The next part of the process is to cluster these data into meaningful groups (meaningful in a geological context and a mathematical context).

 

k-means

The first algorithm we will use is the ubiquitous and much-maligned k-means algorithm. There are several parameters to set for k-means (e.g. initialisation), but we’ll keep them default and use the silhouette method to determine the optimum number of clusters (in this instance, 6).

Points coloured by 6 K-means cluster values

As you can see the results are not optimal. There appears to be too few clusters and cluster boundaries that dissect dense regions of data.

The k-means algorithm has a few obvious flaws – you must choose the number of clusters (k), which can be very difficult to determine. k-means it’s sensitive to the initial conditions when the algorithm runs (see k-means++) and an algorithm based around the arithmetic mean is likely to be affected adversely by clusters of different size and density. k-means is also not a true clustering method in the sense it partitions the data in k clusters but does not allow for outliers to exist. Every sample must belong to a cluster. It is more fitting to say we are partitioning data when we use k-means, not clustering.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

For this dataset we are better off using a density-based clustering method. A density method clusters points based on the density of their distribution, with an island of densely distributed points belonging to a cluster and sparsely distributed points classed as outliers. We also don’t have to choose the number of clusters we want, that number is borne out by the data and a few parameters of the algorithm (e.g. min points and epsilon). The results from a DBSCAN analysis on the same data are presented below.

Points coloured by 8 DBSCAN clusters with outliers coloured dark blue

The dark blue values are outliers and they haven’t been assigned to any cluster. In geology outliers are often of key interest, so this gives us another way to look at our data. Even with this marked improvement from k-means, we still see some odd behaviour.

OPTICS (Ordering Points To Identify the Clustering Structure)

One weakness in the DBSCAN algorithm is that it can struggle with different densities of points (again, a common phenomenon in geological data). This can be overcome by using the OPTICS algorithm, which allows for clusters of differing densities.

Points coloured by 8 OPTICS cluster values with outliers coloured dark blue.

The OPTICS algorithm gives a similar result to DBSCAN; however there are some small but potentially important variations, mainly in the way outliers have been defined. We see the clusters are more spatially distinct with more outliers being defined in the boundary regions of clusters.

There’s still no guarantee that these results provide a geologically reasonable solution, however in our experience a density-based method provides a more useful, mathematically optimal solution than the traditional k-means (or other algorithms that use elements of k-means). We’ll show you what each of these results look like down hole in a future post and address some additional issues typically encountered when clustering data.

Many geologists have seen unsupervised methods and have been unsatisfied with the results. It’s important to understand why the results might have been poor and, more importantly, what can be done to improve this.