Clustering Geoscience data: Part 1 – Choosing the right tool for the job.
May 23, 2017
As Geologists, Metallurgists and Engineers in the mining industry, most of us spend a significant amount of time categorising rock into classes. Geological classification schemes range from lithology and alteration logging, through to grade control and processing. These classes are at the centre of how we mine a resource. These rock classifications are typically derived from a model constructed by an expert.
But what if we want to understand the domains or structure of our data in the absence of an expert driven model? This is where clustering comes in. The most common form of unsupervised learning is clustering, where we want to divide our data into its natural or data-driven groups. Data that belong to the same cluster should be similar (with respect to the similarity measure you have chosen (e.g. Euclidean distance, correlation etc), and clusters should be as distinct and compact as possible. Clustering is easy to execute but complex to perform in an optimal and meaningful way. We can quickly cluster data in numerous software packages with the click of a button, but the real complexity in clustering can be divided into the following themes:
- Algorithm selection – What is the most appropriate clustering method for your task, and what parameters will be set?
- Choosing/preparing appropriate input data: Of all the available data what will be put into the clustering algorithm, what will be kept out and how will it be pre-processed?
- How to interpret the resultant clustered solutions: what are clusters made of? Are they optimal and/or meaningful?
In this blog post we will work through problem 1 – what clustering method to use for your particular task. We will address parts 2 and 3 in future posts.
Clustering geoscience/geological data (or other data derived from natural phenomena) has its own unique set of challenges, as these data are often not part of a statistical distribution (e.g normal, exponential etc), which most clustering algorithms demand (e.g k-means).
Clustering Corescan Mineralogy
One of the key downhole data-sets that we routinely cluster is Corescan hyperspectral mineralogy. This is a dense source of mineral abundance and chemistry data that responds well to clustering and produces mineralogical domains that are highly informative to geologists, metallurgists and engineers.
The first and extremely important step (which we will discuss further in future posts) is understanding what input data is appropriate to be included in the clustering. As clustering is an unsupervised method, this is one of the few ways we can impose some domain knowledge on the process. However, the strength of unsupervised learning workflows such as clustering is that we want to explore the natural variation and structure in our data, therefore we don’t want to impose too many of our own rules.
When clustering a high dimensional dataset such as Corescan, we can use various dimensionality reduction techniques (common methods include PCA, t-SNE, high correlation filter etc) to reduce (30+) variables to a lower dimensional space and remove highly correlated variables.
In the example dataset below, we’ve transformed our high dimensional data into two dimensions. (The axes have no real meaning apart from to illustrate relative distance/similarity between samples.)
Each data point in the above plot is a high dimensional hyperspectral abundance vector for a sample of scanned drill core. (In this instance, 25cm interval of 500-micron data from the Corescan HCI-3 instrument).
The above plot was generated using the t-SNE algorithm and is grouping the points by n-dimensional similarity. Just by looking at the data, there are some obvious (and some not so obvious) clusters that can be pulled out. This is real data, so our clusters have different sizes and densities along with some points which don’t appear to fit into any cluster. The next part of the process is to cluster these data into meaningful groups (meaningful in a geological context and a mathematical context).
The first algorithm we will use is the ubiquitous and much-maligned k-means algorithm. There are several parameters to set for k-means (e.g. initialisation), but we’ll keep them default and use the silhouette method to determine the optimum number of clusters (in this instance, 6).
As you can see the results are not optimal. There appears to be too few clusters and cluster boundaries that dissect dense regions of data.
The k-means algorithm has a few obvious flaws – you must choose the number of clusters (k), which can be very difficult to determine. k-means it’s sensitive to the initial conditions when the algorithm runs (see k-means++) and an algorithm based around the arithmetic mean is likely to be affected adversely by clusters of different size and density. k-means is also not a true clustering method in the sense it partitions the data in k clusters but does not allow for outliers to exist. Every sample must belong to a cluster. It is more fitting to say we are partitioning data when we use k-means, not clustering.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
For this dataset we are better off using a density-based clustering method. A density method clusters points based on the density of their distribution, with an island of densely distributed points belonging to a cluster and sparsely distributed points classed as outliers. We also don’t have to choose the number of clusters we want, that number is borne out by the data and a few parameters of the algorithm (e.g. min points and epsilon). The results from a DBSCAN analysis on the same data are presented below.
The dark blue values are outliers and they haven’t been assigned to any cluster. In geology outliers are often of key interest, so this gives us another way to look at our data. Even with this marked improvement from k-means, we still see some odd behaviour.
OPTICS (Ordering Points To Identify the Clustering Structure)
One weakness in the DBSCAN algorithm is that it can struggle with different densities of points (again, a common phenomenon in geological data). This can be overcome by using the OPTICS algorithm, which allows for clusters of differing densities.
The OPTICS algorithm gives a similar result to DBSCAN; however there are some small but potentially important variations, mainly in the way outliers have been defined. We see the clusters are more spatially distinct with more outliers being defined in the boundary regions of clusters.
There’s still no guarantee that these results provide a geologically reasonable solution, however in our experience a density-based method provides a more useful, mathematically optimal solution than the traditional k-means (or other algorithms that use elements of k-means). We’ll show you what each of these results look like down hole in a future post and address some additional issues typically encountered when clustering data.
Many geologists have seen unsupervised methods and have been unsatisfied with the results. It’s important to understand why the results might have been poor and, more importantly, what can be done to improve this.