Clustering Geoscience Data: Part 2 – Selecting and preparing the inputs for clustering
August 22, 2017
In our previous post – Clustering Geoscience data: Part 1 – Choosing the right tool for the job – we talked about algorithm selection. In this post we continue with this theme and discuss how to choose and prepare appropriate input data for clustering, and how these choices can impact the end results.
Clustering is an unsupervised method of exploring the structure of data by grouping together samples that are similar (with respect to some distance/similarity measure). Even though the clustering itself is unsupervised, we can – and often should – apply some supervision and guidance by being selective about what variables are used and how they are prepared. We also need to consider the most appropriate clustering method and what type of distance/similarity metric will be used.
Different sets of inputs will produce clusters that can have very different meanings. While tempting, using all the available variables will often produce clustered solutions that are difficult to interpret, particularly when clustering spatial data common within geoscientific datasets.
Using the below examples we will try to illustrate why we need to understand the question we are asking when we are applying clustering, why we need to use our domain knowledge and choose and prepare the input data correctly – and why it is often not good practise to throw everything in and hope for the best!
Most importantly, we need to use our domain knowledge to choose appropriate input variables to cluster. For instance, if you want to understand surface properties, only use variables that sense the surface, or filter deeper geophysics to emphasise the surface signal. If you want to investigate the structure of alteration domains, only select alteration mineralogy in your Corescan data.
In the example below we use common and freely available surface datasets (Radiometrics, Landsat 8 and SRTM), Euclidean distance and k-means to produce a surface cluster map that should help identify coherent domains within the regolith and outcropping geology. These datasets are appropriate for such a task as they all measure properties either at or very close to the surface.
If the input data contain sets of variables that are highly correlated, the information contained within these variables will be over-represented in the resultant clustered solutions (when using the commonly applied Euclidean distance). This is particularly relevant for geological datasets, which often contain highly correlated variables (e.g geochemistry, and multispectral data) see Figure 2.
This can be dealt with by omitting some of the highly correlated variables, or by using methods such as Principal Components Analysis (PCA) to reduce many correlated variables down to a smaller number of uncorrelated principal components. Replacing Euclidean distance with another similarity/distance measure such as the Mahalanobis distance can also deal with this issue.
Distance measures such as Euclidean distance are sensitive to the magnitude of the variables. Inputs with large measurement scales will dominate the calculation of Euclidean distance (e.g. Landsat 8 DN values range from 5000-20000, whilst the SRTM data is common between 100-500).
If we want all variables to be equally weighted we need to standardise the input data. One commonly applied technique is known as taking the z-score of the data. The resultant data will have a mean of zero, and standard deviation of 1 with all samples represented as the number of standard deviations from the mean. A technique like this allows us to weight all variables equally.
In the above example, all the data was processed and clustered the same way, with the only difference being which hyperspectral input variables were included. White mica clustering included 6 inputs that describe the variation of chemistry and crystallinity of white mica. The general mineralogy clustering (right) included all hyperspectral minerals that were abundant over this interval.
The considerable difference between the two generated logs shows the advantage of tailoring an unsupervised method by carefully selecting what goes into it (hence our mantra, Expert driven, Computer assisted) and interpreting the results accordingly.
In this instance, there are some similarities in both downhole clustering results (cluster from ~70 to ~75m and cluster boundaries at ~120m and ~260m), but there are notable differences too (a single cluster in the model defined by white mica between ~75m and ~120m, while the general mineralogy has a more complex signature with a number of different clusters at each point).
Each of these results serves a different, but potentially equally important purpose – clustering the abundant minerals will most likely be telling us about where we get consistent domains of similar mineralogy. These domains will represent a combination of lithology and alteration. Clustering only variables associated with white mica (composition, crystallinity) will tell us something about weathering, temperature and alteration chemistry.
The input variables, the processing and the clustering algorithm that are chosen are all dependent on one thing – what patterns you are looking to extract from the data. Throwing all the data in with no pre-processing is fraught with danger will often lead to more questions than answers. Hopefully this blog has helped to shed some light on the clustering process and some of the pitfalls that can occur along the way.
If you would like to know more, or have any comments or questions please contact us at firstname.lastname@example.org