Solve blog

Orange: entry level data mining

Tom Carmichael

Data mining has a strange relationship with geology. When it is described as “the process of finding patterns in large datasets, with the goal of extracting understandable structure, relationships and knowledge“, most geologists agree that it is something that we should, at the very least, be considering.

The problem is, typically, data mining assumes extensive background knowledge from a number of different fields, not limited to statistics, machine learning and artificial intelligence.

We start to see these disciplines creeping into geology in different ways. In the academic sphere Matthew Cracknell’s excellent PhD thesis looked at several different ways machine learning could be used for lithological classification, while UWA, typically a leader in applied research, has shown the potential for large scale prospectivity studies for iron ore exploration.

Within industry, the Integra GoldCorp challenge was won by SGS Geostat who utilised “A prospectivity scoring system that harnessed both geological knowledge and machine learning“. At the Tropicana min in WA, the mill uses a combination of XRF, Hyperspectral and mill processing information to determine what the likely power load will be before material enters the mill.

All of this might seem like a very specialised job – and it certainly is at this level. However, there are entry level, open source products which can be used to process your data in a robust way and allow for key learnings to be identified quickly and easily.

Orange is a data mining tool designed at the University of Ljubljana. What distinguishes it from typical statistical software is that it is a visual programming language, which means it is operated through a series of widgets. A drawback to this technique is that it doesn’t allow you to drill down in your analysis.

An useful product that we’re typically asked to create for clients is feature selection. A feature selection is an indication of what the most important inputs (typically geochemistry, or hyperspectral data) in defining a certain classification (lithology, alteration, stratigraphy or some other classification).

In Orange, it’s simple a matter of dragging the File Icon (to import you’re original .csv, or text file with the both the classification and input merged together), drag that to a Select Columns, to define the target variable and possible determinants and then finally add a Rank widget.

All of these can be found in the data tab of Orange. The example data that you’ll see is from a survey from the Loongana 250K Mapsheet region in southern West Australia, which has three different rock classifications, Foliated Metagranite, Metagranitic Rock and Migmatic Gneiss with lab XRF and basic geochemistry attached . This is provided by the Department of Mines and Petroleum, through their West Australian Resource Information and Map Services (WARIMS) and can be found here.

This is a quick way to identify feature importance for your lithologies. There are several layers of complexity which determine good feature selection from bad feature selection,  a thorough analysis of the distribution of geochemical data, the number of each lithology type and some idea of the overall data quality should be conducted before conclusions are drawn from these data.

How might we test the overall veracity of this information? A good way to compare is to compare an analysis with all of the original input variables (63 Inputs) to those of high importance (in this case, the top 5). If we scale all the data and run a K-means clustering of 3 clusters on both these different data sets, the results are starkly different.

Figure 1 - An example of a simple workflow in Orange, with a file being imported, transformed and then the ranking for defining each variable analysed. (Widgets have been renamed so they better match their purpose)
Figure 2 - A feature selection graph for the three different rock types in this, sorted by their FCBF value, with the top 5 Cr, MnO, Mo, Na20 and Tm, highlighted.
Table 1 - Comparison of two different, unsupervised K-Means clustering analyses for 3 clusters, one with 63 input variables and one with the 5, chosen due to their high variable importance as defined in Figure 2. By using the variables of high importance there is a good correlation between this unsupervised method which uses feature selection and the initial classification of the rock type, while the method which uses no feature selection bears little resemblance.