6 Prediction models

You are reading the work-in-progress of the SoilSpec4GG manual. This chapter is currently draft version, a peer-review publication is pending.

We have received important feedback since the release of the first models and web services (Dec. 2021). The feedback was that the previous model versions were doing a great job, but for others not that much.

Variable model performance may have happened due to the specific soil types not being well represented or due to the spectra not being well aligned with the OSSL instruments. This led us to improve the current outputs to include a flag that indicates if the new samples to be predicted are represented by the calibration set. In addition to that, we revised our uncertainty estimation method by switching to conformal predictions, a simple and robust method for delivering uncertainty bands.

In addition to that, we conducted a systematic analysis of learning algorithms, compression strategies, and preprocessing using the OSSL database and external test sets, and the insights from ring trial experiment - a separate project that was developed to understand the dissimilarity across multiple soil spectroscopy laboratories.

We also removed spatial covariates and even not fused spectral regions compared to the first models, in order to make them simpler and generally applicable. These further combinations require more time to verify the potential performance improvements, although recent literature have been supporting it.

The following sections describe the steps used to produce the OSSL models, which are used in the OSSL Engine and API, and also public available from Google Cloud Storage.

It is important to note that the current models are not free of error and may still provide variable results. Please, let us know about any inconsistency you might find to help us improve even more the current models.

For an in-depth inspection of the modeling steps, please visit the ossl-models GitHub repository.

6.1 Modeling framework

We have used the MLR3 framework (Lang et al., 2019) for fitting machine learning models, specifically with the Cubist algorithm (Quinlan, 1992, 1993).

The R-mlr folder on ossl-models GitHub presents the code used to calibrate the models.

In summary, we have provided 5 different model types depending on the availability of samples in for each soil property, which was fitted without the use of ancillary information (na code), i.e., site information or environmental layers are not used as predictors, only the spectral variation.

The model types are composed of two different subsets, i.e. using the KSSL soil spectral library alone (kssl code) or the full OSSL database (ossl code), in combination with three spectral types: VisNIR (visnir code), NIR from the Neospectra instrument (nir.neospectra code), and MIR (mir code).

spectra_type subset geo model_name
mir kssl na mir_cubist_kssl_na_v1.2
mir ossl na mir_cubist_ossl_na_v1.2
nir.neospectra ossl na nir.neospectra_cubist_ossl_na_v1.2
visnir ossl na visnir_cubist_ossl_na_v1.2
visnir kssl na visnir_cubist_kssl_na_v1.2

We highly recommend using the OSSL model types. The models fitted exclusively with the KSSL can be used when the spectra to be predicted has the same instrument manufacturer/model as the spectrometers used to build the KSSL VisNIR and MIR libraries, and the KSSL library is representative for the new spectra based both on spectral similarity and range of soil properties of interest.

The machine learning algorithm Cubist (coding name cubist) takes advantage of a decision-tree splitting method but fits linear regression models at each terminal leaf. It also uses a boosting mechanism (sequential trees adjusted by weights) that allows the growth of a forest by tuning the number of committees. We haven’t used the correction of final predictions by the nearest neighbors’ influence due to the lack of this feature in the MLR3 framework.

As predictors, we have used the first 120 PCs of the compressed spectra, a threshold that considers the trade-off between spectral representation and compression magnitude (Chang, Laird, Mausbach, & Hurburgh Jr, 2001). The remaining farther components were used for the trustworthiness test, i.e., checking if a sample is underrepresented in respect of the feature space from the calibration set because of some unique features presented in those farther components (Jackson & Mudholkar, 1979; Santana et al., 2023).

Before PCA compression, spectra was preprocessed with Standard Normal Variate (SNV) to reduce some of the variability caused by contrasting instruments and SOPs, besides the minimization of particle size, light scattering, and multicollinearity known issues of diffuse reflectance (Barnes, Dhanoa, & Lister, 1989).

Hyperparameter optimization was done with internal resampling (inner) using 5-fold cross-validation and a smaller subset for speeding up this operation (Yang & Shami, 2020). This task was performed with a grid search of the hyperparameter space testing up to 5 configurations of Comitees to find the lowest Root Mean Squared Error (RMSE). The final model with the best optimal hyperparameter was fitted at the end with the full train data.

Some highly-skewed soil properties were natural-log transformed (with offset = 1, that is why we use the log1p function) to improve the prediction performance (Dangal, Sanderman, Wills, & Ramirez-Lopez, 2019). They were back-transformed only at the end after running all modeling steps, including performance estimation and the definition of the uncertainty intervals.

Final fitted models are listed on GitHub.

6.2 Model performance

Model evaluation was performed with external (outer) 10-fold cross validation of the tuned models using RMSE (rmse), mean error (bias), R squared (rsq), Lin’s concordance correlation coefficient (ccc), and the ratio of performance to the interquartile range (rpiq) (Wadoux et al., 2021).

The final fitted models along with their performance metrics are listed on GitHub.

All validation scatterplots are available on GitHub. Example of a validation plot.

Validation scatterplot for `log..c.tot_usda.a622_w.pct` using `mir_cubist_ossl_na_v1.2`.

Figure 6.1: Validation scatterplot for log..c.tot_usda.a622_w.pct using mir_cubist_ossl_na_v1.2.

6.3 Uncertainty and trustworthiness

The cross-validated predictions were used to estimate the unbiased absolute error, which were employed to calibrate uncertainty models via conformal prediction intervals.The conformal prediction is built upon past experience, i.e. the error model is calibrated using the same fine-tuned structure of the respective response model. Non-conformity of the calibration set were estimated from the predicted error with a defined confidence level of 68% to approximate one standard deviation. The confidence interval (CI) can be derived with the prediction of the response and error values corrected by the non-conformity score of the calibration set (Cortés-Ciriano et al., 2015; Norinder, Carlsson, Boyer, & Eklund, 2014). The main advantage of conformal prediction is that the intervals are always valid, i.e. a confidence of 80% means that the predicted confidence regions may contain the observed value in at least 80% of the cases. Another big advantage of conformal prediction methods is that we take rid of resampling mechanisms such as bootstrapping or Monte Carlo simulation which significantly requires a huge computing time for large sample sizes.

The trustworthiness test was built using the PCA model employed for compression (Jackson & Mudholkar, 1979; Santana et al., 2023). Considering that only the first 120 PCs are used to decompose spectra for modeling fitting, the remaining variation from farther components are denominated residual, which is calculated by the difference between the original spectra and back-transformed with the first 120 PCs, which makes possible to calculate the Q-statistics (sum of squared residual across the spectrum). The Q statistic of new samples are compared against a Q critical value estimated across the calibration set with a 99% confidence level.

6.4 Using the OSSL models

To load the complete analysis-ready models, train data, cross-validated predictions, validation performance metrics, and validation plot in R, please use the public URLs described GitHub.

qs is a serialized and compressed file format that is faster than native R rds. You need to have qs package version >= 0.25.5 to load files direct from the URLs.

NOTE: For using the trained models with new spectra, the supplemented data must be formatted following the defined spectral ranges:
- VisNIR: 400 - 2500 nm
- NIR Neospectra: 1350 - 2550 nm
- MIR: 600 - 4000 cm-1

Shorter intervals will raise an error! Users can impute the edges and fill the missing columns with the last available value of the border wavelengths.

We provided on GitHub both examples of datasets and a prediction bundle that incorporates all processing operations to get the outputs.

Please, check the example datasets for formatting your spectra to the minimum required level of the prediction bundle You can provide either csv files or directly asd or opus (.0) for VisNIR and MIR scans, respectively.

The results table has the prediction value (already back-transformed if log transformation was used) for the soil property of interest, 1-standard-deviation uncertainty band, and a flag column for potential underrepresented samples given the calibration data.

The prediction function requires the tidyverse, mlr3 and mlr3extralearners, Cubist. qs, asdreader, opusreader2, and matrixStats packages.

Parameters:

  • target: the soil property of interest. Log-transformed properties must have log.. appended at the beginning of the name as indicated in the export_name column.
  • spectra.file: the path for the spectral measurements, either a csv table (following the sample-data specifications), asd, or opus (.0) file.
  • spectra.type: the spectral range of interest. Either visnir, nir.neospectra, or mir.
  • subset.type: the subset of interest, either the whole ossl or the kssl alone.
  • geo.type: only available for na.
  • models.dir: the path for the ossl_models folder. Should follow the same structure and naming code as represented in fitted_models_access.csv.

All files that represent the ossl_models directory tree for local run or online access are described in ossl_models_directory_tree.csv.

Please note that the soil properties indication follows the export name. In addition, check fitted_models_performance_v1.2.csv for the complete list of models as some spectra types are not completely available. More importantly, for natural-log soil properties, the upper and lower bands were estimated before the back-transformation.

source("R-mlr/OSSL_functions.R")

list.files("sample-data")

clay.visnir.ossl <- predict.ossl(target = "clay.tot_usda.a334_w.pct",
                                 spectra.file = "sample-data/101453MD01.asd",
                                 spectra.type = "visnir",
                                 subset.type = "ossl",
                                 geo.type = "na",
                                 models.dir = "~/mnt-ossl/ossl_models/")

6.4.1 Registering and hosting models

If you plan to contribute with your own models to the Soil Spectroscopy for Global Good project, please follow these steps:

  1. We recommend registering and sharing new models via S3 file service system (Amazon or Google Cloud) with RDS files produced using Dockerized solution for full scientific reproducibility.
  2. For each model the following metadata need to be supplied:
    • Registered docker image; which should specify in detail: software in use, R package versions etc;
    • DOI of the zenodo repository with a back-up for the data (i.e. how to cite your models);
    • URL for accessing the RDS file via S3;

We can host your models on our infrastructure at no additional costs. Please contact us and apply for registering your models via soilspectroscopy.org.