1 Introduction
Predictive soil spectroscopy (PSS) is a technique that explores the interaction of electromagnetic radiation with soils to estimate several properties via machine learning or chemometrics models. It is a powerful tool for characterizing, measuring, and monitoring soils, as it can provide rapid, cost-effective, and environmentally benign measurements for a number of soil properties from a single scan. However, it has some aspects that requires careful consideration during project development, such as:
- Fitness-for-purpose: the choice of the methodology and spectral range of interest should fit a specific objective and be suitable for the study or application in consideration.
- Good predictions flow from good data: the quality and representativeness of the reference soil spectral library (SSL), with analytical reference data, are crucial for the accuracy and reliability of the predictions.
- Good practices for model building: the selection and evaluation of the multivariate calibration or machine learning algorithm, and the validation and interpretation of the PSS models and predictions are essential to ensure the robustness and applicability of results.
In predictive soil spectroscopy, we need to use a reference SSL, which is a collection of soil spectra and corresponding soil properties measured by conventional methods. The SSL serves as the training dataset for building and testing the predictive models. The performance of the predictive models depends largely on how well the SSL represent the variability and diversity of the soil samples that we want to predict. Therefore, we have two options: we can either create a new SSL that is tailored to some specific goal, or we can use an existing public SSL that is sufficiently representative to our analysis. Examples of such libraries are the USDA NRCS Kellog Soil Spectral Library and the Open Soil Spectral Library.
1.1 Fitness-for-purpose
Soil spectroscopy is a fit-for-purpose technology, which means that the predictive solution must be suitable to a specific goal and context. Using soil spectroscopy without considering its limitations may lead to unsatisfactory results. Therefore, it is helpful to ask some questions before designing and implementing a soil spectroscopy project, such as:
By answering these questions, one can assess the feasibility and suitability of soil spectroscopy for a specific project and make informed decisions about the best practices and methods to use.
Question | Answer |
---|---|
What soil properties will be predicted? | Some soil properties are easier to predict than others, depending on the association of spectral features with the soil property of interest. For example, soil organic carbon has a strong correlation with absorbance across the infrared range, while extractable nutrients have a weaker correlation |
What accuracy/precision is required for the project? | The accuracy and precision of soil spectroscopy estimates depend on the quality and representativeness of the calibration dataset, the spectral range and resolution, and the machine learning method used. Soil spectroscopy may not be suitable for projects that require very high precision, such as detecting differences in controlled and replicated field trials. However, it may be adequate for projects that aim to classify soils into distinct classes or values, or to estimate the average values from many samples across a defined spatial boundary. |
What is the budget for the project? | Soil spectroscopy can significantly reduce the cost of soil analysis as compared to traditional methods, especially when dealing with large numbers of samples. However, the cost of soil spectroscopy also varies depending on the type and quality of the instrument used. Research-grade and bench-top laboratory spectrometers may cost hundreds of thousands of dollars, while lower-cost and portable FTIR instruments may cost a few thousand. Additionally, the cost of soil spectroscopy may include other expenses, such as sample preparation, data processing, and model development and validation. |
What instrumentation is available or accessible for the project? | The choice of instrument for soil spectroscopy depends on the spectral region, resolution, and range that are more appropriate for one application. Different instruments have different advantages and disadvantages in terms of performance, portability, and usability. For example, VisNIR instruments are more accessible and easier to use than MIR instruments, but they may have lower predictive accuracy and precision for predicting some soil properties. |
1.1.1 Example 1: Cover crop impact on soil organic carbon stocks across Iowa
Cover crops are plants that are grown in agricultural lands to improve soil health and control erosion. They can also affect soil organic carbon (SOC) stocks, which are important for mitigating climate change and enhancing soil fertility. However, measuring SOC stocks is resource intensive, time-consuming and expensive, especially at large scales. Therefore, we can use soil spectroscopy to assess the impacts of cover crops on SOC stocks across 20 commercial farms in Iowa.
We would design our study as follows:
- Requirement: We need an unbiased estimate of field-level (~20 ha) mean SOC stocks, as well as soil texture components, and pH, for each farm.
- Study design: We collect 1 soil core per hectare at 3 depths (0-10 cm, 10-20 cm, and 20-30 cm), for each farm, and we divide the farms into two treatments: with cover crops and without cover crops. This results in 2400 soil samples in total.
- Methodology:
- We scan all samples with a VisNIR spectrometer to obtain their spectra.
- We select 25% of the samples for traditional soil analysis, which involves measuring their SOC levels, bulk density, texture, and pH using conventional methods. We use the spectra to subset the most diverse samples, to ensure a representative calibration dataset.
- We train a multivariate calibration model with the 25% of the samples, using their reference values and spectra as inputs.
- We predict the SOC, texture, and pH of the remaining 75% of the samples, using only their spectra as inputs and the calibration model as the predictor.
- We scan all samples with a VisNIR spectrometer to obtain their spectra.
1.1.2 Example 2: Soil classification using the KSSL MIR Soil Spectral Library
Soil classification is the process of grouping soils into categories based on their physical, chemical, and biological characteristics. Soil classification is useful for understanding soil genesis, morphology, distribution, and management. However, obtaining the physio-chemical data needed for soil classification is laborious and costly, especially for large numbers of samples. Soil survey has been conducted at state survey offices with compact spectral instruments covering the same spectral range. Therefore, we use leverage a large soil spectral library with calibration transfer to obtain the physio-chemical characterization data for soil classification.
We may use the following method:
- We scan all new soil samples with the compact MIR spectrometer to obtain their spectra.
- The KSSL MIR library already represents samples from soils surveyed across the US.
- We standardize the instruments using a small set of shared standards to ensure compatibility between the KSSL and the state survey offices’ instruments, which accounts for small differences in instrument settings and operational procedures. This can be done with regular preprocessing or spectral standardization, depending on the dissimilarity level.
- We assess the representativeness of the KSSL MIR library to the new samples using spectral similarity measures. This helps us to identify if the new local/regional samples are well-represented by the existing library. If not, localization, transfer learning, or spiking may be necessary.
- We apply a pretrained model from the KSSL MIR onto local sample spectra, with modeling adjustments depending on the previous representation check.
- We predict several soil properties of interest with uncertainty bounds.
In case of finding many underrepresented samples, we consider sending a portion of them to traditional laboratory analysis and incorporating them into the predictive modeling workflow with localization, transfer learning, or spiking strategies. This will help to reduce the prediction bias for those new locations.
1.2 Good predictions flow from good data
Predictive soil spectroscopy relies on the quality and consistency of both the spectral and the reference analytical data. There are many factors that can affect the spectral acquisition or reference analytical quality, and thus impact the performance of the predictive models.
From a spectral acquisition standpoint, some of the factors that need to be considered are:
- Instrument setup and calibration: The spectral measurements should be made with a well-calibrated and stable instrument, following the manufacturer’s instructions and recommendations. The instrument settings, such as spectral range, resolution, and integration time, should be optimized for proper soil characterization and analysis of soil properties of interest.
- Operating environment: The spectral measurements should be made under controlled and consistent environmental conditions, such as temperature, humidity, lighting, and in some cases, under controlled air conditions with chamber purging. The environmental factors can influence the spectral response of the soil samples and introduce noise or variability in the data.
- Sample preparation: The soil samples should be prepared in a homogeneous way, such as sieving, drying, grinding (optional), and surface homogenization. Poor sample preparation can affect the spectral response and affect the estimation of both physical (grain size) and chemical properties (like organic carbon or mineral occlusion).
- Instrument calibration and standard samples: Spectral instruments should be first calibrated and the measurement session should include a consistent reference or standard material (white reference plate for VisNIR and NIR, or roughened gold or aluminin for MIR) to properly quantify the amount of light reflected by soils. Additional standard soil samples may be included for quality assurance and control to help ensure repeatability and reproducibility of the results.
The soil spectroscopy community started to promote the use of standard protocols and reference samples for high quality control of laboratory measurements. Examples are provided in the KSSL Manual Method 7A7 and a new international protocol is emerging from the IEEE SA P4005 working group. These reference materials provide guidance for proper spectral acquisition.
1.3 Good practices for model building
We need to build and evaluate multivariate or machine learning models that relate the soil spectra to the soil properties of interest. However, building good models require careful consideration of several factors, such as spectral preprocessing, exploratory data analysis, data splitting, dimensionality reduction, algorithm selection, model interpretation and inspection, and estimating prediction uncertainty. These factors affect the quality, accuracy, and reliability of the models, as well as the interpretability and validity of the models. Therefore, it is recommended to follow some best practices and assess if the performance is met for the goal and context of the project:
Spectral preprocessing: This step involves removing noise, enhancing the signal, and making the scale comparable across different spectral measurements.
Data visualization and exploratory data analysis: This step aims to answer questions such as: are there any clusters in the data? What factors are driving the spread in the data? Do a few samples have too much influence on the results? Are there any outliers? Data visualization is critical for understanding the structure and patterns of the data.
Splitting data for calibration and validation: This step depends on the goal of the analysis. There are two main approaches: cross-validation and train-test split. For a train-test split, some algorithms, such as Kennard Stone or Conditioned Latin Hypercube Sampling, can be used to select a representative subset of the data. Cross-validation is suitable when the goal is to use all of the data for internal validation performance and model optimization. Train-test split is more conservative and can provide a better estimate of the model’s extrapolation capacity on new data.
Dimensionality reduction and feature selection: This step involves reducing the number of features of the spectra to speed up computation, avoid redundancy and potential overfitting, and improve interpretability, as the spectra is highly dimensional and colinear. Some model types, such as partial least squares regression (PLSR), perform dimensionality reduction (compression) during model building, while others not. There are many methods for selecting the most important features, such as variable importance in projection (VIP), principal component analysis, or modern variable selection/compression algorithms developed by the machine learning community.
Model algorithm choice: This step is linked to the previous one, as different model types have different assumptions and properties. There are many model forms that may be applied to spectral data, such as PLSR, Cubist, and neural networks. The choice of the model type should be based on experimentation, literature recommendation, performance (uncertainty tolerance), and parsimony.
Model inspection and interpretation: This step is essential for evaluating the quality and validity of the model. It involves examining the factors that drive the predictions, such as the model coefficients and residuals. Some models are easier to interpret than others, depending on how transparent they are. For example, PLSR can dive the correlation between the spectral features and the soil variable of interest, while deep learning models are more like a black box and require some workarounds. Model inspection should also identify any samples that are poorly predicted or have a large variation in their calibration and validation performance.
Outliers and uncertainty: The final step is to apply the model to new data and assess the accuracy and precision of the predictions. A good practice would be to run a small percentage (5-10%) of the new samples through the reference analysis method to truly evaluate the performance (if not properly done during model splitting), but this may not always be feasible. Alternatively, there are some ways to increase the confidence in the predictions: i) checking how well the new samples are represented by the training set using extrapolation metrics, such as Hotelling T2, inlier distances, or Q statistics, for example; ii) check how the predictions with fall in the range of the measured soil properties; iii) calculating the uncertainty (imprecision) around each prediction point using methods such as quantile regression or conformal prediction.