Supported by the National Key Research and Development Program (No.2021YFD1700900)
【Objective】Under the background of high-intensity soil resource utilization, digital soil mapping has become an effective method to obtain and characterize soil information quickly, efficiently and accurately. The accuracy and reliability of soil spatial prediction and digital mapping are restricted by multiple factors, such as soil sample size, sampling strategy, prediction model, the complexity of geomorphology and soil-forming environment in the target region, and quality of covariate data. 【Method】Choosing Henan Province as the study region, we applied five of the most representative machine learning (ML) algorithms to spatially predict and digitally map the topsoil pH of croplands. Afterwards, the impact of different sample sizes and sampling methods on the performance of the chosen ML models and the prediction accuracy of topsoil pH were compared. 【Result】The results showed that: (1) When the soil sample size increased from 200 to 2 000, the performance of all ML models and prediction accuracy of topsoil pH showed a general trend of rapid increase regardless of the sampling method. When sample size reached and exceeded 2 000, the performance of most ML models tended to be stable, and the prediction accuracy of topsoil pH increase rapidly slowed down, suggesting that a soil sample size of 2 000 might be the sample size threshold for these ML models to predict the topsoil pH of croplands in this area. (2) The performance of the five ML models and their topsoil pH prediction accuracy was significantly different. The tree-based ML models, namely Random forests (RF) and Cubist performed best. No matter which sampling method was used, when the sample size was more than 2 000, the archived coefficient of determination (R2) of the two models could be stable between 0.75 and 0.80, and the RMSE could be kept below 0.50. (3) When the soil sample size was large enough, the sampling method had little impact on the ML model performance. Also, the topsoil pH prediction accuracy and the sampling method gradually highlighted when the soil sample size was less than 2 000. Comparatively, Conditioned Latin hypercube sampling (clhs) had advantages when the sample size was small. When the sample size was 1 000, clhs sampling method could still keep the R2 of random forest and Cubist prediction at about 0.80. Even when the sample size was as small as 200, the R2 archived by all five ML models under the clhs sampling method was above 0.55. (4) The uncertainty analysis showed that 73.9% of the observed values of topsoil pH of the validation samples fell into the 90% Prediction Interval (PI) of the random forest model, indicating that the reliability of the model was slightly overconfident, but it was within the acceptable range. In addition, the data indicated that the uncertainty of model prediction was not significantly correlated with sample size. 【Conclusion】Tree-structured machine learning models Random Forest and Cubist stand out in this case. Improving the spatial prediction and digital mapping accuracy of soil target variables cannot be achieved simply by expanding the scale of sample points and increasing the density of sample points. It is necessary to improve the model prediction performance and covariate data quality at the same time. When the sample size is large enough, the sampling strategy has little effect on the performance of the ML model and the prediction accuracy of surface soil pH; when the sample size is smaller than a certain threshold, the sampling method has a significant impact on the model performance and prediction results.
SUN Yueqi, SUN Xiaomei, WU Zhenfu, YAN Junying, ZHAO Yanfeng, CHEN Jie. Impact of Sample Size and Sampling Method on Accuracy of Topsoil pH Prediction on a Regional Scale[J]. Acta Pedologica Sinica,2023,60(6):1595-1609.Copy