1.School of Agricultural Sciences, Zhengzhou University;2.Henan Provincial Station of Soil and Fertilizer;3.School of Public Administration, Zhengzhou University
Supported by the National Key Research and Development Program (No.2021YFD1700900)
土壤空间预测与数字化制图的精度与质量受土壤样点规模、采样策略、预测模型选择、目标区域地貌与成土环境复杂程度、协变量数据质量等多种因素共同制约。选择河南省为研究区，基于9种土壤样点规模、5种采样方法，应用5种最具代表性的机器学习（Machine learning，ML）算法对耕地表层土壤pH实施空间预测与数字化制图，用以对比分析不同样点规模与采样方法对ML模型的性能表现及土壤pH预测精度的影响。结果表明：（1）当研究区土壤样点规模从200个经由400个、800个、1 200个、1 600上升到2 000个时，无论使用何种采样方法，所有ML模型的性能表现与预测精度均呈快速上升的总体趋势；当样点规模达到并超过2 000个时，大多数ML性能表现趋于稳定，预测精度上升快速趋缓，表明2 000个土壤样点可能是这些ML模型预测研究区耕地表层土壤pH的样点规模阈值。（2）5种ML模型性能表现及其土壤pH预测精度存在明显差距，基于树结构的随机森林（Random forests, RF）和Cubist表现最好，无论使用哪种采样方法，这两种模型预测结果的决定系数（R2）均可稳定在0.75~0.80之间、RMSE保持在0.50以下。（3）当土壤样点规模足够大时，采样方法对ML模型性能和土壤pH预测精度的影响很小，五种采样方法的效果相差不大。当土壤样点规模小于2 000个时，采样方法的影响逐渐凸显。比较而言，条件拉丁超立方采样在样点规模较小时具备优势。当样点规模为1000个时，条件拉丁超立方采样仍可使随机森林和Cubist预测的R2维持在0.80左右；在样点规模小至200个时，条件拉丁超立方采样方法下5种ML模型预测的R2均在0.55以上。（4）不确定性分析结果显示，平均73.9%的验证样点表层土壤pH观测值落入随机森林模型90%预测区间，表明该模型的可靠性被轻微高估，但处于可接受范畴。此外，数据显示模型预测的不确定性与样点规模无明显关联。
【Objective】Under the background of high-intensity soil resource utilization, digital soil mapping has become an effective method to obtain and characterize soil information quickly, efficiently and accurately. The accuracy and reliability of soil spatial prediction and digital mapping are restricted by multiple factors, such as soil sample size, sampling strategy, prediction model, the complexity of geomorphology and soil-forming environment in the target region, and quality of covariate data. 【Method】Choosing Henan Province as the study region, we applied five of the most representative machine learning (ML) algorithms to spatially predict and digitally map the topsoil pH of croplands. Afterwards, the impact of different sample sizes and sampling methods on the performance of the chosen ML models and the prediction accuracy of topsoil pH were compared. 【Result】The results showed that: (1) When the soil sample size increased from 200 to 2 000, the performance of all ML models and prediction accuracy of topsoil pH showed a general trend of rapid increase regardless of the sampling method. When sample size reached and exceeded 2 000, the performance of most ML models tended to be stable, and the prediction accuracy of topsoil pH increase rapidly slowed down, suggesting that a soil sample size of 2 000 might be the sample size threshold for these ML models to predict the topsoil pH of croplands in this area. (2) The performance of the five ML models and their topsoil pH prediction accuracy was significantly different. The tree-based ML models, namely Random forests (RF) and Cubist performed best. No matter which sampling method was used, when the sample size was more than 2 000, the archived coefficient of determination (R2) of the two models could be stable between 0.75 and 0.80, and the RMSE could be kept below 0.50. (3) When the soil sample size was large enough, the sampling method had little impact on the ML model performance. Also, the topsoil pH prediction accuracy and the sampling method gradually highlighted when the soil sample size was less than 2 000. Comparatively, Conditioned Latin hypercube sampling (clhs) had advantages when the sample size was small. When the sample size was 1 000, clhs sampling method could still keep the R2 of random forest and Cubist prediction at about 0.80. Even when the sample size was as small as 200, the R2 archived by all five ML models under the clhs sampling method was above 0.54. (4) The uncertainty analysis showed that 73.9% of the observed values of topsoil pH of the validation samples fell into the 90% Prediction Interval (PI) of the random forest model, indicating that the reliability of the model was slightly overconfident, but it was within the acceptable range. In addition, the data indicated that the uncertainty of model prediction was not significantly correlated with sample size. 【Conclusion】Tree-structured machine learning models Random Forest and Cubist stand out in this case. Improving the spatial prediction and digital mapping accuracy of soil target variables cannot be achieved simply by expanding the scale of sample points and increasing the density of sample points. It is necessary to improve the model prediction performance and covariate data quality at the same time. When the sample size is large enough, the sampling strategy has little effect on the performance of the ML model and the prediction accuracy of surface soil pH; when the sample size is smaller than a certain threshold, the sampling method has a significant impact on the model performance and prediction results.
SUN Yueqi, SUN Xiaomei, WU Zhenfu, YAN Junying, ZHAO Yanfeng, CHEN Jie. Impact of Sample Size and Sampling Method on Accuracy of Topsoil pH Prediction on A Regional Scale[J]. Acta Pedologica Sinica, DOI:10.11766/trxb202112010651,[In Press]