YUAN Yuqi, CHEN Hanyue, ZHANG Liming, et al. Prediction of Spatial Distribution of Soil Organic Carbon in Farmland Based on Multi-Variables and Random Forest Algorithm—A Case Study of a Subtropical Complex Geomorphic Region in Fujian as an Example. Acta Pedologica Sinica, 2021, 58(4): 887-899.

Prediction of Spatial Distribution of Soil Organic Carbon in Farmland Based on Multi-Variables and Random Forest Algorithm—A Case Study of a Subtropical Complex Geomorphic Region in Fujian as an Example
YUAN Yuqi, CHEN Hanyue, ZHANG Liming, REN Biwu, XING Shihe, TONG Junyue
University Key Lab of Soil Ecosystem Health and Regulation in Fujian, College of Resource and Environment, Fujian Agriculture and Forestry University, Fuzhou 350002, China
Abstract: 【Objective】Soil organic carbon (SOC) plays an important role in soil fertility and the terrestrial ecosystem carbon cycle. A detailed understanding of the spatial distribution of SOC is vital to management of the soil resources and mitigation of the global climate change. With the development of the 3S technology, the models for predicting soil properties based on environmental variables are getting increasingly popular. The purpose of our study is to try to simulate the complex and nonlinear relationship between SOC and environmental variables, and evaluate the importance of soil attributes to accuracy in SOC mapping.【Method】For this purpose, machine learning methods and a random forest (RF) model was applied to map the spatial distribution of topsoil organic carbon contents for farmlands in the high-yield agricultural areas in Southeast Fujian. A set of environmental variables (including 5 hard-to-obtain quantitative soil attributes such as hydrolysable nitrogen, available phosphorus, pH, etc) and 11 easy-to-obtain variables (i.e. topography factors, vegetation indexes and climate factors) were acquired through analysis of a large number of soil samples collected from that region, and then processed with the RF algorithm to predict spatial distribution of SOC content in the topsoil layers of the farmlands of that region. Two different combinations of the above variables were entered as input to RF-S model and RF-A model separately. The RF-S model functioned only on the basis of easy-to-obtain variables and the RF-A model did on the basis of all the variables, both easy-to-or hard-to-obtain ones, for predicting SOC. Root mean square errors (RMSE), mean absolute errors (MAE), Pearson correlation coefficients (r), coefficients of variation (CV), relative errors (RE) and relative root mean square errors (RRMSE) of the two models were worked out for evaluation of accuracy of their predictions, and screening-out of an optimal RF model for mapping SOC in the study area based the raster datasets of all variables. Then cross-validation was performed to compare the optimal RF model with the Ordinary Kriging (OK) interpolation model.【Result】Results show that of the two models, different in input of environmental variables, the RF-A model that functioned based on remote sensing variables, climate factors and soil attributes was much better than the other in performance and could explain the most of the spatial heterogeneity of SOC. Compared with the RF-S model, the RF-A model significantly improved in fitting and prediction (r increased by 7.95% and RMSE decreased by 45.13%). The SOC contents of the farmlands of the region predicted with the RF-A model varied in the range of 14.70±2.95 g·kg-1 and were quite similar to what was obtained with the OK model in spatial distribution, i.e. an ascending trend from the east coastal area to the western inland of the study area. And despite sampling percentage, the RF-A model was generally higher than the OK model in prediction accuracy, and in capability of capturing spatial heterogeneity, and preferred especially in the case of relatively fewer sampling sites. Among the variables, hydrolysable nitrogen (N) was the most important one for the RF-A model, and followed by elevation(DEM). Both of them significantly affected spatial heterogeneity of the SOC, exhibiting positive relationships with SOC.【Conclusion】It is therefore concluded that the random forest model that functions based on remote sensing variables, climate factors as well as soil attributes is a promising approach to predicting spatial distribution of SOC in Southeast Fujian. In addition, soil attributes variables, such as N and P, should be taken into account for improving prediction accuracy for mapping of SOC in regions with complex geomorphology.
Key words: Soil organic carbon    Random forest    Combination of variables    Spatial distribution    Accuracy evaluation

1 材料与方法 1.1 研究区概况

 图 1 研究区地理位置及采样点、气象站点分布 Fig. 1 Location of the study area and the distribution of soil sampling sites and meteorological stations
1.2 数据来源

1.3 环境变量的获取、组合与筛选

1.4 RF模型构建和验证

1.5 数据处理方法

 $\frac{{NIR - red}}{{NIR{\rm{ + }}red}}$ (1)
 ${\left( {\frac{{NIR - red}}{{NIR{\rm{ + }}red}} + 0.5} \right)^{\frac{1}{2}}} \times 100$ (2)

RF模型构建和预测的实现均通过Python scikit-learn库中RandomForestRegressor包实现。变量相对重要性排序可直接调用工具包中feature_ importances属性实现。

2 结果 2.1 同变量组合下RF模型预测精度对比

 图 2 土壤有机碳实测值与两种不同变量组合模型预测值的累积分布图 Fig. 2 Cumulative distribution map of SOC measured value and predicted value of two different combinations of variables
2.2 RF模型环境变量重要性

2.3 基于不同抽样百分比的精度检验

2.4 耕地土壤有机碳含量空间分布

 图 3 基于RF-S（a）、RF-A（b）和OK（c）模型的闽东南地区耕地SOC空间分布 Fig. 3 SOC spatial distribution in Southeast Fujian estimated by RF-S model(a)RF-A model(b)and OK model(c)

RF-A模型反演得到闽东南区SOC均值为14.70±2.95 g·kg–1，范围为3.63~25.51 g·kg–1，其中13~19 g·kg–1区间的面积占比最高，超过研究区耕地总面积的65%，主要分布在西部内陆闽中大山带戴云山-博平岭段东南侧；小于10 g·kg–1和大于19 g·kg–1的面积占比较低，不足10%，分别分布在闽东南地区三大平原（漳州平原、泉州平原、莆仙平原）和西部海拔最高地；10~13 g·kg–1区间所占面积在19%左右，位于高低值过渡带。

 图 4 基于RF-A模型的土壤有机碳含量与代表性因子关系 Fig. 4 Comparison of soil organic carbon contents based on RF-A model and representative factors
3 讨论 3.1 闽东南地区土壤有机碳空间预测及主要环境变量影响

3.2 RF-A模型精度

4 结论

[张慧东, 尤文忠, 魏文俊, 等. 辽东山区原始红松林土壤理化性质及其与土壤有机碳的相关性分析[J]. 西北农林科技大学学报(自然科学版), 2017, 45(1): 76-82.] (0) [25] Xie E Z, Zhao Y C, Lu F Y, et al. Comparison analysis of methods for prediction of spatial distribution of soil organic matter contents in farmlands south Jiangsu, China (In Chinese)[J]. Acta Pedologica Sinica, 2018, 55(5): 1051-1061. [谢恩泽, 赵永存, 陆访仪, 等. 不同方法预测苏南农田土壤有机质空间分布对比研究[J]. 土壤学报, 2018, 55(5): 1051-1061.] (0) [26] Zhang W, Wang K L, Chen H S, et al. Use of satellite information and GIS to predict distribution of soil organic carbon in depressions amid clusters of Karst peaks (In Chinese)[J]. Acta Pedologica Sinica, 2012, 49(3): 601-606. [张伟, 王克林, 陈洪松, 等. 典型喀斯特峰丛洼地土壤有机碳含量空间预测研究[J]. 土壤学报, 2012, 49(3): 601-606.] (0) [27] Hengl T, Heuvelink G B M, Stein A. A generic framework for spatial prediction of soil variables based on regression-kriging[J]. Geoderma, 2004, 120(1/2): 75-93. DOI:10.1016/j.geoderma.2003.08.018 (0)