Partial Least Squares Optimization Method and Path Analysis Integration for Chinese Medicine Data

Partial least squares (PLS) is widely used in multivariate statistical analysis, but linear and nonlinear model variable selections are based on the selection of principal components. It does not involve the interactions of variables and predictors, which may adversely affect prediction accuracy. In this study, we design a tailor built temperature control system to monitor and control temperature settings during experiments on traditional Chinese medicine (TCM). We combine results from path analysis and the variables’ covariance and correlation matrix, and propose a PLS optimization method that integrates path analysis (PLS-PA). To verify the validity of PLS-PA, we use the measured coefficients and residuals as evaluation indicators. We test the performance of PLS-PA using two TCM dose datasets and one dataset from the University of California, Irvine (UCI). The three experimental results demonstrate that the measured coefficients from the traditional PLS and PLS-PA methods increase by 11.8, 4.7, and 8.5%, which suggest the validity of our experiment. We conclude that PLS-PA can optimize the screening of variables and improve the PLS regression analysis of TCM experimental data without hampering model accuracy.


Introduction
The process of decocting traditional Chinese medicine (TCM) involves careful consideration of the concentration of reactants, the ratio of ingredients, reaction time, decocting temperature, and pressure, which are necessary for an accurate reaction. Among these factors, accurately controlling the temperature affects the reaction rate, which is a crucial factor in the TCM experimental process. The effect of temperature on TCM experiments is complex and mainly includes the following points: (1) The reaction rate of a TCM experiment increases exponentially with temperature.
(2) In a certain temperature range, the experimental reaction speed increases with temperature, but when the temperature exceeds a certain level, an increase in temperature decreases the reaction speed. (3) As the temperature rises, the reaction rate of the experiment decreases.
In addition, minor differences in boiling temperature will also affect the content of chemical components. For example, Zhu et al. (1) proved that 115 ℃ is the best boiling temperature for a Dachengqi decoction, and increasing or decreasing the temperature will decrease the content of some chemical components. Therefore, it is necessary to monitor the temperature in real time during a TCM experiment. Experiments at a constant fixed and optimal boiling temperature can avoid differences in chemical content due to temperature.
In multivariate regression, when the number of independent variables is much larger than that of sample points, the least squares method does not solve the problem very well. Partial least squares (PLS), however, integrates the basic functions of principal component, canonical correlation, and linear regression analyses, (2) and effectively combines data analysis results used for prediction with nonmodal data cognitive analysis results. Data analysis helps find the functional relationship between dependent and independent variables in order for a prediction model to be established. (3) PLS simplifies the data structure through data analysis and finds the relationship between variables. Compared with linear regression and ridge regression analyses, PLS is thus a better regression method for multiple independent variables linked to multiple dependent variables.
The composition and mechanisms of TCM are complicated. (4) Most of the clinical and experimental data consist of noisy information, multiple correlations of variables, and nonlinearity, (5) which are acceptable conditions for the application of PLS regression. However, the number of variables in TCM experimental data tends to be large, and even some observation data are difficult to manage. Some independent variables may have little or no effect on some dependent variables. If the PLS regression model incorporates these unrelated variables, the amount of computational calculations will increase and the model will be inaccurate. If too many independent variables are selected, then results will be collinear. Conversely, if some important variables are missed, the regression results will be affected, and sometimes unpredictable parameter estimates will be generated. Therefore, when using TCM experimental data to establish a PLS multiple regression model, (6) screening the variables in the regression model is necessary to effectively improve its accuracy.

Related Research
At present, the quality of temperature detection systems is generally reflected in the instrument's level of sophistication, the temperature's range of measurement, the measurement precision, and the instrument's power consumption. (7) However, in China, the accuracy of temperature detection is inadequate, because the majority of control systems are singleparameter single-loop systems controlled by a single-chip microcomputer. Multiparameter comprehensive control systems do not exist, resulting in a large gap in the sophistication of temperature control systems. Also, for the same type of TCM experiment, temperature detection accuracy can be very different depending on the laboratories used. Despite the rapid development of computing, some laboratories have yet to fully realize the importance of accurate temperature detection and control during a TCM experiment, which may severely affect experimental results.
Temperature control systems in China, such as those studied by Wang et al., (8) used a single-chip microcomputer for temperature control. This method can ensure relatively highly accurate temperature control, but temperature transmission is subject to delay. Li et al.'s control system (9) incorporated the proportional-integral-derivative (PID) segmentation theory to make the system respond quickly to temperature changes, but the entire system lacked stability. Kong et al.'s remote temperature control system (10) greatly reduced labor costs, but its fault maintenance and operation costs were relatively high. The advantages and disadvantages of the system developed in this study and other systems are shown in Table 1.
Typical variable screening methods such as stepwise regression, subset extraction, and optimal subset variable screening have their own advantages and disadvantages. Stepwise regression and subset extraction methods are very commonly used, but the random errors in the variable screening process are neglected, so it is difficult to systematically study their theoretical properties. Analysis with the optimal subset variable screening rule lacks stability. The well-known Akaike information criterion (AIC) and Bayesian information criterion (BIC) (11) are selection criteria based on the Kullback-Leibler information distance and minimized Bayes posterior probability. These methods statistically depend on the likelihood of the model, and in general, the study of likelihood functions requires knowledge of the type of distribution, with only some parameters unknown. These conditions increase the difficulty in using the aforementioned variable screening methods.
Some variable screening methods apply a punishment function in their statistical analysis. The central idea of this type of method is to replace the minimized loss function based on the penalization of least squares, minimizing the sum of the loss and penalty functions. The classical penalty functions include the least absolute shrinkage and selection operator (Lasso), (12) smoothly clipped absolute deviation (SCAD) penalty, (13) Adaptive Lasso, (14) and Elastic Net. (15) However, with regard to variable screening methods that utilize a punishment function, the calculation is difficult and sometimes impossible for p >> n. In response to this problem, Cai and Lv (16) proposed a Dantzig screening method. This variable screening method involves L1 normalization under the condition that the design matrix satisfies the uniform uncertainty principle (UUP). This method has some desirable mathematical properties, such as if the problem can be transformed into a linear specification problem, then it is easy to solve. However, when the number of dimensions increases and the UUP condition is not easily satisfied, there is no guarantee that the correct model can be selected.
To solve the limitation of the Dantzig screening method, Song et al. (17) and Fan and Lv (18) proposed a sure independent screening (SIS) method, which can reduce the number of dimensions from p to m, where m < n. Yuan and Lin (19) and Kim et al. (20) carried out a similar study of grouping variables and applying a punishment function. However, in practical problems, the following phenomena may occur: some unimportant variables may be preferentially selected because of collinearity with other important variables, and other variables may not be highly correlated with dependent variables individually, but some independent variables may be strongly correlated with dependent variables when combined with independent variables. Therefore, Fan and Lv (21) proposed an iterative SIS (ISIS) method to overcome these problems.
In addition, there are many methods for the selection of linear model variables, but they are generally based on the model error, which has a normal distribution, and are established by the least squares, penalized least squares, wider Lq, (22) or penalty Lq method. In particular, for a large p and a small n, that is, the dimension p is larger than the number of samples, n, the least squares, penalized least squares, wider Lq, and penalty Lq methods perform inaccurately and are slow, and sometimes the results obtained are not completely uniform owing to the use of different penalty functions. These inconveniences make it very difficult to apply screening methods. When linear variable selection cannot meet the requirements of nonlinear variable selection, it must be combined with other nonlinear regression methods, (23) such as artificial neural networks, nuclear methods, (24) support vector machines, (25) and PLS. Wang (26) used the kernel function as the transformation basis function to study the nonlinear structural features of data by PLS regression based on a kernel function transformation.
Path analysis is a statistical method that involves a path graph and multiple linear regressions. (27) It is capable of visualizing the relationship between independent and dependent variables. It is also capable of calculating the direct effect of each causal factor on outcome factors and the indirect effect on the output variable through a path coefficient and the calculation of a residual path coefficient. At the same time, a path map can be used to visually indicate relationships that are difficult to express in multivariate analysis and also help indicate the importance of the different variables in relation to output variables. The development of path analysis as a linear regression model helps overcome the limitations of linear regression models.
In this paper, to tackle the complexity of the TCM dose-effect treatment problem, we use TCM experiments and information from the literature to construct a complete path map with the independent and dependent variables as nodes and the direct and indirect path coefficients as weights. Through the weight analysis of the directed weight graph, different comprehensive weights of various paths are obtained, and the independent variable point groups with large direct and indirect effects on the dependent variable are selected according to weight. At the same time, the principal components of PLS and the principal component of the dependent variable path coefficient are calculated. By retaining a PLS suitable for modeling with a sample size less than the number of variables, a new variable screening method for fusion path analysis is proposed, which uses the PLS regression model for optimization.

Methods
Path analysis is separated into a direct path coefficient (the direct effect of one independent variable on the dependent variable) and an indirect path coefficient (the sum of the indirect effects of an independent variable on a dependent variable by affecting other independent variables) on the basis of multiple regression with correlation coefficients.
For general multivariate linear regression, we set independent variables X 1 , X 2 , ..., X n and the dependent variable Y.
Subtracting Eq. (2) from Eq. (1) gives Dividing both sides of Eq. (3) by the standard deviation ẟ y of Y at the same time gives The models of linear regression coefficients of respective variables in Eq. (4) are obtained by using the least squares method. From this equation, the decomposition equation of each simple correlation coefficient can be obtained by carrying out quantity transformation: r P r P r P r r P P r P r P r r P r P r P P r Equation (5) is the basic path analysis model, during which r ij is the simple correlation coefficient of X i and X j . Moreover, r iY is the simple correlation coefficient of X i and Y. P iY is the partial correlation coefficient between X i and Y after standardization, showing that X i has a direct effect on Y. r ij P jY is the indirect path, indicating the indirect effect of X i on Y by affecting X j .
Through software linear regression, the resulting standard coefficient is the size coefficient that we require, which is then multiplied by the correlation coefficient to obtain indirect path coefficients.
Before the path analysis is carried out for PLS regression, an auxiliary independent variable is selected. That is, the path analysis is used to calculate the effect of each independent variable X j on the explanatory variable Y, and the information irrelevant to the explanatory variable is eliminated. Regression modeling is then performed using methods such as canonical correlation and multiple linear analyses, and cross-validation to verify the predictive power of the model. The entire experimental process is shown in Fig. 1.
PLS regression is used to calculate the multiple linear regression equations of Y to X End

Temperature control system
Our system is a self-developed temperature control system, and each boiling process in an experiment is carried out at a constant and optimal temperature. After decocting, TCM data are collected at 60 ± 2 ℃. The main modules of the system include the medicine information input and medicine information output. The system monitors the temperature in real time and alerts experimenters if the temperature is abnormal. The modules used are shown in Table 2.
The temperature control module includes a digital-to-analog conversion module (1), a refrigeration module (2), and a high-current drive module (7). The digital-to-analog conversion module (1) is connected to a processing module (4); the high-current drive module (7) is connected to a digital-to-analog conversion module (1), and the cooling module (2) is connected to a high-current drive module (7). The specific module design is shown in Figs. 2 and 3 below.
The temperature is intelligently controlled within suitable ranges for a TCM experiment through the main control chip, temperature detection real-time display, temperature control, and function keyboard modules. We try to reduce the number of environmental factors that may change the experimental results. The module that debugs the received data is shown in Fig. 4.

Experimental platform and sample
We mainly use the temperature control system to monitor the temperature, collect the TCM data within the same temperature range, and then test the performance of the fitting of PLS optimization method integrating path analysis (PLS-PA) during a TCM experiment. The algorithm is programmed by MATLAB. Datasets A and B in the experiment are from the Key Laboratory of Modern Chinese Medicine Preparations, Ministry of Education, and the third dataset comes from the Concrete Compressive Strength on the UCI dataset.  The first experimental dataset A is used to study the changes in the physiological index of superoxide dismutase (SOD) under different dosages of rhubarb, magnolia, citrus aurantium, and mirabilite. The independent variables are rhubarb, magnolia, citrus, and mirabilite (set to x 1 -x 4 ), and the dependent variable is SOD (set to y). There are nine groups of experimental samples. The data table is shown in Table 3.
The second experimental dataset B is used to study the changes in the active components in plasma in the treatment of intestinal obstruction with different proportions of a Dachengqi decoction. There are nine independent variables in the dataset and one dependent variable (small intestinal circumference). There are twelve groups of data in total. The data table is shown in Table 4.
In the third experimental dataset, there are eight independent variables that comprise the compressive strength of one concrete and dependent variable. The total number of samples is 1030. A detailed description of this dataset can be found at http://archive.ics.uci.edu/ml/.
To verify the improved performance of the PLS-PA method, the above experimental data are compared with those obtained by the traditional PLS and PLS-variable importance in projection (PLS-VIP) value optimization methods. In our comparison, all raw data are randomly divided in a ratio of 7:3, with the former used as experimental samples and the latter used as test data. (2) Measured coefficient R 2 R 2 indicates the number of interpretable mutations as a percentage of the total number of variations from the data, thus indicating the line fitting performance from regression. It also indicates the degree of correlation of the dependent variable y with the fitted variable. From this viewpoint, the greater the correlation between the fitted variable and the original variable y, the better the fit of the fitted line. The calculation formula for the measurement coefficient R 2 is given below:

Evaluation index
where SST represents the total variation of the sum of squares of the original data y i ; SSR is the explanatory variation of the sum of squares with a degree of freedom of 1.

Experimental results and analysis
Using MATLAB, the direct path coefficients of all the independent variables in relation to the dependent variables are calculated, and we obtain all the tables of path coefficients. See Tables 5-7. In the tables, the underlined values are the direct path coefficients, and the variables are indirect path coefficients, that is, the indirect effect of X i on Y by affecting X j . As can be seen from Table 5, for experimental dataset A, the effect of x 3 on y is minimal, and the other independent variables x i (i = 1, 2, 3, 4) have a small effect on y by affecting x 3 , so we delete the variable x 3 . Similarly, for experimental dataset B, x 6 has the least effect on y, so we delete this variable. For the UCI dataset, the excluded variable is x 3 .
At the same time, to verify the feasibility and effectiveness of our PLS-PA method, we analyze the importance of the variables through the PLS-VIP method and compare the results with those of the PLS-VIP method. The results are shown in Figs. 5-7. It can be observed that the variables deleted by the two variable screening methods are not the same. For example, for experimental dataset A, the variable deleted by PLS-PA is x 3 , and the variable deleted by the PLS-VIP method is x 4 . This difference also shows that because of the strong correlation and redundancy between the variables in Chinese medicine data, there is a strong mutual effect among variables.
Two methods are used to separate the three new samples with the currently removed independent variables. Using MATLAB, we established a PLS regression model. The residuals (e i ) and measured coefficients (R 2 ) of the regression model are obtained. The comparison of the residuals obtained before and after optimization is shown in Table 8, and the measurement  Table 6 Path coefficient table of experimental dataset B.  Table 7 Path coefficient table of UCI dataset.  Table 9. To more intuitively express the structure of the data in the table, the corresponding line graphs are drawn in Figs. 8 and 9.
(1) Residual analysis From Table 8, it is concluded that for these three experimental datasets, the PLS-PA method significantly reduces the ratio of residuals compared with the traditional PLS method. The explanatory ability of the PLS-PA model is enhanced and the fitting effect is improved.    (2) Comparative analysis of R 2 between PLS and PLS-PA methods From Table 9 and Fig. 8, after performing path analysis to improve the PLS regression, the determination coefficient of dataset A increased by 0.0691, compared with that of the PLS method, which increased by 11.8%. Dataset B is used to remove the variable x 3 by path analysis and R 2 is increased by 0.0391, which is an increase of 4.7%. R 2 of the UCI dataset is increased by 0.051, which is an increase of 8.5%. The results of these three datasets show that the PLS regression method after fusion path analysis is more effective than the traditional PLS method.
(3) Comparative analysis of R 2 values of two improved methods The improvement of the PLS-VIP method compared with PLS is shown in Table 9. The comparison of the data showed that the PLS-VIP method has no clear effect on the TCM data and may even have a negative effect. According to Fig. 9, the PLS-PA method proposed in this paper is superior to the PLS-VIP method, except that dataset A is excluded from the same independent variable. The method shows that the PLS-PA method has clear advantages in the field of TCM data.

Conclusion
In this study, an independent temperature control system was developed to overcome the lack of temperature monitoring in TCM experiments. The system is mainly used for temperature detection during TCM experiments and temperature control during data acquisition. It can help a user avoid differences in data collection due to temperature variations in laboratories  and enhance the credibility of experiments. With the aim of dealing with the complexity of variables in TCM data, a PLS optimization method incorporating path analysis was proposed to enhance the predictive ability of the model. PLS regression can measure the importance of features based on regression coefficients, and then the top k features are selected to find multiple correlations between variables. However, the obtained feature subset does not usually fit well with the data. When the relationship between variables is complicated or some explanatory variables indirectly affect the response variables through other explanatory variables, there will often be a major impact on modeling.
The instability of the model, which affects the parameter estimation of the model, will also increase the error of the model. PLS includes a VIP value variable screening analysis method that has an effect on the screening of independent variables, but when the independent variables are of strong importance in relation to the dependent variables and are highly correlated between other independent variables, the effect of using VIP to filter variables is not optimal. Thus, PLS itself needs a variable screening method to solve the problem.
In light of these problems, we chose to perform path analysis to analyze the linear relationship between explanatory and response variables, thereby assisting the PLS model and increasing accuracy. Through theoretical and experimental analyses of the dose-effect TCM data, we obtain a direct path coefficient and a direct path diameter coefficient as an important measurement to determine the selection of variables. The experimental results for three datasets show that the path analysis performed to filter the independent variable can improve the regression coefficient when applying PLS. The following conclusions can be drawn: (1) Temperature monitoring and control during an experiment are helpful for improving the quality of data and reducing noisy experimental results. (2) The improved regression model can eliminate interfering elements in sample TCM data; thus, the data structure can be simplified and accuracy can be improved. We see that the effect of this improved regression model is better than that of PLS. (3) When there is a strong correlation between redundant features in a data sample, the value of the direct path coefficient can effectively indicate the importance of an explanatory variable to a response variable.