Classification of Esophageal Adenocarcinoma, Esophageal Squamous Cell Carcinoma, and Stomach Adenocarcinoma Based on Machine Learning Algorithms

Esophageal and gastric cancers are common malignant tumors. In medicine, it is difficult to differentiate the sickness symptoms of esophageal adenocarcinoma (EAC), esophageal squamous cell carcinoma (ESCC), and stomach adenocarcinoma (SAC). In particular, the molecular characteristics of EAC and SAC are very similar, which makes them difficult to distinguish. Information collected by sensors can be analyzed by machine learning. In this study, we used cancer data published in Nature in 2017, which were downloaded from cBioPortal, to classify the three types of cancer by five machine learning algorithms, and we compared the classification effects for different models by calculating confusion matrices. According to the research data in this paper, the random forest (RF) model is the best of the five machine learning classification models for the overall classification effect of the three types of cancer. More specifically, the classification effect of this model is the best for EAC, whereas the classification effect for ESCC is not ideal. The classification based on the RF model can effectively enhance the differentiation between the symptoms of EAC, SAC, and ESCC, enabling cancer patients to receive more accurate treatment and have an improved prognosis.


Introduction
Esophageal cancer is a common malignant tumor, and its morbidity and mortality rank eighth and fifth out of all malignant tumors, whereas the morbidity and mortality of gastric cancer rank fifth and third out of all malignant tumors, respectively. Esophageal carcinoma is histologically divided into esophageal adenocarcinoma (EAC) and esophageal squamous cell carcinoma (ESCC). (1) In recent decades, the morbidity of esophageal cancer in Western countries has increased several times, the five-year survival rate is in the range of 12-20%, (2,3) and esophageal cancer has caused more than 400000 deaths worldwide every year. (4) Esophageal cancer mainly occurs in the lower esophagus and is associated with obesity, gastric reflux, and Barrett's esophagus. By analyzing the molecular characteristics of patients with esophageal and gastric cancers, it was found that EAC and stomach adenocarcinoma (SAC) have very similar unstable chromosomal variations, which indicate that these cancers can be considered a single disease entity. (5) The increases in the morbidities of EAC and proximal stomach cancer are synchronous. (6) The boundaries between SAC and EAC and the classification of adenocarcinoma that crosses the gastroesophageal junction are still indistinct, and there are also many disputes about the practicability of histological features. (7)(8)(9) Given the uncertainties of the boundaries between EAC and SAC, by analyzing the EAC, ESCC, head and neck squamous cell carcinoma (HNSCC), and SAC, it has been found that the symptoms of ESCC are similar to those of HNSCC, and the symptoms of EAC are similar to those of SAC. The distinction between ESCC and EAC has not only known histopathological and epidemiological characteristics but also known molecular characteristics. Many methods in machine learning can provide the importance of independent variables in a classification and their influence on classified dependent variables, and be used to evaluate the relationship between independent variables and classified dependent variables. These results are more objective and reasonable than the logistic regression model in the interpretation of coefficients. Machine learning can also combine different competing models to produce more accurate predictions than a single model. At present, there are many sensors collecting data, and there is useful information in these data. By combining these data processing and model training in machine learning, complex tasks can be solved. Until now, very few studies have investigated the use of machine learning classification to distinguish the symptoms of SAC, ESCC, and EAC. Thus, we investigated the use of machine learning algorithms to classify the different cancer types, and confusion matrices were investigated to measure and compare the classification effects of different models. The effects of important variables on the different cancer types were identified, which could promote better classification of these cancers and the emergence of new therapies.

Study subjects
In this study, all the used cancer data, which were published in Nature in 2017, were downloaded from cBioPortal. The data include the clinicopathological and molecular characteristics of 90 cases of ESCC, 79 cases of EAC, 388 cases of SAC, and two cases of esophageal gastric cancer. These cancer data were obtained after processing fresh frozen tumor samples, which were obtained from multiple countries with informed consent and approval by the local institutional review board.

Clinical measurements and genetic assessments
Germline deoxyribonucleic acid (DNA) was extracted from blood or nonmalignant esophageal mucosa in these data samples, and complete exon sequencing, analysis of singlenucleotide polymorphism (SNP) array, evaluation of somatic copy-number alterations (SCNAs), analysis of DNA methylation, and mRNA and microRNA sequencing were conducted.

Statistical analysis
In this paper, there were 559 samples with 103 variables in total, including clinical pathology, histopathology, and molecular characteristics. After deleting the variables with missing rates greater than 60%, the remaining 75 variables were imputed. Many variables existed in a variety of forms and had the same missing information, so five imputation methods were chosen: missForest, k neighborhood, center interpolation, classification regression tree, and random forest (RF). Finally, we chose missForest as the base algorithm because it yielded the best results. Although many packages can be used to impute missing values, they usually do not recognize categorical variables, whereas missForest can handle missing values from continuous variables and categorical variables.
All the classifications were performed using the imputed values. Because there were only two cases of esophageal gastric cancer, this category was not suitable for classification and was deleted. The classification of ESCC, EAC, and SAC was performed on the remaining 557 samples. The data were divided into a training set and a test set, with 70% of the data randomly selected as the training set and 30% as the test set. Cancers were classified using a variety of machine learning methods, such as traditional decision trees, conditional inference trees, bagging, AdaBoost, and RF. The misclassification rate, accuracy rate, precision rate, and recall rate of the classification methods were calculated using the confusion matrix to evaluate the different classification methods. All statistical analyses were performed with R software (version 3.5.3).

Decision tree model
The decision tree model is an easy-to-use and nonparametric classifier that classifies instances based on variable characteristics. The structure is tree-shaped, composed of nodes and directed edges, and does not require any priori assumptions on the data. For the decision tree model, its calculation speed is high, its measured results are easy to interpret, and its robustness is strong. Based on the ID3 algorithm and the C4.5 algorithm, the main characteristics of decision tree learning are feature selection, decision tree generation, and branch reduction. When learning the training set samples, the decision tree model is constructed according to the minimum loss function, and a set of test data can be classified with the decision tree model. An important concept in the decision tree algorithm is entropy, which is a measured result of the uncertainty of random variables. If we let X be a discrete random variable with a finite number of values, the probability distribution can be expressed as (1) Then, the entropy of X is defined as The greater the entropy is, the greater the uncertainty of the random variables. The conditional entropy H(Y|X) represents the uncertainty of random variable Y under the condition that random variable X is known. This is the mathematical expectation of the entropy of the conditional probability distribution given the conditions for X, i.e., The information gain represents the information of a known feature X, leading to the degree of information uncertainty reduction of Y. The information gain of feature A for the training data set D is g (D, A), which is the difference between the empirical entropy H(D) of D and the empirical conditional entropy H(D|A) for the given condition of feature A, that is, The decision tree model applies the information gain criteria to select the features, solves the information gain of various schemes under different conditions by means of diagrams, and then makes decisions through comparison processes. Features with large information gains have stronger classification capabilities.

Bagging model
Bagging, which is also known as bootstrap aggregating, is an integration technique that trains classifiers by selecting S new datasets from the original dataset, and the observations in these new datasets are selected without replacement. The trained classifiers are used to classify the new samples, and then the results of the classification of all classifiers are counted, and the most frequent category is the final tag.
The input of the sample can be set as D = {(x 1 , y 1 ), (x 2 , y 2 ), …, (x m , y m )}, the number of iterations of the weak classifier can be represented by T, and the output is the final strong classifier f(x).
(1) For t = 1, 2, ..., T, (a) t are the random samples of m observations, which are collected from the training set to obtain a sampling set D t containing m samples. (b) The tth weak learner is trained with sampling set D t .
(2) The category with the most votes cast by the T weak classifier is the one we finally choose.
Bagging classification is a particularly effective technology when learning is unstable and tends to overfit, i.e., small changes in training data lead to significant changes in the predicted output. Models prone to overfitting do not generalize well outside the training data. Bagging works well with high-variance models, such as decision trees, and when it is used with lowvariance models, such as linear regression, it does not significantly affect the learning process. It effectively reduces the variance by clustering together individuals, which are composed of different statistical attributes (such as different standard deviation means, etc.). The number of basic learners to be selected depends on the characteristics of the data set. Bagging can be executed in parallel to check for excessive computing resources, which is a major advantage, and it is a common algorithm booster used in various fields.

Adaptive boosting model
Adaptive boosting (abbreviated as AdaBoost) is a common boosting and iterative algorithm, and its basic learner is the classification tree. Each iteration can generate a new classifier on the training set, then the classifier is used to classify all the samples to recognize the importance of each sample. Specifically, the algorithm assigns a weight to each training sample, and each sample is labeled with a new classifier after training. If the focal point of a sample has been classified correctly, its weight will be reduced. If the focal point of a sample has not been classified correctly, its weight will be increased. The larger the weight is, the higher the proportion of samples will be in the next training iteration; that is, these points with high error rates will receive more attention in subsequent training iterations. The iteration process lasts until the error rate is small enough or a certain number of iterations is reached.
Assuming that the sample is (X 1 , Y 1 ), (X 2 , Y 2 ), …,(X n , Y n ), to simplify the description, the dependent variable is assumed to be a binary variable . The concrete steps of the AdaBoost algorithm are as follows. (1) Select the initial self-service sampling weight as then update the weights as Steps (2) and (3) AdaBoost can be used to improve the performance of any machine learning algorithm, and it is most suitable for algorithms with poor learning ability. To improve the detection accuracy, AdaBoost requires a large set of training samples, each training of a weak classifier requires a sample, and each sample has many characteristics. Therefore, the number of calculations required to obtain optimal weak classifiers from a large number of features through training is huge.

RF model
An RF classifier contains multiple decision trees, and the output category is determined by the mode of the category output by an individual tree. The RF is composed of multiple decision trees, and there is no correlation between each decision tree in the forest. The final output of the model is determined by each decision tree in the forest. When classification problems are handled, the final category of each decision tree in the forest is given for the test samples. Finally, the output category of each decision tree in the forest is comprehensively considered, and the categories of test samples are determined by voting.
To evaluate the role of each variable in the classification model, an RF classifier gives the importance score of each variable. In an RF classifier, each node is segmented using the best node in a randomly selected set of sub-predictors for that node. Compared with other classifiers, this somewhat counterintuitive strategy performs very well in each form and is robust to overfitting. In addition, an RF classifier is very friendly because it has only two parameters (the number of variables in the random children of each node and the number of trees in the forest) and is usually not very sensitive to their values.

Classification results for different classification models
In total, 34 clinical pathological variables and molecular variables were selected for the exploration of cancer classification models, and the cancer classification results for different classification models were analyzed. The data were divided into training and test sets; 70% of samples were randomly selected as the training set and the remaining 30% of samples were used as the test set. The classification results obtained by the training set for different models were analyzed.

Classic decision tree classification results
The classic decision tree algorithm usually involves an oversized tree, which leads to overfitting and poor classification performance for units outside the training set. Therefore, tenfold cross-validation is used to select the tree with the smallest prediction error. Table 1 shows the complexity parameter (CP) values, which are used to help set the size of the final tree by imposing a penalty on an oversized tree. The size of the tree is the number of branches (nsplit), and a tree with n branches will have n + 1 terminal nodes. rel error is the error corresponding to various trees in the training set, the cross-validation error (xerror) is based on the tenfold crossvalidation error from the training sample, and xstd is the standard deviation of the crossvalidation error. Figure 1 shows the relationship between the cross-validation error and the CP value. For all trees where the cross-validation error is within one standard deviation of the minimum crossvalidation error, the smallest tree will be the best tree. Figure 1 shows that the optimal tree corresponds to three partitions. Table 1 shows that the CP value corresponding to the three partitions was 0.0714, and the most important branch was cut off according to the CP value by using the prune function. Figure 2 shows the pruned classic decision tree with the ideal size used to predict cancer types. When Mutation_Count was larger than 158, the type of cancer was EAC, indicating that Mutation_Count can be used as a significant indicator to distinguish between SAC and EAC. However, Country and Diagnosis_Age can be used as indicators for distinguishing between ESCC and SAC, which could help in the prediction of cancer types and facilitate subsequent targeted treatments.

Classification results for conditional inference tree
A variant of the traditional decision tree is a conditional inference tree, which is similar to a traditional decision tree, but the selections of variables and partitions are based on significance testing, pruning is not necessary, and the generation process is more automated. Figure 3 shows a conditional inference tree in which the shaded area in each node represents (from left to right) the proportions of ESSC, EAC, and SAC. The object attributes in the conditional inference tree were Country, Anatomic_Site, TP53_Mutate, Lymphocyte_Infiltration, Diagnosis_Age, and Histologic_Grade.

Classification results for bagging model
It can be seen from Fig. 4 that Anatomic_Site, Country, Genome_Altered, Diagnosis_Age, Mutation_Count, and Histologic_Grade were the most important variables in the bagging model, similar to the results obtained from the decision tree (Anatomic_Site, Country, Diagnosis_Age, and Histologic_Grade).

Classification results for an AdaBoost model
As shown in Fig. 5, Anatomic_Site, Country, Diagnosis_Age, Genome_Altered, Lymphocyte_Infiltration, and Mutation_Count were the most important variables for the classification of the AdaBoost model, similar to the results obtained for the decision tree model (Anatomic_Site, Country, Diagnosis_Age, and Lymphocyte_Infiltration).

Classification results for RF model
It can be seen from Fig. 6 that Anatomic_Site, Country, Histologic_Grade, Lymphocyte_ Infiltration, Genome_Altered, and Mutation_Rate were important variables in the RF model, which were the same as the most important variables in the decision tree model and the bagging and AdaBoost models. This finding shows that Anatomic_Site, Country, and Genome_Altered were the most important indicators in distinguishing classification models and are an important basis for classifying cancer types.

Computation of confusion matrix and comparison of classification results for different classification models
There are several methods for evaluating classification models: confusion matrices, which include gain charts, lift charts, KS charts, and receiver operating characteristic curves. The data were divided into a training set and a test set, with 70% of the data randomly selected as the training set and 30% selected as the test set for comparison. We used the confusion matrix of the training set to determine the best classification method, and the accuracy, precision, and recall rates were calculated using the obfuscation matrix. The accuracy rate is the proportion of all correct predictions (positive and negative), the precision rate is the proportion of correct  Table 2. From each confusion matrix, we can calculate the accuracy rates, precision rates, and recall rates for classification models for the test set, and the results are shown in Table 3.
According to Table 2, the error rates of the classical decision tree model, conditional inference tree model, bagging model, AdaBoost model, and RF model were 21.40, 19.00, 15.40, 12.5, and 11.30%, respectively. From these results, the RF model has the best classification effect, followed by the AdaBoost model, bagging model, and conditional inference tree model, while the decision tree model has the lowest classification effect. As can be seen from Table 3, both the precision and recall rates of all classification models for ESCC were low, and the classification effect of  the model for ESCC was inadequate, whereas the classification effect for EAC was the best. However, for the RF model, the classification effect for SAC was the best, and the recall rates for ESCC and EAC were the lowest at 78.79 and 62.5%, respectively. The comprehensive comparison showed that the RF model had the highest effect.
For the data set we studied, by analyzing the confusion matrix for different classification methods, we found that the RF model has the best effect in predicting EAC because it has the highest results for the classification of accuracy, precision, and recall rates. Other classification methods have the best correct discrimination for EAC, indicating that it is easily distinguished from the other two cancers.
In this paper, the categorical variables of cancer include clinicopathological characteristics, demographic characteristics, and molecular characteristics, and the variables are mainly  discrete, although some are continuous. The above models can deal with both continuous and discrete data effectively. Overfitting easily occurs in the decision tree model, but we avoid this problem by pruning. However, the classification accuracy of the decision tree model is inferior to those of the other models. AdaBoost improved the performance through boosting, in which it is unnecessary to screen features and overfitting occurs, making AdaBoost suitable for cases with more complex data types in this paper. RF is modified by bagging and is not prone to overfitting, because the training samples do not account for all the samples. When dealing with classification imbalances, RF can also provide an effective method to balance the error of the data set, giving it a better classification effect than the other classification algorithms.

Classification of cancers for RF model
By analyzing the confusion matrix, it is concluded that the RF model can classify the cancer data better than the other models. Next, we use the RF model to classify cancer types and analyze the classification results, and the confusion matrix and error for the RF model are presented in Table 4. For the RF classification, the error rate in classifying the three cancers is very low, especially for SAC.
The Gini index (Gini inequality) indicates the probability that a randomly selected sample will be split in the sample set. The smaller the Gini index is, the smaller the probability that the selected sample in the set will be split, that is, the higher the purity of the set, and the higher the Gini index is, the less pure the set. The Gini index is equal to the probability of a sample being selected multiplied by the probability of the sample being misclassified:  Figure 7 shows accurate measurements of the importance of each variable obtained by using the three levels of the dependent variables (cancer types) and the effect of the variables on the prediction accuracies of all cancer types and the Gini index. The larger the number of chaotic categories contained in the population, the larger the Gini index will be (similar to the concept of entropy). For a certain node, the lower the entropy is, the purer it will be, and the smaller the Gini index is, the purer the Gini index will be. Thus, the purer the node is, the more it can determine which type it belongs to, and the more ideal the result is. As shown in Fig. 7, for the RF model, the variables with the highest importance for EAC were Anatomic_Site, Country,

Discussion
In a previous study, the molecular characteristics of the histological subtypes of EAC and ESCC were different across all detection platforms. (5) The similarity between ESCC and HNSCC is greater than that between ESCC and EAC. Therefore, in classification by machine learning, ESCC and EAC are the easiest to distinguish. Previous studies found that the similarity between EAC and SAC is higher than that between EAC and ESCC. However, according to the model in which EAC is derived from Barrett's esophagus rather than the stomach, Barrett's esophagus and EAC may be derived from proximal gastric cells or the embryonic residual cell population of the gastroesophageal junction, and EAC is considered to be separate from SAC. (10,11) However, since the molecular characteristics of EAC and chromosomal instability (CIN) gastric cancer are similar, we may not be able to completely distinguish them from CIN gastric cancer by relying solely on molecular analysis. Therefore, it is necessary to analyze both the clinical and pathological characteristics and the molecular characteristics and to use a machine learning method to classify them.

Conclusions
The information collected by sensors can be analyzed by machine learning. Machine learning has the advantages of high classification accuracy, fast calculation, and strong learning ability. By analyzing the data, reliable conclusions can be obtained. The combination of sensors and machine learning has certain practicability in many fields; the more data collected by the sensors in the future, the more machine learning algorithms can be optimized, thereby improving the accuracy of the analysis results.
In this paper, the decision tree, bagging, AdaBoost, and RF machine learning classification algorithms were used to classify three types of cancer, and the importance of variables for the different classification models was analyzed. The classification results showed that all models were the least effective and had the lowest precision in the classification of ESCC. Country and Anatomic_Site were the most important variables for the different classification methods, indicating that they are very important for differentiating between cancer types. To verify the classification effects of the different classification models, confusion matrices were used to evaluate the models, and the classification results of the RF model were found to be the best. In this paper, we studied an unbalanced data set, for which the RF model can provide an effective method to balance errors in the data set. If a large part of a feature is lost, the RF algorithm can still maintain accuracy. The RF algorithm has strong anti-interference ability and anti-overfitting ability, resulting in its high classification performance in this study. However, in this paper, we only classified EAC, ESCC, and SAC, and no studies have yet focused on differentiating between other cancer types using machine learning classification algorithms. Because ESCC is not easy to distinguish from other squamous cell carcinomas, such as those of the head and neck, which have high similarity to ESCC, through general medical means, further studies are necessary. Classification based on the RF model can effectively improve the differentiation between EAC and SAC, enabling cancer patients to receive more accurate treatments and have an improved prognosis.