Classification of Hepatocellular Carcinoma and Liver Abscess by Applying Neural Network to Ultrasound Images

1Graduate Institute of Automation and Control, National Taiwan University of Science and Technology, Da’an District, Taipei 10607, Taiwan 2Division of Gastroenterology and Hepatology, Department of Internal Medicine, Taipei Medical University Hospital, Xinyi District, Taipei 11031, Taiwan 3Division of Gastroenterology and Hepatology, Department of Internal Medicine, School of Medicine, College of Medicine, Taipei Medical University, Xinyi District, Taipei 11031, Taiwan 4School of Public Health, College of Public Health, Taipei Medical University, Xinyi District, Taipei 11031, Taiwan 5Department of Family Medicine, Taipei Medical University Hospital, Xinyi District, Taipei 11031, Taiwan 6Department of Electrical Engineering, National Taiwan University of Science and Technology, Da’an District, Taipei 10607, Taiwan


Introduction
There are many liver diseases, for example, liver cancer (hepatocellular carcinoma, HCC) and liver abscess. Liver cancer has a high mortality rate. Depending on the stage of the disease, the treatment also varies between surgery, radiation therapy, chemotherapy, tumor ablation, embolization therapy, targeted therapy, and many others. Even though liver biopsy is effective for obtaining a correct diagnosis, it may generate side effects in patients such as pain, infection, or injuries in the subsequent treatment. Because of the various risks and undesired effects, there are many other approaches to help diagnose liver disease. Ultrasound imaging is a feasible approach, and a computer-aided diagnosis (CAD) system can help an inexperienced clinician in diagnostic evaluation.
Medical ultrasound imaging is based on the pulse-echo principle. An ultrasound transducer converts an electrical signal into an ultrasound pulse, which enters the tissue from the body surface. At the surface, an echo appears. The probe senses and receives the echo, and all the echoes are converted back to signals and graphics, which can be seen by medical staff. In medical ultrasound, a coupling gel is used as a universal medium to avoid excessive reflection caused by the tiny amount of air between the probe and the skin. (1,2) On the other hand, neural networks (NNs) are a powerful technique for solving research problems. For example, a radial basis function NN is used to design a control law for a derived mathematical kinematic model of mobile robots. (3,4) Chien et al. applied a multilayer perceptron (MLP) NN to an impulse noise detector for power-line-based sensor networks. (5) Moreover, an MLP NN has been applied to classification for biomedical image processing. (6) Here, we propose NN-based classification for the CAD of images obtained from ultrasound imaging, which has several advantages over liver biopsy such as no radiation risk, low cost, easy operation, and non-invasiveness. We applied the gray-level co-occurrence matrix (GLCM) (7)(8)(9) and the gray-level run-length matrix (GLRLM) (7,10) as textural features with three feature selection models: sequential forward selection (SFS), (7,9,11) sequential backward selection (SBS), (7,9,12) and F-score. (7,13)

Feature Extraction
We retrieved the images for analysis from the Medical University Hospital in Taipei, which consisted of 44 cases of HCC and 35 cases of liver abscess: in total, 79 cases of liver disease. For each case, we selected a 32 × 32-pixel region of interest (ROI) inside marked boundaries and converted it to a 256-grayscale BMP file using MATLAB for convenient processing as shown in Fig. 1. We sampled 400 ROIs from the images for each disease, which were used for training and testing. From each of the ROIs, we extracted 96 features (52 GLCMs and 44 GLRLMs). The GLCM (7)(8)(9) is represented by a matrix depicting how different combinations of gray levels exist in an image.
The GLCM feature extraction method consists of two steps: (1) co-occurrence matrix calculation and (2) the computation of texture features from the co-occurrence matrix. The GLCM feature extraction results, also called the Haralick features, are extracted from each image and shown in Table 1. The other method employed in this research is the use of the GLRLM (7,9) to compute four matrices for horizontal, vertical, and diagonal directions, i.e., 0, 45, 90, and 135°, to produce a run-length matrix. The results of the calculation are called texture descriptors, and each descriptor is unique for each texture. We extracted the 11 most often used features from the run-length matrices, which are shown in Table 2.

Feature Selection
The main idea of our feature selection methods is to keep the useful features while eliminating those that contain little or no predictive information. The advantages of using feature selection are reduced computation and cost, improved accuracy, and greater understanding of the difference between HCC and liver abscess. We used three feature selection methods: SFS, SBS, and F-score.
The SFS method starts with an empty set and adds the next selected feature is the highest objective function. This is repeated continuously until a predefined number of features are selected. SBS works the opposite way: it starts from a full set of features and removes the worst feature continuously until a predefined number of features are left.
The F-score measures discrimination from a given training vector. The higher the F-score, the more it discriminates between the positive and negative sets. The disadvantage of the F-score is that it cannot reveal shared information between features. We used two F-score methods. The first method, called the search-all method, computed all the features. The second method, called the threshold method, selected four thresholds for each feature and six thresholds for all features (discussed in Sect. 6) where the gap between the low and high F-scores is considerable.

NN
Recently, numerous research studies have been carried out on NNs. This is because they are powerful for performing complex tasks in a wide range of fields, such as system control, (3,4) communication, (5) and medical diagnosis. (6) We used a feedforward neural network (FFNN) based on a backpropagation (BP) learning algorithm with one hidden layer and 10 nodes. (14,15) When a sample x p = (x p1 , x p2 , ..., x pR ) T is input into the FFNN, it is distributed among the hidden layers, as shown by the structure in Fig. 2 in which   f denote the bias and transfer functions of the kth output node, respectively. At the beginning, we assigned the initial weights and thresholds randomly, updating them at every iteration to minimize the difference or the mean square error E between the output and the target. The weights between the layers were updated in each iteration by the gradient-descent rule as follows: where , , , , , and η is a step size in the range [0.01,1] in the formula for updating the weight from the jth layer to the kth layer and from the ith layer to the kth layer in the output node. We used 0.1 as the step size for 1000 iterations.

Performance Evaluation
One of the most popular methods of evaluating a model's prediction performance is crossvalidation. There are two commonly used cross-validation methods, leave-one-out crossvalidation (LOOCV) and k-fold cross-validation. We used k-fold cross-validation, more specifically, 10-fold cross-validation, because it has the advantage of using all samples in both training and validation. We partitioned all the samples randomly into 10 groups of the same size and used one group for testing and the others for training. We repeated the process until all the groups were tested, then all the results were averaged to a single estimation, which is called the true accuracy defined as Then, the accuracy factor, which represents the performance of the classifier, was estimated as TP TN Accuracy TP TN FP FN in which TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively.

Results and Discussion
The NN-based classification system we built was trained using several different sets of features: only GLCM features, only GLRLM features, and both GLCM and GLRLM features. The classification results of training with GLCM features, GLRLM features, and both GLCM and GLRLM features obtained with the F-score feature selection method are shown in Figs. 3-5, respectively. There are a total of 52 GLCM features (Feature 1 to Feature 52) and a total of 44 GLRLM features (Feature 53 to Feature 96). We calculated and added features based on their F-score in descending order to train the network and compute the accuracy. Details of the feature extraction are given in Ref. 7. The F-score of the GLCM feature extraction method was as high as 0.225. When Feature 52 was added, the accuracy reached 80.75% (Fig. 3). For GLRLM, the highest F-score was 0.5 and the accuracy increased to 81.5% (Fig. 4) when Feature 96 was added, then decreased. By combining GLCM and GLRLM features, we can obtain  an accuracy of up to 88.375% based on their F-scores up to Feature 76 (Fig. 5). Table 3 shows the classification results obtained using different feature selection methods. It can be seen that using feature selection models generally gives better results, except for SFS. The best result was obtained using the F-score search with the search-all method, which had an accuracy of 88.375%.

Conclusion
Medical ultrasound is one of the diagnostic imaging techniques. It has several advantages, such as no radiation risk, low cost, easy operation, and non-invasiveness. On the other hand, a CAD system can help an inexperienced clinician in diagnostic evaluation. The novelty of this paper lies in introducing an NN-based classification system for ultrasound images with textural features to distinguish between HCC and liver abscess. We calculated GLCM and GLRLM feature matrices, and selected them by SFS, SBS, and F-score feature selection methods before using an NN to classify images. We verified its feasibility by employing an NN to classify HCC and liver abscess in this research. The proposed method can provide diagnostic help while distinguishing HCC from liver abscess with a high accuracy of up to 88.375%. A limitation of this study was the lack of a large amount of data for training and validation. As future research, an extended scheme for use with big data, such as that based on deep learning, can be considered.