Comparison of machine learning methods for classifying mediastinal lymph node metastasis of non-small cell lung cancer from 18 F-FDG PET/CT images

Method comparison

As shown by the comparison of classical results, AdaBoost, RF, and SVM performed better than ANN in terms of both AUC and ACC. From a methodology aspect, AdaBoost and RF are both ensembles of decision trees. We found that some other comparison studies [36, 37] also demonstrated that ensemble methods outperform other classifiers. The mechanism of a decision tree can utilize different features to compensate each other, and the ensemble of decision trees combines multiple weak tree classifiers into a strong classifier. As a result, the ensembles of decision trees can yield good classification results based on an even weak feature set. As shown by Table 3 and Fig. 1, when only T82 was used as input features, AdaBoost and RF yielded better AUC and ACC than SVM and BP-ANN. SVM belongs to the kernel-based classifier family, which implicitly maps the input features into a higher dimensional feature space using a kernel function that measures the distance between feature points in the mapped space. By such kernel-based mapping, SVM is able to achieve much better classification performance than conventional linear classification methods. The choice of kernel function greatly affects the performance of SVM, and the nonlinear kernel function used in this study helped SVM to maintain good performance even with suboptimal input features (A95 and S6).

ANN and CNN both belong to the neural networks family, but ANN performed worse than CNN. We only used two hidden layers for ANN; it seems that the imperfect performance of ANN might be due to the insufficient number of layers. However, as we tested different layer numbers (from one hidden layer to seven hidden layers), the best number was two instead of seven. Although it is generally assumed that deeper networks perform better than shallower ones, such assumption is valid when there is enough training data, while the training method should also be good at learning deep networks. In this study, the training data of BP-ANN is not abundant enough to support a deeper ANN, and the back-propagation method is not suitable for training deep networks [33]. In contrast, CNN is well designed for learning deep networks, and it also uses data augmentation to increase the training set.

Compared with human doctors from our institute, all the five machine learning methods had higher sensitivity but lower specificity than human doctors. Doctors tended to underestimate the malignant tumors because most of the lymph nodes in this study were small in size. The machine learning methods gained better sensitivities than human doctors at the cost of losing specificities. When ACC was used as more balanced criteria than sensitivity and specificity, RF, AdaBoost, and CNN were better than human doctors, but the difference was not significant after Bonferroni and FDR corrections.

In many recent publications of medical image analysis, CNN was reported to outperform classical methods for imaging modalities other than PET/CT. In this study, CNN is not significantly better than RF, AdaBoost, or SVM, because it has not fully explored the functional nature of PET. Before the image patches are input to CNN, the pixel intensities are normalized to a range of [?1, 1], thus the discriminative power of SUV is lost during the normalization. It was surprising to find that without the important SUV feature, the difference between CNN and the best classical methods is not evident. CNN utilizes the image appearance pattern around the lymph node. The appearance pattern includes information of local contrast, nearby tissues, boundary sharpness, and etc. Such information is different from but as powerful as the diagnostic features like SUV, tumor size, and local heterogeneity. To illustrate the discriminative power of the image appearance pattern, we extracted the intermediate feature vector produced by the internal flattening layer of the CNN. This was a vector of 512 features, which could be considered as a sparse representation of the image patch’s appearance. We used the 512 features as the input of the classical methods. For RF, the 512 features resulted in AUC and ACC of 0.89 and 80.8%, respectively. For SVM, the AUC and ACC were 0.89 and 80.6%, respectively. These results were much higher than the AUC and ACC of T82, and they were even close to the results of D13. Unlike the texture features, the appearance patterns of CNN are not affected by the size of the lymph nodes, because they are computed from the entire image patch which includes both the lymph node and its surrounding tissues. Therefore, the image appearance pattern can be a promising substitute for the texture features, as well as a good compensation to the diagnostic features.

This study used the AlexNet for CNN architecture, but with a reduced number of layers. The reason for using less neuron layers was to avoid overfitting to the training data. Although we used 729 times data augmentation, the total number of training data for each cross-validation was still not abundant compared to many other deep learning applications. This was also the reason that we did not use the more advanced CNN architectures like VGGNet [38], GoogleNet [39], and ResNet [40] which were designed for a much larger training set. In future work, if we can collect more data from multi-centers, deeper CNN architectures will be explored. Recently, there are some studies using a small set of medical images to fine-tune the deep network learned from a large natural image set, in order to solve the problem of insufficient medical training data [41]. However, it is to be investigated if this method could perform well on PET images, since the appearance of PET is quite different from natural images.

In this study, image patches of both modalities (PET and CT) were mixed into the same network. Such mixed setting may potentially limit the performance of CNN, because the PET and CT patches contained different types of diagnostic information. It should be more appropriate to process the PET and CT patches with separated subnetworks and combine the results of different subnetworks at the output layers. However, there is currently no such architecture for dual-modality PET/CT images, we will leave this issue for future research.