Breast tumor classification in ultrasound images using support vector machines and neural networks

Introduction: The use of tools for computer-aided diagnosis (CAD) has been proposed for detection and classification of breast cancer. Concerning breast cancer image diagnosing with ultrasound, some results found in literature show that morphological features perform better than texture features for lesions differentiation, and indicate that a reduced set of features performs better than a larger one. Methods: This study evaluated the performance of support vector machines (SVM) with different kernels combinations, and neural networks with different stop criteria, for classifying breast cancer nodules. Twenty-two morphological features from the contour of 100 BUS images were used as input for classifiers and then a scalar feature selection technique with correlation was used to reduce the features dataset. Results: The best results obtained for accuracy and area under ROC curve were 96.98% and 0.980, respectively, both with neural networks using the whole set of features. Conclusion: The performance obtained with neural networks with the selected stop criterion was better than the ones obtained with SVM. Whilst using neural networks the results were better with all 22 features, SVM classifiers performed better with a reduced set of 6 features.


Introduction
Breast cancer remains the leading cause of death among women in developed and developing countries.It is the most common cancer in women worldwide representing about 12% of all new cases and 25% of all cancers in women (World…, 2014).In Brazil, it ranks first in incidence in the Northeast, South and Southwest, in proportion 22.84%, 24.14% and 23.83% respectively.In North and Midwest, this incidence is second only to cervical cancer (Sociedade…, 2015).
In the world population, survival rate five years after diagnosis is 61%, however in Brazil breast cancer mortality rate remains high, mainly because in most cases the disease is diagnosed in advanced stages (Instituto…, 2015).
Early detection is the main strategy for breast cancer prevention and control.However, early detection requires an accurate and reliable diagnosis.A good detection approach should produce both low false positive (FP) rate and false negative (FN) rate (Cheng et al., 2010).
Due to its high resolution that enables nodule detection, a mammographic exam is one of the most important tools used for breast cancer screening.Typically, breast cancer appears in mammographic images as microcalcification clusters with irregular shapes.Microcalcifications are small calcium deposits inside breast tissue that sometimes are associated with active processes of tumor cells (Kramer and Aghdasi, 1999).
Mammography remains the procedure of choice in screening asymptomatic women for breast cancer, and has a major impact on the effectiveness of therapy.However, a large number of doubtful solid masses are usually forwarded to surgical biopsy, an invasive and painful procedure, although only 10-30% of them are malignant.This restriction increases the cost and stress imposed on the patient (Dennis et al., 2001).
With the aim to increase specificity of breast cancer image diagnostics, breast ultrasound (BUS) emerged as an important complement to mammography.Ultrasound is more sensitive for detecting invasive cancer in dense breasts (Skaane, 1999).However, it is an operator-dependent modality, and the interpretation of its images requires expertise from the radiologist.
To minimize operator dependency and improve the diagnostic accuracy, computer-aided diagnosis (CAD) systems has been proposed to detect and classify breast cancer nodules (Uniyal et al., 2013).
CAD systems provide important information based on computer analysis of the BUS images assisting health professionals to locate lesions and classify them as benign or malignant.
Regarding lesions in BUS images, literature shows that features related to morphology and texture are used for differentiating between malignant and benign lesions.Flores et al. (2015), in a literature review, listed 26 morphological features and 1465 texture features used for this task.
Some results found in the literature show a better performance of morphological features in breast cancer lesion differentiating.Alvarenga et al. (2007) obtained a poorer performance with texture features than with a morphological feature set (Alvarenga et al., 2010), using a Fisher linear discriminant ratio classifier.With a combined set of features, the same authors did not surpass the previous results obtained with the morphological feature set.
With the purpose of finding the smallest set of morphological features producing an effective improvement in classification performance, Flores et al. (2015) evaluated a set of morphological and textural features proposed in the literature.As a result, the authors suggest using only five morphological features to classify breast lesions.
In this paper, we aim to investigate the results of new methods for improving neural network generalization and the results of SVM classifiers with different kernels over the classification performance of breast cancer lesions in ultrasound images.Using the set of features compiled by Flores et al. (2015), we also tested the effect of dimensionality reduction of the input data on both neural network and SVM performance, using scalar feature ranking with a correlation technique.For training and testing the classifiers, the 4-fold-cross-validation method was used (Leisch et al., 1998).
Table 1 shows characteristics of three published studies using a neural network and three published studies using SVM for breast cancer classification.As shown, the accuracy in first three studies varies between 90% and 95% and in the three last studies varies between 82% and 92%, suggesting a better performance of neural network classifiers over SVM classifiers.

Methods
The methodology for breast tumor classification is comprised of the following steps: dataset acquisition, feature selection, dataset division and classification.In the classification step, two techniques were investigated: SVM and neural network classifiers.The scalar feature selection technique was used to identify the best characteristics.The methodological topics mentioned will be presented below.

Dataset acquisition
In this retrospective study, using a 7.5-MHz linear array B mode 40-mm ultrasound probe (Siemens Sonoline Sienna  ) with axial resolutions of 0.45 and 0.49mm respectively, 100 US breast tumor images were acquired from 100 patients of the National Cancer Institute (INCa, Rio de Janeiro, Brazil).It is worth clarifying that this study was carried out according to INCa's diagnosis routine.Hence, the US images were obtained after patients' clinical examination and mammography, and then it was decided whether the patient should be submitted to biopsy.
Only BUS images from patients with histopathological diagnosis were selected, resulting in an image set of 50 malignant and 50 benign tumors.
Figure 1 shows BUS image examples available on the dataset.
For each image, a rectangular ROI, including the tumor and its neighboring area was determined by a radiologist (a medical doctor with 30 years of experience in mammography and breast ultrasound interpretation).The radiologist was requested to select the portion of the image background surrounding the lesion that includes essential information for the routine sonographic diagnosis.Besides, each ROI was segmented using the semiautomatic contour (SAC) procedure, based on morphological operators (Alvarenga et al., 2012).
A set of 22 features, from each lesion was extracted.to circularity: orientation (OE), Circularity A (Ca), Circularity B (Cb) and Circularity C (Cc).One of them is related to equivalent ellipse: Depth-to-width ratio (DWR).All of them are described in (Flores, 2009).

Feature selection
To identify the most important features to reduce the feature vector dimensionality while retaining as much as possible of their class discriminatory information and with the aim of evaluating the effect of a reduced set of variables on the classification performance, the scalar selection technique with correlation was used (Theodoridis and Koutroumbas, 2008).
With the aim to select features leading to large between-class distance and small within-class variance in the feature vector space, the class separation measurement used in this study was Fisher's Discriminant Ratio (FDR) described in Equation 1: where, μ k1 , σ k1 , μ k2 , σ k2 are mean values and standard deviations of characteristic x k in classes w 1 and w 2 respectively.Classes w 1 and w 2 represent malignant and benign tumors.The value of FDR k is calculated for each feature x k , k = 1, ..., m.The characteristic x k with higher FDR k is selected.This is the x S1 characteristic.To select the second characteristic, x S2 , the cross correlation coefficient is used between two characteristics, x i and x j defined in Equation 2.
where N is the total number of patterns, x ni and x nj are values of ith and jth characteristic of pattern n. i, j = 1,…, m.The second characteristic is named x S2 and is the one that maximizes Equation 3: α 1 and α 2 express the importance of the first and second terms, respectively, in choosing the second best characteristic.In this work α 1 = α 2 = 0.5.Other selected characteristics, x Sk , k = 3, …, m, are those that maximize the Equation 4: Sets with 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 features were produced as a result.

K-fold Cross Validation
In k-fold cross-validation, the dataset is randomly split into k mutually exclusive subsets, the folds, of approximately equal size.One fold is excluded and the classifier is trained with the k-1 remaining folders, then the classifier is tested with the previously excluded folder.This process is repeated k times, until all the folders have been used to test the classifier.The cross-validation estimate of accuracy is the overall number of correct classifications, dived by the number of instances in the dataset (Kohavi, 1995).
In this study, the dataset was divided into four folders, each one with twenty-five patients.Each one of the first and second groups contains images of 12 malignant and 13 benign tumors.Each one of the third and fourth groups contains images of 13 malignant and 12 benign tumors.These folders were used to train and test the SVM classifier and neural networking with mean square error and regularization.

Support vector machines
SVM separates patterns belonging to two classes defining one hyperplane that maximizes the separating margin between these two classes (Haykin, 2009).According to Theodoridis and Koutroumbas (2008), the hyperplane parameters that maximize the separating margin are the weight vector w and polarization w 0 that minimizes Equation 5 and satisfies Equation 6: ( ) where N is the number of patterns to be classified.
For non-separating classes, the same parameters could be determined, minimizing the Equation 7, where new variables, ξ i , known as slack variables are introduced.The optimizing task becomes more complex.The goal now is to make the margin as large as possible, but at the same time keep the number of points with ξ > 0 as small as possible.
( ) The C parameter in Equation 7 is a positive constant that controls the relative influence of the two competing terms.
SVMs use kernels for mapping characteristic vectors to a large dimension space vector where classes could be separated by hyperplanes.In this work the kernels were polynomial and Gaussian radial basis function (GRBF) used in association with the SVM classifier.
Simulations were carried out with each subset of features obtained in the feature selection step and with the original set, which includes all features, using the kernels mentioned above varying their degrees from 1 to 5. The values of C used to aid selecting the best classifier vary from 0.03 to 8.

Neural networks
Single layer neural networks are not able to learn and generalize complex problems.Multilayers neural networks with nonlinear transfer functions, in contrast, allow the network to learn nonlinear relationships between input and output vectors increasing the space of hypotheses that it can represent and providing great computing power (Duda et al., 2000).
The number of artificial neurons per layer, as well as the number of layers, greatly influences the prediction abilities of the neural network.Too few of them hinder the learning process, and too many of them can depress the generalizing abilities of the neural network through over fitting or memorization of the training data set.In this work, four-layer neural networks with i-n-n-1 architecture, n = 5, 8 and 10 and i = number of input variables, were employed in breast lesion classification.
There are many different learning algorithms to train the neural networks.The neural network training algorithm used in this work was the Levenberg Marquardt (Moré, 1978).This algorithm approximates the error of the network with a second order expression, which contrasts to the former category that follows a first order expression.
The prediction error is minimized across many training cycles, known as epochs until the network reaches specified level of accuracy.If a network is left to train for too long, however, it will be over trained and will lose the ability to generalize.Three stop training criteria were employed for neural network training: mean square error, regularization (Doan and Liong, 2004) and early stop (Hagan et al., 2016).
With mean square error criterion, the training was finished when its value reached 10 -6 or 1000 epochs.With regularization criterion, aiming to work with more stable neural networks, a new term, proportional to the sum of the squared network weights, is added to the mean square errors, according to Expression 8: ( ) where γ is the performance factor, a number between 0 and 1 and mse is the mean square error.In this work γ = 0.5, and msw is described in Equation 9: The regularization criterion in Expression 9 causes lower neural network weights, enforcing a smooth network response and improving the generalization power of the neural network.
With the early stop training criterion the data set is divided into three groups: training, validation and testing.In this study, each of these groups consisted of 33 patients.The main characteristic of this method is that, during the training, although the validation group is not used for training, the mean square error is evaluated in it.When the mean square error grows in this data group, the neural network training is stopped.The neural network performance is calculated as a mean performance of the validation and test groups.

Reduced set of features
Using the feature selection technique previously described, sets with the best 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 characteristics were generated.Table 2 shows the selected features for each of these subsets.
It is noticed a major difference between the subset of five features obtained in this paper, namely {Cnvx, LI, ENS, PD, ENC}, and with the one obtained by Flores et al. (2015), namely {ENS, NSPD, DWR, RS, O E }. Comparing these two sets, we notice that there is an overlap of only one feature: elliptic normalized skeleton.

Support vector machines
For each simulation using SVM classifiers, a different combination of feature set, kernel, kernel order and C was employed and the accuracy, sensitivity, specificity and area under ROC curve (AUC) were calculated.As SVM classifiers output values are 0 or 1, the ROC is constituted of just one point and not a curve, due to this fact, its value is similar to the accuracy.Table 3 shows the best results obtained using SVM classifiers with two different kernels, GRBF and polynomial, varying its orders from one to five and varying C from 0.03 to 8. All the 22 features were used as input variables.The best accuracy value obtained is 90% when the Polynomial kernel is used.
Table 4 shows the best classification results obtained using SVM classifiers with GRBF and Polynomial kernels respectively and using the best 4, 5, 6, 7 and 8 features as input variables, varying the kernels' orders from one to five.All subsets shown in Table 3 were used as input to the SVM classifiers but, as seen in Table 5, the classifier performance does not vary significantly by inserting new features.

Neural networks
The performance of the three stop training criteria mentioned, mean square error, regularization and early stop with all 22 features used as input variables and different architectures, 22-5-5-1, 22-8-8-1 and  22-10-10-1 are shown in Table 5, where one can find the accuracy, sensitivity, specificity and AUC for each of these combinations.
Table 6 shows the accuracy, sensitivity, specificity and AUC for the best 4, 5, 6, 7 and 8 input variables selected with scalar selection technique with correlation, with mean square error, early stop and regularization training stop criteria, and 4-5-5-1, 5-5-5-1, 6-5-5-1, 7-5-5-1 and 8-5-5-1 architectures.Table 7, adapted from the study of Flores et al. (2015), shows the performance of some previous studies published for breast cancer classification.In this Table are shown: the category of the study -M, T or C (M -studies that use morphological characteristics, T -studies that use texture features and C -studies that use both types of features), the Mean value of Area Under ROC Curve -AUC, the standard deviation or AUC -SD and the coefficient of variation CV (SD/AUC).

Discussion
Analyzing the results shown in Table 3, when using the whole features dataset as the best mean accuracy value of the SVM classifier, 87.40%, was obtained with RBF kernel.The mean AUC of RBF kernel was 0.875, while the mean AUC of Polynomial kernel was 0.817.Assessing the significance of the difference between the areas that lie under these two ROC curves (Hanley and McNeil, 1982), we found that P = 0.149 > 0.05, the null hypothesis should not be rejected (i.e., the SVM classifier with RBF kernel was not superior to the SVM classifier with polynomial kernel, at the 5% significance level).
As shown in Table 4, one can observe that concerning SVM classifiers with polynomial kernel, the scalar feature selection technique with correlation does not improve the value of AUC performance regarding the use of the 22 features.Using the RBF kernel, the same technique slightly improves the values of AUC, when using 6 and 7 characteristics, regarding the 22 features.The best accuracy value obtained with SVM classifiers, 91%, was achieved using a RBF kernel and the best 6 features.It corresponds to an AUC of 0.911.With the whole set of features a mean AUC of 0.875 was obtained.Assessing the significance of the difference between the areas that lie under these two ROC curves (Hanley and McNeil, 1982), we found that P = 0.221 > 0.05, the null hypothesis should be rejected (i.e., the SVM classifier with 6 features was not superior to the classifier with whole features, at the 5% significance level).
Although we tested many different values of C in order to improve the classification performance, Tables 3 shows that the best results were obtained varying C from 0.031 to 2.8.
Regarding neural networks performance in terms of AUC, Table 4 shows that, when using the 22 characteristics, regularization and early stop neural network criteria performed better than mean square error criterion.This behavior is due the fact that the first two criteria are used to improve neural network generalization.The best mean value of AUC, 0.980, was obtained when using the architecture 22-5-5-1 and the early stop criterion.
Comparing the results in Tables 5 and 6, the following can be seen: for the mean square error, both AUC and accuracy obtained with 8 features are equal to the ones obtained with 22 features, for early stop criterion, the best performance is obtained with the best 7 selected features.The AUC value is equal to the one obtained with 22 features.The best mean value of AUC obtained with neural networks, 0.980, is superior to the best mean value obtained with the SVM classifier, 0.875.Assessing the significance of the difference between the areas that lie under these two ROC curves, we found that P = 0.003 < 0.05, the null hypothesis should be rejected (i.e., the neural network classifier with early stop criterion is superior to SVM classifier, at the 5% significance level).
Although, in terms of AUC, there are no statistical differences in the results obtained with a lower number of features and with the whole set of features, we understand that feature selection is an important stage of the classification process as it reduces the features vector dimensionality removing possible redundancy, filtering noises and therefore helping improving the classifiers and reducing computational efforts, as can be noticed with the SVM classifier.
Using the minimal-redundancy maximal-relevance (mrMR) criterion, based on mutual information (MI) technique, Flores et al. (2015) proposed the use of a reduced set with 5 morphological features for malignant lesion detection in BUS images, namely elliptic normalized skeleton, orientation, number protuberances and depressions, depth-to-width ratio and overlap ratio.In this study however, the best results obtained with a reduced set of 7 features: convexity, lobulation index, elliptic normalized skeleton, proportional distance, elliptic normalized circumference, depth-to-width ratio average distance and normalized residual value.Comparing these two subsets, we notice that they have a different number of features, and there is an overlap of only two variables: elliptical normalized skeleton and depth-to-width ratio.This difference suggests that the classifications results depends both on the feature selection method and on classifiers used.
Table 7. Statistical results of distinct feature sets in terms of AUC value, where SD, and CV are the standard deviation and coefficient of variation attained by each set.The sets are ordered from the best to the worst classification performance (adapted from Flores et al., 2015).The results presented in Table 7 show that the improvement in accuracy and AUC over time is incrementally.A direct comparison of the values is not possible, because the image databases used in these studies were extracted from different populations and the image quality is different, inducing a different behavior of the classifiers.The studies of Alvarenga et al. (2007Alvarenga et al. ( , 2010Alvarenga et al. ( , 2012)), Flores et al. (2015), and this study, nevertheless, used different samples of a same population and the image database has the same quality.In the sequence, we compare the best results of these studies, the one reported by study of Flores et al. (2015), with the result obtained in this study.

Study
The best mean value of AUC obtained is this study, 0.980, is better than the value of 0.942, obtained by Flores et al. (2015) (see Table 7) using a different sample of a same population (413 benign and 228 malignant lesions).Assessing the significance of the difference between the areas that lie under these two ROC curves, we found that P = 0.011 < 0.05, the null hypothesis should be rejected (ie, the AUC obtained in this work is superior to the value of AUC obtained in the work of Flores et al. (2015), at the 5% significance level).

Figure 1 .
Figure 1 shows BUS image examples available on the dataset.For each image, a rectangular ROI, including the tumor and its neighboring area was determined by a radiologist (a medical doctor with 30 years of experience in mammography and breast ultrasound interpretation).The radiologist was requested to select the portion of the image background surrounding the lesion that includes essential information for the routine sonographic diagnosis.Besides, each ROI was segmented using the semiautomatic contour (SAC) procedure, based on morphological operators(Alvarenga et al., 2012).A set of 22 features, from each lesion was extracted.These features are divided into five classes: one class related to morphological skeleton; one class related to radial normalized length; one class related to a lesion convex polygon; one class related to circularity and one class related to equivalent ellipse.Two of them are related to morphological skeleton: elliptic normalized skeleton (ENS) and skeleton end (S#).Six of them are related to radial normalized length (NRL): NRL area ratio (dA), NRL mean (dμ), NRL standard deviation (dσ) , NRL entropy (dE), NRL roughness (dR) and NRL crossing (dZ).Nine of them are related to the lesion convex polygon: Overlap ratio (RS), Number of protuberances and depressions (NSPD), Lobulation index (LI), Normalized residual value (NRV), Proportional distance (PD), Convexity (Cnvx), Elliptic normalized circumference (ENC), Hausdorff distance (HD) and Average distance (MD).Four of them are related

Table 1 .
A brief review of breast cancer classification using neural network and SVM.

Table 2 .
Features subsets resulting of scalar feature selection technique application.

Table 3 .
Best values of accuracy, sensitivity, specificity and AUC obtained with SVM classifiers using two different kernels, GRBF and polynomial, varying their orders from one to five and the C parameter from 0.03 to 8. All the 22 features were used as input variables.

Table 4 .
Best accuracy, sensitivity, specificity and AUC of the selected 4, 5, 6, 7 and 8 variables using SVM classifiers with GRBF and polynomial kernel.The pair (order, C) used is the one that achieved best results using the whole set of features.

Table 5 .
Accuracy, sensitivity, specificity and AUC of three neural network architectures, for mean square error, early stop and regularization training stop criteria.All the 22 features were used as input.

Table 6 .
Accuracy, sensitivity, specificity and AUC of a neural network with the best 4, 5, 6, 7 and 8 input variables, for mean square error,