A new algorithm to extract hidden rules of gastric cancer data based on ontology

Large data sets are idle in database of companies, universities, etc. Using the hidden information in these data bases is based on the efficient management. Data mining is looking for hidden relationships in databases. This process is more than just a simple retrieving data and lets researchers to find new information from data (Alizadehsani et al. 2013).

Cancer is the leading cause of death in economically developed countries and the second leading cause of death in developing countries (Jemal et al. 2011). Gastric cancer is the fourth-most common cancer (Price et al. 2012) and the second leading cause of cancer deaths in the worldwide (Ferlay et al. 2010). According to the latest researches conducted in Iran in 2008, 9.3 % of the common cancer is related to stomach cancer and it is the third most common cancer among men and women in the country (Etemad et al. 2012). It is necessary to further examine the factors affecting the incidence of the disease Due to the prevalence of the disease and the high mortality rate of gastric cancer.

Today, the medical knowledge of the data is very extensive on the symptoms of patients with various diseases and the ways to help them with the diagnosis of these diseases. Analyzing and considering all the engaged factors by one person are usually difficult. Thus, the need for an automated system to help discover the rules and patterns and predict future events is fully felt. Data mining, as the provider of this automated system, helps many medical advances, especially in the field of diagnosis of various diseases and obtaining useful relations among factors available in the data (Chou et al. 2004).

Kirshners et al. (2012) used a data set including 819 samples (24 positive and 795 negative samples) and 31 features and three algorithms CN2 Rules and C4.5 and Naive Bayes to diagnose stomach cancer. The results showed that sex and protein HER-1, are important factors in the diagnosis and classification of gastric cancer. Experimental results showed the average sensitivity 50 and 86–100 % at most, at the same time—having classification accuracy and specificity close to 65–70 %. Silvera et al. (2014) used classification tree analysis to analyze data from a population-based case–control study (1095 cases, 687 controls) conducted in Connecticut, New Jersey, and Western Washington State. Frequency of reported gastroesophageal reflux disease symptoms was the most important risk stratification factor for esophageal adenocarcinoma, gastric cardia, and noncardia gastric, with dietary factors (esophageal adenocarcinoma, noncardia gastric), smoking (esophageal adenocarcinoma, gastric cardia), wine intake (gastric cardia, noncardia gastric), age (noncardia gastric), and income (noncardia gastric) appearing to modify the risk of these cancer sites. For esophageal squamous cell carcinoma, smoking was the most important risk stratification factor, with gastroesophageal reflux disease, income, race, noncitrus fruit, and energy intakes further modifying risk. Wang et al. (2012) used hierarchical clustering on 14 available clinical factors from three categories, i.e., the clinical background, immunohistochemistry data, and the caner’s stage information. The results showed that that two clinical factors, Her-1 and gender, can clearly characterize and differentiate these three groups.

In classifying and clustering somehow these methods derive patterns from the data set. For example, in the clustering, a pattern of similarity is defined and assessed. But the main focus is on discovering the hidden factors and influencing factors in this dangerous disease by the help of discovering the association rules that no research evidence has been carried out in this regard. Most conducted researches that use discovery of association rules are applied for data collection of other cancers.

Since primary prevention, by control of modifiable risk factors and increased surveillance of persons at increased risk, is important in decreasing morbidity and mortality of this harmful disease (Bathaie and Mohagheghi 2012), in this study, the effect of some new features which was not considered in previous studies are investigated. Although the use of traditional data mining techniques such as association rules helps to extract knowledge from large data sets, sometimes the results obtained from a data set are so large that it is a major problem. In fact, one of the disadvantages of this technique is a lot of nonsense and redundant rules due to the lack of attention to the concept and meaning of items or the samples. This paper presents a new method to discover association rules using ontology to solve the expressed problems. Experiments using the proposed algorithm have given results that are more concise and precise than the results obtained using Traditional data mining techniques. The quantitative comparison was performed according to the number of pruned rules. The qualitative comparison was performed by human evaluation.