Classification techniques for clinical diagnosis of lung cancer: A Case of Uganda Cancer Institute- Mulago
Abstract
This study evaluated classification techniques in order to predict lung cancer based on signs, symptoms and risk factors. The study utilized data from the Uganda Cancer Institute with 354 patients’ records. To identify high risk factors for lung cancer (such as persistent cough, dry cough, difficulty in breath, family history of cancer), entropy and information gain data mining approaches were used. To optimally detect lung cancer, Decision Trees (DT), Naïve Bayes (NB) and Classification Rules (CR) data mining techniques were used with the aid of WEKA data mining tool. To test the reliability of the different data mining techniques amenable to clinical diagnosis of lung cancer, the confusion matrix results and the number of correctly classified instances for each data mining technique were used. The high risk factors for lung cancer that were identified, with their respective information gains, include chest pain (0.4), persistent cough (0.3), plural effusion (0.3), diabetes (0.3), allergy (0.2), difficulty in breath (0.2), family history of cancer (0.2) and night sweats (0.2). Tests on reliability of the different classification techniques revealed that all techniques performed well, though Naïve Bayes registered the best performance with 97% of its instances correctly classified compared to 96% for Decision Trees and 95% for Classification Rules. Naïve Bayes also had the best accuracy rate of 0.97 compared to 0.96 for Decision Trees and 0.95 for Classification Rules. The study therefore recommends the use of Naïve Bayes data mining technique for clinical diagnosis of lung cancer. Also, in future this technique can be automated into a computerized system for clinical diagnosis of lung cancer. Further research can be done on how to merge the three data mining techniques into a single robust algorithm for comparison purposes.