- Ms Word Format
- 75 Pages
- ₦5,000
- 1-5 Chapters
Cancer Prediction Using Data Mining
CANCER PREDICTION USING DATA MINING
INTRODUCTION
1.1 BACKGROUND
Medical Databases today can range in size into hundreds of millions of terabytes. Within these masses of data lies hidden information of strategic importance. Due to these vast amounts of data, it then begs the question, “How do you draw meaningful conclusions about this data?” Data mining answers thisquestion.
Although computational, the utility of data mining algorithms can be used as a qualitative tool to analyze quantitative data, particularly the large, complex databases being created by the health informatics community, Young (2012). Lloyd-Williams (2013), Data stored in hospital warehouses range from quantitative to analog to qualitative data; however well structured, these data conceal implicit patterns of information which cannot readily be detected by conventional analysis techniques. The formats data warehouses also vary and amounting to information explosion within the health care field. The problem however, is finding the right methodological tool stominethisnewdatagivenitsenormousvariety,size,and complexity.
The advancement of information technology, software development, and system integration techniques have produced a new generation of complex computer systems. These systems have presented challenges to information technology researchers. The major challenge is how to benefit from the existing resources anddata.
These complex systems include the healthcare system. In recent times, there has been an increased interest in the utilization and advancement of data mining technologies and communication in healthcare systems and in this respect, a global healthcare system is getting adopted by many countries by setting healthcare standardization in communication and building electronic health records(EHR).
Gunter (2005), EHR is a systematic collection of electronic health data about individual patients or populations and is capable of being shared across healthcare providers in a certain state or country. Health records may include a range of data such as general medical records, patient examinations, patient treatments, medical history, allergies, immunization status, laboratory results, radiology images, and some useful information for examination. This rich information may help researchers in examining and diagnosing diseases using computer techniques.
The shift of many countries moving fast toward electronic healthcare information systems has produced huge EHRs for health related information. This data can be a valuable assetfor populations and healthcare providers. In this respect the aim of this research is to investigate the aspects of utilizing health data for the benefit of humans by using novel data mining techniques.
The current research focuses on diagnosing cancer based on machine intelligence and previous history. The approach develops a new technique known as Information Gain Artificial Neuro Inference System (IG-ANFIS). This uses a combination of an Adaptive Network based Fuzzy Inference System (ANFIS) and the Information Gain method (IG). The purpose of ANFIS is to build an input-output mapping using both human knowledge and machine learning ability and the purpose of IG method is to reduce the number of input features to ANFIS. The IG method approximates the quality of each attribute using the entropy by estimating the difference between the prior entropy and the post entropy. IG is one of the attribute ranking methods often applied in text categorization. In text categorization, it is used to measure the number of bits of information obtained for category prediction. This is done by knowing the presence or absence of a term in adocument.
1.1 Statement Of TheProblem
Data mining methods used for diagnosing diseases based on previous data and information have been improving over the years. The data mining methods used currently particularly for disease diagnosis use various feature selection techniques which includes Correlation based Feature Selection (CFS), Relief (R), Principle Components Analysis (PCA), Consistency based Subset Evaluation (CSE), Information Gain (IG), and symmetrical uncertainty (SU). These techniques have no doubt improved disease diagnosis. However there are several problems associated with effectively utilizing this previously acquired patient data, which can make any electronic healthcare system problematic and less efficient i.e. the problem of missing values and how to process them, huge features and attributes and how to select the most beneficial features, the problem of extracting accurate diagnostic markers that can predict the early onset of the disease and monitoring of different stages of thedisease.
Based on the power of the current data mining methods and the previous evidence or data, this research tries to investigate feature selection techniques, and a novel hybrid method (IGANFIS) for diagnosing diseases (in this case cancer) has been developed. IGANFIS combines IG method and ANFIS method for Cancer Diagnosis. The IG will be used for selectingthequalityofattributes.Asetoffeatureswithhighrankingvalueswillbetheoutput of applying IG method. These high ranking values will constitute the input for ANFIS andOdeh(2008).
1.2 Objectives
The general objective of this research thesis is:-
To develop a data mining approach that will combine information gain algorithm and adaptive neural fuzzy inference system to analyze large data obtained from healthcare databases.
1.2.2 SPECIFICOBJECTIVES
The specific objectives of this research thesis were to:-
- To identify the current data mining algorithms used in healthcare sector for cancer diagnosis.
- To identify the significance of diagnostic features that best describe cancer data using data miningtechniques.
- To describe how missing feature values improve prediction in determining the performance achieved by data miningalgorithms.
- To develop a hybrid data mining model from the existing techniques that can improve classification accuracy and missingvalues.
- To test the developed hybrid data mining model for classification accuracy and missing values.
1.3 ResearchQuestions
The main goal of this study is to answer the following research questions:-
- What are the data mining algorithms used currently in the healthcare sector for cancer diagnosis?
- How can the diagnostic features that best describe data for the purpose of differentiating malignant and benign form of cancer be identified using data mining techniques?
- How do missing feature values improve prediction in determining the performance achieved by data miningalgorithms?
- Does hybridization model of the existing data mining algorithms produce better approaches for cancer interms of classification accuracy and missing values?
- How can the developed hybrid data mining model be tested for classification accuracy and missing values?
1.4 Justification Of TheStudy
The medical industry has been slow to adopt new, efficient and timely data mining techniques which ideally lower the cost of information and accelerate information access. These are the things that healthcare practitioners want i.e. integrated historical data, easy and fast informationaccess.
In a global perspective, the limited medical resources and long waiting times to receive medical services has magnified people’s suffering. The World Health Organization (WHO) ranks Kenya at 140 out of 190 countries in their report of the year 2000. A study shows that all African countries including Kenya had fewer practicing physicians and limited care beds per one thousand people than the median of some countries. This is according to the Organization for Economic Cooperation and Development (OECD) (Source: OECD Health Data, 2010).
The available medical resources and infrastructure force Health organizations and state governments to set procedures, plans, manage, and cope with the challenges of medical personnel and equipment. This helps them in delivering decent healthcareservices for residents however there still exists shortage of innovative e-Health technologies. IG-ANFIS could be the solution for thissuffering.
1.5 SCOPE OF THESTUDY
In this thesis, EHRs have been used as data sources for developing automatic data mining techniques, so as to produce useful patterns and decision support logic for automatic computer aided diagnosis. The study has used Wisconsin Breast Cancer (WBC) datasets from the University of California Irvine (UCI). This is a machine learning repository available publicly for research purposes. The research will combine Naïve Bayes and k-NN as one classifier for constructing missing feature values to find the most suitable feature values that satisfy classificationaccuracy.