On the Scalability of Machine-Learning Algorithms for Breast Cancer Prediction in Big Data Context

On the Scalability of Machine-Learning Algorithms for Breast Cancer Prediction in Big Data Context


 Recent advances in information technology have induced an explosive growth of data, creating a new era of big data. Unfortunately, traditional machine-learning algorithms cannot cope with the new characteristics of big data. In this paper, we address the problem of breast cancer prediction in the big data context. We considered two varieties of data, namely gene expression (GE) and DNA methylation (DM). The objective of our work is to scale up the machine-learning algorithms which are used for classification by applying each dataset separately and jointly. For this purpose, we chose Apache Spark as a platform. In our study, we selected three different classification algorithms, namely support vector machine (SVM), decision tree, and random forest, to create nine models that help in predicting breast cancer. We conducted a comprehensive comparative study using three scenarios with GE, DM, and GE and DM combined, in order to show which of the three types of data would produce the best result in terms of accuracy and error rate. Moreover, we performed an experimental comparison between two platforms (Spark and Weka), in order to show their behavior when dealing with large sets of data. The experimental results showed that the scaled SVM classifier in the Spark environment outperforms the other classifiers, as it achieved the highest accuracy and the lowest error rate with the GE dataset.

Existing System:

In the medical field, many data about patients of different diseases are collected every day. Processing these datasets and discovering more valuable knowledge and hidden patterns will improve the medical service and healthcare. Moreover, it will lower the cost of fighting or healing diseases. The fast development of computer science and algorithms has allowed for novel approaches to harness data in order to discover more insight for competitive advantages, such as classical machine-learning techniques.



The data samples which are used with machine-learning methods are described in terms of features or attributes, which may be of different types and values. The nature of the data decides the type of machine-learning techniques to be used in order to obtain valuable information. The analysis of large sets of data is challenging when its aim is to obtain more powerful patterns and information that enable enhanced insight, decision making, and process automation. Unfortunately, the traditional ways of using machine-learning algorithms could not cope with the new challenges of big data, especially scalability.



Proposed System:

In the GE process, the word expression refers to the ability for a gene to convert its genetic information stored in the DNA molecule into a gene product, such as a protein. The GE encompasses several steps, which can be categorized into transcription and translation steps. In the transcription step, the DNA copies its biological information into the messenger RNA (mRNA). In the translation step, the mRNA is translated into a gene product such as a protein, which performs some cellular functions. The transcription step is called GE and indicates the approximate number of copies of that a gene’s RNA produces in a cell. It is correlated with the amount of the corresponding proteins that the process generates