A comparative analysis of seven algorithms using a comprehensive dataset

Topic > A comparative analysis of seven algorithms using a comprehensive dataset

IndexMethodologyDatasetMachine learning techniquesEvaluation parametersConfusion matrixResultConclusionReferencesChronic kidney disease (CKD) is a condition characterized by a gradual loss of kidney function over time. Includes risk of cardiovascular disease and end-stage renal disease. The approximate prevalence of chronic kidney disease is 800 per million population (pmp) [1]. In this article, we used a machine learning approach to predict CKD. In this article we have presented a comparative analysis of 7 different machine learning algorithms. This study starts with 24 parameters in addition to the class attribute, and 25% of the dataset is used to test predictions. The data is evaluated using five-fold cross validation, and the system performance is evaluated using classification accuracy, confusion matrix, specificity matrix, and sensitivity matrix. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Chronic kidney disease (CKD) is a permanent reduction in kidney function that can progress to end-stage renal disease (ESRD) that requires continuous dialysis or a kidney transplant to keep you alive. Chronic kidney disease also affects the amount of drugs eliminated from the body [2]. In routine practice, a laboratory serum creatinine value is used to estimate kidney function by incorporating it into a formula to estimate glomerular filtration rate and determine whether a patient has chronic kidney disease. It is becoming a major threat in developing and underdeveloped countries. The main cause of its occurrence are diseases such as diabetes and hypertension. Other risk circumstances that cause chronic kidney disease include heart disease, obesity, and a family history of chronic kidney disease. Its medications, such as dialysis or kidney transplant, are very expensive and therefore early diagnosis is necessary. In the United States (US) [3], approximately 117,000 patients developed end-stage renal disease (ESRD) requiring dialysis, while more than 663,000 prevalent patients were on dialysis in 2013. 5.6% of the total medical budget is was spent on the ERDS in 2012 which amounts to approximately 28 billion dollars. In India, chronic kidney disease is prevalent among 800 per million population and ESRD is 150-200 per million population. We consider seven machine learning classifiers, namely Logistic Regression, Support Vector Machine, K-nearest Neighbor, Naïve Bayes, Stochastic Gradient Descent classifier, Decision Trees Random Forest. Finally, a set of standard performance metrics is used to design the computer-aided diagnosis system to estimate the performance of each machine learning and artificial intelligence classifier. The metrics we used include confusion matrix, classification accuracy, specificity, and sensitivity. Methodology Dataset Our research uses a CKD dataset [4], which is openly accessible at the UCI Machine Learning Lab. The CKD dataset consists of 24 attributes (i.e., predictors) in addition to the binary class attribute. Of the 24 attributes, 11 are numeric attributes, two are categorical with five levels, while the remaining parameters are binary and were coded as zero for anomalous cases and one for normality. In the class attribute one is coded for the presence of CKD and zero indicates that CKD is not present. This dataset contains 400 instances with 150 samples without kidney disease (not present) and250 samples with kidney disease (present). A total of 400 instances of the dataset where 300 of them are used for training the classification algorithms and 100 instances are used to test the result of these models. The attributes in the dataset are age, blood pressure, specific gravity, albumin, sugar, red blood cells, pus cells, pus cell clumps, bacteria, random blood sugar, blood urea, serum creatinine, sodium, potassium, hemoglobin, packed cell volume, White blood cell count, Red blood cell count, hypertension, diabetes mellitus, appetite, pedal edema, anemia, and Class.Machine Learning TechniquesLogistic RegressionLogistic regression (LR) is the linear regression model [ref-1]. LR calculates the distribution between example X and the Boolean label of class Y using P(X|Y). Logistic regression classifies the boolean class label Y as follows:P(Y=1│X)=1/(1+exp⁡(w_0+∑_(i=1)^n▒〖w_i X_i)〗)P( Y=0│ data mining method used to predict data category [ref-2]. The main idea of SVM is to find the optimal hyperplane between the data of two classes in the training data. SVM finds the hyperplane by solving the optimization problem.K-Nearest NeighborsK-nearest neighbors (KNN) is the classification method for classifying unknown examples by searching for the closest data in the model space [ref-3]. KNN predicts the class using the Euclidean distance defined as follows: d(x,y)=√(∑_(i=1)^k▒〖〖(x〗_i-y_i)〗^2 )The Euclidean distance d( x ,y) is used to measure the distance to find the k closest examples in the model space. The class of the unknown example is identified by the majority of its neighbors' votes. Naïve Bayes Naïve Bayes are probabilistic classifiers based on Bayes' theorem [ref-4]. In Naïve Bayes each value is marked independently of other values and characteristics. Each value contributes independently to the probability. The higher the probabilistic value, the greater the chance that the data belongs to that class or category. The Naïve Bayes algorithm uses the concept of maximum likelihood for prediction. This algorithm is fast and can be used to make real-time predictions such as sentiment analysis. SGD Classifier It is a logistic regression classifier based on stochastic gradient descent optimization. Stochastic gradient descent (SGD) on the other hand performs a parameter update for each training example x( i) and labels y(i): [ref-5]θ= θ- η .∇_θ J(θ;x^ i;y^i) Decision treeThe decision tree is the classification method frequently used in data mining [ref- 6]. A decision tree is a structure that includes a root node, branches, and leaf nodes. Divides the data into classes based on the attribute value found in the training sample. Random ForestRandom Forest (RF) is a variant of the ensemble classifier consisting of a collection of tree-structured classifiers h(x,yk), which is defined as multiple predictor tree yk such that each tree is based on estimates of an arbitrary inspected vector independently and with a similar distribution for all trees in the forest. Randomization is done by randomly selecting input attributes to produce individual basic decision trees [ref-7]. Random forests become different from other methods because a modified tree learning algorithm is used that chooses the differentiable candidate in the learning procedure, a random subset of the features. The cause of this is the relationship between trees in a standard bootstrap sample. For example,) 2.