Data Science Portfolio

How Personality affects Heroin Consumption

Setting up the dataset for binary classification

For each drug, an individual falls under one of 7 classes—CL0, CL1, CL2, CL3, CL4, CL5, and CL6. CL0 means that the individual never used the drug, CL1 means the individual used the drug over a decade ago, CL2 means the individual used the drug within the last decade; and CL3, CL4, CL5, and CL6 mean the individual used the drug in the last year, in the last month, in the last week, and in the last day, respectively.

First, I took the .data file and converted it into a .csv file. I restricted the dataset to the drug heroin by deleting all the columns for the other drugs. I then added a new column “User” that has value 0 if the individual falls under CL0 or CL1—the individual is not considered a User. Otherwise, the individual falls under CL2, CL3, CL4, CL5, or CL6; in this case, the value is 1—the individual is considered a User. This gives a binary classification problem for classifying individuals as User or Not User.

I took a random sample of the heroin dataset using excel. Then, I broke up the dataset into two separate csv files: one containing 80% of the data, the other containing 20% of the data. The set containing 80% is our training set, and the set containing 20% is our test set.  

Visualizing the data

For this section, I’m going to use the entire dataset for heroin consumption. Let’s look at heroin consumption by gender. Here is a graph for the number of users for each gender.

As you can see, there are about the same number of males as females in the population. For both genders, only a small fraction are heroin users, and there are about twice as many male heroin users as female heroin users. We can test whether gender is statistically significant using the chi-squared test:

Chi-Squared Test
Observed
Not User User
Male 797 146 943
Female 876 66 942
1673 212 1885
Expected
Not User User
Male 836.9437666 106.056233 943
Female 836.0562334 105.943767 942
1673 212 1885
P-value 5.74672E-09
Significance Level 0.05

Since the p-value is less than the significance level, gender is statistically significant.

Let’s look at the percentages of users by country.

As you can see, the USA and UK have the largest representation in the sample. Most of the heroin users in the sample are from the USA. Applying the chi-squared test, we can check that country is statistically significant:

Column1 Column2 Column3 Column4
Chi-Squared Test
Observed
Country Not User User
-0.57009 412 145 557
-0.46841 3 2 5
-0.28519 107 11 118
-0.09765 48 6 54
0.21128 18 2 20
0.24923 75 12 87
0.96082 1,010 34 1044
1673 212 1885
Expected
Country Not User User
-0.57009 494.3559682 62.6440318 557
-0.46841 4.437665782 0.56233422 5
-0.28519 104.7289125 13.2710875 118
-0.09765 47.92679045 6.07320955 54
0.21128 17.75066313 2.24933687 20
0.24923 77.21538462 9.78461538 87
0.96082 926.5846154 117.415385 1044
1673 212 1885
P-value 3.70058E-39
Signif Level 0.05

Since the p-value is less than 0.05, country is statistically significant.

Let’s see how Impulsivity affects the percentage of users:

As the Impulsive score increases, the percentage of users increases. 

We can apply the Chi-Squared test to check whether Impulsivity is statistically significant:

       
Chi-Squared Test
Observed
Impulsive Not User User
-2.55524 20 0 20
-1.37983 269 7 276
-0.71126 282 25 307
-0.21712 329 26 355
0.19268 231 26 257
0.52975 185 31 216
0.88113 155 40 195
1.29221 119 29 148
1.86203 79 25 104
2.90161 4 3 7
1673 212 1885
Expected
Impulsive Not User User
-2.55524 17.75066313 2.24933687 20
-1.37983 244.9591512 31.0408488 276
-0.71126 272.472679 34.527321 307
-0.21712 315.0742706 39.9257294 355
0.19268 228.0960212 28.9039788 257
0.52975 191.7071618 24.2928382 216
0.88113 173.0689655 21.9310345 195
1.29221 131.3549072 16.6450928 148
1.86203 92.30344828 11.6965517 104
2.90161 6.212732095 0.7872679 7
1673 212 1885
P-value 1.28149E-14
Signif Level 0.05

Since the p-value is less than the significance level, Impulsivity is statistically significant.  

As the Impulsive score increases, the percentage of users increases . A similar effect can be seen with Nscore:

The same thing can be seen with SS:

The opposite effect is seen with Ascore:

The percentage of users is higher for lower Ascores. We can see this more clearly by normalizing the bars:

Similarly, the percentage of users is higher for lower levels of education:

We can see this better by normalizing the bars:

The Chi-squared test was applied to Nscore, SS, Ascore, and Education; all were found to be statistically significant. In contrast, applying the chi-squared test to Escore gives a p-value of 0.156615315, which is greater than 0.05. Therefore, Escore is not statistically significant. Ethnicity also turns out to be statistically insignificant. Interestingly, Cscore and Age turn out to be statistically significant although we will eliminate them in backward elimination when building the model. There does seem to be a relationship between Cscore and whether or not the individual is a user:

The percentage of users is higher for lower Cscores. Similarly, the percentage of users is higher for younger individuals:

Building a Logistic Regression Model

Using gretl, I applied logistic regression to the training set. In logistic regression, our model will give us a probability for each individual—namely, the probability that the individual is a User.

First, I used backward elimination to eliminate some of the independent variables using the threshold of 0.05 for p-value. In this way, I was able to eliminate Escore, Cscore, Age, and Ethnicity. At this point, here is what our logistic regression model in gretl looks like:

Model 32: Logit, using observations 1-1508

Dependent variable: User

Standard errors based on Hessian

  Coefficient Std. Error z p-value  
const −2.35671 0.118925 −19.82 <0.0001 ***
Gender −0.491813 0.203406 −2.418 0.0156 **
Nscore 0.279726 0.0952334 2.937 0.0033 ***
Ascore −0.233985 0.0926942 −2.524 0.0116 **
Impulsive 0.300076 0.118098 2.541 0.0111 **
SS 0.236009 0.127718 1.848 0.0646 *
Education −0.193636 0.107405 −1.803 0.0714 *
Oscore 0.181861 0.101459 1.792 0.0731 *
Country −1.00028 0.151742 −6.592 <0.0001 ***
Mean dependent var  0.112732   S.D. dependent var  0.316370
McFadden R-squared  0.187395   Adjusted R-squared  0.170449
Log-likelihood −431.5763   Akaike criterion  881.1526
Schwarz criterion  929.0194   Hannan-Quinn  898.9800

Number of cases ‘correctly predicted’ = 1334 (88.5%)

f(beta’x) at mean of independent vars = 0.316

Likelihood ratio test: Chi-square(8) = 199.051 [0.0000]

As you can see, the p-value for Oscore is greater than 0.05; however, removing Oscore from the model decreases the Adjusted R-squared value to 0.169289. So, I decided to keep Oscore as well as Education and SS. I checked for collinearity, and there were no signs of collinearity. The accuracy rate of the model is 88.5%.

Assessing the model

Here is what the CAP curve looks like:

Our training set is ordered by our logistic regression model from highest predicted probability to lowest predicted probability. The horizontal axis measures the percentage of the population of individuals considered. The vertical axis measures the percentage of the population of users in our training set. The red curve, at x% on the horizontal axis, measures the number of users identified out of the first x% of our training set (as ordered by our model) divided by the total number of users in our training set. The blue line is the performance of the random model. As you can see from the CAP curve, our logistic regression model is good. At x=50%, our CAP curve is approximately 90%.

Next, we applied our model to the test set and created the CAP curve for our test set. The test set is ordered by our logistic regression model from highest predicted probability to lowest predicted probability. Here is the CAP curve for our test set:

At x=50%, our CAP curve is approximately 95%. So our model is very good. What this CAP curve tells us is that, by ordering the list of individuals in our test set according to highest probability given by our model, we would have identified 95% of the users by considering the first 50% of the individuals in the list. If we were to consider a new set of individuals, we could potentially identify 95% of the users by considering only 50% of the individuals. The accuracy rate of our model on the test set is 89.9%.

Let’s take a look at the odds-ratios for the predictors:

coefficient p-value Odds-ratio
const -2.3567137 2.13E-87
Gender -0.491812885 0.015610872 0.6115
Nscore 0.279725709 0.003311199 1.3228
Ascore -0.233985148 0.011593878 0.7914
Impulsive 0.300075998 0.011056415 1.35
SS 0.236008533 0.064617594 1.2662
Education -0.193636214 0.071409777 0.824
Oscore 0.181861329 0.073059336 1.1994
Country -1.000278262 4.34E-11 0.3678

As you can see, Impulsive, Nscore, SS, and Oscore have a positive correlation with their odds-ratios. On the other hand, Country, Gender, Ascore, and Education have a negative correlation with their odds-ratios. This suggests that increasing Impulsive, Nscore, SS, and Oscore correlates with increasing the probability that the individual is a user. On the other hand, increasing Ascore and Education correlates with decreasing the probability that the individual is a user. 

Recall that Cscore and Age were found to be statistically significant although they were not included in the model. I trained a new model by including them to see if it improves the model. Here is the new model in gretl:

Model 1: Logit, using observations 1-1508

Dependent variable: User

Standard errors based on Hessian

  Coefficient Std. Error z p-value  
const −2.35386 0.119036 −19.77 <0.0001 ***
Gender −0.501279 0.204678 −2.449 0.0143 **
Nscore 0.289359 0.101232 2.858 0.0043 ***
Ascore −0.234622 0.0928943 −2.526 0.0115 **
Impulsive 0.308060 0.122780 2.509 0.0121 **
SS 0.240555 0.129383 1.859 0.0630 *
Education −0.204355 0.109989 −1.858 0.0632 *
Oscore 0.183988 0.102186 1.801 0.0718 *
Country −1.01506 0.156544 −6.484 <0.0001 ***
Cscore 0.0259174 0.102757 0.2522 0.8009  
Age 0.0396074 0.119620 0.3311 0.7406  
Mean dependent var  0.112732   S.D. dependent var  0.316370
McFadden R-squared  0.187567   Adjusted R-squared  0.166856
Log-likelihood −431.4846   Akaike criterion  884.9692
Schwarz criterion  943.4731   Hannan-Quinn  906.7582

Number of cases ‘correctly predicted’ = 1337 (88.7%)

f(beta’x) at mean of independent vars = 0.316

Likelihood ratio test: Chi-square(10) = 199.235 [0.0000]

As you can see, the accuracy rate is 88.7%, not much better than the accuracy rate of our original model.

Here is the CAP curve for our new model applied to the training set:

At x=50%, the CAP curve is approximately 90%. The accuracy rate of the model on the training set is 88.66%. Thus, our new model is not much of an improvement over the original model.

I applied the new model on the test set and got the following CAP curve.  

At x=50%, the curve is approximately 95%. The accuracy rate of the model on the test set is 89.9%. Thus, including Cscore and Age in the model does not improve the model by much.

Here are the odds-ratios for the predictors Cscore and Age:

coefficient p-value Odds-ratio
Cscore 0.025917 0.80087 1.0263
Age 0.039607 0.740562 1.0404

Since the coefficients for Cscore and Age are close to 0, they don’t contribute much to the probability that an individual is a user. It’s strange that Cscore and Age are statistically significant according to the chi-squared test and yet not necessary in the model. What could explain this is that Cscore is negatively correlated with Nscore and Impulsive and that Age is negatively correlated with SS. Here is a heatmap of the correlation matrix:

The dataset can be found here: http://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29

E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan and A. N. Gorban, “The Five Factor Model of personality and evaluation of drug consumption risk.,” arXiv [Web Link], 2015

Online News Popularity of Mashable articles

We have a dataset of articles published by mashable, and we want to predict the popularity of a given article.  This is a classification problem.  Some of the features fall into various categories such as quantitative information about the article—such things as number of images, number of videos, etc.—and qualitative information about the article—such as which day it was published and which topic the article falls under. 

Comparing the number of images with the number of shares, we get the following bar graph:

As you can see, articles that have 1 image do better overall.  Then, articles that have no images are second-best; and, articles that have 2 images come in third place.  The number of shares for articles that have more than 2 images is negligible.  Given this insight, it would be wise to include either 0 or 1 image in an article.

Comparing the number of videos with the number of shares, we get the following image:

Notice that articles that have no videos tend to do best, with articles that have 1 video doing second-best and articles that have 2 videos doing third-best.  The articles that have more than 2 videos are negligible in comparison. 

We can also see which days of the week have the highest number of shares.  Here is a bar chart for days of the week:

As you can see, the weekdays have the highest number of shares.  Wednesday, Monday, Tuesday, and Thursday have the highest number of shares, with Friday having a significant drop. 

We can also see which category or topic does the best.  Here is a pie chart for the six categories Tech, Entertainment, World, Business, Social Media, and Lifestyle:

The best performing category is Tech, followed by Entertainment, World, and Business.  The least popular categories are Social Media and Lifestyle. 

Using Tableau, we can create the above visualizations and do some basic data mining.  The insights we’ve derived can tell us how many images and videos to include in an article, on which day to publish the article, and which category the article should be about. 

Next, using Python, I applied some machine learning to the dataset.  First, to formulate a classification problem, I used the threshold of 1400 shares to create two classes: if the number of shares is greater than 1400, then the article is classified as popular; if the number of shares is less than or equal to 1400, then the article is classified as unpopular. 

To prepare the csv file, I created a new column—a popularity column–after the shares column using the IF function; if the number of shares is greater than 1400, then the class is 4 (popular), and otherwise the class is 2 (unpopular).

Our goal is to create a machine learning model that will classify articles as popular or unpopular.  To do this, I used the gradient boosting classifier.

First, I split the dataset into a training set and a test set—taking 80% of the dataset as the training set and 20% of the dataset as the test set.

We fit the gradient boosting classifier to our training set.  Then, we apply the fitted gradient boosting classifier to our test set.

According to the confusion matrix gotten by comparing the predicted classes to the test classes, our model got 5277 correct classifications out of 7929 test classifications.  This gives an accuracy of 67%.

Here is the python code:

#Classification for Online News Popularity

#Importing the Libraries

import pandas as pd

#Importing the dataset

dataset = pd.read_csv(‘OnlineNewsPopularity(classification with nominal).csv’)

X = dataset.iloc[:,2:-2].values

Y = dataset.iloc[:, -1].values

#Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

#Fitting Gradient Boosting to Training set

from sklearn.ensemble import GradientBoostingClassifier

classifier = GradientBoostingClassifier()

classifier.fit(X_train, Y_train)

#Predicting the Test set results

Y_pred = classifier.predict(X_test)

#Making the Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, Y_pred)

The dataset can be found here: http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

High School Math Performance

Introduction

In the world of math education, one of the major issues that universities and educators have is that students do not succeed at mathematics at a satisfactory level and at a rate that is satisfactory. Universities and educators complain of the high failure, drop, and withdrawal rates of their students. This is a problem for students because low performance in math prevents them from pursuing their degrees and careers. It is a problem for universities and educators because it means that the university or educator is not successfully teaching students, not retaining their students, and not satisfying the needs of their students — these problems hurt the profitability and attractiveness of the university and educator.

If we can gain some insights into what factors most contribute to or hurt student performance in math, we have the potential to solve the above-mentioned problems. If we can produce predictive models that can predict whether a student will pass or fail, that can predict the numerical score of students on math assessments, and that can predict the overall strength and promise of a student, then universities and educators will be able to use these models to better place students at the appropriate level of competence, to better select students for admission, and to better understand the factors that can be improved upon to help students be successful.

In this paper, we will perform data science and machine learning to a dataset representing the math performance of students from two Portuguese high schools. The dataset can be found at the link at the end of this article.

The data file was separated by semicolons rather than commas. I replaced the semicolons by commas. Then, copy and pasted everything into notepad. Then, convert to a csv file using the steps from the following link:

https://knowledgebase.constantcontact.com/articles/KnowledgeBase/6269-convert-a-text-file-to-an-excel-file?lang=en_US

Now, I have a nice csv file.

There are 30 attributes that include things like student age, parent’s education, parent’s job, weekly study time, number of absences, number of past class failures, etc. There are grades for years 1, 2, and 3; these are denoted by G1, G2, and G3. The grades range from 0–20. G1 and G2 can be used as input features, and G3 will be the main target output.

Some of the attributes are ordinal, some are binary yes-no, some are numeric, and some are nominal. We do need to do some data preprocessing. For the binary yes-no attributes, I will encode them using 0’s and 1’s. I did this for schoolsup, famsup, paid, activities, nursery, higher, internet, and romantic. The attributes famrel, freetime, goout, Dalc, Walc, and health are ordinal; the values for these range from 1 to 5. The attributes Medu, Fedu, traveltime, studytime, failures are also ordinal; the values range from 0 to 4 or 1 to 4. The attribute absences is a count attribute; the values range from 0 to 93. The attributes sex, school, address, Pstatus, Mjob, Fjob, guardian, famsize, reason are nominal. For nominal attributes, we can use one-hot encoding. The attributes age, G1, G2, and G3 can be thought of as interval attributes.

I one-hot encoded each nominal attribute, one at a time. I exported the dataframe as a csv file each time, relabeling the columns as I go. Finally, I reordered the columns.

Here is the python code:

 import numpy as np
 import pandas as pd
 dataset = pd.read_csv(‘C:\Users\ricky\Downloads\studentmath.csv’)
 X = dataset.iloc[:,:-1].values
 Y = dataset.iloc[:,32].values
 from sklearn.preprocessing import LabelEncoder, OneHotEncoder
 labelencoder_X = LabelEncoder()
 #Encoding binary yes-no attributes
 X[:,15] = labelencoder_X.fit_transform(X[:,15])
 X[:,16] = labelencoder_X.fit_transform(X[:,16])
 X[:,17] = labelencoder_X.fit_transform(X[:,17])
 X[:,18] = labelencoder_X.fit_transform(X[:,18])
 X[:,19] = labelencoder_X.fit_transform(X[:,19])
 X[:,20] = labelencoder_X.fit_transform(X[:,20])
 X[:,21] = labelencoder_X.fit_transform(X[:,21])
 X[:,22] = labelencoder_X.fit_transform(X[:,22])
 #Encoding nominal attributes
 X[:,0] = labelencoder_X.fit_transform(X[:,0])
 X[:,1] = labelencoder_X.fit_transform(X[:,1])
 X[:,3] = labelencoder_X.fit_transform(X[:,3])
 X[:,4] = labelencoder_X.fit_transform(X[:,4])
 X[:,5] = labelencoder_X.fit_transform(X[:,5])
 X[:,8] = labelencoder_X.fit_transform(X[:,8])
 X[:,9] = labelencoder_X.fit_transform(X[:,9])
 X[:,10] = labelencoder_X.fit_transform(X[:,10])
 X[:,11] = labelencoder_X.fit_transform(X[:,11])
 onehotencoder = OneHotEncoder(categorical_features = [0])
 X = onehotencoder.fit_transform(X).toarray()
 from pandas import DataFrame
 df = DataFrame(X)
 export_csv = df.to_csv (r’C:\Users\Ricky\Downloads\highschoolmath.csv’, index = None, header=True)

Using seaborn, we can take a look at some visualizations. Here’s a histogram for “absences”.

Note that most students have a very low number of absences and that as the absences increases the number of students with that many absences decreases.

Let’s look at the distributions for grades G1, G2, and G3:

For G1, the first year grade, the scores look normally distributed. For G2 and G3, there are two maxima if you look at the curve; a certain number of students seem to have low scores near 0.

One might wonder how age affects the grades G1, G2, and G3. Here are some boxplots for age against G1:

The median score appears to have a local maximum at age 16, decreasing until it reaches age 19, then increasing sharply at age 20. The same thing can be seen with age versus G2:

In the third year, the local maximum at age 16 disappears, but the sharp increase at age 20 remains:

Note also that the range in scores gets tighter at ages 19 and 20.

We might also wonder whether males or females perform better. In all of three years, the median score for males is higher than for females. For instance, here’s the boxplots for Male vs G3:

If we apply linear regression to model G3 as a function of G1, we get the following:

As G1 score increases, so does G3 score. Similarly, we can see a linear relationship between G3 and G2:

Here is the python code:

 import numpy as np
 import pandas as pd
 dataset = pd.read_csv(‘C:\Users\ricky\Downloads\studentmathdummified.csv’)
 X = dataset.iloc[:,:-1].values
 Y = dataset.iloc[:,50].values
 import matplotlib.pyplot as plt
 import seaborn as sns
 %matplotlib inline
 plt.rcParams[‘figure.figsize’]=8,4
 import warnings
 warnings.filterwarnings(‘ignore’)
 #Distribution
 vis1 = sns.distplot(dataset[“Fedu”])
 #Boxplots
 vis2 = sns.boxplot(data=dataset, x=”Male”,y=”G3")
 #Linear regression model
 vis3 = sns.lmplot(x=”G2", y=”G3", data=dataset)

Simple Linear Regression

Now, let’s apply some machine learning to our dataset. It’s a good guess that G3 depends on G1 in a linear fashion. We can see this more clearly by applying simple linear regression with G3 as the dependent variable and G1 as the independent variable.

First, we let X be the matrix consisting of the first 50 columns of our dataset studentmathdummified.csv. Then, let Y be the last column of the dataset — namely, G3.

We then split our dataset into a training set and a test set. We apply linear regression to our training set to train our simple linear regression model; we then apply the model to our test set, and we can compare the predicted Y values with the actual Y values of our test set.

 Here is the python code:
 #Simple Linear Regression
 #Importing the libraries
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 #Importing the dataset
 dataset = pd.read_csv(“studentmathdummified.csv”)
 X = dataset.iloc[:,:-1].values
 Y = dataset.iloc[:,-1].values
 #Splitting the dataset into the Training set and Test set
 from sklearn.model_selection import train_test_split
 X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
 #Fitting Simple Linear Regression to the Training set
 from sklearn.linear_model import LinearRegression
 regressor = LinearRegression()
 regressor.fit(X_train[:,48:49],Y_train)
 #Predicting the Test set results
 Y_pred = regressor.predict(X_test[:,48:49])
 X_train[:,48:49]

Let’s take a look at the linear model attained from training it on our training set. Here is what it looks like:

In red is the scatter plot of our training set G3 values versus the training set G1 values. The blue line is our linear regression model.

Here is the python code used to generate the graph:

 #Visualizing the Training set results
 plt.scatter(X_train[:,48:49], Y_train, color = ‘red’)
 plt.plot(X_train[:,48:49],regressor.predict(X_train[:,48:49]), color = ‘blue’)
 plt.title(‘G3 vs G1 (Training set)’)
 plt.xlabel(‘G1’)
 plt.ylabel(‘G3’)
 plt.show()

Now, let’s see how the linear regression model performs on the test set. Here is the scatter plot of the test set G3 values versus the test set G1 values in red and the linear model in blue.

As you can see, the linear regression model performs extremely well on the test set.

Here is the python code to generate the graph:

 #Visualizing the Test set results
 plt.scatter(X_test[:,48:49], Y_test, color = ‘red’)
 plt.plot(X_train[:,48:49],regressor.predict(X_train[:,48:49]), color = ‘blue’)
 plt.title(‘G3 vs G1 (Test set)’)
 plt.xlabel(‘G1’)
 plt.ylabel(‘G3’)
 plt.show()

We can see a very similar relationship between G3 and G2. We can apply simple linear regression with G3 as the dependent variable and G2 as the independent variable. Here are the results:

Multiple Linear Regression

So far, we’ve applied linear regression using a single variable, either G1 or G2. Perhaps the other independent variables have an effect on G3. To see, we can apply multiple linear regression where we take into account all of the independent variables.

First, in order to avoid the dummy variable trap, I deleted the columns for GP, Male, urban,LE3, Apart,mother_at_home, father_at_home, reason_course, guardian_other. I named the new dataset ‘dataset_trap’. Then, I defined X and Y using dataset_trap. I split the dataset into a training set and a test set, trained the multiple linear regression model on the training set, and applied the model to the X_test.

 Here is the python code:
 #Multiple Linear Regression
 #Importing the libraries
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 #Importing the dataset
 dataset = pd.read_csv(“studentmathdummified.csv”)
 #Avoiding the dummy variable trap
 #Dropping GP, Male, urban,LE3, Apart,mother_at_home, father_at_home, reason_course, guardian_other
 dataset_trap = dataset.drop(dataset.columns[[0,2,4,6,8,10,15,20,26]],axis=1)
 #Define X and Y using dataset_trap
 X = dataset_trap.iloc[:,:-1].values
 Y = dataset_trap.iloc[:,-1].values
 #Splitting the dataset into the Training set and Test set
 from sklearn.model_selection import train_test_split
 X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
 #Fitting Multiple Linear Regression to the Training set
 from sklearn.linear_model import LinearRegression
 regressor = LinearRegression()
 regressor.fit(X_train,Y_train)
 #Predicting the Test set results
 Y_pred = regressor.predict(X_test)

Comparing the predicted Y values with the test set Y values, the model did a pretty good job but not an excellent job. Perhaps we can get better performance by only including the attributes that have a significant effect on G3. We can do this by performing backward elimination. If we do this, using a threshold of 0.05 for the p-value, we end up with the attributes Age, famrel, absences, G1, and G2 as our optimal set of attributes. The age of the student, the quality of family relationships, the number of absences, and the grades in the first and in the second years are found to be the most significant attributes.

Here is the python code for performing backward elimination.

 #Performing backward elimination
 import statsmodels.formula.api as sm
 X = np.append(arr = np.ones((395,1)).astype(int), values = X, axis = 1)
 X_opt = X[:, [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41]]
 regressor_OLS = sm.OLS(endog = Y, exog = X_opt).fit()
 regressor_OLS.summary()

We eliminate the column in X_opt that corresponds to the independent variable with highest p-value over 0.05. Then, we perform the following code again with the new X_opt:

X_opt = X[:, [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41]]
 regressor_OLS = sm.OLS(endog = Y, exog = X_opt).fit()
 regressor_OLS.summary()

We repeat this process until all independent variables have p-value below 0.05. We end up with:

 X_opt = X[:, [19,33,39,40,41]]
 regressor_OLS = sm.OLS(endog = Y, exog = X_opt).fit()
 regressor_OLS.summary()

Columns 19, 33, 39, 40, 41 correspond to the attributes Age, famrel, absences, G1, and G2.

Instead of performing backward elimination manually, we can also use the following code to perform it automatically:

 import statsmodels.formula.api as sm
 def backwardElimination(x, sl):
 numVars = len(x[0])
 for i in range(0, numVars):
 regressor_OLS = sm.OLS(Y, x).fit()
 maxVar = max(regressor_OLS.pvalues).astype(float)
 if maxVar > sl:
 for j in range(0, numVars — i):
 if (regressor_OLS.pvalues[j].astype(float) == maxVar):
 x = np.delete(x, j, 1)
 regressor_OLS.summary()
 return x
 SL = 0.05
 X_opt = X[:, [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41]]
 X_Modeled = backwardElimination(X_opt, SL)
 regressor_OLS = sm.OLS(endog = Y, exog = X_Modeled).fit()
 regressor_OLS.summary()

SVR Regression

In this section, we will perform support vector regression using Gaussian kernel. Given our earlier insight that the attributes that have most significance are Age, famrel, absences, G1, and G2, I trained an SVR model using these attributes on a training set. I performed feature scaling on X_train, X_test, and Y_train. I then compared Y_test with Y_pred. The performance was very impressive.

Here is the python code:

 #SVR Regression
 #Importing the libraries
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 #Importing the dataset
 dataset = pd.read_csv(“studentmathdummified.csv”)
 #Avoiding the dummy variable trap
 #Dropping GP, Male, urban,LE3, Apart,mother_at_home, father_at_home, reason_course, guardian_other
 dataset_trap = dataset.drop(dataset.columns[[0,2,4,6,8,10,15,20,26]],axis=1)
 #Define X and Y using dataset_trap
 X = dataset_trap.iloc[:,:-1].values
 Y = dataset_trap.iloc[:,-1].values
 #Splitting the dataset into the Training set and Test set
 from sklearn.model_selection import train_test_split
 X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
 #Feature Scaling
 from sklearn.preprocessing import StandardScaler
 sc_X = StandardScaler()
 X_train = sc_X.fit_transform(X_train[:, [18,32,38,39,40]])
 X_test = sc_X.fit_transform(X_test[:, [18,32,38,39,40]])
 sc_Y = StandardScaler()
 Y_train = sc_Y.fit_transform(Y_train.reshape(-1,1))
 #Fitting SVR Regression to the Training set
 from sklearn.svm import SVR
 regressor = SVR(kernel = ‘rbf’)
 regressor.fit(X_train,Y_train)
 #Predicting the Test set results
 Y_pred = sc_Y.inverse_transform(regressor.predict(X_test))

Decision Tree Regression

I performed decision tree regression on the training set, without removing any of the attributes. The performance was very good, even though we didn’t use only the attributes Age, famrel, absences, G1, and G2. Here is the python code:

 #Decision Tree Regression
 #Importing the libraries
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 #Importing the dataset
 dataset = pd.read_csv(“studentmathdummified.csv”)
 #Avoiding the dummy variable trap
 #Dropping GP, Male, urban,LE3, Apart,mother_at_home, father_at_home, reason_course, guardian_other
 dataset_trap = dataset.drop(dataset.columns[[0,2,4,6,8,10,15,20,26]],axis=1)
 #Define X and Y using dataset_trap
 X = dataset_trap.iloc[:,:-1].values
 Y = dataset_trap.iloc[:,-1].values
 #Splitting the dataset into the Training set and Test set
 from sklearn.model_selection import train_test_split
 X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
 #Feature Scaling
 “””from sklearn.preprocessing import StandardScaler
 sc_X = StandardScaler()
 X_train = sc_X.fit_transform(X_train[:, [18,32,38,39,40]])
 X_test = sc_X.fit_transform(X_test[:, [18,32,38,39,40]])
 sc_Y = StandardScaler()
 Y_train = sc_Y.fit_transform(Y_train.reshape(-1,1))”””
 #Fitting Decision Tree Regression to the Training set
 from sklearn.tree import DecisionTreeRegressor
 regressor = DecisionTreeRegressor(random_state = 0)
 regressor.fit(X_train,Y_train)
 #Predicting the Test set results
 Y_pred = regressor.predict(X_test)

Random Forest Regression

I also applied random forest regression using 10, 100, and 500 trees. In random forests, a bunch of trees are grown and the average of the predicted values is taken to be the prediction. In the python code, it’s similar to the decision tree regression code except we replace the section on fitting decision tree regression to the training set with the following:

 #Fitting Random Forest Regression to the Training set
 from sklearn.ensemble import RandomForestRegressor
 regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
 regressor.fit(X_train,Y_train)

Assessing Model Performance

In order to determine which model is the best, we will perform k-fold cross validation (k=10) for each model and pick the one that has the best accuracy.

For multiple linear regression using all attributes, I got an accuracy of 80%. For multiple linear regression using only the five attributes Age, famrel, absences, G1, and G2, I got an accuracy of 83%.

For SVR regression using all attributes, I got an accuracy of 73%. For SVR regression using only the five attributes Age, famrel, absences, G1, and G2, I got an accuracy of 83%.

For decision tree regression, I got an accuracy of 77%. For decision tree regression using only the five attributes Age, famrel, absences, G1, and G2, the accuracy is 83%.

For random forest regression, using 10 trees, I got an accuracy of 85%. For 100 trees, I got an accuracy of 86%. Using 500 trees, I got an accuracy of 87%. I’ve tried increasing the number of trees, but the accuracy doesn’t go beyond 87%.

For random forest regression using only the five attributes Age, famrel, absences, G1, and G2, I got the following: 86% for 10 trees, 88% for 100 trees, 88% for 500 trees. I tried increasing the number of trees, but the accuracy doesn’t go beyond 88%.

What’s interesting about the above results is that by limiting the attributes to only the five attributes Age, famrel, absences, G1, and G2, the accuracy went up for each model.

The best performing model appears to be random forest regression using 500 trees. The performance is even better if we limit the attributes to the five attributes Age, famrel, absences, G1, and G2.

Here is the python code:

 #Applying k-fold cross validation
 from sklearn.model_selection import cross_val_score
 accuracies = cross_val_score(estimator=regressor,X=X_train, y=Y_train, cv=10)
 accuracies.mean()
 accuracies.std()

Conclusion

In our regression analysis of the dataset, we found that some of the most significant attributes were grades in years 1 and 2, quality of family relationships, age, and the number of absences. The random forest regression with 500 trees turned out to be one of the best performing models with 87–88% accuracy. We also saw a strong linear relationship between the grade in year 3 with the grades in years 1 and 2.

The dataset can be found here:

https://archive.ics.uci.edu/ml/datasets/student+performance

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5–12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978–9077381–39–7.
[Web Link]

High School Math Performance Prediction (Classification)

Introduction

In the world of math education, one of the major issues that universities and educators have is that students do not succeed at mathematics at a satisfactory level and at a rate that is satisfactory.  Universities and educators complain of the high failure, drop, and withdrawal rates of their students.  This is a problem for students because low performance in math prevents them from pursuing their degrees and careers.  It is a problem for universities and educators because it means that the university or educator is not successfully teaching students, not retaining their students, and not satisfying the needs of their students—these problems hurt the profitability and attractiveness of the university and educator.

If we can gain some insights into what factors most contribute to or hurt student performance in math, we have the potential to solve the above-mentioned problems.  If we can produce predictive models that can predict whether a student will pass or fail, that can predict the numerical score of students on math assessments, and that can predict the overall strength and promise of a student, then universities and educators will be able to use these models to better place students at the appropriate level of competence, to better select students for admission, and to better understand the factors that can be improved upon to help students be successful.

In this paper, we will perform data science and machine learning to a dataset representing the math performance of students from two Portuguese high schools.

In a previous article, which can be found at High School Math Performance Regression, I applied regression methods to the dataset to predict the value of G3.  In the present paper, I would like to separate the G3 scores into five classes and try to classify a student as falling into one of five classes depending on their G3 score.  This becomes a 5-class classification problem, and we can apply machine learning classification methods to this problem. 

Data Preparation

The data file was separated by semicolons rather than commas.  I replaced the semicolons by commas.  Then, copy and pasted everything into notepad.  Then, convert to a csv file using the steps from the following link:

https://knowledgebase.constantcontact.com/articles/KnowledgeBase/6269-convert-a-text-file-to-an-excel-file?lang=en_US

Now, I have a nice csv file.

There are 30 attributes that include things like student age, parent’s education, parent’s job, weekly study time, number of absences, number of past class failures, etc.  There are grades for years 1, 2, and 3; these are denoted by G1, G2, and G3.  The grades range from 0-20.  G1 and G2 can be used as input features, and G3 will be the main target output. 

Some of the attributes are ordinal, some are binary yes-no, some are numeric, and some are nominal.  We do need to do some data preprocessing.  For the binary yes-no attributes, I will encode them using 0’s and 1’s.  I did this for schoolsup, famsup, paid, activities, nursery, higher, internet, and romantic.  The attributes famrel, freetime, goout, Dalc, Walc, and health are ordinal; the values for these range from 1 to 5.  The attributes Medu, Fedu, traveltime, studytime, failures are also ordinal; the values range from 0 to 4 or 1 to 4.  The attribute absences is a count attribute; the values range from 0 to 93.  The attributes sex, school, address, Pstatus, Mjob, Fjob, guardian, famsize, reason are nominal.  For nominal attributes, we can use one-hot encoding.  The attributes age, G1, G2, and G3 can be thought of as interval attributes. I one-hot encoded each nominal attribute, one at a time.  I exported the dataframe as a csv file each time, relabeling the columns as I go.  Finally, I reordered the columns.

Here is the python code:

import numpy as np
import pandas as pd
dataset = pd.read_csv('C:\\Users\\ricky\\Downloads\\studentmath.csv') 
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,32].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
# Encoding binary yes-no attributes
X[:,15] = labelencoder_X.fit_transform(X[:,15])
X[:,16] = labelencoder_X.fit_transform(X[:,16])
X[:,17] = labelencoder_X.fit_transform(X[:,17])
X[:,18] = labelencoder_X.fit_transform(X[:,18])
X[:,19] = labelencoder_X.fit_transform(X[:,19])
X[:,20] = labelencoder_X.fit_transform(X[:,20])
X[:,21] = labelencoder_X.fit_transform(X[:,21])
X[:,22] = labelencoder_X.fit_transform(X[:,22])
# Encoding nominal attributes
X[:,0] = labelencoder_X.fit_transform(X[:,0])
X[:,1] = labelencoder_X.fit_transform(X[:,1])
X[:,3] = labelencoder_X.fit_transform(X[:,3])
X[:,4] = labelencoder_X.fit_transform(X[:,4])
X[:,5] = labelencoder_X.fit_transform(X[:,5])
X[:,8] = labelencoder_X.fit_transform(X[:,8])
X[:,9] = labelencoder_X.fit_transform(X[:,9])
X[:,10] = labelencoder_X.fit_transform(X[:,10])
X[:,11] = labelencoder_X.fit_transform(X[:,11])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
from pandas import DataFrame
df = DataFrame(X)
export_csv = df.to_csv (r'C:\Users\Ricky\Downloads\highschoolmath.csv', index = None, header=True)

At this point, the final column of our dataset consists of integers for G3.  Scores 16-20 will form class 1, scores 14-15 will form class 2, scores 12-13 will form class 3, scores 10-11 will form class 4, and scores 0-9 will form class 5.  We can create a final column of classes 1-5 by converting each score to one of the classes. 

Here is the python code for doing it:

#Defining a function that converts G3 to one of the five classes
def filter_class(score):
    if score<10:
        return 5
    elif score<12:
        return 4
    elif score<14:
        return 3
    elif score<16:
        return 2
    else:
        return 1

#defining a new column called 'class' and dropping column 'G3'
dataset_trap['class'] = dataset_trap['G3'].apply(filter_class)
dataset_trap = dataset_trap.drop(['G3'], axis=1)

Now, our dataset is ready for us to apply classification methods on it.

Logistic Regression

We define X and Y using dataset_trap.  Then, we split the dataset into a training set and a test set, apply feature scaling to X_train and X_test, fit logistic regression to the training set, and predict the test set results. 

Here is the python code:

#Define X and Y using dataset_trap
X = dataset_trap.iloc[:,:-1].values
Y = dataset_trap.iloc[:,-1].values

#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

#Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train,Y_train)

#Predicting the Test set results
Y_pred = classifier.predict(X_test)

Now, we have the predicted Y values for the test set Y values.  We can see how accurate our model is by looking at the confusion matrix:

The numbers in the diagonal of the confusion matrix count the number of correct classifications.  So, to find how accurate our model is, we would add the diagonal entries and divide by the total number of test set results, which is 79. 

Here is the python code for creating the confusion matrix:

#Making the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
cm.trace()/79

The accuracy of the model can be measured by the number of correct predictions divided by the total number of test set results.  In this case, the accuracy is 50%.  This is not very impressive.

K Nearest Neighbors

The k-nearest neighbors model was trained on our training set using the Euclidean distance and k=5 neighbors.  The python code is pretty much the same as the one for logistic regression except we replace the logistic regression model with the k nearest neighbors model.  Here is the full python code:

#Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Importing the dataset
dataset = pd.read_csv("studentmathdummified.csv")

#Avoiding the dummy variable trap
#Dropping GP, Male, urban,LE3, Apart,mother_at_home, father_at_home, reason_course, guardian_other
dataset_trap = dataset.drop(dataset.columns[[0,2,4,6,8,10,15,20,26]],axis=1)

#Defining a function that converts G3 to one of the five classes
def filter_class(score):
    if score<10:
        return 5
    elif score<12:
        return 4
    elif score<14:
        return 3
    elif score<16:
        return 2
    else:
        return 1

#defining a new column called 'class' and dropping column 'G3'
dataset_trap['class'] = dataset_trap['G3'].apply(filter_class)
dataset_trap = dataset_trap.drop(['G3'], axis=1)

#Define X and Y using dataset_trap
X = dataset_trap.iloc[:,:-1].values
Y = dataset_trap.iloc[:,-1].values

#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

#Fitting K nearest neighbors to the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p =2)
classifier.fit(X_train,Y_train)

#Predicting the Test set results
Y_pred = classifier.predict(X_test)

#Making the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
cm.trace()/79

The accuracy of our model is 28%, which is really bad.

Support Vector Machines

Support vector machines are designed to perform classification when there are only two classes.  However, there is a way to use svm’s when there are more than 2 classes, as in our case with five classes.  One way is to use something called one-versus-one scheme.  In the one-versus-one scheme, we build an svm model for every possible pair of classes.  For K-class classification, there are K choose 2 pairs.  So, in our case, there are 10 pairs of classes.  We can apply build 10 support vector classifiers and predict the class for a given test point by applying all 10 support vector classifiers to the test point and choosing the class for which the number of times the test point is classified as that class is highest.  The python code is the same as for previous classifiers except we replace the classifier with the support vector classifier:

#Fitting support vector classifier to the Training set using one-versus-one
from sklearn.svm import SVC
classifier = SVC(kernel='linear')
classifier.fit(X_train,Y_train)

The accuracy of our model is 62%.

Another way to use svm’s when there are more than 2 classes is to use one-versus-all scheme.  In this scheme, K models are built by pairing each class with the rest.  In our case 5 models are built.  To predict the class for a given test point, we apply all 5 models to the test point and choose the class for which the perpendicular distance of the test point from the maximal margin hyperplane of the class’s corresponding model is largest.  In other words, we choose the class for which the corresponding model most confidently classifies the test point as of that class.  Here’s the python code:

#Fitting support vector classifier to the Training set using one-versus-rest
from sklearn.svm import LinearSVC
classifier = LinearSVC()
classifier.fit(X_train,Y_train)

The accuracy of our model is 56%.

I tried using the rbf kernel and got 44% accuracy. 

#Fitting support vector classifier to the Training set using one-versus-one and rbf kernel
from sklearn.svm import SVC
classifier = SVC(kernel='rbf', random_state=0)
classifier.fit(X_train,Y_train)

The default regularization parameter C is 1.  I raised the regularization parameter to 4 and got an accuracy of 45.5%.  Raising the regularization parameter to 10 gives an accuracy of 48%.

#Fitting support vector classifier to the Training set using one-versus-one and rbf kernel and C=10
from sklearn.svm import SVC
classifier = SVC(kernel='rbf', C=10, random_state=0)
classifier.fit(X_train,Y_train)

Increasing the value of C means we’re being more strict about how many points are misclassified; it corresponds to having a smaller-margin separating hyperplane.  Put another way, by increasing C, we’re decreasing the leeway for violations of the margin.  Lowering the value of C corresponds to being more lenient with misclassifications; it corresponds to having a larger-margin separating hyperplane.  As we increased the value of C, we saw that the accuracy went up.

I also tried the sigmoid kernel, with C=30, and got a 62% accuracy.  Lowering the value of C to less than 30 or raising the value of C to higher than 30 gives poorer accuracy.

#Fitting support vector classifier to the Training set using one-versus-one and sigmoid kernel and C=30
from sklearn.svm import SVC
classifier = SVC(kernel='sigmoid', C=30, random_state=0)
classifier.fit(X_train,Y_train)

Decision Trees

In this method, we’re going to grow one tree.  The splitting criterion is chosen to be entropy, and feature scaling is not necessary.  When I applied the decision tree classifier to the test set, I got an accuracy of 72%.

#Fitting decision tree classifier to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy')
classifier.fit(X_train,Y_train)

I set the minimum number of samples required to split an internal node to 10 and the minimum number of samples required to be at a leaf node to 5.  This improved the accuracy to 77%.

#Fitting decision tree classifier to the Training set with minimum number of samples required to split 10 and minimum number of leaf samples 5
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', min_samples_split=10, min_samples_leaf=5)
classifier.fit(X_train,Y_train)

Random Forests

In this method, we’re going to grow a bunch of trees.  The splitting criterion is chosen to be entropy, and feature scaling is not used.  When I applied the random forest classifier to the test set, using 10 trees, I got an accuracy of 62%:

#Fitting random forest classifier to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(criterion='entropy', n_estimators=10)
classifier.fit(X_train,Y_train)

I set the minimum number of samples required to split an internal node to 10 and the minimum number of samples required to be at a leaf node to 5.  I also increased the number of trees to 100.  This improved the accuracy to 74%:

#Fitting random forest classifier to the Training set with 100 trees and minimum samples required to split 10 and minimum samples at a leaf 5
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(criterion='entropy', n_estimators=100, min_samples_split=10, min_samples_leaf=5)
classifier.fit(X_train,Y_train)

To improve accuracy even more, I set the max_features to None:

#Fitting random forest classifier to the Training set with 100 trees, min_samples_split=10, min_samples_leaf=5, and max_features=None
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(criterion='entropy', n_estimators=100, max_features=None, min_samples_split=10, min_samples_leaf=5)
classifier.fit(X_train,Y_train)

I got an accuracy of 78%.

Model Selection

In order to determine which model is the best, we will perform k-fold cross validation (k=10) for each model and pick the one that has the best accuracy.

For logistic regression, I got an accuracy of 57%.

For k-nearest neighbors with k=5, I got an accuracy of 43%.

For support vector classifier, one-versus-one, I got an accuracy of 64%.

For support vector classifier, one-versus-rest, I got an accuracy of 57%.

For support vector classifier, one-versus-one with rbf kernel, I got an accuracy of 53%.

For support vector classifier, one-versus-one with rbf kernel and C=10, I got an accuracy of 57%.

For support vector classifier, one-versus-one with sigmoid kernel and C=30, I got an accuracy of 60%.

For a single decision tree, I got an accuracy of 66%.

For a single decision tree with min_samples_split=10 and min_samples_leaf=5, I got an accuracy of 69%.

For random forest with 10 trees, I got an accuracy of 64%.

For random forest with 100 trees, min_samples_split=10, and min_samples_leaf=5, I got an accuracy of 70%.

For random forest with 100 trees, min_samples_split=10, min_samples_leaf=5, max_features=None, I got an accuracy of 76%.

Here is the python code for applying k-fold cross validation:

#Applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=classifier,X=X_train, y=Y_train, cv=10)
accuracies.mean()

Comparing the accuracies of each model, we see that random forests with 100 trees, min_samples_split=10, min_samples_leaf=5, max_features=None has the highest accuracy.  We might wonder whether entropy is the best criterion for splitting and whether 100 trees is the best number of trees to use; we might also wonder about what the best value for max_features is.  I performed a grid search for criterion among entropy and gini, for n_estimators among 10,100,500, for max_features among ‘auto’, None, ‘log2’, and 1.

#grid search
from sklearn.model_selection import GridSearchCV
parameters = [{'criterion':['entropy'],'n_estimators':[10,100,500],'max_features':['auto',None,'log2',1]},{'criterion':['gini'],'n_estimators':[10,100,500],'max_features':['auto',None,'log2',1]}]
grid_search=GridSearchCV(estimator = classifier, param_grid=parameters, scoring='accuracy',cv=10)
grid_search=grid_search.fit(X_train, Y_train)
best_accuracy=grid_search.best_score_
best_parameters=grid_search.best_params_

The result is a best accuracy of 76.9% and best parameters criterion=’gini’, max_features=None, and n_estimators=500.

For random forest with criterion=’gini’, 500 trees, min_samples_split=10, min_samples_leaf=5, max_features=None, I got an accuracy of 77.5%.  I increased min_samples_split to 50 and got an accuracy of 78.2%.

Conclusion

In this paper, we applied logistic regression, k-nearest neighbors, support vector classifiers, decision trees, and random forests to the 5-class classification problem of predicting which of five classes each student’s third year score would fall under. We found that the best performing model, among the ones we examined, is the random forest classifier with criterion=’gini’, 500 trees, min_samples_split=50, min_samples_leaf=5, max_features=None.  The accuracy achieved was 78.2%. 

In our regression analysis of the dataset in a previous paper, we found that some of the most significant attributes were grades in years 1 and 2, quality of family relationships, age, and the number of absences.  The random forest regression with 500 trees turned out to be one of the best performing models with 87-88% accuracy (R squared).  We also saw a strong linear relationship between the grade in year 3 with the grades in years 1 and 2. 

Whether or not the attributes G1, G2, quality of family relationships, age, and number of absences are always significant in every school, in every time period, and in every country is an open question.  Can the insights gathered here be generalized beyond the two Portuguese high schools we considered?  What other attributes, beside the ones we considered, might be significant in determining math performance?  These are open questions worth pursuing to further understand and resolve the issue of poor math performance.

The dataset can be found here:

https://archive.ics.uci.edu/ml/datasets/student+performance

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
[Web Link]

Pneumonia Diagnosis using CNN’s

Introduction

In the world of healthcare, one of the major issues that medical professionals face is the correct diagnosis of conditions and diseases of patients.  Not being able to correctly diagnose a condition is a problem for both the patient and the doctor.  The doctor is not benefitting the patient in the appropriate way if the doctor misdiagnoses the patient.  This could lead to malpractice lawsuits and overall hurt the doctor’s business.  The patient suffers by not receiving the proper treatment and risking greater harm to health by the condition that goes undetected; further, the patient undergoes unnecessary treatment and takes unnecessary medications, costing the patient time and money.    

If we can correctly diagnose a patient’s condition, we have the potential to solve the above-mentioned problems.  If we can produce deep learning models that can classify whether a patient has a condition or not, that can determine which particular condition the patient has, and that can determine the severity of the condition, then medical professionals will be able to use these models to better diagnose their patients.  Accurate diagnosis can also be useful by allowing for timely treatment of a patient; being misdiagnosed can cause a delay in receiving the proper treatment.

In this paper, we will perform deep learning to a dataset representing the chest x-rays of pediatric patients from Guangzhou Women and Children’s Medical Center, Guangzhou.

I would like to apply a convolutional neural network (CNN) and try to classify a patient as either having pneumonia or not having pneumonia.  This is a binary classification problem.  I would also like to apply CNN’s to classify a patient as either having bacterial pneumonia, viral pneumonia, or no pneumonia.  This is a 3-class classification problem.

[The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia (middle) typically exhibits a focal lobar consolidation, in this case in the right upper lobe (white arrows), whereas viral pneumonia (right) manifests with a more diffuse “interstitial” pattern in both lungs. (Kermany et al, 2018)]

Apparently, the bacterial pneumonia has areas of opaqueness that are more concentrated in one lobe whereas viral pneumonia has opaque areas more spread out on both lungs.  The right lung is divided into three lobes, and the left lung is divided into two lobes.  It’s certainly not obvious to me how to tell the difference.  Hopefully, deep learning can help us tell the difference.

Data Preparation

The dataset can be found here: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia#IM-0007-0001.jpeg

The data comes in two folders, one for the training set and one for the test set.  The training set folder contains a folder of images for pneumonia cases and a folder of images for normal cases.  The training set consists of 5216 images total.  The test set folder contains a folder of images for pneumonia cases and a folder of images for normal cases.  The test set consists of 624 images total, approximately 10.68% of the total set of images.  Unlike the case in classical machine learning, we don’t have to worry about the various attributes of the dataset; in the case of convolutional neural networks, we just have a set of images. 

Convolutional Neural Networks

There is, however, some preparation of the images that is necessary before applying an artificial neural network.  The images need to be prepared using convolutional layers in a process called convolution.  There are several stages in this process—convolution operation, ReLU operation, pooling, and flattening; the end result is a vector that we can feed into an artificial neural network. 

Here is an image of a general CNN architecture:

[By Aphex34 – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374]

During the convolution operation, various feature detectors are applied to the image, creating a stack of feature maps—this stack of feature maps is called a convolutional layer.  ReLU is applied to each feature map to enhance non-linearity.  During the pooling stage, also known as subsampling, we apply max-pooling (or some other type of pooling) to each feature map, creating smaller feature maps that preserve the relevant features of the image.  The resulting stack of pooled featured maps forms the pooling layer.

Once we get to the pooling layer, consisting of pooled feature maps, each pooled feature map is flattened into a vector and the resulting vectors are combined sequentially into one vector.  The entries of this vector are fed into the input units of the artificial neural network.  Thus, the entries of the flattened vector corresponding to one image are fed into the input units of the ANN.  (This is in contrast to ANN’s used on a classical dataset where the attributes of a single instance are fed into the input units of the ANN).  The artificial neural network is then trained on the training set and tested on the test set.  Here is an image of a general ANN:

The ANN begins where it says ‘Fully connected’ in the diagram for the CNN architecture.  As you can see, a convolutional neural network is the combination of convolution and an ANN.

Building a CNN with Python

In order to build the CNN, we import Keras libraries and packages:

#Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

The Sequential package is used to initialise the CNN.  The Convolution2D package is used to create the convolutional layers.  The MaxPooling2D package is used to created the pooled feature maps.  The Flatten package is used to flatten the stack of pooled feature maps into one vector that can be fed into the ANN.  The Dense package is used to add layers to the ANN.

Next, we initialize the CNN by creating an object of the Sequential class.  This object we will call ‘classifier’:

#Initialising the CNN
classifier = Sequential()

We’re going to add one convolutional layer of 32  filter maps by applying 32 filters (feature detectors) of dimension 3 by 3 to the input image.  We want our input images to have dimension 64 by 64 and treated as color images with 3 channels.  We also apply ReLU to each feature map using the activation function ‘relu’:

#Step 1 - Convolution
classifier.add(Convolution2D(32,(3,3),input_shape=(64,64,3),activation = 'relu'))

Now that we have our feature maps in the convolutional layer, we apply max-pooling to each feature map using a 2 by 2 grid.

#Step 2 - Pooling
classifier.add(MaxPooling2D((2,2)))

Now that we have a pooling layer consisting of pooled feature maps, we flatten each pooled feature map into a vector and combine all the resulting vectors sequentially into one giant vector.

#Step 3 - Flattening
classifier.add(Flatten())

Next, we’re going to add our artificial neural network.  First, we add a hidden layer of 128 units and use the activation function ‘relu’.  Second, we add the output layer consisting of one output unit and use the sigmoid function as the activation function; we use one output unit because our output is binary (either normal or pneumonia).

#Step 4 - Full connection
classifier.add(Dense(units=128,activation='relu'))
classifier.add(Dense(units=1, activation='sigmoid'))

Now, we need to compile the CNN.  We’re going to use ‘adam’ as the optimizer in stochastic gradient descent, binary cross-entropy for the loss function, and accuracy as the performance metric.

#Compiling the CNN
classifier.compile(optimizer = 'adam',loss='binary_crossentropy',metrics=['accuracy'])

Our training set and test set combined has a total of 5840 images; so, we’re going to apply image augmentation to increase the size of our training set and test set while reducing overfitting.  We then fit the CNN to our augmented training set and test it on our augmented test set:

#Fitting the CNN to the images
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

training_set = train_datagen.flow_from_directory(
        'chest_xraybinary/train',
        target_size=(64, 64),
        batch_size=32,
        class_mode='binary')

test_set = test_datagen.flow_from_directory(
        'chest_xraybinary/test',
        target_size=(64,64),
        batch_size=32,
        class_mode='binary')

classifier.fit_generator(
        training_set,
        epochs=25,
        validation_data=test_set)

After 25 epochs, I got an accuracy of 95% on the training set and 89% on the test set.

Evaluating, Improving, and Tuning the CNN

Previously, we built a CNN with one convolutional layer and one hidden layer.  This time, we’re going to add a second convolutional layer and see if it improves performance.  We simply add the following code after step 2—pooling:

#Adding a second convolutional layer
classifier.add(Convolution2D(32,(3,3),activation = 'relu'))
classifier.add(MaxPooling2D((2,2)))

After 25 epochs, we get an accuracy of 96% on the training set and 91.5% on the test set.

Next, in addition to having a second convolutional layer, we’re going to add a second hidden layer and see if it improves performance.  To add a second hidden layer, we simply duplicate the code for adding one hidden layer:

#Step 4 - Full connection
classifier.add(Dense(units=128,activation='relu'))
classifier.add(Dense(units=128,activation='relu'))
classifier.add(Dense(units=1, activation='sigmoid'))

After 25 epochs, we get an accuracy of 96% on the training set and 91.5% on the test set.  Adding a second hidden layer did not improve performance.

Distinguishing between Bacterial and Viral Pneumonia

Not only do we want to distinguish between normal and pneumonia x-rays but we want to distinguish between the bacterial and viral pneumonia x-rays.  To do this, we split up the folder containing pneumonia cases into two folders, one for bacteria cases and one for virus cases.  Now, we have a three-class classification problem where the classes are normal, bacteria, and virus.  Just as we used a CNN to solve the binary classification problem, we can use a CNN to solve the three-class classification problem.  The code stays the same with a few exceptions.  In the artificial neural network phase of the CNN, we change the number of output units from 1 to 3 and the output activation function from ‘sigmoid’ to ‘softmax’.  When compiling the CNN, the loss function is changed from ‘binary_crossentropy’ to ‘categorical_crossentropy’.  When fitting the CNN to the images, instead of using the folder ‘chest_xraybinary’, we use the folder ‘chest_xray’ which contains the training and test set folders that each have three folders corresponding to the three classes.  The class_mode is changed from ‘binary’ to ‘categorical’.

After 25 epochs, I got an accuracy of 80.64% on the training set and 83.33% on the test set.

A second convolutional layer was added; and, after 25 epochs, I got an accuracy of 81.33% on the training set and 85.9% on the test set.  This is a slight improvement in performance.

In addition to the second convolutional layer, a second hidden layer was added; and, after 25 epochs, I got an accuracy of 80.25% on the training set and 86.7% on the test set.  This is a slight improvement in performance on the test set.

In addition to the second convolutional layer and the second hidden layer, I changed the number of feature detectors in the second convolutional layer from 32 to 64.  After 25 epochs, I got an accuracy of 81% on the training set and 87% on the test set.  This is a slight improvement in test set performance from the CNN with two convolutional layers and two hidden layers.  However, it’s a decent improvement in test set performance from the CNN with one convolutional layer and one hidden layer.  I then increased the dimensions of the images fed into the CNN from 64 by 64 to 128 by 128; this resulted in an accuracy of 80.85% on the training set and 85.74% on the test set, a worse performance than what we’ve reached so far.

I brought the dimensions of the input images back down to 64 by 64 and added a third convolutional layer of 128 feature detectors.  This resulted in an accuracy of 81.54% on the training set and 87.34% on the test set, a very small improvement on what we’ve got so far.  Given the current settings, I ran the CNN for 200 epochs; the training set accuracy steadily increased to 96.32% while the training set accuracy fluctuated between the low 80’s and mid-80’s, ending at 86.54%.

Keeping three convolutional layers, I added more hidden layers for a total of 10 hidden layers; after 25 epochs, I got a training set accuracy of 79.54% and a test set accuracy of 83.33%.  So, adding more hidden layers didn’t lead to an improvement.

Conclusion

In this paper, we applied convolutional neural networks to the binary classification problem of determining which of two classes—normal or pneumonia—a chest x-ray falls under.  We found that a CNN with one convolutional layer and one hidden layer achieved an accuracy of 95% on the training set and an accuracy of 89% on the test set.  We then added a second convolutional layer to the CNN and achieved an accuracy of 96% on the training set and an accuracy of 91.5% on the test set; this improved performance by a little bit.  Next, in addition to having a second convolutional layer, we added a second hidden layer and achieved an accuracy of 96% on the training set and an accuracy of 91.5% on the test set; this did not improve performance.

We also applied convolutional neural networks to the three-class classification problem of determining which of three classes—normal, bacterial, or viral—a chest x-ray falls under.  We found that a CNN with one convolutional layer and one hidden layer achieved an accuracy of 80.64% on the training set and an accuracy of 83.33% on the test set.  We then added a second convolutional layer to the CNN and achieved an accuracy of 81.33% on the training set and an accuracy of 85.9% on the test set, which was a slight improvement.  Next, in addition to having a second convolutional layer, we added a second hidden layer and achieved an accuracy of 80.25% on the training set and an accuracy of 86.7% on the test set, which was a slight improvement in performance on the test set but a decline in improvement on the training set.  In addition to the second convolutional layer and the second hidden layer, changing the number of feature detectors in the second convolutional layer from 32 to 64 resulted, after 25 epochs, in an accuracy of 81% on the training set and 87% on the test set.  Adding a third convolutional layer of 128 feature detectors resulted in an accuracy of 81.54% on the training set and 87.34% on the test set.

Because we had a limited number of images in our total dataset, it was important to use image augmentation to provide more images to train our CNN; this lack of sufficient numbers of medical images seems to be a common problem for other medical issues besides pneumonia.  Therefore, image augmentation promises to be a useful tool in any case in which there aren’t sufficient numbers of images.

Given that, in certain parts of the world, there may be a shortage of trained professionals who can read and interpret x-rays, there is the potential for automated diagnosis based on a patient’s x-ray.

In the case of classical machine learning techniques, sometimes we are able to identify particular attributes that are significant in determining the output of the machine learning model.  In the case of convolutional neural networks, on the other hand, there is no set of attributes that we can identify as significant in determining the output of the CNN model; all we have are the images and their pixels.  Further, it’s difficult to understand, in an intuitive way, how the CNN model is making its classifications.

The CNN we’ve trained used chest x-rays of pediatric patients aged 1-5 from Guangzhou, China.  Can our CNN be applied to children of other ages, to children outside of China, or even to adults?  How might deep learning be used to detect the severity of a given case of pneumonia?  These are open questions worth pursuing to further understand how image classification can be used to make medical diagnoses.

The dataset can be found here:

https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia#IM-0007-0001.jpeg

Acknowledgements

Data: https://data.mendeley.com/datasets/rscbjbr9sj/2

License: CC BY 4.0

Citation: http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5

References

Kermany et al. Illustrative Examples of Chest X-Rays in Patients with Pneumonia. Identifying Medical Diagnoses and Treatable

Diseases by Image-Based Deep Learning, Cell 172, 1122-1131, 22 Feb. 2018, Elsevier Inc., https://doi.org/10.1016/j.cell.2018.02.010