Hello, Machine Learning

Introduction

This is a Hello, World type of project for getting acquainted with machine learning. It examines the iris dataset.

References:

  1. Your First Machine Learning Project in R Step-By-Step.
  2. The caret package. Package home page by Max Kuhn.

Approach

I began with this outline.

  • Define problem.
  • Prepare data.
  • Evaluate algorithms.
  • Improve results.
  • Present results.

Structured write up

I revised the presentation per these questions, suggested by Data Science Weekly.

  1. What question are you looking to answer?
  2. Why does this question matter?
  3. What data did you use?
  4. Where did you get the data?
  5. How was the data sampled?
  6. How was the data obtained?
  7. How did you explore the data?
  8. How did you model the data?
  9. Why did you choose to model it that way?
  10. What code did you write or use?
  11. How did you fit the model?
  12. How did you validate the model?
  13. How do you know the results make sense?
  14. How did you visualize the results?
  15. How would you communicate the results to others?
  16. What did you learn?
  17. What you would do differently if you did this project again?
  18. If you were going to continue this work, what next steps you would take with this project?
  19. How you would explain what you did to a data scientist?
  20. How you would explain what you did to a non-data scientist

Reference: How You Should Create A Data Science Portfolio That Will Get You Hired.

Dependencies

caret needs these suggested packages for this analysis. They don’t require loading as libraries, but caret refers to them, so they need to be installed.

  • ellipse. Plotting ellipses in featurePlot().
  • randomForest.
library(caret)

Objectives

In this project, I examine measures of parts of the iris flower and use those measures to predict its species. The motivation for the analysis is to demonstrate unsupervised classification, in which classes derive from the data and predictor variables map to these classes based on machine learning models. I code a workflow in a design on which I can base other such analyses.

Data

Edgar Anderson’s iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of three species of iris. The species are iris setosa, versicolor, and virginica. Anderson collected the data in 1935.

The iris dataset appears in the package datasets in the base installation of R. It is accessible and it is clean, so preparation and pre-processing are unrequired, making it convenient for demonstrating data analyses.

  • Predictor variables in the iris data are numeric and in a consistent unit and scale (centimeters), requiring no conversion.
  • Iris observations can be classified by species, so we can run unsupervised classification models on it.
  • Species is given for each iris observation. Therefore, this dataset is labelled, enabling us to split the dataset for machine learning whereby one partition serves to train the models while holding out the other partition for testing predictions and recording their accuracy against the known label.
  • There are multiple species in the dataset, so it is multi-class, or multi-nominal.

Citation:

The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

Modeling

The caret package (which stands for Classification and Regression Training) provides a consistent interface into hundreds of machine learning algorithms and provides useful convenience methods for data visualization, data resampling, model tuning and model comparison, among other features. It’s a must have tool for machine learning projects in R.

In this analysis, I evaluate 5 different algorithms:

  • Linear Discriminant Analysis (LDA)
  • Classification and Regression Trees (CART).
  • k-Nearest Neighbors (kNN).
  • Support Vector Machines (SVM) with a linear kernel.
  • Random Forest (RF)

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF).

Load data

I use a generic name for the dataset so I can reuse my code.

data(iris)
dataset <- iris
rm(iris)
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 1896473 101.3    3792455 202.6  2468118 131.9
## Vcells 3189048  24.4    8388608  64.0  5239337  40.0

Create validation dataset

I split the dataset into partitions for training of the models and validation of the predictions. The training set contains 80% of the original dataset, from which I hold out 20% for validation. Since the partioning is a random sample, I initialize the random seed in order to reproduce these results in each execution.

# Create a list of 80% of the rows in the original dataset we can
# use for training.
set.seed(7)
training_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
# Select 20% of the data for validation.
validation <- dataset[-training_index,]
# Use the remaining 80% of data to training and testing the models
dataset <- dataset[training_index,]

Exploratory analysis

Summarize the data and examine it.

  • Dimensions of the dataset.
  • Types of the attributes.
  • Peek at the data itself.
  • Levels of the class attribute.
  • Breakdown of the instances in each class.
  • Statistical summary of all attributes.
# Dimensions of the dataset.
dim(dataset)
## [1] 120   5
# List types for each attribute.
sapply(dataset, class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"
str(dataset)
## 'data.frame':    120 obs. of  5 variables:
##  $ Sepal.Length: num  4.9 4.7 4.6 5 5.4 5 4.4 5.4 4.8 4.8 ...
##  $ Sepal.Width : num  3 3.2 3.1 3.6 3.9 3.4 2.9 3.7 3.4 3 ...
##  $ Petal.Length: num  1.4 1.3 1.5 1.4 1.7 1.5 1.4 1.5 1.6 1.4 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Take a peek at the first 5 rows of the data.
head(dataset)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 8          5.0         3.4          1.5         0.2  setosa
# List the levels for the class factor.
levels(dataset$Species)
## [1] "setosa"     "versicolor" "virginica"
# Summarize the class distribution.
percentage <- prop.table(table(dataset$Species)) * 100
cbind(freq=table(dataset$Species), percentage=percentage)
##            freq percentage
## setosa       40   33.33333
## versicolor   40   33.33333
## virginica    40   33.33333
# Summarize attribute distributions.
summary(dataset)
##   Sepal.Length    Sepal.Width    Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.00   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.80   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.00   Median :4.250   Median :1.300  
##  Mean   :5.845   Mean   :3.04   Mean   :3.756   Mean   :1.192  
##  3rd Qu.:6.425   3rd Qu.:3.30   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.40   Max.   :6.700   Max.   :2.500  
##        Species  
##  setosa    :40  
##  versicolor:40  
##  virginica :40  
##                 
##                 
## 

Visualize

We are going to look at two types of plots:

  1. Univariate plots, to better understand each attribute.
  2. Multivariate plots, to better understand the relationships between attributes.

Univariate

It is helpful with visualization to have a way to refer to just the input attributes and just the output attributes. Let’s set that up and call the inputs attributes x and the output attribute (or class) y.

# Split input and output.
x <- dataset[,1:4]
y <- dataset[,5]

Given that the input variables are numeric, we can create box and whiskers plots of each. Present the distributions as histograms, too.

# Save the plotting environment for restoring it later.
# Note that we only return parameters we can reset.
par.old <- par(no.readonly = TRUE)

# Boxplot for each attribute on one image.
par(mfrow = c(1,4))
  for(i in 1:4) {
  boxplot(x[, i], main = names(iris)[i])
  }

# Histogram for each attribute.
# Identify numeric variables.
is_col_numeric <- sapply(dataset, class) == "numeric"
# Reset the plotting environment.
par(par.old)
# Plot.
for(i in 1:sum(is_col_numeric)) {
    hist(x[, i], main = names(iris)[i], xlab = "Units")
}

We can also create a barplot of the Species class variable to get a graphical representation of the class distribution (generally uninteresting in this case because they’re even).

# Barplot for class breakdown.
plot(y)

Multivariate

Now we can look at the interactions between the variables.

First let’s look at scatterplots of all pairs of attributes and color the points by class. In addition, because the scatterplots show that points for each class are generally separate, we can draw ellipses around them.

featurePlot() comes from caret and it is a wrapper for lattice plotting of predictor variables.

# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")

We can also look at box and whiskers plots of each input variable again, but this time broken down into separate plots for each class. This can help to tease out obvious linear separations between the classes.

This is useful to see that there are clearly different distributions of the attributes for each class value.

# Box and whiskers plots for each attribute.
featurePlot(x=x, y=y, plot="box")

Next we can get an idea of the distribution of each attribute, again like the box and whiskers plots, broken down by class value. Sometimes histograms are good for this, but in this case we will use some probability density plots to give nice smooth lines for each distribution.

# Density plots for each attribute by class value.
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)

Evaluate algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

  • Set-up the test harness to use 10-fold cross validation.
  • Build 5 different models to predict species from flower measurements
  • Select the best model.

Test harness

We will run 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.

# Run algorithms using 10-fold cross validation.
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will use the metric variable when we build and evaluate each model next.

Build models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Our five algorithms:

  1. Linear Discriminant Analysis (LDA)
  2. Classification and Regression Trees (CART).
  3. k-Nearest Neighbors (kNN).
  4. Support Vector Machines (SVM) with a linear kernel.
  5. Random Forest (RF)

This is our mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Build our five models.

# Linear algorithm.
set.seed(7)
fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# Nonlinear algorithms.
# CART.
set.seed(7)
fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# kNN.
set.seed(7)
fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# Advanced algorithms.
# SVM.
set.seed(7)
fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest.
set.seed(7)
fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)

Select best model

We now have five models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

We can see the accuracy of each classifier and also other metrics like Kappa.

# summarize accuracy of models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## lda  0.8333333 1.0000000 1.0000000 0.9750000 1.0000000    1    0
## cart 0.8333333 0.8541667 0.9166667 0.9166667 0.9791667    1    0
## knn  0.8333333 0.9375000 1.0000000 0.9666667 1.0000000    1    0
## svm  0.8333333 0.9166667 0.9166667 0.9333333 1.0000000    1    0
## rf   0.8333333 0.9166667 1.0000000 0.9500000 1.0000000    1    0
## 
## Kappa 
##      Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## lda  0.75 1.00000  1.000 0.9625 1.00000    1    0
## cart 0.75 0.78125  0.875 0.8750 0.96875    1    0
## knn  0.75 0.90625  1.000 0.9500 1.00000    1    0
## svm  0.75 0.87500  0.875 0.9000 1.00000    1    0
## rf   0.75 0.87500  1.000 0.9250 1.00000    1    0

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

# compare accuracy of models
dotplot(results)

Summarize the most accurate model, LDA.

# summarize Best Model
print(fit.lda)
## Linear Discriminant Analysis 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results:
## 
##   Accuracy  Kappa 
##   0.975     0.9625

Predict

Linear Discriminant Analysis (LDA) was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip, such as overfitting to the training set or a data leak, both of which will result in an overly optimistic result.

We can run the model directly on the validation set and summarize the results in a confusion matrix.

# estimate skill of LDA on the validation dataset
set.seed(7)
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          0        10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 4.857e-15  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

We can see that the accuracy is 100%. It was a small validation dataset (20%), but this result is within our expected margin of 97% +/-4% suggesting we may have an accurate and a reliably accurate model.

Conclusions

Non-technical

Given key measures of an iris, we are able to predict its species with an estimated accuracy of 97.5%.

The measures we consider are:

  • Sepal length.
  • Sepal width.
  • Petal length.
  • Petal width.

The species predicted by this model are:

  • Setosa.
  • Versicolor.
  • Virginica.

If prediction were required for other species, a new dataset including those species would be necessary for training another predictive model.

Technical

I evaluated multiple predictive models for species of iris. This was an unsupervised classification machine learning process. I arrived at a model whose estimated accuracy is 97.5%.

The predictors are:

  • Sepal length.
  • Sepal width.
  • Petal length.
  • Petal width.

The species predicted by this model are:

  • Setosa.
  • Versicolor.
  • Virginica.

I evaluated the accuracy of five machine learning models, including algorithms that were simple linear, nonlinear or complex nonlinear methods. The superior predictive model for this dataset, established through cross validation, was Linear Discrimant Analysis.

Retrospective

Lessons learned.

  • Produced a machine learning project template.
  • Saw an example of a suitable dataset for machine learning. I have a picture of a clean dataset, what I would need to shoot for given an otherwise less ideal dataset.
  • Reminded me of the distinction between models that are simple linear, nonlinear, and complex nonlinear.
  • Introduced cross validation, evaluated by accuracy.
  • Demonstrated validation on a hold out set for a final check on accuracy.
  • You can get all the datatypes of a dataframe with sapply().
  • Demystified the caret package and started using it.
    • createDataPartition(). Easy way to hold out test data.
    • featurePlot(): Wrapper for `lattice plotting of predictor variables Includes ellipses and density.
    • trainControl(). Control computational nuances of the train() function. Easy for setting up cross validation.
    • resamples(). Collects resampling results for analysis and visualization.
  • Visualization.
    • Create variables for input and output attributes to make plotting convenient.
    • An approach for univariate and multivariate visualizations. Boxplots and histograms in the first case, scatterplots otherwise. caret::featurePlot() helps.
  • Plotting.
    • Save parameter settings and restore them.
    • lattice::dotplot(). A Cleveland dot plot, a bivariate trellis plot.

Further research

If I were to continue this project, further research could include:

  • Tuning of the models.

Improvements

These are some ideas I would consider if I were to do this project again, or another similar one.

Study

The caret package

Models

  • Review the models used here in Data Science for Business.