Logistic Regression on Titanic Dataset

We will use the Titanic dataset, which is included in the MLDatasets package. It describes the survival status of individual passengers on the Titanic. The model will be approached as a logistic regression problem, although a Classifier model could also have been used (see the Classification - Iris tutorial).

Getting started

To begin, we will load the required packages and the dataset:

using EvoTrees
using MLDatasets
using DataFrames
using Statistics: mean, median
using CategoricalArrays
using Random

df = MLDatasets.Titanic().dataframe

Preprocessing

A first step in data processing is to prepare the input features in a model compatible format.

EvoTrees' Tables API supports input that are either Real (incl. Bool) or Categorical. Bool variables are treated as unordered, 2-levels categorical variables. A recommended approach for String features such as Sex is to convert them into an unordered Categorical.

For dealing with features with missing values such as Age, a common approach is to first create an Bool indicator variable capturing the info on whether a value is missing. Then, the missing values can be imputed (replaced by some default values such as mean or median, or more sophisticated approach such as predictions from another model).

# convert string feature to Categorical
transform!(df, :Sex => categorical => :Sex)

# treat string feature and missing values
transform!(df, :Age => ByRow(ismissing) => :Age_ismissing)
transform!(df, :Age => (x -> coalesce.(x, median(skipmissing(x)))) => :Age);

# remove unneeded variables
df = df[:, Not([:PassengerId, :Name, :Embarked, :Cabin, :Ticket])]

The full data can now be split according to train and eval indices. Target and feature names are also set.

Random.seed!(123)

train_ratio = 0.8
train_indices = randperm(nrow(df))[1:Int(round(train_ratio * nrow(df)))]

dtrain = df[train_indices, :]
deval = df[setdiff(1:nrow(df), train_indices), :]

target_name = "Survived"
feature_names = setdiff(names(df), [target_name])

Training

Now we are ready to train our model. We will first define a model configuration using the EvoTreeRegressor model constructor. Then, we'll use fit_evotree to train a boosted tree model. We'll pass optional deval arguments, which enables the tracking of an evaluation metric and early stopping.

config = EvoTreeRegressor(
  loss=:logloss, 
  nrounds=200, 
  early_stopping_rounds=10,
  eta=0.05, 
  max_depth=5, 
  rowsample=0.5, 
  colsample=0.9)

model = EvoTrees.fit(
    config, dtrain; 
    deval,
    target_name,
    feature_names,
    print_every_n=10)

Diagnosis

We can get predictions by passing training and testing data to our model. We can then evaluate the accuracy of our model, which should be around 85%.

pred_train = model(dtrain)
pred_eval = model(deval)

julia> mean((pred_train .> 0.5) .== dtrain[!, target_name])
0.8821879382889201

julia> mean((pred_eval .> 0.5) .== deval[!, target_name])
0.8426966292134831

Finally, features importance can be inspected using EvoTrees.importance.

julia> EvoTrees.importance(model)
7-element Vector{Pair{String, Float64}}:
           "Sex" => 0.29612654189959403
           "Age" => 0.25487324307720827
          "Fare" => 0.2530947969323613
        "Pclass" => 0.11354283043193575
         "SibSp" => 0.05129209383816148
         "Parch" => 0.017385183317069588
 "Age_ismissing" => 0.013685310503669728