EvoTrees.jl

A Julia implementation of boosted trees with CPU and GPU support. Efficient histogram based algorithms with support for multiple loss functions, including various regressions, multi-classification and Gaussian max likelihood.

See the examples-API section to get started using the internal API, or examples-MLJ to use within the MLJ framework.

Complete details about hyper-parameters are found in the Models section.

R binding available.

Installation

Latest:

julia> Pkg.add(url="https://github.com/Evovest/EvoTrees.jl")

From General Registry:

julia> Pkg.add("EvoTrees")

Quick start

A model configuration must first be defined, using one of the model constructor:

Then fitting can be performed using fit_evotree. 2 broad methods are supported: Matrix and Tables based inputs. Optional kwargs can be used to specify eval data on which to track eval metric and perform early stopping. Look at the docs for more details on available hyper-parameters for each of the above constructors and other options for training.

Predictions are obtained by passing features data to the model. Model acts as a functor, ie. it's a struct containing the fitted model as well as a function generating the prediction of that model for the features argument.

Tables and DataFrames input

When using a Tables compatible input such as DataFrames, features with element type Real (incl. Bool) and Categorical are automatically recognized as input features. Alternatively, feature_names kwarg can be used.

Categorical features are treated accordingly by the algorithm. Ordered variables will be treated as numerical features, using ≤ split rule, while unordered variables are using ==. Support is currently limited to a maximum of 255 levels. Bool variables are treated as unordered, 2-levels cat variables.

using EvoTrees
using DataFrames

config = EvoTreeRegressor(
    loss=:mse, 
    nrounds=100, 
    max_depth=6,
    nbins=32,
    eta=0.1)

x_train, y_train = rand(1_000, 10), rand(1_000)
dtrain = DataFrame(x_train, :auto)
dtrain.y .= y_train
m = fit_evotree(config, dtrain; target_name="y");
m = fit_evotree(config, dtrain; target_name="y", feature_names=["x1", "x3"]); # to only use specified features
preds = m(dtrain)

Matrix features input

using EvoTrees

config = EvoTreeRegressor(
    loss=:mse, 
    nrounds=100, 
    max_depth=6,
    nbins=32,
    eta=0.1)

x_train, y_train = rand(1_000, 10), rand(1_000)
m = fit_evotree(config; x_train, y_train)
preds = m(x_train)

GPU Acceleration

EvoTrees supports training and inference on Nvidia GPU's with CUDA.jl. Note that on Julia ≥ 1.9 CUDA support is only enabled when CUDA.jl is installed and loaded, by another package or explicitly with e.g.

using CUDA

If running on a CUDA enabled machine, training and inference on GPU can be triggered through the device kwarg passed to the learner's constructor:

config = EvoTreeRegressor(
    loss=:mse, 
    device=:gpu
)

m = fit_evotree(config, dtrain; target_name="y");
p = m(dtrain; device=:gpu)

Reproducibility

EvoTrees models trained on cpu can be fully reproducible.

Models of the gradient boosting family typically involve some stochasticity. In EvoTrees, this primarily concern the the 2 subsampling parameters rowsample and colsample. The other stochastic operation happens at model initialisation when the features are binarized to allow for fast histogram construction: a random subsample of 1_000 * nbins is used to compute the breaking points.

These random parts of the algorithm can be deterministically reproduced on cpu by specifying an rng to the model constructor. rng can be an integer (ex: 123) or a random generator (ex: Random.Xoshiro(123)). If no rng is specified, 123 is used by default. When an integer rng is used, a Random.MersenneTwister generator will be created by the EvoTrees's constructor. Otherwise, the provided random generator will be used.

Consequently, the following m1 and m2 models will be identical:

config = EvoTreeRegressor(rowsample=0.5, rng=123)
m1 = fit_evotree(config, dtrain; target_name="y");
config = EvoTreeRegressor(rowsample=0.5, rng=123)
m2 = fit_evotree(config, dtrain; target_name="y");

However, the following m1 and m2 models won't be because the there's stochasticity involved in the model from rowsample and the random generator in the config isn't reset between the fits:

config = EvoTreeRegressor(rowsample=0.5, rng=123)
m1 = fit_evotree(config, dtrain; target_name="y");
m2 = fit_evotree(config, dtrain; target_name="y");

Note that in presence of multiple identical or very highly correlated features, model may not be reproducible if features are permuted since in situation where 2 features provide identical gains, the first one will be selected. Therefore, if the identity relationship doesn't hold on new data, different predictions will be returned from models trained on different features order.

At the moment, there's no reproducibility guarantee on GPU, although this may change in the future.

Missing values

Features

EvoTrees does not handle features having missing values. Proper preprocessing of the data is therefore needed (and a general good practice regardless of the ML model used).

This includes situations where values may be all non-missing, but where the eltype is Union{Missing,Float64} or Any for example. A conversion using identity is then recommended:

julia> x = Vector{Union{Missing, Float64}}([1, 2])
2-element Vector{Union{Missing, Float64}}:
 1.0
 2.0

julia> identity.(x)
2-element Vector{Float64}:
 1.0
 2.0

For dealing with numerical or ordered categorical features containing missing values, a common approach is to first create an Bool variable capturing the info on whether a value is missing:

using DataFrames
transform!(df, :my_feat => ByRow(ismissing) => :my_feat_ismissing)

Then, the missing values can be imputed (replaced by some default values such as mean or median, or using a more sophisticated approach such as predictions from another model):

transform!(df, :my_feat => (x -> coalesce.(x, median(skipmissing(x)))) => :my_feat)

For unordered categorical variables, a recode of the missing into a non missing level is sufficient:

using CategoricalArrays
julia> x = categorical(["a", "b", missing])
3-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "a"
 "b"
 missing

julia> x = recode(x, missing => "missing value")
3-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "missing value"

Target

Target variable must have its element type <:Real. Only exception is for EvoTreeClassifier for which CategoricalValue, Integer, String and Char are supported.

Save/Load

EvoTrees.save(m, "data/model.bson")
m = EvoTrees.load("data/model.bson");