# load packages "torch" and "luz"
library(torch)
library(luz)
torch::install_torch()
# load packages "sits" and "sitsdata"
library(sits)
library(sitsdata)
# set tempdir if it does not exist
tempdir_r <- "~/sitsbook/tempdir/R/cl_tuning"
dir.create(tempdir_r, showWarnings = FALSE)
17 Deep learning model tuning
Configurations to run this chapter
# load "pysits" library
from pysits import *
from pathlib import Path
# set tempdir if it does not exist
= Path.home() / "sitsbook/tempdir/Python/cl_tuning"
tempdir_py =True, exist_ok=True) tempdir_py.mkdir(parents
17.1 Introduction
Model tuning is the process of selecting the best set of hyperparameters for a specific application. When using deep learning models for image classification, it is a highly recommended step to enable a better fit of the algorithm to the training data. Hyperparameters are parameters of the model that are not learned during training but instead are set prior to training and affect the behavior of the model during training. Examples include the learning rate, batch size, number of epochs, number of hidden layers, number of neurons in each layer, activation functions, regularization parameters, and optimization algorithms.
Deep learning model tuning involves selecting the best combination of hyperparameters that results in the optimal performance of the model on a given task. This is done by training and evaluating the model with different sets of hyperparameters to select the set that gives the best performance.
Deep learning algorithms try to find the optimal point representing the best value of the prediction function that, given an input \(X\) of data points, predicts the result \(Y\). In our case, \(X\) is a multidimensional time series, and \(Y\) is a vector of probabilities for the possible output classes. For complex situations, the best prediction function is time-consuming to estimate. For this reason, deep learning methods rely on gradient descent methods to speed up predictions and converge faster than an exhaustive search [1]. All gradient descent methods use an optimization algorithm adjusted with hyperparameters such as the learning and regularization rates [2]. The learning rate controls the numerical step of the gradient descent function, and the regularization rate controls model overfitting. Adjusting these values to an optimal setting requires using model tuning methods.
17.2 How SITS performs model tuning
To reduce the learning curve, sits
provides default values for all machine learning and deep learning methods, ensuring a reasonable baseline performance. However, refining model hyperparameters might be necessary, especially for more complex models such as sits_lighttae()
or sits_tempcnn()
. To that end, the package provides the sits_tuning()
function.
The most straightforward approach to model tuning is to run a grid search; this involves defining a range for each hyperparameter and then testing all possible combinations. This approach leads to a combinatorial explosion and thus is not recommended. Instead, Bergstra and Bengio propose randomly chosen trials [3]. Their paper shows that randomized trials are more efficient than grid search trials, selecting adequate hyperparameters at a fraction of the computational cost. The sits_tuning()
function follows Bergstra and Bengio by using a random search on the chosen hyperparameters.
Experiments with image time series show that other optimizers may have better performance for the specific problem of land classification. For this reason, the authors developed the torchopt
R package, which includes several recently proposed optimizers, including Madgrad [4], and Yogi [5]. Using the sits_tuning()
function allows testing these and other optimizers available in torch
and torch_opt
packages.
The sits_tuning()
function takes the following parameters:
-
samples
: Training dataset to be used by the model. -
samples_validation
: Optional dataset containing time series to be used for validation. If missing, the next parameter will be used. -
validation_split
: Ifsamples_validation
is not used, this parameter defines the proportion of time series in the training dataset to be used for validation (default is 20%). -
ml_method()
: Deep learning method (eithersits_mlp()
,sits_tempcnn()
,sits_tae()
orsits_lighttae()
). -
params
: Defines the optimizer and its hyperparameters by callingsits_tuning_hparams()
, as shown in the example below. -
trials
: Number of trials to run the random search. -
multicores
: Number of cores to be used for the procedure. -
progress
: Show a progress bar?
The sits_tuning_hparams()
function inside sits_tuning()
allows defining optimizers and their hyperparameters, including lr
(learning rate), eps
(controls numerical stability), and weight_decay
(controls overfitting). The default values for eps
and weight_decay
in all sits
deep learning functions are 1e-08 and 1e-06, respectively. The default lr
for sits_lighttae()
and sits_tempcnn()
is 0.005.
Users have different ways to randomize the hyperparameters, including:
-
choice()
(a list of options); -
uniform
(a uniform distribution); -
randint
(random integers from a uniform distribution); -
normal(mean, sd)
(normal distribution); -
beta(shape1, shape2)
(beta distribution); -
loguniform(max, min)
(loguniform distribution).
We suggest to use the log-uniform distribution to search over a wide range of values that span several orders of magnitude. This is common for hyperparameters like learning rates, which can vary from very small values (e.g., 0.0001) to larger values (e.g., 1.0) in a logarithmic manner. By default, sits_tuning()
uses a loguniform distribution between 10^-2 and 10^-4 for the learning rate and the same distribution between 10^-2 and 10^-8 for the weight decay.
17.3 Tuning a LightTAE model
Our fist example is tuning a Lightweight Temporal Attention Enconder model [6] on the MOD13Q1 dataset for the state of MatoGrosso. To recall, this data set contains time series samples from the Brazilian Mato Grosso state obtained from the MODIS MOD13Q1 product. It has 1,892 samples and nine classes (Cerrado
, Forest
, Pasture
, Soy_Corn
, Soy_Cotton
, Soy_Fallow
, Soy_Millet
) [7] and is available in the R package sitsdata
.
# Tuning ``sits_lighttae`` model
tuned_mt <- sits_tuning(
samples = samples_matogrosso_mod13q1,
ml_method = sits_lighttae(),
params = sits_tuning_hparams(
optimizer = torch::optim_adamw,
opt_hparams = list(
lr = loguniform(10^-2, 10^-4),
weight_decay = loguniform(10^-2, 10^-8)
)
),
trials = 40,
multicores = 6,
progress = FALSE
)
# Load samples
= load_samples_dataset(
samples_matogrosso_mod13q1 = "samples_matogrosso_mod13q1",
name = "sitsdata"
package
)
# Tuning ``sits_lighttae`` model
= sits_tuning(
tuned_mt = samples_matogrosso_mod13q1,
samples = sits_lighttae,
ml_method = sits_tuning_hparams(
params = "torch::optim_adamw",
optimizer = dict(
opt_hparams = hparam("loguniform", 10**-2, 10**-4),
lr = hparam("loguniform", 10**-2, 10**-8)
weight_decay
)
),= 40,
trials = 6,
multicores = FALSE
progress )
The result is a tibble with different values of accuracy, kappa, decision matrix, and hyperparameters. The best results obtain accuracy values between 0.978 and 0.970, as shown below. The best result is obtained by a learning rate of 0.0013 and a weight decay of 3.73e-07. The worst result has an accuracy of 0.891, which shows the importance of the tuning procedure.
# Obtain accuracy, kappa, lr, and weight decay for the 5 best results
# Hyperparameters are organized as a list
hparams_5 <- tuned_mt[1:5,]$opt_hparams
# Extract learning rate and weight decay from the list
lr_5 <- purrr::map_dbl(hparams_5, function(h) h$lr)
wd_5 <- purrr::map_dbl(hparams_5, function(h) h$weight_decay)
# Create a tibble to display the results
best_5 <- tibble::tibble(
accuracy = tuned_mt[1:5,]$accuracy,
kappa = tuned_mt[1:5,]$kappa,
lr = lr_5,
weight_decay = wd_5)
# Print the best five combination of hyperparameters
best_5
# A tibble: 5 × 4
accuracy kappa lr weight_decay
<dbl> <dbl> <dbl> <dbl>
1 0.978 0.974 0.00136 0.000000373
2 0.975 0.970 0.00269 0.0000000861
3 0.973 0.967 0.00162 0.00218
4 0.970 0.964 0.000378 0.00000868
5 0.970 0.964 0.00198 0.00000275
# Import pandas
import pandas as pd
# Obtain accuracy, kappa, lr, and weight decay for the 5 best results
# Hyperparameters are organized as a list
= tuned_mt.opt_hparams[0:5]
hparams_5
# Extract learning rate and weight decay from the list
= [x for x_ in hparams_5 for x in x_["lr"]]
lr_5 = [x for x_ in hparams_5 for x in x_["weight_decay"]]
wd_5
# Create a tibble to display the results
= pd.DataFrame(dict(
best_5 = tuned_mt.accuracy[0:5],
accuracy = tuned_mt.kappa[0:5],
kappa = lr_5,
lr = wd_5
weight_decay
))
# Print the best five combination of hyperparameters
best_5
accuracy kappa lr weight_decay
0 0.978202 0.973719 0.001364 3.730247e-07
1 0.975477 0.970402 0.002686 8.605873e-08
2 0.972752 0.967170 0.001623 2.179615e-03
3 0.970027 0.963927 0.000378 8.683021e-06
4 0.970027 0.963900 0.001981 2.753714e-06
17.4 Tuning a TempCNN model
In the example, we use sits_tuning()
to find good hyperparameters to train the sits_tempcnn()
algorithm for a dataset for measuring deforestation in Rondonia (samples_deforestation_rondonia
) available in package sitsdata
. This dataset consists of 6,007 samples collected from Sentinel-2 images covering the state of Rondonia. There are nine classes: Clear_Cut_Bare_Soil
, Clear_Cut_Burned_Area
, Mountainside_Forest
, Forest
, Riparian_Forest
, Clear_Cut_Vegetation
, Water
, Wetland
, and Seasonally_Flooded
. Each time series contains values from Sentinel-2/2A bands B02, B03, B04, B05, B06, B07, B8A, B08, B11 and B12, from 2022-01-05 to 2022-12-23 in 16-day intervals. The samples are intended to detect deforestation events and have been collected by remote sensing experts using visual interpretation.
The hyperparameters for the sits_tempcnn()
method include the size of the layers, convolution kernels, dropout rates, learning rate, and weight decay. Please refer to the description of the Temporal CNN algorithm in Chapter Machine learning for data cubes.
# Tuning ``sits_tempcnn`` model
tuned_tempcnn <- sits_tuning(
samples = samples_deforestation_rondonia,
ml_method = sits_tempcnn(),
params = sits_tuning_hparams(
cnn_layers = choice(c(256, 256, 256), c(128, 128, 128), c(64, 64, 64)),
cnn_kernels = choice(c(3, 3, 3), c(5, 5, 5), c(7, 7, 7)),
cnn_dropout_rates = choice(c(0.15, 0.15, 0.15), c(0.2, 0.2, 0.2),
c(0.3, 0.3, 0.3), c(0.4, 0.4, 0.4)),
optimizer = torch::optim_adamw,
opt_hparams = list(
lr = loguniform(10^-2, 10^-4),
weight_decay = loguniform(10^-2, 10^-8)
)
),
trials = 50,
multicores = 4
)
# Load samples
= load_samples_dataset(
samples_deforestation_rondonia = "samples_deforestation_rondonia",
name = "sitsdata"
package
)
# Tuning ``sits_tempcnn`` model
= sits_tuning(
tuned_tempcnn = samples_deforestation_rondonia,
samples = sits_tempcnn,
ml_method = sits_tuning_hparams(
params = hparam(
cnn_layers "choice", (256, 256, 256), (128, 128, 128), (64, 64, 64)
),= hparam(
cnn_kernels "choice", (3, 3, 3), (5, 5, 5), (7, 7, 7)
),= hparam(
cnn_dropout_rates "choice", (0.15, 0.15, 0.15), (0.2, 0.2, 0.2),
0.3, 0.3, 0.3), (0.4, 0.4, 0.4)
(
),= "torch::optim_adamw",
optimizer = dict(
opt_hparams = hparam("loguniform", 10**-2, 10**-4),
lr = hparam("loguniform", 10**-2, 10**-8)
weight_decay
)
),= 50,
trials = 4
multicores )
The result of sits_tuning()
is tibble with different values of accuracy, kappa, decision matrix, and hyperparameters. The five best results obtain accuracy values between 0.939 and 0.908, as shown below. The best result is obtained by a learning rate of 3.76e-04 and a weight decay of 1.5e-04, and three CNN layers of size 256, kernel size of 5, and dropout rates of 0.2.
# Obtain accuracy, kappa, cnn_layers, cnn_kernels, and cnn_dropout_rates the best result
cnn_params <- tuned_tempcnn[1,c("accuracy", "kappa", "cnn_layers", "cnn_kernels", "cnn_dropout_rates"), ]
# Learning rates and weight decay are organized as a list
hparams_best <- tuned_tempcnn[1,]$opt_hparams[[1]]
# Extract learning rate and weight decay
lr_wd <- tibble::tibble(lr_best = hparams_best$lr,
wd_best = hparams_best$weight_decay)
# Print the best parameters
dplyr::bind_cols(cnn_params, lr_wd)
# A tibble: 1 × 7
accuracy kappa cnn_layers cnn_kernels cnn_dropout_rates lr_best wd_best
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 0.939 0.929 c(256, 256, 256) c(5, 5, 5) c(0.2, 0.2, 0.2) 0.000376 1.53e-4
# Import pandas
import pandas as pd
# Learning rates and weight decay are organized as a list
= tuned_tempcnn.opt_hparams[0]
hparams_best
# Obtain accuracy, kappa, cnn_layers, cnn_kernels, and cnn_dropout_rates the best result
dict(
pd.DataFrame(= tuned_tempcnn.accuracy[0:1],
accuracy = tuned_tempcnn.kappa[0:1],
kappa = tuned_tempcnn.cnn_layers[0:1],
cnn_layers = tuned_tempcnn.cnn_kernels[0:1],
cnn_kernels = tuned_tempcnn.cnn_dropout_rates[0:1],
cnn_dropout_rates = hparams_best['lr'],
lr_best = hparams_best['weight_decay'],
wd_best ))
accuracy kappa cnn_layers cnn_kernels cnn_dropout_rates lr_best wd_best
0 0.939268 0.928595 c(256, 256, 256) c(5, 5, 5) c(0.2, 0.2, 0.2) 0.000376 0.000153
17.5 Summary
For large datasets, the tuning process is time-consuming. Despite this cost, it is recommended to achieve the best performance. In general, tuning hyperparameters for models such as sits_tempcnn()
and sits_lighttae()
will result in a slight performance improvement over the default parameters on overall accuracy. The performance gain will be stronger in the less well represented classes, where significant gains in producer’s and user’s accuracies are possible. When detecting change in less frequent classes, tuning can make a substantial difference in the results.