17 Deep learning model tuning

Configurations to run this chapter

R
Python

# load packages "torch" and "luz"
library(torch)
library(luz)
torch::install_torch()
# load packages "sits" and "sitsdata"
library(sits)
library(sitsdata)
# set tempdir if it does not exist 
tempdir_r <- "~/sitsbook/tempdir/R/cl_tuning"
dir.create(tempdir_r, showWarnings = FALSE)

# load "pysits" library
from pysits import *
from pathlib import Path
# set tempdir if it does not exist 
tempdir_py = Path.home() / "sitsbook/tempdir/Python/cl_tuning"
tempdir_py.mkdir(parents=True, exist_ok=True)

17.1 Introduction

Model tuning is the process of selecting the best set of hyperparameters for a specific application. When using deep learning models for image classification, it is a highly recommended step to enable a better fit of the algorithm to the training data. Hyperparameters are parameters of the model that are not learned during training but instead are set prior to training and affect the behavior of the model during training. Examples include the learning rate, batch size, number of epochs, number of hidden layers, number of neurons in each layer, activation functions, regularization parameters, and optimization algorithms.

Deep learning model tuning involves selecting the best combination of hyperparameters that results in the optimal performance of the model on a given task. This is done by training and evaluating the model with different sets of hyperparameters to select the set that gives the best performance.

Deep learning algorithms try to find the optimal point representing the best value of the prediction function that, given an input \(X\) of data points, predicts the result \(Y\). In our case, \(X\) is a multidimensional time series, and \(Y\) is a vector of probabilities for the possible output classes. For complex situations, the best prediction function is time-consuming to estimate. For this reason, deep learning methods rely on gradient descent methods to speed up predictions and converge faster than an exhaustive search [1]. All gradient descent methods use an optimization algorithm adjusted with hyperparameters such as the learning and regularization rates [2]. The learning rate controls the numerical step of the gradient descent function, and the regularization rate controls model overfitting. Adjusting these values to an optimal setting requires using model tuning methods.

17.2 How SITS performs model tuning

To reduce the learning curve, sits provides default values for all machine learning and deep learning methods, ensuring a reasonable baseline performance. However, refining model hyperparameters might be necessary, especially for more complex models such as sits_lighttae() or sits_tempcnn(). To that end, the package provides the sits_tuning() function.

The most straightforward approach to model tuning is to run a grid search; this involves defining a range for each hyperparameter and then testing all possible combinations. This approach leads to a combinatorial explosion and thus is not recommended. Instead, Bergstra and Bengio propose randomly chosen trials [3]. Their paper shows that randomized trials are more efficient than grid search trials, selecting adequate hyperparameters at a fraction of the computational cost. The sits_tuning() function follows Bergstra and Bengio by using a random search on the chosen hyperparameters.

Experiments with image time series show that other optimizers may have better performance for the specific problem of land classification. For this reason, the authors developed the torchopt R package, which includes several recently proposed optimizers, including Madgrad [4], and Yogi [5]. Using the sits_tuning() function allows testing these and other optimizers available in torch and torch_opt packages.

The sits_tuning() function takes the following parameters:

samples: Training dataset to be used by the model.
samples_validation: Optional dataset containing time series to be used for validation. If missing, the next parameter will be used.
validation_split: If samples_validation is not used, this parameter defines the proportion of time series in the training dataset to be used for validation (default is 20%).
ml_method(): Deep learning method (either sits_mlp(), sits_tempcnn(), sits_tae() or sits_lighttae()).
params: Defines the optimizer and its hyperparameters by calling sits_tuning_hparams(), as shown in the example below.
trials: Number of trials to run the random search.
multicores: Number of cores to be used for the procedure.
progress: Show a progress bar?

The sits_tuning_hparams() function inside sits_tuning() allows defining optimizers and their hyperparameters, including lr (learning rate), eps (controls numerical stability), and weight_decay (controls overfitting). The default values for eps and weight_decay in all sits deep learning functions are 1e-08 and 1e-06, respectively. The default lr for sits_lighttae() and sits_tempcnn() is 0.005.

Users have different ways to randomize the hyperparameters, including:

choice() (a list of options);
uniform (a uniform distribution);
randint (random integers from a uniform distribution);
normal(mean, sd) (normal distribution);
beta(shape1, shape2) (beta distribution);
loguniform(max, min) (loguniform distribution).

We suggest to use the log-uniform distribution to search over a wide range of values that span several orders of magnitude. This is common for hyperparameters like learning rates, which can vary from very small values (e.g., 0.0001) to larger values (e.g., 1.0) in a logarithmic manner. By default, sits_tuning() uses a loguniform distribution between 10^-2 and 10^-4 for the learning rate and the same distribution between 10^-2 and 10^-8 for the weight decay.

17.3 Tuning a LightTAE model

Our fist example is tuning a Lightweight Temporal Attention Enconder model [6] on the MOD13Q1 dataset for the state of MatoGrosso. To recall, this data set contains time series samples from the Brazilian Mato Grosso state obtained from the MODIS MOD13Q1 product. It has 1,892 samples and nine classes (Cerrado, Forest, Pasture, Soy_Corn, Soy_Cotton, Soy_Fallow, Soy_Millet) [7] and is available in the R package sitsdata.

R
Python

# Tuning ``sits_lighttae`` model
tuned_mt <- sits_tuning(
     samples = samples_matogrosso_mod13q1,
     ml_method = sits_lighttae(),
     params = sits_tuning_hparams(
         optimizer = torch::optim_adamw,
         opt_hparams = list(
             lr = loguniform(10^-2, 10^-4),
             weight_decay = loguniform(10^-2, 10^-8)
             )
         ),
     trials = 40,
     multicores = 6,
     progress = FALSE
)

# Load samples
samples_matogrosso_mod13q1 = load_samples(
    name = "samples_matogrosso_mod13q1", 
    package = "sitsdata"
)

# Tuning ``sits_lighttae`` model
tuned_mt = sits_tuning(
     samples = samples_matogrosso_mod13q1,
     ml_method = sits_lighttae,
     params = sits_tuning_hparams(
         optimizer = "torch::optim_adamw",
         opt_hparams = dict(
             lr = hparam("loguniform", 10**-2, 10**-4),
             weight_decay = hparam("loguniform", 10**-2, 10**-8)
             )
         ),
     trials = 40,
     multicores = 6,
     progress = FALSE
)

The result is a tibble with different values of accuracy, kappa, decision matrix, and hyperparameters. The best results obtain accuracy values between 0.978 and 0.970, as shown below. The best result is obtained by a learning rate of 0.0013 and a weight decay of 3.73e-07. The worst result has an accuracy of 0.891, which shows the importance of the tuning procedure.

R
Python

# Obtain accuracy, kappa, lr, and weight decay for the 5 best results
# Hyperparameters are organized as a list
hparams_5 <- tuned_mt[1:5,]$opt_hparams

# Extract learning rate and weight decay from the list
lr_5 <- purrr::map_dbl(hparams_5, function(h) h$lr)
wd_5 <- purrr::map_dbl(hparams_5, function(h) h$weight_decay)

# Create a tibble to display the results
best_5 <- tibble::tibble(
    accuracy = tuned_mt[1:5,]$accuracy,
    kappa = tuned_mt[1:5,]$kappa,
    lr    = lr_5,
    weight_decay = wd_5)

# Print the best five combination of hyperparameters
best_5

# A tibble: 5 × 4
  accuracy kappa       lr weight_decay
     <dbl> <dbl>    <dbl>        <dbl>
1    0.978 0.974 0.00136  0.000000373 
2    0.975 0.970 0.00269  0.0000000861
3    0.973 0.967 0.00162  0.00218     
4    0.970 0.964 0.000378 0.00000868  
5    0.970 0.964 0.00198  0.00000275

# Import pandas
import pandas as pd

# Obtain accuracy, kappa, lr, and weight decay for the 5 best results
# Hyperparameters are organized as a list
hparams_5 = tuned_mt.opt_hparams[0:5]

# Extract learning rate and weight decay from the list
lr_5 = [x for x_ in hparams_5 for x in x_["lr"]]
wd_5 = [x for x_ in hparams_5 for x in x_["weight_decay"]]

# Create a tibble to display the results
best_5 = pd.DataFrame(dict(
    accuracy = tuned_mt.accuracy[0:5],
    kappa = tuned_mt.kappa[0:5],
    lr = lr_5,
    weight_decay = wd_5
))

# Print the best five combination of hyperparameters
best_5

   accuracy     kappa        lr  weight_decay
0  0.978202  0.973719  0.001364  3.730247e-07
1  0.975477  0.970402  0.002686  8.605873e-08
2  0.972752  0.967170  0.001623  2.179615e-03
3  0.970027  0.963927  0.000378  8.683021e-06
4  0.970027  0.963900  0.001981  2.753714e-06

17.4 Tuning a TempCNN model

In the example, we use sits_tuning() to find good hyperparameters to train the sits_tempcnn() algorithm for a dataset for measuring deforestation in Rondonia (samples_deforestation_rondonia) available in package sitsdata. This dataset consists of 6,007 samples collected from Sentinel-2 images covering the state of Rondonia. There are nine classes: Clear_Cut_Bare_Soil, Clear_Cut_Burned_Area, Mountainside_Forest, Forest, Riparian_Forest, Clear_Cut_Vegetation, Water, Wetland, and Seasonally_Flooded. Each time series contains values from Sentinel-2/2A bands B02, B03, B04, B05, B06, B07, B8A, B08, B11 and B12, from 2022-01-05 to 2022-12-23 in 16-day intervals. The samples are intended to detect deforestation events and have been collected by remote sensing experts using visual interpretation.

The hyperparameters for the sits_tempcnn() method include the size of the layers, convolution kernels, dropout rates, learning rate, and weight decay. Please refer to the description of the Temporal CNN algorithm in Chapter Machine learning for data cubes.

R
Python

# Tuning ``sits_tempcnn`` model
tuned_tempcnn <- sits_tuning(
  samples = samples_deforestation_rondonia,
  ml_method = sits_tempcnn(),
  params = sits_tuning_hparams(
    cnn_layers = choice(c(256, 256, 256), c(128, 128, 128), c(64, 64, 64)),
    cnn_kernels = choice(c(3, 3, 3), c(5, 5, 5), c(7, 7, 7)),
    cnn_dropout_rates = choice(c(0.15, 0.15, 0.15), c(0.2, 0.2, 0.2),
                               c(0.3, 0.3, 0.3), c(0.4, 0.4, 0.4)),
    optimizer = torch::optim_adamw,
    opt_hparams = list(
      lr = loguniform(10^-2, 10^-4),
      weight_decay = loguniform(10^-2, 10^-8)
    )
  ),
  trials = 50,
  multicores = 4
)

# Load samples
samples_deforestation_rondonia = load_samples(
    name = "samples_deforestation_rondonia", 
    package = "sitsdata"
)

# Tuning ``sits_tempcnn`` model
tuned_tempcnn = sits_tuning(
  samples = samples_deforestation_rondonia,
  ml_method = sits_tempcnn,
  params = sits_tuning_hparams(
    cnn_layers = hparam(
        "choice", (256, 256, 256), (128, 128, 128), (64, 64, 64)
    ),
    cnn_kernels = hparam(
        "choice", (3, 3, 3), (5, 5, 5), (7, 7, 7)
    ),
    cnn_dropout_rates = hparam(
        "choice", (0.15, 0.15, 0.15), (0.2, 0.2, 0.2), 
                  (0.3, 0.3, 0.3), (0.4, 0.4, 0.4)
    ),
    optimizer = "torch::optim_adamw",
    opt_hparams = dict(
      lr = hparam("loguniform", 10**-2, 10**-4),
      weight_decay = hparam("loguniform", 10**-2, 10**-8)
    )
  ),
  trials = 50,
  multicores = 4
)

The result of sits_tuning() is tibble with different values of accuracy, kappa, decision matrix, and hyperparameters. The five best results obtain accuracy values between 0.939 and 0.908, as shown below. The best result is obtained by a learning rate of 3.76e-04 and a weight decay of 1.5e-04, and three CNN layers of size 256, kernel size of 5, and dropout rates of 0.2.

R
Python

# Obtain accuracy, kappa, cnn_layers, cnn_kernels, and cnn_dropout_rates the best result
cnn_params <- tuned_tempcnn[1,c("accuracy", "kappa", "cnn_layers", "cnn_kernels", "cnn_dropout_rates"), ]

# Learning rates and weight decay are organized as a list
hparams_best <- tuned_tempcnn[1,]$opt_hparams[[1]]

# Extract learning rate and weight decay 
lr_wd <- tibble::tibble(lr_best = hparams_best$lr, 
                        wd_best = hparams_best$weight_decay)

# Print the best parameters 
dplyr::bind_cols(cnn_params, lr_wd)

# A tibble: 1 × 7
  accuracy kappa cnn_layers       cnn_kernels cnn_dropout_rates  lr_best wd_best
     <dbl> <dbl> <chr>            <chr>       <chr>                <dbl>   <dbl>
1    0.939 0.929 c(256, 256, 256) c(5, 5, 5)  c(0.2, 0.2, 0.2)  0.000376 1.53e-4

# Import pandas
import pandas as pd

# Learning rates and weight decay are organized as a list
hparams_best = tuned_tempcnn.opt_hparams[0]

# Obtain accuracy, kappa, cnn_layers, cnn_kernels, and cnn_dropout_rates the best result
pd.DataFrame(dict(
    accuracy = tuned_tempcnn.accuracy[0:1],
    kappa = tuned_tempcnn.kappa[0:1],
    cnn_layers = tuned_tempcnn.cnn_layers[0:1],
    cnn_kernels = tuned_tempcnn.cnn_kernels[0:1],
    cnn_dropout_rates = tuned_tempcnn.cnn_dropout_rates[0:1],
    lr_best = hparams_best['lr'],
    wd_best = hparams_best['weight_decay'],
))

   accuracy     kappa        cnn_layers cnn_kernels cnn_dropout_rates   lr_best   wd_best
0  0.939268  0.928595  c(256, 256, 256)  c(5, 5, 5)  c(0.2, 0.2, 0.2)  0.000376  0.000153

17.5 Summary

For large datasets, the tuning process is time-consuming. Despite this cost, it is recommended to achieve the best performance. In general, tuning hyperparameters for models such as sits_tempcnn() and sits_lighttae() will result in a slight performance improvement over the default parameters on overall accuracy. The performance gain will be stronger in the less well represented classes, where significant gains in producer’s and user’s accuracies are possible. When detecting change in less frequent classes, tuning can make a substantial difference in the results.

References

[1]

Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” arXiv:1206.5533 [cs], 2012, [Online]. Available: http://arxiv.org/abs/1206.5533.

[2]

R. M. Schmidt, F. Schneider, and P. Hennig, “Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers,” in Proceedings of the 38th International Conference on Machine Learning, 2021, pp. 9367–9376, [Online]. Available: https://proceedings.mlr.press/v139/schmidt21a.html.

[3]

J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research, vol. 13, no. 10, pp. 281–305, 2012, [Online]. Available: http://jmlr.org/papers/v13/bergstra12a.html.

[4]

A. Defazio and S. Jelassi, “Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization.” arXiv, 2021, doi: 10.48550/arXiv.2101.11075.

[5]

M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, “Adaptive Methods for Nonconvex Optimization,” in Advances in Neural Information Processing Systems, 2018, vol. 31, [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/90365351ccc7437a1309dc64e4db32a3-Abstract.html.

[6]

V. S. F. Garnot and L. Landrieu, “Lightweight Temporal Self-attention for Classifying Satellite Images Time Series,” in Advanced Analytics and Learning on Temporal Data, 2020, pp. 171–181.

[7]

M. Picoli et al., “Big earth observation time series analysis for monitoring Brazilian agriculture,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 145, pp. 328–339, 2018, doi: 10.1016/j.isprsjprs.2018.08.007.