13  Hierarchical clustering of time series

Configurations to run this chapter

# load package "tibble"
library(tibble)
# load packages "sits" and "sitsdata"
library(sits)
library(sitsdata)
# load "pysits" library
from pysits import *

13.1 Introduction

Given a set of training samples, experts should first cross-validate the training set to assess their inherent prediction error. The results show whether the data is internally consistent. Since cross-validation does not predict actual model performance, this chapter provides additional tools for improving the quality of training sets. More detailed information is available on Chapter Validation and accuracy measurements.

13.2 Dataset used in this chapter

The examples of this chapter use the cerrado_2classes data, a set of time series for the Cerrado region of Brazil, the second largest biome in South America with an area of more than 2 million \(km^2\). The data contains 746 samples divided into 2 classes (Cerrado and Pasture). Each time series covers 12 months (23 data points) from MOD13Q1 product, and has 2 bands (EVI, and NDVI).

# Show the summary of the cerrado_2_classes dataset
summary(cerrado_2classes)
# A tibble: 2 × 3
  label   count  prop
  <chr>   <int> <dbl>
1 Cerrado   400 0.536
2 Pasture   346 0.464
# Show the summary of the cerrado_2_classes dataset
summary(cerrado_2classes)
     label  count      prop
1  Cerrado    400  0.536193
2  Pasture    346  0.463807

13.3 Hierarchical clustering for sample quality control

The package provides two clustering methods to assess sample quality: Agglomerative Hierarchical Clustering (AHC) and Self-organizing Maps (SOM). These methods have different computational complexities. AHC has a computational complexity of \(\mathcal{O}(n^2)\), given the number of time series \(n\), whereas SOM complexity is linear. For large data, AHC requires substantial memory and running time; in these cases, SOM is recommended. This section describes how to run AHC in sits. The SOM-based technique is presented in the next section.

AHC computes the dissimilarity between any two elements from a dataset. Depending on the distance functions and linkage criteria, the algorithm decides which two clusters are merged at each iteration. This approach is helpful for exploring samples due to its visualization power and ease of use [1]. In sits, AHC is implemented using sits_cluster_dendro().

# Take a set of patterns for 2 classes
# Create a dendrogram, plot, and get the optimal cluster based on ARI index
clusters <- sits_cluster_dendro(
    samples = cerrado_2classes, 
    bands = c("NDVI", "EVI"),
    dist_method = "dtw_basic",
    linkage =  "ward.D2")
Figure 13.1: Example of hierarchical clustering for a two class set.
# Take a set of patterns for 2 classes
# Create a dendrogram, plot, and get the optimal cluster based on ARI index
clusters = sits_cluster_dendro(
    samples = cerrado_2classes, 
    bands = ("NDVI", "EVI"),
    dist_method = "dtw_basic",
    linkage =  "ward.D2")
Figure 13.2: Example of hierarchical clustering for a two class set.

The sits_cluster_dendro() function has one mandatory parameter (samples), with the samples to be evaluated. Optional parameters include bands, dist_method, and linkage. The dist_method parameter specifies how to calculate the distance between two time series. We recommend a metric that uses dynamic time warping (DTW) [2], as DTW is a reliable method for measuring differences between satellite image time series [3]. The options available in sits are based on those provided by package dtwclust, which include dtw_basic, dtw_lb, and dtw2. Please check ?dtwclust::tsclust for more information on DTW distances.

The linkage parameter defines the distance metric between clusters. The recommended linkage criteria are: complete or ward.D2. Complete linkage prioritizes the within-cluster dissimilarities, producing clusters with shorter distance samples, but results are sensitive to outliers. As an alternative, Ward proposes to use the sum-of-squares error to minimize data variance [4]; his method is available as ward.D2 option to the linkage parameter. To cut the dendrogram, the sits_cluster_dendro() function computes the adjusted rand index (ARI) [5], returning the height where the cut of the dendrogram maximizes the index. In the example, the ARI index indicates that there are six clusters. The result of sits_cluster_dendro() is a time series tibble with one additional column called “cluster”. The function sits_cluster_frequency() provides information on the composition of each cluster.

# Show clusters samples frequency
sits_cluster_frequency(clusters)
         
            1   2   3   4   5   6 Total
  Cerrado 203  13  23  80   1  80   400
  Pasture   2 176  28   0 140   0   346
  Total   205 189  51  80 141  80   746
# Show clusters samples frequency
sits_cluster_frequency(clusters)
             1      2     3     4      5     6  Total
Cerrado  203.0   13.0  23.0  80.0    1.0  80.0  400.0
Pasture    2.0  176.0  28.0   0.0  140.0   0.0  346.0
Total    205.0  189.0  51.0  80.0  141.0  80.0  746.0

The cluster frequency table shows that each cluster has a predominance of either Cerrado or Pasture labels, except for cluster 3, which has a mix of samples from both labels. Such confusion may have resulted from incorrect labeling, inadequacy of selected bands and spatial resolution, or even a natural confusion due to the variability of the land classes. To remove cluster 3, use dplyr::filter(). The resulting clusters still contain mixed labels, possibly resulting from outliers. In this case, sits_cluster_clean() removes the outliers, leaving only the most frequent label. After cleaning the samples, the resulting set of samples is likely to improve the classification results.

# Remove cluster 3 from the samples
clusters_new <- dplyr::filter(clusters, cluster != 3)

# Clear clusters, leaving only the majority label
clean <- sits_cluster_clean(clusters_new)

# Show clusters samples frequency
sits_cluster_frequency(clean)
         
            1   2   4   5   6 Total
  Cerrado 203   0  80   0  80   363
  Pasture   0 176   0 140   0   316
  Total   203 176  80 140  80   679
# Remove cluster 3 from the samples
clusters_new = clusters.query("cluster != 3")

# Clear clusters, leaving only the majority label
clean = sits_cluster_clean(clusters_new)

# Show clusters samples frequency
sits_cluster_frequency(clean)
             1      2     4      5     6  Total
Cerrado  203.0    0.0  80.0    0.0  80.0  363.0
Pasture    0.0  176.0   0.0  140.0   0.0  316.0
Total    203.0  176.0  80.0  140.0  80.0  679.0

13.4 Summary

In this chapter, we present hierarchical clustering to improve the quality of training data. This method works well for up to four classes. Because of its quadratical computational complexity, it is not practical to use it for data sets with many classes. In this case, we suggest the use of self-organized maps (SOM) as shown in the next chapter.

References

[1]
E. Keogh, J. Lin, and W. Truppel, “Clustering of time series subsequences is meaningless: Implications for previous and future research,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 115–122.
[2]
F. Petitjean, J. Inglada, and P. Gancarski, “Satellite Image Time Series Analysis Under Time Warping,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 8, pp. 3081–3095, 2012, doi: 10.1109/TGRS.2011.2179050.
[3]
V. Maus, G. Camara, R. Cartaxo, A. Sanchez, F. M. Ramos, and G. R. Queiroz, “A Time-Weighted Dynamic Time Warping Method for Land-Use and Land-Cover Mapping,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 8, pp. 3729–3739, 2016, doi: 10.1109/JSTARS.2016.2517118.
[4]
J. H. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the American statistical association, vol. 58, no. 301, pp. 236–244, 1963.
[5]
W. M. Rand, “Objective Criteria for the Evaluation of Clustering Methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971, doi: 10.1080/01621459.1971.10482356.