Introduction

Who is this book for?

This book, tailored for land use change experts and researchers, is a practical guide that enables them to analyze big Earth observation data sets. It provides readers with the means of producing high-quality maps of land use and land cover, guiding them through all the steps to achieve good results. Given the natural world’s complexity and huge variations in human-nature interactions, only local experts who know their countries and ecosystems can extract full information from big EO data.

One group of readers that we are keen to engage with is the national authorities on forest, agriculture, and statistics in developing countries. We aim to foster a collaborative environment where they can use EO data to enhance their national land use and cover estimates, supporting sustainable development policies. To achieve this goal, sits has strong backing from the FAO Expert Group on the Use of Earth Observation data (FAO-EOSTAT)[https://www.fao.org/in-action/eostat]. FAO-EOSTAT is at the forefront of using advanced EO data analysis methods for agricultural statistics in developing countries [[1]][2].

Why work with satellite image time series?

Satellite imagery provides the most extensive data on our environment. By encompassing vast areas of the Earth’s surface, images enable researchers to analyze local and worldwide transformations. By observing the same location multiple times, satellites provide data on environmental changes and survey areas that are difficult to observe from the ground. Given its unique features, images offer essential information for many applications, including deforestation, crop production, food security, urban footprints, water scarcity, and land degradation. Using time series, experts improve their understanding of ecological patterns and processes. Instead of selecting individual images from specific dates and comparing them, researchers track change continuously [3].

Time-first, space-later

“Time-first, space-later” is a concept in satellite image classification that takes time series analysis as the first step for analyzing remote sensing data, with spatial information being considered after all time series are classified. The time-first part brings a better understanding of changes in landscapes. Detecting and tracking seasonal and long-term trends becomes feasible, as well as identifying anomalous events or patterns in the data, such as wildfires, floods, or droughts. Each pixel in a data cube is treated as a time series, using information available in the temporal instances of the case. Time series classification is pixel-based, producing a set of labeled pixels. This result is then used as input to the space-later part of the method. In this phase, a smoothing algorithm improves the results of time-first classification by considering the spatial neighborhood of each pixel. The resulting map thus combines both spatial and temporal information.

Land use and land cover

The UN Food and Agriculture Organization defines land cover as “the observed biophysical cover on the Earth’s surface” [4]. Land cover can be observed and mapped directly through remote sensing images. In FAO’s guidelines and reports, land use is described as “the human activities or purposes for which land is managed or exploited”. Although land cover and land use denote different approaches for describing the Earth’s landscape, in practice there is considerable overlap between these concepts [5]. When classifying remote sensing images, natural areas are classified using land cover types (e.g, forest), while human-modified areas are described with land use classes (e.g., pasture).

One of the advantages of using image time series for land classification is its capacity of measuring changes in the landscape related to agricultural practices. For example, the time series of a vegetation index in an area of crop production will show a pattern of minima (planting and sowing stages) and maxima (flowering stage). Thus, classification schemas based on image time series data can be richer and more detailed than those associated only with land cover. In what follows, we use the term “land classification” to refer to image classification representing both land cover and land use classes.

How sits works

The sits package uses satellite image time series for land classification, using a time-first, space-later approach. In the data preparation part, collections of big Earth observation images are organized as data cubes. Each spatial location of a data cube is associated with a time series. Locations with known labels train a machine learning algorithm, which classifies all time series of a data cube, as shown in Figure 1.

Using time series for land classification (source: authors).

Figure 1: Using time series for land classification (source: authors).

The package provides tools for analysis, visualization, and classification of satellite image time series. Users follow a typical workflow for a pixel-based classification:

  1. Select an analysis-ready data image collection from a cloud provider such as AWS, Microsoft Planetary Computer, Digital Earth Africa, or Brazil Data Cube.
  2. Build a regular data cube using the chosen image collection.
  3. Obtain new bands and indices with operations on data cubes.
  4. Extract time series samples from the data cube to be used as training data.
  5. Perform quality control and filtering on the time series samples.
  6. Train a machine learning model using the time series samples.
  7. Classify the data cube using the model to get class probabilities for each pixel.
  8. Post-process the probability cube to remove outliers.
  9. Produce a labeled map from the post-processed probability cube.
  10. Evaluate the accuracy of the classification using best practices.

Each workflow step corresponds to a function of the sits API, as shown in the Table below and Figure 2. These functions have convenient default parameters and behaviors. A single function builds machine learning (ML) models. The classification function processes big data cubes with efficient parallel processing. Since the sits API is simple to learn, achieving good results do not require in-depth knowledge about machine learning and parallel processing.

Table 1: The sits API workflow for land classification.
API_function Inputs Output
sits_cube() ARD image collection Irregular data cube
sits_regularize() Irregular data cube Regular data cube
sits_apply() Regular data cube Regular data cube with new bands and indices
sits_get_data() Data cube and sample locations Time series
sits_train() Time series and ML method ML classification model
sits_classify() ML classification model and regular data cube Probability cube
sits_smooth() Probability cube Post-processed probability cube
sits_uncertainty() Post-processed probability cube Uncertainty cube
sits_label_classification() Post-processed probability cube Classified map
sits_accuracy() Classified map and validation samples Accuracy assessment
Main functions of the sits API (source: authors).

Figure 2: Main functions of the sits API (source: authors).

Additionally, experts can perform object-based image analysis (OBIA) with sits. In this case, before classifying the time series, one can use sits_segments() to create a set of closed polygons. These polygons are classified using a subset of the time series contained inside each segment. For details, see Chapter Object-based time series image analysis.

Creating a data cube

There are two kinds of data cubes in sits: (a) irregular data cubes generated by selecting image collections on cloud providers such as AWS and Planetary Computer; (b) regular data cubes with images fully covering a chosen area, where each image has the same spectral bands and spatial resolution, and images follow a set of adjacent and regular time intervals. Machine learning applications need regular data cubes. Please refer to Chapter Earth observation data cubes for further details.

The first steps in using sits are: (a) select an analysis-ready data image collection available in a cloud provider or stored locally using sits_cube(); (b) if the collection is not regular, use sits_regularize() to build a regular data cube.

This section shows how to build a data cube from local images already organized as a regular data cube. The data cube is composed of MODIS MOD13Q1 images for the region close to the city of Sinop in Mato Grosso, Brazil. This region is one of the world’s largest producers of soybeans. All images have indexes NDVI and EVI covering a one-year period from 2013-09-14 to 2014-08-29 (we use “year-month-day” for dates). There are 23 time instances, each covering a 16-day period. This data is available in the package sitsdata.

To build a data cube from local files, users must provide information about the original source from which the data was obtained. In this case, sits_cube() needs the parameters:

  1. source, the cloud provider from where the data has been obtained (in this case, the Brazil Data Cube “BDC”);
  2. collection, the collection of the cloud provider from where the images have been extracted. In this case, data comes from the MOD13Q1 collection 6;
  3. data_dir, the local directory where the image files are stored;
  4. parse_info, a vector of strings stating how file names store information on “tile”, “band”, and “date”. In this case, local images are stored in files whose names are similar to TERRA_MODIS_012010_EVI_2014-07-28.tif. This file represents an image obtained by the MODIS sensor onboard the TERRA satellite, covering part of tile 012010 in the EVI band for date 2014-07-28.
# load package "tibble"
library(tibble)
# load packages "sits" and "sitsdata"
library(sits)
library(sitsdata)
# Create a data cube using local files
sinop_cube <- sits_cube(
  source = "BDC",
  collection = "MOD13Q1-6.1",
  bands = c("NDVI", "EVI"),
  data_dir = system.file("extdata/sinop", package = "sitsdata"),
  parse_info = c("satellite", "sensor", "tile", "band", "date")
)
# Plot the NDVI for the first date (2013-09-14)
plot(sinop_cube,
  band = "NDVI",
  dates = "2013-09-14",
  palette = "RdYlGn"
)
False color MODIS image for NDVI band in 2013-09-14 from sinop data cube (source: Brazil Data Cube).

Figure 3: False color MODIS image for NDVI band in 2013-09-14 from sinop data cube (source: Brazil Data Cube).

The aim of the parse_info parameter is to extract tile, band, and date information from the file name. Given the large variation in image file names generated by different produces, it includes designators such as X1 and X2; these are place holders for parts of the file name that is not relevant to sits_cube().

The R object returned by sits_cube() contains the metadata describing the contents of the data cube. It includes data source and collection, satellite, sensor, tile in the collection, bounding box, projection, and list of files. Each file refers to one band of an image at one of the temporal instances of the cube.

# Show the description of the data cube
sinop_cube
#> # A tibble: 1 × 11
#>   source collection satellite sensor tile     xmin    xmax    ymin    ymax crs  
#>   <chr>  <chr>      <chr>     <chr>  <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>
#> 1 BDC    MOD13Q1-6… TERRA     MODIS  0120… -6.18e6 -5.96e6 -1.35e6 -1.23e6 "PRO…
#> # ℹ 1 more variable: file_info <list>

The list of image files which make up the data cube is stored as a data frame in the column file_info. For each file, sits stores information about spectral band, reference date, size, spatial resolution, coordinate reference system, bounding box, path to file location and cloud cover information (when available).

# Show information on the images files which are part of a data cube
sinop_cube$file_info[[1]]
#> # A tibble: 46 × 13
#>    fid   band  date       nrows ncols  xres  yres      xmin      ymin      xmax
#>    <chr> <chr> <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>     <dbl>
#>  1 1     EVI   2013-09-14   551   944  232.  232. -6181982. -1353336. -5963298.
#>  2 1     NDVI  2013-09-14   551   944  232.  232. -6181982. -1353336. -5963298.
#>  3 2     EVI   2013-09-30   551   944  232.  232. -6181982. -1353336. -5963298.
#>  4 2     NDVI  2013-09-30   551   944  232.  232. -6181982. -1353336. -5963298.
#>  5 3     EVI   2013-10-16   551   944  232.  232. -6181982. -1353336. -5963298.
#>  6 3     NDVI  2013-10-16   551   944  232.  232. -6181982. -1353336. -5963298.
#>  7 4     EVI   2013-11-01   551   944  232.  232. -6181982. -1353336. -5963298.
#>  8 4     NDVI  2013-11-01   551   944  232.  232. -6181982. -1353336. -5963298.
#>  9 5     EVI   2013-11-17   551   944  232.  232. -6181982. -1353336. -5963298.
#> 10 5     NDVI  2013-11-17   551   944  232.  232. -6181982. -1353336. -5963298.
#> # ℹ 36 more rows
#> # ℹ 3 more variables: ymax <dbl>, crs <chr>, path <chr>

A key attribute of a data cube is its timeline, as shown below. The command sits_timeline() lists the temporal references associated to sits objects, including samples, data cubes and models.

# Show the R object that describes the data cube
sits_timeline(sinop_cube)
#>  [1] "2013-09-14" "2013-09-30" "2013-10-16" "2013-11-01" "2013-11-17"
#>  [6] "2013-12-03" "2013-12-19" "2014-01-01" "2014-01-17" "2014-02-02"
#> [11] "2014-02-18" "2014-03-06" "2014-03-22" "2014-04-07" "2014-04-23"
#> [16] "2014-05-09" "2014-05-25" "2014-06-10" "2014-06-26" "2014-07-12"
#> [21] "2014-07-28" "2014-08-13" "2014-08-29"

The timeline of the sinop_cube data cube has 23 intervals with a temporal difference of 16 days. The chosen dates capture the agricultural calendar in Mato Grosso, Brazil. The agricultural year starts in September-October with the sowing of the summer crop (usually soybeans) which is harvested in February-March. Then the winter crop (mostly Corn, Cotton or Millet) is planted in March and harvested in June-July. For LULC classification, the training samples and the date cube should share a timeline with the same number of intervals and similar start and end dates.

The time series tibble

To handle time series information, sits uses a tibble. Tibbles are extensions of the data.frame tabular data structures provided by the tidyverse set of packages. The example below shows a tibble with 1,837 time series obtained from MODIS MOD13Q1 images. Each series has four attributes: two bands (NIR and MIR) and two indexes (NDVI and EVI). This dataset is available in package sitsdata.

The time series tibble contains data and metadata. The first six columns contain the metadata: spatial and temporal information, the label assigned to the sample, and the data cube from where the data has been extracted. The time_series column contains the time series data for each spatiotemporal location. This data is also organized as a tibble, with a column with the dates and the other columns with the values for each spectral band.

# Load the MODIS samples for Mato Grosso from the "sitsdata" package
library(tibble)
library(sitsdata)
data("samples_matogrosso_mod13q1", package = "sitsdata")
samples_matogrosso_mod13q1
#> # A tibble: 1,837 × 7
#>    longitude latitude start_date end_date   label   cube     time_series      
#>        <dbl>    <dbl> <date>     <date>     <chr>   <chr>    <list>           
#>  1     -57.8    -9.76 2006-09-14 2007-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  2     -59.4    -9.31 2014-09-14 2015-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  3     -59.4    -9.31 2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  4     -57.8    -9.76 2006-09-14 2007-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  5     -55.2   -10.8  2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  6     -51.9   -13.4  2014-09-14 2015-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  7     -56.0   -10.1  2005-09-14 2006-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  8     -54.6   -10.4  2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#>  9     -52.5   -11.0  2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 10     -52.1   -14.0  2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> # ℹ 1,827 more rows

The timeline for all time series associated with the samples follows the same agricultural calendar, starting in September 14th and ending in August 28th. All samples contain 23 values, corresponding to the same temporal interval as those of the sinop data cube. Notice that that although the years for the samples are different, the samples for a given year follow the same agricultural calendar.

The time series can be displayed by showing the time_series column.

# Load the time series for MODIS samples for Mato Grosso
samples_matogrosso_mod13q1[1, ]$time_series[[1]]
#> # A tibble: 23 × 5
#>    Index       NDVI   EVI   NIR    MIR
#>    <date>     <dbl> <dbl> <dbl>  <dbl>
#>  1 2006-09-14 0.500 0.263 0.230 0.139 
#>  2 2006-09-30 0.485 0.330 0.359 0.161 
#>  3 2006-10-16 0.716 0.397 0.264 0.0757
#>  4 2006-11-01 0.654 0.415 0.332 0.124 
#>  5 2006-11-17 0.591 0.433 0.400 0.172 
#>  6 2006-12-03 0.662 0.439 0.348 0.125 
#>  7 2006-12-19 0.734 0.444 0.295 0.0784
#>  8 2007-01-01 0.739 0.502 0.348 0.0887
#>  9 2007-01-17 0.768 0.526 0.351 0.0761
#> 10 2007-02-02 0.797 0.550 0.355 0.0634
#> # ℹ 13 more rows

The distribution of samples per class can be obtained using the summary() command. The classification schema uses nine labels, four associated to crops (Soy_Corn, Soy_Cotton, Soy_Fallow, Soy_Millet), two with natural vegetation (Cerrado, Forest) and one to Pasture.

# Load the MODIS samples for Mato Grosso from the "sitsdata" package
summary(samples_matogrosso_mod13q1)
#> # A tibble: 7 × 3
#>   label      count   prop
#>   <chr>      <int>  <dbl>
#> 1 Cerrado      379 0.206 
#> 2 Forest       131 0.0713
#> 3 Pasture      344 0.187 
#> 4 Soy_Corn     364 0.198 
#> 5 Soy_Cotton   352 0.192 
#> 6 Soy_Fallow    87 0.0474
#> 7 Soy_Millet   180 0.0980

It is helpful to plot the dispersion of the time series. In what follows, for brevity, we will filter only one label (Forest) and select one index (NDVI). Note that for filtering the label we use a function from dplyr package, while for selecting the index we use sits_select(). We use two different functions for selection because of they way metadata is stored in a samples files. The labels for the samples are listed in column label in the samples tibble, as shown above. In this case, one can use functions from the dplyr package to extract subsets. In particular, the function dplyr::filter retaining all rows that satisfy a given condition. In the above example, the result of dplyr::filter is the set of samples associated to the “Forest” label. The second selection involves obtaining only the values for the NDVI band. This operation requires access to the time_series column, which is stored as a list. In this case, selection with dplyr::filter will not work. To handle such cases, sits provides sits_select() to select subsets inside the time_series list.

# select all samples with label "Forest"
samples_forest <- dplyr::filter(
  samples_matogrosso_mod13q1,
  label == "Forest"
)
# select the NDVI band for all samples with label "Forest"
samples_forest_ndvi <- sits_select(
  samples_forest,
  band = "NDVI"
)
plot(samples_forest_ndvi)
Joint plot of all samples in band NDVI for label Forest (source: authors).

Figure 4: Joint plot of all samples in band NDVI for label Forest (source: authors).

The above figure shows all the time series associated with label Forest and band NDVI (in light blue), highlighting the median (shown in dark red) and the first and third quartiles (shown in brown). The spikes are noise caused by the presence of clouds.

Training a machine learning model

The next step is to train a machine learning (ML) model using sits_train(). It takes two inputs, samples (a time series tibble) and ml_method (a function that implements a machine learning algorithm). The result is a model that is used for classification. Each ML algorithm requires specific parameters that are user-controllable. For novice users, sits provides default parameters that produce good results. Please see Chapter Machine learning for data cubes for more details.

Since the time series data has four attributes (EVI, NDVI, NIR, and MIR) and the data cube images have only two, we select the NDVI and EVI values and use the resulting data for training. To build the classification model, we use a random forest model called by sits_rfor(). Results from the random forest model can vary between different runs, due to the stochastic nature of the algorithm, For this reason, in the code fragment below, we set the seed of R’s pseudo-random number generation explicitly to ensure the same results are produced for documentation purposes.

set.seed(03022024)
# Select the bands NDVI and EVI
samples_2bands <- sits_select(
  data = samples_matogrosso_mod13q1,
  bands = c("NDVI", "EVI")
)
# Train a random forest model
rf_model <- sits_train(
  samples = samples_2bands,
  ml_method = sits_rfor()
)
# Plot the most important variables of the model
plot(rf_model)
Most relevant variables of trained random forest model (source: authors).

Figure 5: Most relevant variables of trained random forest model (source: authors).

Data cube classification

After training the machine learning model, the next step is to classify the data cube using sits_classify(). This function produces a set of raster probability maps, one for each class. For each of these maps, the value of a pixel is proportional to the probability that it belongs to the class. This function has two mandatory parameters: data, the data cube or time series tibble to be classified; and ml_model, the trained ML model. Optional parameters include: (a) multicores, number of cores to be used; (b) memsize, RAM used in the classification; (c) output_dir, the directory where the classified raster files will be written. Details of the classification process are available in “Image classification in data cubes”.

# Classify the raster image
sinop_probs <- sits_classify(
  data = sinop_cube,
  ml_model = rf_model,
  multicores = 2,
  memsize = 8,
  output_dir = "./tempdir/chp3"
)
# Plot the probability cube for class Forest
plot(sinop_probs, labels = "Forest", palette = "BuGn")
Probability map for class Forest (source: authors).

Figure 6: Probability map for class Forest (source: authors).

After completing the classification, we plot the probability maps for class Forest. Probability maps are helpful to visualize the degree of confidence the classifier assigns to the labels for each pixel. They can be used to produce uncertainty information and support active learning, as described in Chapter Image classification in data cubes.

Spatial smoothing

When working with big Earth observation data, there is much variability in each class. As a result, some pixels will be misclassified. These errors are more likely to occur in transition areas between classes. To address these problems, sits_smooth() takes a probability cube as input and uses the class probabilities of each pixel’s neighborhood to reduce labeling uncertainty. Plotting the smoothed probability map for class Forest shows that most outliers have been removed.

# Perform spatial smoothing
sinop_bayes <- sits_smooth(
  cube = sinop_probs,
  multicores = 2,
  memsize = 8,
  output_dir = "./tempdir/chp3"
)
plot(sinop_bayes, labels = "Forest", palette = "BuGn")
Smoothed probability map for class Forest (source: authors).

Figure 7: Smoothed probability map for class Forest (source: authors).

Labeling a probability data cube

After removing outliers using local smoothing, the final classification map can be obtained using sits_label_classification(). This function assigns each pixel to the class with the highest probability.

# Label the probability file
sinop_map <- sits_label_classification(
  cube = sinop_bayes,
  output_dir = "./tempdir/chp3"
)
plot(sinop_map)
Classification map for Sinop (source: authors).

Figure 8: Classification map for Sinop (source: authors).

The resulting classification files can be read by QGIS. Links to the associated files are available in the sinop_map object in the nested table file_info.

# Show the location of the classification file
sinop_map$file_info[[1]]
#> # A tibble: 1 × 12
#>   band  start_date end_date   ncols nrows  xres  yres      xmin     xmax    ymin
#>   <chr> <date>     <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>   <dbl>
#> 1 class 2013-09-14 2014-08-29   944   551  232.  232. -6181982.  -5.96e6 -1.35e6
#> # ℹ 2 more variables: ymax <dbl>, path <chr>