Introduction
Who is this book for?
This book, tailored for land use change experts and researchers, is a practical guide that enables them to analyze big Earth observation data sets. It provides readers with the means of producing high-quality maps of land use and land cover, guiding them through all the steps to achieve good results. Given the natural world’s complexity and huge variations in human-nature interactions, only local experts who know their countries and ecosystems can extract full information from big EO data.
One group of readers that we are keen to engage with is the national authorities on forest, agriculture, and statistics in developing countries. We aim to foster a collaborative environment where they can use EO data to enhance their national land use and cover estimates, supporting sustainable development policies. To achieve this goal, sits
has strong backing from the FAO Expert Group on the Use of Earth Observation data (FAO-EOSTAT)[https://www.fao.org/in-action/eostat]. FAO-EOSTAT is at the forefront of using advanced EO data analysis methods for agricultural statistics in developing countries [[1]][2].
Why work with satellite image time series?
Satellite imagery provides the most extensive data on our environment. By encompassing vast areas of the Earth’s surface, images enable researchers to analyze local and worldwide transformations. By observing the same location multiple times, satellites provide data on environmental changes and survey areas that are difficult to observe from the ground. Given its unique features, images offer essential information for many applications, including deforestation, crop production, food security, urban footprints, water scarcity, and land degradation. Using time series, experts improve their understanding of ecological patterns and processes. Instead of selecting individual images from specific dates and comparing them, researchers track change continuously [3].
Time-first, space-later
“Time-first, space-later” is a concept in satellite image classification that takes time series analysis as the first step for analyzing remote sensing data, with spatial information being considered after all time series are classified. The time-first part brings a better understanding of changes in landscapes. Detecting and tracking seasonal and long-term trends becomes feasible, as well as identifying anomalous events or patterns in the data, such as wildfires, floods, or droughts. Each pixel in a data cube is treated as a time series, using information available in the temporal instances of the case. Time series classification is pixel-based, producing a set of labeled pixels. This result is then used as input to the space-later part of the method. In this phase, a smoothing algorithm improves the results of time-first classification by considering the spatial neighborhood of each pixel. The resulting map thus combines both spatial and temporal information.
Land use and land cover
The UN Food and Agriculture Organization defines land cover as “the observed biophysical cover on the Earth’s surface” [4]. Land cover can be observed and mapped directly through remote sensing images. In FAO’s guidelines and reports, land use is described as “the human activities or purposes for which land is managed or exploited”. Although land cover and land use denote different approaches for describing the Earth’s landscape, in practice there is considerable overlap between these concepts [5]. When classifying remote sensing images, natural areas are classified using land cover types (e.g, forest), while human-modified areas are described with land use classes (e.g., pasture).
One of the advantages of using image time series for land classification is its capacity of measuring changes in the landscape related to agricultural practices. For example, the time series of a vegetation index in an area of crop production will show a pattern of minima (planting and sowing stages) and maxima (flowering stage). Thus, classification schemas based on image time series data can be richer and more detailed than those associated only with land cover. In what follows, we use the term “land classification” to refer to image classification representing both land cover and land use classes.
How sits
works
The sits
package uses satellite image time series for land classification, using a time-first, space-later approach. In the data preparation part, collections of big Earth observation images are organized as data cubes. Each spatial location of a data cube is associated with a time series. Locations with known labels train a machine learning algorithm, which classifies all time series of a data cube, as shown in Figure 1.
The package provides tools for analysis, visualization, and classification of satellite image time series. Users follow a typical workflow for a pixel-based classification:
- Select an analysis-ready data image collection from a cloud provider such as AWS, Microsoft Planetary Computer, Digital Earth Africa, or Brazil Data Cube.
- Build a regular data cube using the chosen image collection.
- Obtain new bands and indices with operations on data cubes.
- Extract time series samples from the data cube to be used as training data.
- Perform quality control and filtering on the time series samples.
- Train a machine learning model using the time series samples.
- Classify the data cube using the model to get class probabilities for each pixel.
- Post-process the probability cube to remove outliers.
- Produce a labeled map from the post-processed probability cube.
- Evaluate the accuracy of the classification using best practices.
Each workflow step corresponds to a function of the sits
API, as shown in the Table below and Figure 2. These functions have convenient default parameters and behaviors. A single function builds machine learning (ML) models. The classification function processes big data cubes with efficient parallel processing. Since the sits
API is simple to learn, achieving good results do not require in-depth knowledge about machine learning and parallel processing.
API_function | Inputs | Output |
---|---|---|
sits_cube() | ARD image collection | Irregular data cube |
sits_regularize() | Irregular data cube | Regular data cube |
sits_apply() | Regular data cube | Regular data cube with new bands and indices |
sits_get_data() | Data cube and sample locations | Time series |
sits_train() | Time series and ML method | ML classification model |
sits_classify() | ML classification model and regular data cube | Probability cube |
sits_smooth() | Probability cube | Post-processed probability cube |
sits_uncertainty() | Post-processed probability cube | Uncertainty cube |
sits_label_classification() | Post-processed probability cube | Classified map |
sits_accuracy() | Classified map and validation samples | Accuracy assessment |
Additionally, experts can perform object-based image analysis (OBIA) with sits
. In this case, before classifying the time series, one can use sits_segments()
to create a set of closed polygons. These polygons are classified using a subset of the time series contained inside each segment. For details, see Chapter Object-based time series image analysis.
Creating a data cube
There are two kinds of data cubes in sits
: (a) irregular data cubes generated by selecting image collections on cloud providers such as AWS and Planetary Computer; (b) regular data cubes with images fully covering a chosen area, where each image has the same spectral bands and spatial resolution, and images follow a set of adjacent and regular time intervals. Machine learning applications need regular data cubes. Please refer to Chapter Earth observation data cubes for further details.
The first steps in using sits
are: (a) select an analysis-ready data image collection available in a cloud provider or stored locally using sits_cube()
; (b) if the collection is not regular, use sits_regularize()
to build a regular data cube.
This section shows how to build a data cube from local images already organized as a regular data cube. The data cube is composed of MODIS MOD13Q1 images for the region close to the city of Sinop in Mato Grosso, Brazil. This region is one of the world’s largest producers of soybeans. All images have indexes NDVI and EVI covering a one-year period from 2013-09-14 to 2014-08-29 (we use “year-month-day” for dates). There are 23 time instances, each covering a 16-day period. This data is available in the package sitsdata
.
To build a data cube from local files, users must provide information about the original source from which the data was obtained. In this case, sits_cube()
needs the parameters:
-
source
, the cloud provider from where the data has been obtained (in this case, the Brazil Data Cube “BDC”); -
collection
, the collection of the cloud provider from where the images have been extracted. In this case, data comes from the MOD13Q1 collection 6; -
data_dir
, the local directory where the image files are stored; -
parse_info
, a vector of strings stating how file names store information on “tile”, “band”, and “date”. In this case, local images are stored in files whose names are similar toTERRA_MODIS_012010_EVI_2014-07-28.tif
. This file represents an image obtained by the MODIS sensor onboard the TERRA satellite, covering part of tile 012010 in the EVI band for date 2014-07-28.
# load package "tibble"
library(tibble)
# load packages "sits" and "sitsdata"
library(sits)
library(sitsdata)
# Create a data cube using local files
sinop_cube <- sits_cube(
source = "BDC",
collection = "MOD13Q1-6.1",
bands = c("NDVI", "EVI"),
data_dir = system.file("extdata/sinop", package = "sitsdata"),
parse_info = c("satellite", "sensor", "tile", "band", "date")
)
# Plot the NDVI for the first date (2013-09-14)
plot(sinop_cube,
band = "NDVI",
dates = "2013-09-14",
palette = "RdYlGn"
)
The aim of the parse_info
parameter is to extract tile
, band
, and date
information from the file name. Given the large variation in image file names generated by different produces, it includes designators such as X1
and X2
; these are place holders for parts of the file name that is not relevant to sits_cube()
.
The R object returned by sits_cube()
contains the metadata describing the contents of the data cube. It includes data source and collection, satellite, sensor, tile in the collection, bounding box, projection, and list of files. Each file refers to one band of an image at one of the temporal instances of the cube.
# Show the description of the data cube
sinop_cube
#> # A tibble: 1 × 11
#> source collection satellite sensor tile xmin xmax ymin ymax crs
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 BDC MOD13Q1-6… TERRA MODIS 0120… -6.18e6 -5.96e6 -1.35e6 -1.23e6 "PRO…
#> # ℹ 1 more variable: file_info <list>
The list of image files which make up the data cube is stored as a data frame in the column file_info
. For each file, sits
stores information about spectral band, reference date, size, spatial resolution, coordinate reference system, bounding box, path to file location and cloud cover information (when available).
# Show information on the images files which are part of a data cube
sinop_cube$file_info[[1]]
#> # A tibble: 46 × 13
#> fid band date nrows ncols xres yres xmin ymin xmax
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 EVI 2013-09-14 551 944 232. 232. -6181982. -1353336. -5963298.
#> 2 1 NDVI 2013-09-14 551 944 232. 232. -6181982. -1353336. -5963298.
#> 3 2 EVI 2013-09-30 551 944 232. 232. -6181982. -1353336. -5963298.
#> 4 2 NDVI 2013-09-30 551 944 232. 232. -6181982. -1353336. -5963298.
#> 5 3 EVI 2013-10-16 551 944 232. 232. -6181982. -1353336. -5963298.
#> 6 3 NDVI 2013-10-16 551 944 232. 232. -6181982. -1353336. -5963298.
#> 7 4 EVI 2013-11-01 551 944 232. 232. -6181982. -1353336. -5963298.
#> 8 4 NDVI 2013-11-01 551 944 232. 232. -6181982. -1353336. -5963298.
#> 9 5 EVI 2013-11-17 551 944 232. 232. -6181982. -1353336. -5963298.
#> 10 5 NDVI 2013-11-17 551 944 232. 232. -6181982. -1353336. -5963298.
#> # ℹ 36 more rows
#> # ℹ 3 more variables: ymax <dbl>, crs <chr>, path <chr>
A key attribute of a data cube is its timeline, as shown below. The command sits_timeline()
lists the temporal references associated to sits
objects, including samples, data cubes and models.
# Show the R object that describes the data cube
sits_timeline(sinop_cube)
#> [1] "2013-09-14" "2013-09-30" "2013-10-16" "2013-11-01" "2013-11-17"
#> [6] "2013-12-03" "2013-12-19" "2014-01-01" "2014-01-17" "2014-02-02"
#> [11] "2014-02-18" "2014-03-06" "2014-03-22" "2014-04-07" "2014-04-23"
#> [16] "2014-05-09" "2014-05-25" "2014-06-10" "2014-06-26" "2014-07-12"
#> [21] "2014-07-28" "2014-08-13" "2014-08-29"
The timeline of the sinop_cube
data cube has 23 intervals with a temporal difference of 16 days. The chosen dates capture the agricultural calendar in Mato Grosso, Brazil. The agricultural year starts in September-October with the sowing of the summer crop (usually soybeans) which is harvested in February-March. Then the winter crop (mostly Corn, Cotton or Millet) is planted in March and harvested in June-July. For LULC classification, the training samples and the date cube should share a timeline with the same number of intervals and similar start and end dates.
The time series tibble
To handle time series information, sits
uses a tibble
. Tibbles are extensions of the data.frame
tabular data structures provided by the tidyverse
set of packages. The example below shows a tibble with 1,837 time series obtained from MODIS MOD13Q1 images. Each series has four attributes: two bands (NIR and MIR) and two indexes (NDVI and EVI). This dataset is available in package sitsdata
.
The time series tibble contains data and metadata. The first six columns contain the metadata: spatial and temporal information, the label assigned to the sample, and the data cube from where the data has been extracted. The time_series
column contains the time series data for each spatiotemporal location. This data is also organized as a tibble, with a column with the dates and the other columns with the values for each spectral band.
# Load the MODIS samples for Mato Grosso from the "sitsdata" package
library(tibble)
library(sitsdata)
data("samples_matogrosso_mod13q1", package = "sitsdata")
samples_matogrosso_mod13q1
#> # A tibble: 1,837 × 7
#> longitude latitude start_date end_date label cube time_series
#> <dbl> <dbl> <date> <date> <chr> <chr> <list>
#> 1 -57.8 -9.76 2006-09-14 2007-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 2 -59.4 -9.31 2014-09-14 2015-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 3 -59.4 -9.31 2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 4 -57.8 -9.76 2006-09-14 2007-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 5 -55.2 -10.8 2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 6 -51.9 -13.4 2014-09-14 2015-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 7 -56.0 -10.1 2005-09-14 2006-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 8 -54.6 -10.4 2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 9 -52.5 -11.0 2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> 10 -52.1 -14.0 2013-09-14 2014-08-29 Pasture bdc_cube <tibble [23 × 5]>
#> # ℹ 1,827 more rows
The timeline for all time series associated with the samples follows the same agricultural calendar, starting in September 14th and ending in August 28th. All samples contain 23 values, corresponding to the same temporal interval as those of the sinop
data cube. Notice that that although the years for the samples are different, the samples for a given year follow the same agricultural calendar.
The time series can be displayed by showing the time_series
column.
# Load the time series for MODIS samples for Mato Grosso
samples_matogrosso_mod13q1[1, ]$time_series[[1]]
#> # A tibble: 23 × 5
#> Index NDVI EVI NIR MIR
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 2006-09-14 0.500 0.263 0.230 0.139
#> 2 2006-09-30 0.485 0.330 0.359 0.161
#> 3 2006-10-16 0.716 0.397 0.264 0.0757
#> 4 2006-11-01 0.654 0.415 0.332 0.124
#> 5 2006-11-17 0.591 0.433 0.400 0.172
#> 6 2006-12-03 0.662 0.439 0.348 0.125
#> 7 2006-12-19 0.734 0.444 0.295 0.0784
#> 8 2007-01-01 0.739 0.502 0.348 0.0887
#> 9 2007-01-17 0.768 0.526 0.351 0.0761
#> 10 2007-02-02 0.797 0.550 0.355 0.0634
#> # ℹ 13 more rows
The distribution of samples per class can be obtained using the summary()
command. The classification schema uses nine labels, four associated to crops (Soy_Corn
, Soy_Cotton
, Soy_Fallow
, Soy_Millet
), two with natural vegetation (Cerrado
, Forest
) and one to Pasture
.
# Load the MODIS samples for Mato Grosso from the "sitsdata" package
summary(samples_matogrosso_mod13q1)
#> # A tibble: 7 × 3
#> label count prop
#> <chr> <int> <dbl>
#> 1 Cerrado 379 0.206
#> 2 Forest 131 0.0713
#> 3 Pasture 344 0.187
#> 4 Soy_Corn 364 0.198
#> 5 Soy_Cotton 352 0.192
#> 6 Soy_Fallow 87 0.0474
#> 7 Soy_Millet 180 0.0980
It is helpful to plot the dispersion of the time series. In what follows, for brevity, we will filter only one label (Forest
) and select one index (NDVI). Note that for filtering the label we use a function from dplyr
package, while for selecting the index we use sits_select()
. We use two different functions for selection because of they way metadata is stored in a samples files. The labels for the samples are listed in column label
in the samples tibble, as shown above. In this case, one can use functions from the dplyr
package to extract subsets. In particular, the function dplyr::filter
retaining all rows that satisfy a given condition. In the above example, the result of dplyr::filter
is the set of samples associated to the “Forest” label. The second selection involves obtaining only the values for the NDVI band. This operation requires access to the time_series
column, which is stored as a list. In this case, selection with dplyr::filter
will not work. To handle such cases, sits
provides sits_select()
to select subsets inside the time_series
list.
# select all samples with label "Forest"
samples_forest <- dplyr::filter(
samples_matogrosso_mod13q1,
label == "Forest"
)
# select the NDVI band for all samples with label "Forest"
samples_forest_ndvi <- sits_select(
samples_forest,
band = "NDVI"
)
plot(samples_forest_ndvi)
The above figure shows all the time series associated with label Forest
and band NDVI (in light blue), highlighting the median (shown in dark red) and the first and third quartiles (shown in brown). The spikes are noise caused by the presence of clouds.
Training a machine learning model
The next step is to train a machine learning (ML) model using sits_train()
. It takes two inputs, samples
(a time series tibble) and ml_method
(a function that implements a machine learning algorithm). The result is a model that is used for classification. Each ML algorithm requires specific parameters that are user-controllable. For novice users, sits
provides default parameters that produce good results. Please see Chapter Machine learning for data cubes for more details.
Since the time series data has four attributes (EVI, NDVI, NIR, and MIR) and the data cube images have only two, we select the NDVI and EVI values and use the resulting data for training. To build the classification model, we use a random forest model called by sits_rfor()
. Results from the random forest model can vary between different runs, due to the stochastic nature of the algorithm, For this reason, in the code fragment below, we set the seed of R’s pseudo-random number generation explicitly to ensure the same results are produced for documentation purposes.
set.seed(03022024)
# Select the bands NDVI and EVI
samples_2bands <- sits_select(
data = samples_matogrosso_mod13q1,
bands = c("NDVI", "EVI")
)
# Train a random forest model
rf_model <- sits_train(
samples = samples_2bands,
ml_method = sits_rfor()
)
# Plot the most important variables of the model
plot(rf_model)
Data cube classification
After training the machine learning model, the next step is to classify the data cube using sits_classify()
. This function produces a set of raster probability maps, one for each class. For each of these maps, the value of a pixel is proportional to the probability that it belongs to the class. This function has two mandatory parameters: data
, the data cube or time series tibble to be classified; and ml_model
, the trained ML model. Optional parameters include: (a) multicores
, number of cores to be used; (b) memsize
, RAM used in the classification; (c) output_dir
, the directory where the classified raster files will be written. Details of the classification process are available in “Image classification in data cubes”.
# Classify the raster image
sinop_probs <- sits_classify(
data = sinop_cube,
ml_model = rf_model,
multicores = 2,
memsize = 8,
output_dir = "./tempdir/chp3"
)
# Plot the probability cube for class Forest
plot(sinop_probs, labels = "Forest", palette = "BuGn")
After completing the classification, we plot the probability maps for class Forest
. Probability maps are helpful to visualize the degree of confidence the classifier assigns to the labels for each pixel. They can be used to produce uncertainty information and support active learning, as described in Chapter Image classification in data cubes.
Spatial smoothing
When working with big Earth observation data, there is much variability in each class. As a result, some pixels will be misclassified. These errors are more likely to occur in transition areas between classes. To address these problems, sits_smooth()
takes a probability cube as input and uses the class probabilities of each pixel’s neighborhood to reduce labeling uncertainty. Plotting the smoothed probability map for class Forest shows that most outliers have been removed.
# Perform spatial smoothing
sinop_bayes <- sits_smooth(
cube = sinop_probs,
multicores = 2,
memsize = 8,
output_dir = "./tempdir/chp3"
)
plot(sinop_bayes, labels = "Forest", palette = "BuGn")
Labeling a probability data cube
After removing outliers using local smoothing, the final classification map can be obtained using sits_label_classification()
. This function assigns each pixel to the class with the highest probability.
# Label the probability file
sinop_map <- sits_label_classification(
cube = sinop_bayes,
output_dir = "./tempdir/chp3"
)
plot(sinop_map)
The resulting classification files can be read by QGIS. Links to the associated files are available in the sinop_map
object in the nested table file_info
.
# Show the location of the classification file
sinop_map$file_info[[1]]
#> # A tibble: 1 × 12
#> band start_date end_date ncols nrows xres yres xmin xmax ymin
#> <chr> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 class 2013-09-14 2014-08-29 944 551 232. 232. -6181982. -5.96e6 -1.35e6
#> # ℹ 2 more variables: ymax <dbl>, path <chr>