Earth observation data cubes
Analysis-ready data image collections
Analysis-ready data (ARD) are images that are ready for analysis without the need for further preprocessing or transformation. They simplify and accelerate the analysis of Earth observation data by providing consistent and high-quality data that are standardized across different sensors and platforms. ARD data is typically provided as a collection of files, where each pixel contains a single value for each spectral band for a given date.
ARD collections are available in cloud services such as Amazon Web Service, Brazil Data Cube, Digital Earth Africa, Swiss Data Cube, and Microsoft’s Planetary Computer. These collections have been processed to improve multidate comparability. Radiance measures at the top of the atmosphere were converted to ground reflectance measures. In general, the timelines of the images of an ARD collection are different. Images still contain cloudy or missing pixels; bands for the images in the collection may have different resolutions. Figure 9 shows an example of the Landsat ARD image collection.
ARD image collections are organized in spatial partitions. Sentinel-2/2A images follow the Military Grid Reference System (MGRS) tiling system, which divides the world into 60 UTM zones of 8 degrees of longitude. Each zone has blocks of 6 degrees of latitude. Blocks are split into tiles of 110 \(\times\) 110 km\(^2\) with a 10 km overlap. Figure 10 shows the MGRS tiling system for a part of the Northeastern coast of Brazil, contained in UTM zone 24, block M.
The Landsat-4/5/7/8/9 satellites use the Worldwide Reference System (WRS-2), which breaks the coverage of Landsat satellites into images identified by path and row (see Figure 11). The path is the descending orbit of the satellite; the WRS-2 system has 233 paths per orbit, and each path has 119 rows, where each row refers to a latitudinal center line of a frame of imagery. Images in WRS-2 are geometrically corrected to the UTM projection.
ARD image collections handled by sits
Package sits
supports access to the following ARD image collections:
- Amazon Web Services (AWS): Open data Sentinel-2/2A level 2A collections for the Earth’s land surface.
- Brazil Data Cube (BDC): Open data collections of Sentinel-2/2A, Landsat-8, CBERS-4/4A, and MODIS images for Brazil. These collections are organized as regular data cubes.
- Digital Earth Africa (DEAFRICA): Open data collections of Sentinel-2/2A and Landsat-8 for Africa.
- Microsoft Planetary Computer (MPC): Open data collections of Sentinel-2/2A and Landsat-4/5/7/8/9 for the Earth’s land areas.
- USGS: Landsat-4/5/7/8/9 collections available in AWS, which require access payment.
- Swiss Data Cube (SDC): Open data collection of Sentinel-2/2A and Landsat-8 images for Switzerland.
- Harmonized Landsat-Sentinel (HLS): HLS, provided by NASA, is a collection that processes Landsat 8 and Sentinel-2 imagery to a common standard.
Regular image data cubes
Machine learning and deep learning (ML/DL) classification algorithms require the input data to be consistent. The dimensionality of the data used for training the model has to be the same as that of the data to be classified. There should be no gaps and no missing values. Thus, to use ML/DL algorithms for remote sensing data, ARD image collections should be converted to regular data cubes. Adapting a previous definition by Appel and Pebesma [4], we consider a regular data cube has the following definition and properties:
- A regular data cube is a four-dimensional structure with dimensions x (longitude or easting), y (latitude or northing), time, and bands. The spatial, temporal, and attribute dimensions are independent and not interchangeable.
- The spatial dimensions refer to a coordinate system, such as the grids defined by UTM (Universal Transverse Mercator) or MGRS (Military Grid Reference System). A grid (or tile) of the grid corresponds to a unique zone of the coordinate system.
- The temporal dimension is a set of continuous and equally-spaced intervals.
- For every combination of dimensions, a cell has a single value.
All cells of a data cube have the same spatiotemporal extent. The spatial resolution of each cell is the same in X and Y dimensions. All temporal intervals are the same. Each cell contains a valid set of measures. Each pixel is associated to a unique coordinate in a zone of the coordinate system. For each position in space, the data cube should provide a set of valid time series. For each time interval, the regular data cube should provide a valid 2D image (see Figure 12.
Currently, the only cloud service that provides regular data cubes by default is the Brazil Data Cube (BDC). ARD collections available in AWS, MPC, USGS, and DEAFRICA are not regular in space and time. Bands may have different resolutions, images may not cover the entire time, and time intervals may be irregular. For this reason, subsets of these collections need to be converted to regular data cubes before further processing. To produce data cubes for machine-learning data analysis, users should first create an irregular data cube from an ARD collection and then use sits_regularize()
, as described below.
Creating data cubes
To obtain information on ARD image collection from cloud providers, sits
uses the SpatioTemporal Asset Catalogue (STAC) protocol, a specification of geospatial information which many large image collection providers have adopted. A ‘spatiotemporal asset’ is any file that represents information about the Earth captured in a specific space and time. To access STAC endpoints, sits
uses the rstac R package.
The function sits_cube()
supports access to image collections in cloud services; it has the following parameters:
-
source
: Name of the provider. -
collection
: A collection available in the provider and supported bysits
. To find out which collections are supported bysits
, seesits_list_collections()
. -
platform
: Optional parameter specifying the platform in collections with multiple satellites. -
tiles
: Set of tiles of image collection reference system. Eithertiles
orroi
should be specified. -
roi
: A region of interest. Either: (a) a named vector (lon_min
,lon_max
,lat_min
,lat_max
) in WGS 84 coordinates; or (b) ansf
object. All images intersecting the convex hull of theroi
are selected. -
bands
: Optional parameter with the bands to be used. If missing, all bands from the collection are used. -
start_date
: The initial date for the temporal interval containing the time series of images. -
end_date
: The final date for the temporal interval containing the time series of images.
The result of sits_cube()
is a tibble with a description of the selected images required for further processing. It does not contain the actual data, but only pointers to the images. The attributes of individual image files can be assessed by listing the file_info
column of the tibble.
Assessing Amazon Web Services
Amazon Web Services (AWS) holds two kinds of collections: open-data and requester-pays. Open data collections can be accessed without cost. Requester-pays collections require payment from an AWS account. Currently, sits
supports collection SENTINEL-2-L2A
which is open data. The bands in 10 m resolution are B02, B03, B04, and B08. The 20 m bands are B05, B06, B07, B8A, B11, and B12. Bands B01 and B09 are available at 60 m resolution. A CLOUD band is also available. The example below shows how to access one tile of the open data SENTINEL-2-L2A
collection. The tiles
parameter allows selecting the desired area according to the MGRS reference system.
# Create a data cube covering an area in Brazil
s2_23MMU_cube <- sits_cube(
source = "AWS",
collection = "SENTINEL-2-L2A",
tiles = "23MMU",
bands = c("B02", "B8A", "B11", "CLOUD"),
start_date = "2018-07-12",
end_date = "2019-07-28"
)
plot(s2_23MMU_cube,
red = "B11",
blue = "B02",
green = "B8A",
date = "2018-10-05"
)
Assessing Microsoft’s Planetary Computer
Microsoft’s Planetary Computer (MPC) hosts two open data collections: SENTINEL-2-L2A
and LANDSAT-C2-L2
. The first collection contains SENTINEL-2/2A ARD images, with the same bands and resolutions as those available in AWS (see above). The example below shows how to access the SENTINEL-2-L2A
collection.
# Create a data cube covering an area in the Brazilian Amazon
s2_20LKP_cube_MPC <- sits_cube(
source = "MPC",
collection = "SENTINEL-2-L2A",
tiles = "20LKP",
bands = c("B02", "B8A", "B11", "CLOUD"),
start_date = "2019-07-01",
end_date = "2019-07-28"
)
# Plot a color composite of one date of the cube
plot(s2_20LKP_cube_MPC,
red = "B11", blue = "B02", green = "B8A",
date = "2019-07-18"
)
The LANDSAT-C2-L2
collection provides access to data from Landsat-4/5/7/8/9 satellites. Images from these satellites have been intercalibrated to ensure data consistency. For compatibility between the different Landsat sensors, the band names are BLUE, GREEN, RED, NIR08, SWIR16, and SWIR22. All images have 30 m resolution. For this collection, tile search is not supported; the roi
parameter should be used. The example below shows how to retrieve data from a region of interest covering the city of Brasilia in Brazil.
# Read a ROI that covers part of the Northeastern coast of Brazil
roi <- c(
lon_min = -43.5526, lat_min = -2.9644,
lon_max = -42.5124, lat_max = -2.1671
)
# Select the cube
s2_L8_cube_MPC <- sits_cube(
source = "MPC",
collection = "LANDSAT-C2-L2",
bands = c("BLUE", "RED", "GREEN", "NIR08", "SWIR16", "CLOUD"),
roi = roi,
start_date = "2019-06-01",
end_date = "2019-09-01"
)
# Plot the tile that covers the Lencois Maranhenses
plot(s2_L8_cube_MPC,
red = "RED", green = "GREEN", blue = "BLUE",
date = "2019-06-30"
)
Assessing Digital Earth Africa
Digital Earth Africa (DEAFRICA) is a cloud service that provides open-access Earth observation data for the African continent. The ARD image collections in sits
are S2_L2A
(Sentinel-2 level 2A) and LS8_SR
(Landsat-8). Since the STAC interface for DEAFRICA does not implement the concept of tiles, users need to specify their area of interest using the roi
parameter. The requested roi
produces a cube that contains one MGRS tiles (“35LPH”) covering an area of Madagascar that includes the Betsiboka Estuary.
dea_s2_cube <- sits_cube(
source = "DEAFRICA",
collection = "S2_L2A",
roi = c(
lon_min = 46.1, lat_min = -15.6,
lon_max = 46.6, lat_max = -16.1
),
bands = c(
"B02", "B03", "B04",
"B05", "B06", "B07",
"B08", "B8A", "B11",
"B12", "CLOUD"
),
start_date = "2019-04-01",
end_date = "2019-11-30"
)
plot(dea_s2_cube, red = "B04", blue = "B02", green = "B08")
Assessing the Brazil Data Cube
The Brazil Data Cube (BDC) is built by Brazil’s National Institute for Space Research (INPE), to provide regular EO data cubes from CBERS, LANDSAT, SENTINEL-2, and TERRA/MODIS satellites for environmental applications. The collections available in the BDC are: LANDSAT-OLI-16D
(Landsat-8 OLI, 30 m resolution, 16-day intervals), SENTINEL-2-16D
(Sentinel-2A and 2B MSI images at 10 m resolution, 16-day intervals), CBERS-WFI-16D
(CBERS 4 WFI, 64 m resolution, 16-day intervals), CBERS-WFI-8D
(CBERS 4 and 4A WFI images, 64m resolution, 8-day intervals) CBERS-MUX-2M
(CBERS 4/4A MUX, 20 m resolution, two-month intervals), and MOD13Q1-6
(MODIS MOD13SQ1 product, collection 6, 250 m resolution, 16-day intervals). For more details, use sits_list_collections(source = "BDC")
.
The BDC uses three hierarchical grids based on the Albers Equal Area projection and SIRGAS 2000 datum. The large grid has tiles of 4224.4 \(\times4\) 224.4 km2 and is used for CBERS-4 AWFI collections at 64 m resolution; each CBERS-4 AWFI tile contains images of 6600 \(\times\) 6600 pixels. The medium grid is used for Landsat-8 OLI collections at 30 m resolution; tiles have an extension of 211.2 \(\times\) 211.2 km2, and each image has 7040 \(\times\) 7040 pixels. The small grid covers 105.6 \(\times\) 105.6 km2 and is used for Sentinel-2 MSI collections at 10 m resolutions; each image has 10560 \(\times\) 10560 pixels. The data cubes in the BDC are regularly spaced in time and cloud-corrected [5].
To access the BDC, users must provide their credentials using environment variables, as shown below. Obtaining a BDC access key is free. Users must register at the BDC site to obtain the key.
In the example below, the data cube is defined as one tile (“005004”) of CBERS-WFI-16D
collection, which holds CBERS AWFI images at 16 days resolution.
# Define a tile from the CBERS-4/4A AWFI collection
cbers_tile <- sits_cube(
source = "BDC",
collection = "CBERS-WFI-16D",
tiles = "005004",
bands = c("B13", "B14", "B15", "B16", "CLOUD"),
start_date = "2021-05-01",
end_date = "2021-09-01"
)
# Plot one time instance
plot(cbers_tile,
red = "B15",
green = "B16",
blue = "B13",
date = "2021-05-09"
)
Accessing Harmonized Landsat-Sentinel collections
Harmonized Landsat Sentinel (HLS) is a NASA initiative that processes and harmonizes Landsat 8 and Sentinel-2 imagery to a common standard, including atmospheric correction, alignment, resampling, and corrections for BRDF (bidirectional reflectance distribution function). The purpose of the HLS project is to create a unified and consistent dataset that integrates the advantages of both systems, making it easier for users to work with the data.
The NASA Harmonized Landsat and Sentinel (HLS) service provides two image collections:
Landsat 8 OLI Surface Reflectance HLS (HLSL30) – The HLSL30 product includes atmospherically corrected surface reflectance from the Landsat 8 OLI sensors at 30 m resolution. The dataset includes 11 spectral bands.
Sentinel-2 MultiSpectral Instrument Surface Reflectance HLS (HLSS30) – The HLSS30 product includes atmospherically corrected surface reflectance from the Sentinel-2 MSI sensors at 30 m resolution. The dataset includes 12 spectral bands.
The HLS tiling system is identical as the one used for Sentinel-2 (MGRS). The tiles dimension is 109.8 km and there is an overlap of 4,900 m on each side.
To access NASA HLS, users need to registed at NASA EarthData, and save their login and password in a ~/.netrc plain text file in Unix (or %HOME%_netrc in Windows). The file must contain the following fields:
machine urs.earthdata.nasa.gov
login <username>
password <password>
Access to images in NASA HLS is done by region of interest. The following example shows an HLS Sentinel-2 image over the Brazilian coast.
# define a region of interest
roi <- c(
lon_min = -45.6422, lat_min = -24.0335,
lon_max = -45.0840, lat_max = -23.6178
)
# create a cube from the HLSS30 collection
hls_cube_s2 <- sits_cube(
source = "HLS",
collection = "HLSS30",
roi = roi,
bands = c("BLUE", "GREEN", "RED", "CLOUD"),
start_date = as.Date("2020-06-01"),
end_date = as.Date("2020-09-01"),
progress = FALSE
)
# plot the cube
plot(hls_cube_s2, red = "RED", green = "GREEN", blue = "BLUE", date = "2020-06-20")
Images from the HLS Landsat and Sentinel-2 collections are accessed separately and can be combined with sits_merge()
. The script below creates an HLS Landsat cube over the same area as the Sentinel-2 cube above bands. The two cubes are then merged.
# define a region of interest
roi <- c(
lon_min = -45.6422, lat_min = -24.0335,
lon_max = -45.0840, lat_max = -23.6178
)
# create a cube from the HLSS30 collection
hls_cube_l8 <- sits_cube(
source = "HLS",
collection = "HLSL30",
roi = roi,
bands = c("BLUE", "GREEN", "RED", "CLOUD"),
start_date = as.Date("2020-06-01"),
end_date = as.Date("2020-09-01"),
progress = FALSE
)
# merge the Sentinel-2 and Landsat-8 cubes
hls_cube_merged <- sits_merge(hls_cube_s2, hls_cube_l8)
Comparing the timelines of the original cubes and the merged one, one can see the benefits of the merged collection for time series data analysis.
# Timeline of the Sentinel-2 cube
sits_timeline(hls_cube_s2)
#> [1] "2020-06-15" "2020-06-20" "2020-06-25" "2020-06-30" "2020-07-05" "2020-07-10" "2020-07-20"
#> [8] "2020-07-25" "2020-08-04" "2020-08-09" "2020-08-14" "2020-08-19" "2020-08-24" "2020-08-29"
# Timeline of the Landsat-8 cube
sits_timeline(hls_cube_l8)
#> [1] "2020-06-09" "2020-06-25" "2020-07-11" "2020-07-27" "2020-08-12" "2020-08-28"
# Timeline of the Landsat-8 cube
sits_timeline(hls_cube_merged)
#> [1] "2020-06-09" "2020-06-15" "2020-06-20" "2020-06-25" "2020-06-30" "2020-07-05" "2020-07-10"
#> [8] "2020-07-11" "2020-07-20" "2020-07-25" "2020-07-27" "2020-08-04" "2020-08-09" "2020-08-12"
#> [15] "2020-08-14" "2020-08-19" "2020-08-24" "2020-08-28" "2020-08-29"
# plotting a harmonized Landsat image from the merged dataset
# plot the cube
plot(hls_cube_merged,
red = "RED",
green = "GREEN",
blue = "BLUE",
date = "2020-07-11"
)
Defining a data cube using ARD local files
ARD images downloaded from cloud collections to a local computer are not associated with a STAC endpoint that describes them. They must be organized and named to allow sits
to create a data cube from them. All local files have to be in the same directory and have the same spatial resolution and projection. Each file must contain a single image band for a single date. Each file name needs to include tile, date, and band information. Users must provide information about the original data source to allow sits
to retrieve information about image attributes such as band names, missing values, etc. When working with local cubes, sits_cube()
needs the following parameters:
-
source
: Name of the original data provider; for a list of providers and collections, usesits_list_collections()
. -
collection
: Collection from where the data was extracted. -
data_dir
: Local directory for images. -
bands
: Optional parameter to describe the bands to be retrieved. -
parse_info
: Information to parse the file names. File names need to contain information on tile, date, and band, separated by a delimiter (usually"_"
). -
delim
: Separator character between descriptors in the file name (default is"_"
).
The example shows how to define a data cube using files from the sitsdata
package. The dataset contains part of tile “20LMR” of Sentinel-2 images for the period 2022-01-05 to 2021-08-26, with bands B02, B8A, and B11. Data is extracted from collection “SENTINEL-2-L2A” on Microsoft Planetary Computer (“MPC”).
In general, sits
users need to match the local file names to the values provided by the parse_info
parameter. Since the file names of this dataset use the format SENTINEL-2_MSI_20LMR_B8A_2022-12-23.tif
, which fits the default value for parse_info
which is c("X1", "X2", "tile", "band", "date")
and for delim
which is “_“, it is not necessary to set these values when creating a data cube from the local files.
library(sits)
# Create a cube based on a stack of CBERS data
data_dir <- system.file("extdata/Rondonia-20LKP", package = "sitsdata")
# List the first file
list.files(data_dir)[1]
#> [1] "SENTINEL-2_MSI_20LKP_B02_2020-06-04.tif"
# Create a data cube from local files
s2_cube_20LKP <- sits_cube(
source = "MPC",
collection = "SENTINEL-2-L2A",
data_dir = data_dir
)
# Plot the band B8A in the first time instance
plot(s2_cube_20LKP,
red = "B11", green = "B8A", blue = "B02",
date = "2020-07-22"
)
Defining a data cube using classified images
It is also possible to create local cubes based on results that have been produced by classification or post-classification algorithms. In this case, more parameters are required, and the parameter parse_info
is specified differently, as follows:
-
source
: Name of the original data provider. -
collection
: Name of the collection from where the data was extracted. -
data_dir
: Local directory for the classified images. -
band
: Band name associated with the type of result. Use: (a)probs
for probability cubes produced bysits_classify()
; (b)bayes
, for cubes produced bysits_smooth()
; (c)entropy
,least
,ratio
ormargin
, according to the method selected when usingsits_uncertainty()
; and (d)class
for classified cubes. -
labels
: Labels associated with the names of the classes (not required for cubes produced bysits_uncertainty()
). -
version
: Version of the result (default =v1
). -
parse_info
: File name parsing information to allowsits
to deduce the values oftile
,start_date
,end_date
,band
, andversion
from the file name. Unlike non-classified image files, cubes produced by classification and post-classification have bothstart_date
andend_date
.
The following code creates a results cube based on the classification of deforestation in Brazil. This classified cube was obtained by a large data cube of Sentinel-2 images, covering the state of Rondonia, Brazil comprising 40 tiles, 10 spectral bands, and covering the period from 2020-06-01 to 2021-09-11. Samples of four classes were trained by a random forest classifier. Internally, classified images use integers to represent classes. Thus, labels have to be associated to the integers that represent each class name.
# Create a cube based on a classified image
data_dir <- system.file("extdata/Rondonia-20LLP",
package = "sitsdata"
)
# File name "SENTINEL-2_MSI_20LLP_2020-06-04_2021-08-26_class_v1.tif"
Rondonia_class_cube <- sits_cube(
source = "AWS",
collection = "SENTINEL-S2-L2A-COGS",
bands = "class",
labels = c(
"1" = "Burned_Area", "2" = "Cleared_Area",
"3" = "Highly_Degraded", "4" = "Forest"
),
data_dir = data_dir,
parse_info = c(
"satellite", "sensor", "tile", "start_date", "end_date",
"band", "version"
)
)
# Plot the classified cube
plot(Rondonia_class_cube)
Regularizing data cubes
ARD collections available in AWS, MPC, USGS, and DEAFRICA are not regular in space and time. Bands may have different resolutions, images may not cover the entire tile, and time intervals are irregular. For this reason, data from these collections need to be converted to regular data cubes by calling sits_regularize()
, which uses the gdalcubes package [4].
In the following example, the user has created an irregular data cube from the Sentinel-2 collection available in Microsoft’s Planetary Computer (MPC) for tiles 20LKP
and 20LLP
in the state of Rondonia, Brazil. We first build an irregular data cube using sits_cube()
.
# Creating an irregular data cube from MPC
s2_cube_rondonia <- sits_cube(
source = "MPC",
collection = "SENTINEL-2-L2A",
tiles = c("20LKP", "20LLP"),
bands = c("B05", "B8A", "B12", "CLOUD"),
start_date = as.Date("2018-07-01"),
end_date = as.Date("2018-08-31")
)
# Show the different timelines of the cube tiles
sits_timeline(s2_cube_rondonia)
#> $`20LKP`
#> [1] "2018-07-03" "2018-07-08" "2018-07-13" "2018-07-18" "2018-07-23" "2018-07-28" "2018-08-02"
#> [8] "2018-08-07" "2018-08-12" "2018-08-17" "2018-08-22" "2018-08-27"
#>
#> $`20LLP`
#> [1] "2018-07-03" "2018-07-05" "2018-07-08" "2018-07-10" "2018-07-13" "2018-07-15" "2018-07-18"
#> [8] "2018-07-20" "2018-07-23" "2018-07-25" "2018-07-28" "2018-07-30" "2018-08-02" "2018-08-04"
#> [15] "2018-08-07" "2018-08-09" "2018-08-12" "2018-08-14" "2018-08-17" "2018-08-19" "2018-08-22"
#> [22] "2018-08-24" "2018-08-27" "2018-08-29"
# plot the first image of the irregular cube
s2_cube_rondonia |>
dplyr::filter(tile == "20LLP") |>
plot(red = "B12", green = "B8A", blue = "B05", date = "2018-07-03")
Because of the different acquisition orbits of the Sentinel-2 and Sentinel-2A satellites, the two tiles also have different timelines. Tile 20LKP
has 12 instances, while tile 20LLP
has 24 instances for the chosen period. The function sits_regularize()
builds a data cube with a regular timeline and a best estimate of a valid pixel for each interval. The period
parameter sets the time interval between two images. Values of period
use the ISO8601 time period specification, which defines time intervals as P[n]Y[n]M[n]D
, where “Y” stands for years, “M” for months, and “D” for days. Thus, P1M
stands for a one-month period, P15D
for a fifteen-day period. When joining different images to get the best image for a period, sits_regularize()
uses an aggregation method that organizes the images for the chosen interval in order of increasing cloud cover and then selects the first cloud-free pixel.
# Regularize the cube to 15 day intervals
reg_cube_rondonia <- sits_regularize(
cube = s2_cube_rondonia,
output_dir = "./tempdir/chp4",
res = 120,
period = "P15D",
multicores = 4
)
# Plot the first image of the tile 20LLP of the regularized cube
# The pixels of the regular data cube cover the full MGRS tile
reg_cube_rondonia |>
dplyr::filter(tile == "20LLP") |>
plot(red = "B12", green = "B8A", blue = "B05")
After obtaining a regular data cube, users can perform data analysis and classification operations, as shown in the following chapters.