Creating a dataset#
This guide will walk through creating a new dataset for processing in PCTasks, using an existing dataset as an example, which can be found in the datasets/chesapeake_lulc folder.
dataset.yaml#
In the datasets/chesapeake_lulc
folder, you’ll see a file called dataset.yaml
. This is the YAML configuration that
will tell PCTasks all it needs to know to create workflows for ingesting this dataset. It includes information about
the STAC Collections for this dataset, where the assets are, what tokens it needs, what code files are needed to
processes the dataset, etc. It also includes sections that are the same as a PCTasks workflow and task configuration, such as image
, code
,
and environment
. These sections are forwarded directly into the workflows that are generated by running pctasks dataset
commands
against the dataset configuration file.
The file in full is here; we will walk through the sections below:
name: chesapeake_lulc
image: ${{ args.registry }}/pctasks-task-base:latest
args:
- registry
code:
src: ${{ local.path(./chesapeake_lulc.py) }}
requirements: ${{ local.path(./requirements.txt) }}
environment:
AZURE_TENANT_ID: ${{ secrets.task-tenant-id }}
AZURE_CLIENT_ID: ${{ secrets.task-client-id }}
AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }}
collections:
- id: chesapeake-lc-7
template: ${{ local.path(./collection/chesapeake-lc-7) }}
class: chesapeake_lulc:ChesapeakeCollection
asset_storage:
- uri: blob://landcoverdata/chesapeake
token: ${{ pc.get_token(landcoverdata, chesapeake) }}
chunks:
options:
name_starts_with: lc-7/
chunk_length: 1000
chunk_storage:
uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-7/
- id: chesapeake-lc-13
template: ${{ local.path(./collection/chesapeake-lc-13) }}
class: chesapeake_lulc:ChesapeakeCollection
asset_storage:
- uri: blob://landcoverdata/chesapeake
token: ${{ pc.get_token(landcoverdata, chesapeake) }}
chunks:
options:
name_starts_with: lc-13/
chunk_length: 1000
chunk_storage:
uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-13/
- id: chesapeake-lu
template: ${{ local.path(./collection/chesapeake-lu) }}
class: chesapeake_lulc:ChesapeakeCollection
asset_storage:
- uri: blob://landcoverdata/chesapeake
token: ${{ pc.get_token(landcoverdata, chesapeake) }}
chunks:
options:
name_starts_with: lu/
chunk_length: 1000
chunk_storage:
uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lu/
Templating#
You’ll notice the usage of ${{ ... }}
in the dataset YAML for various values. This represents a templated value that is dynamically
computed by PCTasks, either on the client or server side. See the Templating user guide for more information about templating.
name#
name: chesapeake_lulc
This section simply gives the dataset a name. Use an ID that can be put into file paths etc (no spaces).
image and args#
image: ${{ args.registry }}/pctasks-task-base:latest
args:
- registry
This section describes the docker image to use to run tasks. The main requirement is that pctasks.dataset
and pctasks.ingest_task
are
installed in the environment of the docker image. Otherwise you can use any image to run the task. It’s recommended to use
pctasks-task-base
when possible, or an image derived from that base image.
code#
code:
src: ${{ local.path(./chesapeake_lulc.py) }}
requirements: ${{ local.path(./requirements.txt) }}
The code section allows you to specify a local code file or package that should be uploaded and available to the task runner when executing
tasks. You can also supply a local requirements.txt
file that lists dependencies that should be installed before running tasks. If installing
dependencies will take a significant amount of time, it is recommended that you instead create and publish a docker image with those dependencies
installed and use that image to speed things up. See Runtime Environment for more details.
environment#
environment:
AZURE_TENANT_ID: ${{ secrets.task-tenant-id }}
AZURE_CLIENT_ID: ${{ secrets.task-client-id }}
AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }}
The environment
provides the ability to inject environment variables into each task that is issued for the dataset. In this case, we’re injecting the Azure SDK credentials for tasks. These environment variables will be provided to each task, regardless
of whether they will be utilized in any specific task. In this case, the variable values are using the ${{ secrets.* }}
template
group to retrieve secret values. See secrets for more details about secrets.
task_config#
Although not included in the example being reviewed here, custom task-level configurations specific to one of the collections defined in the dataset.yaml
file can be defined in a task_config
section. Currently, only the assignment of tags to tasks is supported. For example, to specify that the high memory Azure batch pool should be used for the create-items
task for the chesapeake-lc-13
collection, we can define an appropriate tag on the create-items
task for that collection by adding the following section to the dataset.yaml
file:
task_config:
chesapeake-lc-13:
create-items:
tags:
batch_pool_id: high_memory_pool
Note that should a conflict occur between a tag generated in code and one defined in the task_config
section of the dataset.yaml
file, the task_config
tag value will take precedence.
collections#
The collections
element is a list of collection configuration. If your dataset only has one collection,
there will be a single object listed here. We’ll look at the first collection object as an example; the
rest are similar:
- id: chesapeake-lc-7
template: ${{ local.path(./collection/chesapeake-lc-7) }}
class: chesapeake_lulc:ChesapeakeCollection
asset_storage:
- uri: blob://landcoverdata/chesapeake
token: ${{ pc.get_token(landcoverdata, chesapeake) }}
chunks:
options:
name_starts_with: lc-7/
chunk_length: 1000
chunk_storage:
uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-7/
id#
id: chesapeake-lc-7
This is the STAC Collection ID. For any STAC Items that are processed, this will either be set into their collection
property if none is set, or throw an error if the Item’s collection
does not match this value. This must also
match the ID in the collection template.
template#
template: ${{ local.path(./collection/chesapeake-lc-7) }}
This is the path to the directory containing the Collection template. The collection template is
Note here we use the PCTasks template function local.path
to specify the path to the template directory relative to the location of the dataset.yaml
path.
class#
class: chesapeake_lulc:ChesapeakeCollection
The class property points PCTasks to the pctasks.dataset.collection.Collection
subclass that will be used to process STAC Items.
This must be a class accessible in the Python path of the task execution environment, either through the code or packages described
in the code
configuration block described above, packages installed in the docker image, or core PCTasks implementations such as pctasks.dataset.collection:PremadeItemCollection
.
The implementation of the class is described below.
asset_storage#
There can be multiple asset storage configurations, which describes where assets for the dataset exist, are specified in the list elements of asset_storage
property. We’ll look at the single asset_storage configuration here:
uri: blob://landcoverdata/chesapeake
token: ${{ pc.get_token(landcoverdata, chesapeake) }}
chunks:
options:
name_starts_with: lc-7/
chunk_length: 1000
The uri
is the PC URI to the assets (currently must be a blob://storage_account/container(/prefix)
type URI).
The token
provides a SAS token to access the assets. Any token provided with asset storage will be available to tasks through the StorageFactory mechanism.
The chunks
section describes how the dataset assets get translated into “chunk files”. Chunk files are simple lists of asset URIs
that are used to break data processing work into groups of work that can be processed in parallel. This section would also
contain information defining the “splits” that would parallelize the creation of chunk files, though this particular dataset
does not utilize splits and so no options are defined. See Chunking for more information about splits and chunk files.
chunk_storage#
chunk_storage:
uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-7/
This section defines where chunk files will be stored. You can supply a read/write SAS token in this configuration, but in this example we only specify the URI, which requires the service principal whose credentials are set in the environment
have read/write access to this container. See Chunking for more information about chunk files.
Collection templates#
A Collection template is a directory has two files: template.json
, which
is a STAC Collection JSON with a templated description value, and a description.md
, which contains the text that will be
templated into the Collection JSON.
For example, using
> pctasks dataset ingest-collection -d datasets/chesapeake_lulc/dataset.yaml -c chesapeake-lc-7 --submit
will submit a task to write the collection to the database. Note that if your dataset only has a single collection, you
do not have to supply the -c
option. Also, if you are in the dataset directory and your dataset is named dataset.yaml
,
you do not have to supply teh -d
option.
chesapeake_lulc.py#
This is the code file that contains the subclass of pctasks.dataset.model.collection.Collection. This dataset uses a stactools package for Item creation, so the code quite simply calls out to that stactools package to create an item from an asset:
from typing import List, Union
import pystac
from stactools.chesapeake_lulc.stac import create_item
from pctasks.core.models.task import WaitTaskResult
from pctasks.core.storage import StorageFactory
from pctasks.dataset.collection import Collection
class ChesapeakeCollection(Collection):
@classmethod
def create_item(
cls, asset_uri: str, storage_factory: StorageFactory
) -> Union[List[pystac.Item], WaitTaskResult]:
storage, asset_path = storage_factory.get_storage_for_file(asset_uri)
href = storage.get_url(asset_path)
item = create_item(href, read_href_modifier=storage.sign)
return [item]
See the chesapeake-lulc stactools package for an example of how to create a stactools package. It’s recommended that any public dataset ingestion starts with a stactools package, which allows community involvement in the generation of STAC for public datasets.
requirements.txt#
This file contains the requirements that are needed to run the code contained in chesapeake_lulc.py
file. Since this
is declared in the code:
block of the dataset.yaml
file, this requirements file along with the code file will be
uploaded and then transferred to task runners, which will install the dependencies before running any task.
Because installation of dependencies can be time consuming, you can speed up the running of dataset tasks by
creating and publishing a docker image that already contains the requirements and code. The image must be available
to be docker pull
’d by the task runner. In that scenario, you would not need to supply a code:
block in the
dataset configuration.
Running the dataset#
With the above configuration file, code files, and collection templates, you can ingest the dataset into the development or deployed PCTasks system.
With the appropriate profile set, use:
> pctasks dataset ingest-collection -d dataset.yaml -c chesapeake-lc-7 --submit
> pctasks dataset process-items -d dataset.yaml -c chesapeake-lc-7 test-ingest --limit 1 --submit
The above process-items
command will limit the number of Items processed for testing.
Also note that, because our dataset has multiple collections, you need to pass in the
-c
argument with the collection name. If your dataset only has a single argument, this
option is not required. The --submit
option will submit the generated workflow to PCTasks;
if not supplied, then the workflow will be printed to stdout, which can be submitted later through
the pctasks workflow submit
command.
Dynamic updates#
You generate a workflow from a dataset.yaml
to use in dynamic updates with the
--is-update-workflow
flag to pctasks dataset
. This will append a --since
argument
to the workflow’s args
.
You can register the generated workflow with pctasks:
> pctasks workflow update ...
And then set up a cron job or some other system to call it, providing --since
as an arguemnt
> pctasks workflow submit ... --arg since $(date -d '-1 day' '+%Y-%m-%dT%H:%M:%S')