# Creating a dataset This guide will walk through creating a new dataset for processing in PCTasks, using an existing dataset as an example, which can be found in the [datasets/chesapeake_lulc](https://github.com/microsoft/planetary-computer-tasks/tree/main/datasets/chesapeake_lulc) folder. ## dataset.yaml In the `datasets/chesapeake_lulc` folder, you'll see a file called `dataset.yaml`. This is the YAML configuration that will tell PCTasks all it needs to know to create workflows for ingesting this dataset. It includes information about the STAC Collections for this dataset, where the assets are, what tokens it needs, what code files are needed to processes the dataset, etc. It also includes sections that are the same as a PCTasks workflow and task configuration, such as `image`, `code`, and `environment`. These sections are forwarded directly into the workflows that are generated by running `pctasks dataset` commands against the dataset configuration file. The file in full is here; we will walk through the sections below: ```yaml name: chesapeake_lulc image: ${{ args.registry }}/pctasks-task-base:latest args: - registry code: src: ${{ local.path(./chesapeake_lulc.py) }} requirements: ${{ local.path(./requirements.txt) }} environment: AZURE_TENANT_ID: ${{ secrets.task-tenant-id }} AZURE_CLIENT_ID: ${{ secrets.task-client-id }} AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }} collections: - id: chesapeake-lc-7 template: ${{ local.path(./collection/chesapeake-lc-7) }} class: chesapeake_lulc:ChesapeakeCollection asset_storage: - uri: blob://landcoverdata/chesapeake token: ${{ pc.get_token(landcoverdata, chesapeake) }} chunks: options: name_starts_with: lc-7/ chunk_length: 1000 chunk_storage: uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-7/ - id: chesapeake-lc-13 template: ${{ local.path(./collection/chesapeake-lc-13) }} class: chesapeake_lulc:ChesapeakeCollection asset_storage: - uri: blob://landcoverdata/chesapeake token: ${{ pc.get_token(landcoverdata, chesapeake) }} chunks: options: name_starts_with: lc-13/ chunk_length: 1000 chunk_storage: uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-13/ - id: chesapeake-lu template: ${{ local.path(./collection/chesapeake-lu) }} class: chesapeake_lulc:ChesapeakeCollection asset_storage: - uri: blob://landcoverdata/chesapeake token: ${{ pc.get_token(landcoverdata, chesapeake) }} chunks: options: name_starts_with: lu/ chunk_length: 1000 chunk_storage: uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lu/ ``` ### Templating You'll notice the usage of `${{ ... }}` in the dataset YAML for various values. This represents a templated value that is dynamically computed by PCTasks, either on the client or server side. See the [](../user_guide/templating) user guide for more information about templating. ### name ```yaml name: chesapeake_lulc ``` This section simply gives the dataset a name. Use an ID that can be put into file paths etc (no spaces). ### image and args ```yaml image: ${{ args.registry }}/pctasks-task-base:latest args: - registry ``` This section describes the docker image to use to run tasks. The main requirement is that `pctasks.dataset` and `pctasks.ingest_task` are installed in the environment of the docker image. Otherwise you can use any image to run the task. It's recommended to use `pctasks-task-base` when possible, or an image derived from that base image. ### code ```yaml code: src: ${{ local.path(./chesapeake_lulc.py) }} requirements: ${{ local.path(./requirements.txt) }} ``` The code section allows you to specify a local code file or package that should be uploaded and available to the task runner when executing tasks. You can also supply a local `requirements.txt` file that lists dependencies that should be installed before running tasks. If installing dependencies will take a significant amount of time, it is recommended that you instead create and publish a docker image with those dependencies installed and use that image to speed things up. See [](../user_guide/runtime) for more details. ### environment ```yaml environment: AZURE_TENANT_ID: ${{ secrets.task-tenant-id }} AZURE_CLIENT_ID: ${{ secrets.task-client-id }} AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }} ``` The `environment` provides the ability to inject environment variables into each task that is issued for the dataset. In this case, we're injecting the Azure SDK credentials for tasks. These environment variables will be provided to each task, regardless of whether they will be utilized in any specific task. In this case, the variable values are using the `${{ secrets.* }}` template group to retrieve secret values. See [](../user_guide/templating.md#secrets) for more details about secrets. ### task_config Although not included in the example being reviewed here, custom task-level configurations specific to one of the collections defined in the `dataset.yaml` file can be defined in a `task_config` section. Currently, only the assignment of tags to tasks is supported. For example, to specify that the high memory Azure batch pool should be used for the `create-items` task for the `chesapeake-lc-13` collection, we can define an appropriate tag on the `create-items` task for that collection by adding the following section to the `dataset.yaml` file: ```yaml task_config: chesapeake-lc-13: create-items: tags: batch_pool_id: high_memory_pool ``` Note that should a conflict occur between a tag generated in code and one defined in the `task_config` section of the `dataset.yaml` file, the `task_config` tag value will take precedence. ### collections The `collections` element is a list of collection configuration. If your dataset only has one collection, there will be a single object listed here. We'll look at the first collection object as an example; the rest are similar: ```yaml - id: chesapeake-lc-7 template: ${{ local.path(./collection/chesapeake-lc-7) }} class: chesapeake_lulc:ChesapeakeCollection asset_storage: - uri: blob://landcoverdata/chesapeake token: ${{ pc.get_token(landcoverdata, chesapeake) }} chunks: options: name_starts_with: lc-7/ chunk_length: 1000 chunk_storage: uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-7/ ``` #### id ```yaml id: chesapeake-lc-7 ``` This is the STAC Collection ID. For any STAC Items that are processed, this will either be set into their `collection` property if none is set, or throw an error if the Item's `collection` does not match this value. This must also match the ID in the collection template. #### template ```yaml template: ${{ local.path(./collection/chesapeake-lc-7) }} ``` This is the path to the directory containing the Collection template. The collection template is Note here we use the PCTasks template function `local.path` to specify the path to the template directory relative to the location of the `dataset.yaml` path. #### class ```yaml class: chesapeake_lulc:ChesapeakeCollection ``` The class property points PCTasks to the `pctasks.dataset.collection.Collection` subclass that will be used to process STAC Items. This must be a class accessible in the Python path of the task execution environment, either through the code or packages described in the `code` configuration block described above, packages installed in the docker image, or core PCTasks implementations such as `pctasks.dataset.collection:PremadeItemCollection`. The implementation of the class is described below. #### asset_storage There can be multiple asset storage configurations, which describes where assets for the dataset exist, are specified in the list elements of `asset_storage` property. We'll look at the single asset_storage configuration here: ```yaml uri: blob://landcoverdata/chesapeake token: ${{ pc.get_token(landcoverdata, chesapeake) }} chunks: options: name_starts_with: lc-7/ chunk_length: 1000 ``` The `uri` is the PC URI to the assets (currently must be a `blob://storage_account/container(/prefix)` type URI). The `token` provides a SAS token to access the assets. Any token provided with asset storage will be available to tasks through the {ref}`StorageFactory` mechanism. The `chunks` section describes how the dataset assets get translated into "chunk files". Chunk files are simple lists of asset URIs that are used to break data processing work into groups of work that can be processed in parallel. This section would also contain information defining the "splits" that would parallelize the creation of chunk files, though this particular dataset does not utilize splits and so no options are defined. See [](../user_guide/chunking) for more information about splits and chunk files. #### chunk_storage ```yaml chunk_storage: uri: blob://landcoverdata/chesapeake-etl-data/pctasks-chunks/lc-7/ ``` This section defines where chunk files will be stored. You can supply a read/write SAS token in this configuration, but in this example we only specify the URI, which requires the service principal whose credentials are set in the `environment` have read/write access to this container. See [](../user_guide/chunking) for more information about chunk files. ## Collection templates A Collection template is a directory has two files: `template.json`, which is a STAC Collection JSON with a templated description value, and a `description.md`, which contains the text that will be templated into the Collection JSON. For example, using ```shell > pctasks dataset ingest-collection -d datasets/chesapeake_lulc/dataset.yaml -c chesapeake-lc-7 --submit ``` will submit a task to write the collection to the database. Note that if your dataset only has a single collection, you do not have to supply the `-c` option. Also, if you are in the dataset directory and your dataset is named `dataset.yaml`, you do not have to supply teh `-d` option. ## chesapeake_lulc.py This is the code file that contains the subclass of [pctasks.dataset.model.collection.Collection](../reference/generated/pctasks.dataset.collection.Collection). This dataset uses a [stactools package](https://stactools-packages.github.io/) for Item creation, so the code quite simply calls out to that stactools package to create an item from an asset: ```python from typing import List, Union import pystac from stactools.chesapeake_lulc.stac import create_item from pctasks.core.models.task import WaitTaskResult from pctasks.core.storage import StorageFactory from pctasks.dataset.collection import Collection class ChesapeakeCollection(Collection): @classmethod def create_item( cls, asset_uri: str, storage_factory: StorageFactory ) -> Union[List[pystac.Item], WaitTaskResult]: storage, asset_path = storage_factory.get_storage_for_file(asset_uri) href = storage.get_url(asset_path) item = create_item(href, read_href_modifier=storage.sign) return [item] ``` See the [chesapeake-lulc stactools package](https://github.com/stactools-packages/chesapeake-lulc) for an example of how to create a stactools package. It's recommended that any public dataset ingestion starts with a stactools package, which allows community involvement in the generation of STAC for public datasets. ## requirements.txt This file contains the requirements that are needed to run the code contained in `chesapeake_lulc.py` file. Since this is declared in the `code:` block of the `dataset.yaml` file, this requirements file along with the code file will be uploaded and then transferred to task runners, which will install the dependencies before running any task. Because installation of dependencies can be time consuming, you can speed up the running of dataset tasks by creating and publishing a docker image that already contains the requirements and code. The image must be available to be `docker pull`'d by the task runner. In that scenario, you would not need to supply a `code:` block in the dataset configuration. ## Running the dataset With the above configuration file, code files, and collection templates, you can ingest the dataset into the development or deployed PCTasks system. With the appropriate profile set, use: ```shell > pctasks dataset ingest-collection -d dataset.yaml -c chesapeake-lc-7 --submit > pctasks dataset process-items -d dataset.yaml -c chesapeake-lc-7 test-ingest --limit 1 --submit ``` The above `process-items` command will limit the number of Items processed for testing. Also note that, because our dataset has multiple collections, you need to pass in the `-c` argument with the collection name. If your dataset only has a single argument, this option is not required. The `--submit` option will submit the generated workflow to PCTasks; if not supplied, then the workflow will be printed to stdout, which can be submitted later through the `pctasks workflow submit` command. ## Dynamic updates You generate a workflow from a `dataset.yaml` to use in dynamic updates with the `--is-update-workflow` flag to `pctasks dataset`. This will append a `--since` argument to the workflow's `args`. You can register the generated workflow with pctasks: ```shell > pctasks workflow update ... ``` And then set up a cron job or some other system to call it, providing `--since` as an arguemnt ```shell > pctasks workflow submit ... --arg since $(date -d '-1 day' '+%Y-%m-%dT%H:%M:%S') ```