Note
This technote is a historical artifact that does not represent an active proposal. It was a useful thought experiment, and some of the ideas may be revived in the future, but overall we have decided to move in a very different direction for the use cases this design was intended to support.
The pipetask tool included in the Gen3 middleware for running PipelineTasks is often opaque and hard to use. This is partly the result of history; the original interface was much more straightforward, but it assumed a very different collections model than what Butler has now, and the changes made to adapt it to the new collections system did not go deep enough.
But it also suffers from the fact that many of its steps are “leaky”; information provided in one step (e.g. input collections used for QuantumGraph) cannot be straightforwardly passed to later steps (e.g. execution of that QuantumGraph), forcing it to rely on the user to provide consistent arguments and limiting the guarantees we can make about behavior in the presence of user error.
Finally, the pipetask interface is currently only really usable from the command-line, making it hard to write unit tests for concrete PipelineTasks; even if we were to keep the command-line interface the same, we need to refactor the implementation to provide a usable Python API.
This technote will provide a detailed design sketch for a new pipetask interface that attempts to address these problems, centered around a git-like “campaign directory” concept that will be used to save state between different pipetask invocations.
Preliminaries¶
The reader is assumed to be familiar with the Gen3 butler’s main organizational concepts (dimensions, data IDs, datasets, dataset types, and collections), the types of collections, and the basic functionality of the current pipetask tool.
The “campaign” terminology used here is very much open to discussion; it’s entirely possible that the concept it describes here does not match the unit of processing that “campaign” evokes for those who have used that term in the past.
On a similar note, while I have not proposed a different name for the pipetask tool itself, this is a good time to consider whether we want one; I certainly think the best way to implement something like the design described here would be to do it alongside the current pipetask and deprecate that only when the new one is relatively complete.
This document will probably be hard to understand in a single-pass read-through, though I can make some recommendations:
- The next section, Conceptual Overview, should definitely be read first.
- Campaign State should be read next, though this contains a few forward-references for details (which should mostly be ignored on first read). Expect to go back to this section frequently when reading later sections.
- The Campaign Class and Subcommands contains numerous references to Common Option Groups that are often necessary to really understand what a method can and cannot do; readers should probably skim both of these sections before trying to understand either of them in detail. When reading them in detail, it will probably be helpful to have views of both Common Option Groups and The Campaign Class and Subcommands open at the same time.
Unfinished Business¶
While this document represents a fairly complete snapshot of my thinking at one point, some discussions since the last substantive changes have raised some possibilities I’d want to explore further before considering even myself happy with this proposal. But it’s also past time for me to get this out for review to others, especially considering that there may be other new big ideas out there that should be considered before making any major changes here. So I’ll just leave most of the content as-is, and list here the big ideas I’d want to explore further:
- Creating a SQLite staging registry (even for local execution), by transferring input datasets needed from an “upstream” shared registry after
QuantumGraphgeneration. Datasets would only be pushed back to the upstream repo if explicitly requested. This could be a one-step way to provide a lot more insulation between users, and hence make the shared repo access model easier to implement (maybe the only non-admin write operations we need to support are ingest and import?) while also giving users more power to play in their own sandboxes. It also opens up a lot of intriguing possibilities for campaign features (more on that in later bullets). I think the big complications this raises are about how to handle multiple transfers from upstream (if needed), when the staging repo doesn’t start out empty and there’s potential for conflicts, especially whenQuantumGraphgeneration is run multiple times over the life of the campaign. - I don’t think I’ve tried quite hard enough to remove functionality here that isn’t obviously needed in the name of simplification. One concrete example is that I don’t think anyone really needs to be able to change the input collections after a campaign is created, though it’s also impossible to rigorously guard against the possibility that input collection content may have changed (unless we only care about changes to a local staging repo!), so that may not be as much of a simplification as I hope.
- We could at least sometimes initialize a git repository in the campaign directory and commit every time we run
pipetask. That would provide a really nice history-tracking and rollback feature almost for free, though it’s not quite as useful if we don’t also go in the direction of having a local SQLite registry that we could also git-control, and it doesn’t seem like something that’s viable for large campaigns (unless we can define a sound subset of the campaign content that’s still worth putting in git). - Michelle Gower brought up the possibility of putting the campaign metadata into the registry database, as that’s desirable for long-term control and management of that at least in the BPS context.
I’m a bit wary of expanding
butler/pipetaskto centrally control even more things (developer campaigns may be quite different from production campaigns in that respect - sandboxes vs. data releases). But combined with a local SQLite repo, so the metadata for sandbox campaigns doesn’t get pushed upstream until/unless a run is pushed back, it’s very intriguing. - The possibility of putting log files in the Registry as well was also raised by Michelle.
I think we’d all agree that relating logs to their processing runs in the registry database makes sense, but I think it’s also pretty clear that
getandputare not ideal interfaces for them (we could work with them, I imagine, but it seems like forcing/encouraging log access through those is a negative-value proposition). - K-T suggested looking at third-party tooling for more general data science workflows for ideas, such as MLFlow. I didn’t have a chance to get to this before we decided to accelerate getting the document out for review.
Conceptual Overview¶
Campaigns¶
A “campaign” is a directory with (at least) a .campaign.json file that tracks the what will be run, and how.
As processing proceeds, this will also be the (default) directory where pipeline definitions, QuantumGraphs, and logs are written.
The same campaign is generally used for multiple “runs”, each of which usually corresponds to its own output RUN-type collection.
An important aspect of this design is that the .campaign.json information does not attempt to track what has been run; it may frequently do so incidentally, because our best guess at what to run next is often what we just ran, but actual provenance will always be stored in the data repository and managed by butler.
A campaign directory may be in a non-POSIX location (reads and writes will use lsst.daf.butler.ButlerURI), and ideally we would make it possible (but not necessary) for this to be the same as the directory that maps to the common prefix of the RUN collections that hold butler-managed output datasets, even though many of the files we’ll write to it (certainly .campaign.json and logs) are not butler datasets.
The Python API will center around a Campaign class whose instances each represent a campaign directory.
Campaign methods mostly correspond one-to-one with pipetask subcommands, and these are described together in The Campaign Class and Subcommands.
pipetask interacts with a campaign in the same way that git interacts with a local repository: one special command, Campaign.init, is first invoked to set up the campaign, and all others are then invoked from within the campaign directory and then do not need to repeatedly provide the information stored within it by previous invocations.
Unlike git, however, commands other than Campaign.init can also be passed options that create or fundamentally modify a campaign on-the-fly, in order to allow simple processing to be performed with a single (albeit verbose) command.
The analogy also does not extend to the term “repository”; a git repository is analogous to a campaign, not a butler data repository.
The .campaign.json file is a hidden file (and JSON rather than YAML) to reflect the fact that it should be generally be manipulated by the pipetask tool, not humans running their favorite text editor.
Campaign Flow and Lifecycle¶
The big-picture steps involved in executing processing pipelines are shown in the figure below:
The dependencies in this figure only show the simple case of running a pipeline once, however, and much of the complexity of the problem comes from the fact that users usually want to run the same pipeline (or many closely-related pipelines) many times, for different reasons:[1]
- to fix a problem, or just get something running at all for the first time;
- to run the same pipeline on more data IDs;
- to run additional tasks;
- any combination of the above.
There are no general rules about what happens when the user revisits one of the previous steps after performing a later one; each case is different and needs to be thought through carefully.
In some cases, we may need to rely on the user for extra information: for example, if the user changes a configuration option after generating the QuantumGraph, do we need to regenerate it?
Or can we just re-run the existing graph?
At present, there’s no way for the software to tell whether a configuration (or software change, for that matter) would affect the graph; we must rely on the user.
There is also at least one case where users have good reasons to prefer different orders of operations, even if starting from the beginning:
- Users who just want to get something working will generally want to build a
QuantumGraphbefore creating an output collection and writing/checking provenance, to fail as early as possible (and avoiding writing anything to the repo). - Users who expect to run multiple
QuantumGraphsin a campaign while writing results to the same output collection (especially in batch contexts) will often want to create that collection up front to avoid race condition.
Finally, users in at least some contexts have a strong expectation that they will be able to perform all of these steps (or, rather, arbitrary subsets!) with a single command-line invocation.
This design mostly attempts to meet that expection, by mapping steps to keyword arguments/command-line options as well as methods/subcommands.
For example, one can use Campaign.edit to set the input collections (collections.inputs) without updating anything downstream, but also use the same --input option in Campaign.run to change them at that stage (or set them for the first time if starting from scratch).
It’s worth questioning whether the right design decision is to instead [try to] push back on that single-invocation expectation in the name of simplicity; that’s just not something I’ve done here (and I suspect at least a couple of single-invocation use cases are really quite well-motivated).
| [1] | The use case of running similar-but-not-identical pipelines on the same data IDs in order to compare their outputs is intentionally not included here, because that isn’t something that should be done within one campaign; this is a use case best handled by using a different campaign for each pipeline (and possibly importing a QuantumGraph). |
The Collection Stack¶
The collections associated with a campaign are organized largely as a stack - in the first-in, last-out data structure sense.
This idea is already lurking behind the current pipetask, but one of the goals in this redesign proposal is to make it more explicit in both the terminology (e.g. --push and Campaign.pop) and the documentation as a way to give users a better mental model of what is going on.[2]
The top of the collection stack is what’s searched first for input datasets, and it starts with the current output RUN-type collection, if there is one (see collections.current_run, below).
It proceeds to past RUN-type collections produced as part of the same campaign (collections.past_runs), and ends with the pure-input collections (collections.inputs).
When the campaign is configured to create a CHAINED-type collection, the definition of the collection is exactly that sequence.
When we do processing as part of a campaign, we’ll often push a new RUN-type collection to the top of the stack (I imagine this being the most common operation when extending the pipeline to new PipelineTasks).
We can instead add more datasets to the collection that is already at the top of the stack (this is more common when adding new data IDs only).
And, finally, we can pop the top collection and push a new one (--replace) or even pop all of them (--restart), which is the mode I expect developers to use when first getting something working or debugging problems.
| [2] | I’m not actually sure that most Science Pipelines developers or external science users are super familiar with stacks in the data structure sense, because many of us have only informal programming backgrounds, but it’s a sufficiently ubiquitous and simple concept that I still think it’s worth asking people to learn about it in order to understand pipetask in detail. |
Future Extensions¶
This proposal does not include any kind of BPS integration, just because it’s a big proposal already.
I do still hope that we can integrate the BPS command-line interface with this one, e.g. via some kind of subcommand-extension system that would add batch-submission subcommands for different batch systems, with roughly the same prerequisites and options as the Campaign.run method/subcommand described later.
In Python, I am vaguely imagining an ABC for per-campaign, per-batch-system state, and that a Campaign object would have a container of concrete instances of these.
It is also possible that adding quantum-level provenance to processing will have a bigger impact on this design than I am anticipating.
That would allow us to write per-quantum configuration or even software versions, rather than per-RUN.
I suspect we will want to at least write per-RUN provenance datasets as well, and I think that means the impact will be small.
Campaign State¶
The schema for the .campaign.json file is presented as a flat list below; .-separated names indicate hierarchies in the actual JSON form.
Options are str unless marked as some other type.
It is expected that the Campaign class will have nearly identical state, but the detailed form it will take (dict? nested dataclasses?) is unspecified.
-
version¶ version triplet for the campaign format. Always present. Cannot be changed after the campaign is created.
-
name¶ Name of the campaign. Always present; defaults to the directory name if that is a valid name (e.g. not
.). Cannot be changed after the campaign is created.
-
doc¶ Documentation for the campaign. Always present; defaults to
"".
-
repo¶ URI to the data repository. Always present, no default, never
null. Cannot be changed after the campaign is created.
-
collections.inputs¶ Type: list[str]List of input collections. May be absent, but is required to be present (or populated on-the-fly) by some subcommands.
-
collections.chain¶ Name of the
CHAINEDinput/output collection.Always present; defaulted to
nameif not provided when campaign is created. May be set tonull, but does not default tonull. Setting it tonulldoes not automatically delete the collection if it has already been created, butCampaign.cleanwill delete it.The child collections are set to the sequence
(current_run, *past_runs, *inputs)whenevercurrent_runis updated.
-
collections.current_run¶ Name for a current
RUN-type output collection that already exists and should generally be used by the next step that writes datasets. This entry is often absent ornull(these are equivalent), to indicate that steps that write datasets should create a newRUN-type collection instead.
-
collections.next_run¶ Name or name pattern used to set
collections.current_runwhen needed. May contain placeholders, including%tto insert a timestamp,%nto insert a per-campaign counter value, and%cto insert the campaign name. Always present; defaults to%c/%t.
-
collections.past_runs¶ Type: list[str]Previous RUN-type collections created as part of this campaign, ordered from the most recent to the oldest. Always present; defaults to an empty list.
-
collections.counter¶ Type: intInteger counter to insert into output run names with the
%nplaceholder. Always present; defaults to0.
-
pipeline¶ URI to a pipeline YAML definition. May be absent, but is required to be present (or populated on-the-fly) by some subcommands.
-
quantum_graph.uri¶ URI to a saved
QuantumGraphobject. May be absent, but is required to be present (or populated on-the-fly) by some subcommands.
-
quantum_graph.collections¶ Type: list[str]Snapshot of the input collections (both
collections.past_runsandcollections.inputs, concatenated) used to build or refresh theQuantumGraph.This is
nullif the graph was imported instead of built, and is used to test whether the graph needs to be refreshed or rebuilt prior to execution.
-
quantum_graph.pipeline_fingerprint¶ Hash or checksum of the pipeline (including software versions and configuration) used to build or refresh the graph.
This is
nullif the graph was imported instead of built, and is used to test whether the graph needs to be refreshed or rebuilt prior to execution.
The Campaign Class and Subcommands¶
The Campaign class is used to represent a campaign directory; instances can be contructed from an existing campaign directory and written out to create or modify a campaign directory.
At least in most cases, Campaign methods correspond directly to pipetask subcommands, and both are described together in the method documentation below.
Most subcommands are expected to be implemented in 2-3 lines, aside from the translation of command-line options to keyword function arguments:
- a call to
Campaign.initor (usually)Campaign.loadto construct theCampaigninstance; - a call to the method that corresponds directly to the subcommand;
- a call to
Campaign.saveto write the updated campaign to disk.
Because most parameters are common to multiple methods/subcommands, these are described in detail later in Common Option Groups, using command-line option syntax instead of method parameter syntax to flesh out the command-line interface further.
Keyword argument names are just the long option names with - replaced by _.
-
class
Campaign¶ -
static
init(repo: ButlerURI, **kwargs) → Campaign¶ Create and return a new
Campaigninstance.This method corresponds directly to the
initsubcommand:pipetask init REPO URI <OPTIONS>
That should be implemented simply as:
campaign = Campaign.init(REPO, **kwargs) campaign.save(URI)
(with
**kwargsgenerated from command-line options).Option groups:
- Campaign Definition: The
--repoand--campaign-diroptions are replaced by theREPOandURIpositional arguments for this subcommand (only), but the others are still valid here as-is. TheURIargument is not relevant for the Python method call, because the campaign is not actually written untilsaveis called. - Pipeline Definition: Optional; if not provided, no pipeline information will be present in the campaign (yet).
- Collections: Optional; if not provided, no input collections will be present in the campaign (yet) and output collection names will be set to their default values.
Sequencing:
This operation can only be run when the campaign does not yet exist, and hence before all other operations.
- Campaign Definition: The
-
static
load(uri: ButlerURI) → Campaign¶ Create a
Campaigninstance corresponding to an existing campaign directory.This method has no direct subcommand equivalent, and does not use any of the common option groups.
-
save(uri: ButlerURI)¶ Save the campaign to the given directory URI.
This method has no direct subcommand equivalent, and does not use any of the common option groups.
-
edit(**kwargs)¶ Modify an existing campaign in-place.
This method corresponds directly to the
editsubcommand:pipetask edit <OPTIONS>
This method can be used to set all campaign information that can be specified in
init, but it can be used on existing campaigns.Option groups:
Sequencing:
Can be run at any time, and can create a new campaign if one does not exist.
-
status(**kwargs)¶ Print information about the current state of the pipeline to STDOUT.
This method corresponds directly to the
statussubcommand:pipetask status <OPTIONS>
Option Groups:
- Pipeline Definition:
--pipeline-dotonly, and only if the campaign already contains a pipeline. - QuantumGraphs:
--qg-dotonly, and only if the campaign already contains aQuantumGraph.
Other Options: TODO
- Pipeline Definition:
-
build_quantum_graph(**kwargs)¶ Build a
QuantumGraphfor the campaign.This method corresponds directly to the
qg buildsubcommand:pipetask qg build <OPTIONS>
Option groups:
QuantumGraphs, except:
--allow-pruning(pruning is a fundamental part of building a graph and cannot be disabled)--refresh(a graph is implicitly refreshed when it is built, so other options normally enabled by--refreshare allowed).
Sequencing:
Can be run at any time, can create a new campaign if one does not exist, and can edit the campaign’s collections and/or pipeline.
A pipeline and input collections must be provided here or already present in the campaign.
If
collections.current_runis set, it is ignored; onlycollections.past_runsandcollections.inputsare used as inputs tolsst.pipe.base.QuantumGraphgeneration.Downstream operations can be passed
--rebuildto perform this operation on-the-fly.
-
import_quantum_graph(uri: ButlerURI, **kwargs)¶ Import an existing
QuantumGraphinto the campaign.This method corresponds directly to the
qg importsubcommand:pipetask qg import <URI> <OPTIONS>
Option groups:
QuantumGraphs, except
--data-query- Passing
--refreshto this method/subcommand performs the refresh after the import, not before.
- Passing
Sequencing:
Can be run at any time, and can create a new campaign if one does not exist. Cannot be used to modify the campaign’s pipeline or input collections (because an imported graph essentially supersedes both of these).
This operation sets
quantum_graph.collectionsandquantum_graph.pipeline_fingerprinttonull, which means that later steps will require a resolution option ifcollections.inputsorpipelineare also set.
-
refresh_quantum_graph(**kwargs)¶ Refresh the campaign’s
QuantumGraphby querying again for its input and intermediate datasets.This method corresponds directly to the
qg refreshsubcommand:pipetask qg refresh <OPTIONS>
Refreshing a
QuantumGraphensures that any embeddedDatasetRefobjects are resolved if and only if they can be found in thecollections.inputs,collections.past_runs, andcollections.current_runcollections.A campaign’s
QuantumGraphshould always be (at least) refreshed whenever the collections used to build it are changed. Refreshing the graph can never add new quanta, however; that requires a full rebuild.When an overall-input (i.e. non-intermediate) dataset cannot be resolved (by definition, these datasets must have been resolved when the
QuantumGraphwas originally built) some aspects of the graph generation logic must be re-run, which can result in some quanta being dropped. The--trim-existing-inoption can also be used to drop quanta whose outputs already exist.Option groups:
Campaign Definition:
--campaign-dironly, and only to find an existing campaign.QuantumGraphs, except:
--data-query--extend-qg--refresh(implied, so all options normally enabled by--refreshare allowed).
Sequencing:
Can only be run on an existing campaign that already has a pipeline or a
QuantumGraph.Cannot be used to perform any other operations.
Downstream operations can be passed
--rebuildto perform this operation on-the-fly.
-
register_dataset_types(**kwargs)¶ Register all intermediate and output dataset types that would be written by a pipeline, and check that all input dataset types are consistent with the definitions in the pipeline.
This method corresponds directly to the
register-dataset-typessubcommand:pipetask register-dataset-types <OPTIONS>
The action of this method intentionally cannot be performed by providing options to any other method; registering dataset types is something that should be done only rarely, when they are first defined, and attempting to register them with every
pipetaskinvocation (as is all too easy to do now) is an antipattern that can lead to incorrectly-defined or typo’d dataset types that are hard to clean up.Option groups:
- Campaign Definition:
--campaign-dironly, and only to find an existing campaign. - Discrepancy Resolution Options
Sequencing:
Can only be run on an existing campaign that already has a pipeline or a
QuantumGraph.If both of these are present and are potentially discrepant, a discrepancy resolution option must be provided or another method must be called first to resolve the discrepancy.
Cannot be used to perform any other operations.
- Campaign Definition:
-
prep(**kwargs)¶ Register a new output
RUNcollection, write all “init output” datasets to it, including software versions and configuration for all tasks.This method corresponds directly to the
prepsubcommand:pipetask prep <OPTIONS>
This method creates the
collections.current_runcampaign entry if it does not exist and does not clear it when finished, indicating that the next dataset-writing step should write to that same collection. Ifcollections.current_rundoes already exist, it writes init output datasets if they do not exist and checks them for consistency if they do. Ifcollections.chainis notnull, it also [re]registers and [re]defines that collection.Option groups:
Execution, except:
-j,--processes: irrelevant, because no quanta are executed.--finish,--no-finish:--no-finishis implied.
Sequencing:
Can be run at any time, can create a new campaign if one does not exist, and can edit the campaign’s collections and/or pipeline.
Either a pipeline and input collections or a
QuantumGraphmust be provided here or already present in the campaign, but aQuantumGraphis never built by this operation.If both of these are present and are potentially discrepant, a discrepancy resolution option must be provided or another method must be called first to resolve the discrepancy.
-
run(**kwargs)¶ Run the campaign’s
QuantumGraph, creating it if needed.This method corresponds directly to the
RUNsubcommand:pipetask run <OPTIONS>
This operation will create a
QuantumGraphif one does not exist, but does not require the campaign to have a pipeline if it has aQuantumGraph(which thus must have been imported).High-level interfaces like this method and subcommand should always invoke
prepbefore actually running any quanta (but after creating theQuantumGraph, if one does not exist). This ensures that the outputRUN-type collection exists and that any provenance datasets it holds are consistent with the current configuration and environment. We also need a lower-level interface (at least in Python; maybe on the command-line, too, perhaps as a completely different executable) that instead assumes thatcollections.current_runexists and holds the right provenance datasets, for use by e.g. batch jobs that just want to run some already-exising quanta, but it’s important that those interfaces are only called by higher-level code that itself ensures thatprepis called appropriately.Running a pipeline does not re-resolve any resolved overall-input
DatasetRefsembedded in theQuantumGraph, and hence it ignorescollections.inputsentirely unless the graph is being [re]built or explicitly refreshed. Embedded inputsDatasetRefsthat correspond to intermediates that are being regenerated (i.e. their quanta are not being skipped) are re-resolved before that quantum is executed, in order to pick up datasets produced since execution began.Option groups:
Sequencing:
Can be run at any time, can create a new campaign if one does not exist, and can edit the campaign’s collections and/or pipeline. Will create a
QuantumGraphif one does not exist.If both of these are present and are potentially discrepant,
--refresh,--rebuild, or a discrepancy resolution option must be provided.
-
pop(n: int = 0, **kwargs)¶ Drop existing
RUN-type collections from the campaign and redefine itsCHAINEDcollection (if one exists) accordingly.This method corresponds directly to the
popsubcommand:pipetask pop [INT] <OPTIONS>
If
n == 0(default),collections.current_runis cleared if it is set. Ifn > 0, the firstncollections incollections.past_runsare also removed.If
collections.chainis notnull, theCHAINED-type collection for this campaign is updated.Option groups:
- Campaign Definition:
--campaign-dironly, and only to find an existing campaign. - Collections:,:
--unstoreand--purgeonly.
Sequencing:
Can only be run on an existing campaign that already has
collections.current_runand/orcollections.past_runsset (any collection that would be dropped by this operation must exist; anything else is an error that does not affect the repo at all).- Campaign Definition:
-
clean(purge: bool = False)¶ Remove datasets and possibly collections were created by this campaign but have since been dropped.
This method corresponds directly to the
cleansubcommand:pipetask clean <OPTIONS>
This operation computes the “dropped” collections as those that are in
collections.createdbut not (currently) in any ofcollections.chain,collections.past_runs,collections.inputs, orcollections.current_run.If possible, we should make this remove directories that correspond to unstored
RUN-type collections, especially if those are in the campaign directory themselves.Option groups:
- Campaign Definition,
--campaign-dironly, and only to find an existing campaign. - Collections,
--purgeonly (--unstoreis the implied default behavior).
Sequencing:
Can only be run on an existing campaign.
It is not an error to run this when it would do nothing.
- Campaign Definition,
-
static
Common Option Groups¶
Campaign Definition¶
These options are used to provide the core campaign definition information.
-
--repo<URI>¶ Data repository URI; sets
repoin.campaign.json. Required whenever creating a new campaign.
-
--campaign-dir<URI>¶ Campaign directory.
Except where otherwise noted, this option is optional if and only if the current working directory is a campaign directory.
-
--campaign-name<NAME>¶ Name of the campaign; sets
namein.campaign.jsonIf used with an existing campaign, its name is modified. If the campaign does not exist and this option is not provided, a name is inferred from its directory. Must be provided if creating a new campaign with a directory that includes.or...
Pipeline Definition¶
These options are used to define, modify, or inspect the pipeline.
The behavior of options that modify the pipeline is specified such that repeated invocations with the same set of options are idempotent.
-
-p<URI>,--pipeline<URI>¶ URI to a pipeline definition file. If the campaign already has a local pipeline, this new pipeline will be added to its imports. If the campaign already has a URI to an external pipeline other than this one, a local pipeline will be created that imports both.
-
-t<LABEL>:<TASK>¶ PipelineTaskto add to the pipeline. This creates a local pipeline if one does not exist. If a URI to an external pipeline exists, it will be imported in the new local pipeline.
-
-c<LABEL>:<PARAMETER>=<VALUE>,--config<LABEL>:<PARAMETER>=<VALUE>¶ Override a
pex_configparameter value. This creates a local pipeline if one does not exist. If a URI to an external pipeline exists, it will be imported in the new local pipeline. If a local pipeline does exist, this is added as a (YAML) config override to it, replacing an existing override for the same option if it exists and creating a section for the label if necessary.
-
-C<LABEL>:<URI>,--config-file<LABEL>:<URI>¶ Apply a
pex_configconfig override file. Affects new and existing pipelines the same way as-c.
-
--instrument<NAME>¶ Set an instrument whose
obs-package config overrides should be loaded. This creates a local pipeline if one does not exist, unless a URI to an external pipeline exists and it already has the same instrument.
-
--pipeline-dot<URI>¶ Write a GraphViz dot diagram for the pipeline graph to the given file.
-
--write-pipeline[<URI>]¶ Write the pipeline YAML file to the given URI, and update the
pipelineentry in.campaign.jsonto point to it. If invoked with no argument, or if not provided but other options require a local pipeline to be created, a default filename (pipeline.yaml) within the campaign directory is used.
Collections¶
These options control the input and output collections.
-
-i<COLLECTION>,--input<COLLECTION>¶ Collections to search for input datasets; sets
collections.inputsin.campaign.json. May be passed multiple times (arguments are concatenated), and multiple collections may be passed together by separating them with commas. Order matters. If a collection that is already incollections.past_runsis included, it is automatically removed fromcollections.past_runs.
-
--prepend-inputs¶ Instead of replacing
collections.inputswith the values given by all-iarguments, prepend them if they are not already included in the existing inputs, and move them to the front if they are already included.
-
--chain<NAME>¶ Name of the
CHAINEDcollection that combines input collections and all output collections; setscollections.chainin.campaign.json.
-
--no-chain¶ Disable creation of the
CHAINEDcollection by settingcollections.chaintonullin.campaign.json.
-
--next-run<NAME>¶ Name for the RUN collection that will directly hold the outputs of the next
RUN-type collection created. Setscollections.next_runin.campaign.json; see that for documentation on placeholders and defaults.
-
--set-counter<INT>¶ Manually set
collections.counterin.campaign.json.
QuantumGraphs¶
-
--qg-dot<URI>¶ Write a GraphViz dot diagram for the QuantumGraph to the given file.
-
--write-qg[<URI>]¶ Write the
QuantumGraphfile to the given URI, and update thequantum_graph.urientry in.campaign.jsonto point to it. If invoked with no argument, or if not provided but other options require a localQuantumGraphto be created, a default filename (using the campaign name) within the campaign directory is used.
-
-d<QUERY>,--data-query<QUERY>¶ Provide a SQL-like query expression that constrains the data IDs of the
QuantumGraph.
-
--extend-qg¶ If the campaign is already associated with a
QuantumGraph, extend it when building or importing a new one, instead of replacing it.
-
--refresh¶ Equivalent to running
Campaign.refresh_quantum_graphimmediately before (usually) or after (where noted) some other method.This can be used to address errors that would otherwise occur because the pipeline (including code edits to local setups) or input collections have changed since the
QuantumGraphwas built, essentially asserting that these changes can trim quanta and changeDatasetRefresolutions, but would not otherwise modify the graph.
-
--trim-existing-in[INPUTS|CAMPAIGN|RUN]¶ Remove quanta from the
QuantumGraphwhen all of their outputs already exist in the given collection category:RUN- Trim a quantum if all of its outputs exist in
collections.current_run; do nothing ifcollections.current_runis not set. CAMPAIGN- Trim a quantum if all of its outputs exist in either
collections.current_runorcollections.past_runs, i.e. anyRUN-type collection produced by this campaign that has not been discarded from it; INPUTS- Trim a quantum if all of its outputs exist in any of
collections.current_run,collections.past_runs, orcollections.inputs.
Except where otherwise noted,
--refreshmust also be passed for this option to be valid.
-
--allow-pruning¶ When refreshing a
QuantumGraph, allow a quantum to be removed if one or more of its input datasets cannot be resolved and thePipelineTaskindicates that the quantum is not viable without them.When this option is not given and an nonviable quantum is found, the refresh operation fails but the campaign and its
QuantumGraphare not modified.Except where otherwise noted,
--refreshmust also be passed for this option to be valid.
-
--allow-empty¶ When building a
QuantumGraphor refreshing one with--allow-pruningor--trim-existing-in, allow the graph to end up with no quanta. When this option is not given, an empty graph is treated as an error condition, and the campaign and itsQuantumGraphare not modified.Except where otherwise noted,
--refreshmust also be passed for this option to be valid.
Execution¶
These options control how quanta are executed and how RUN-type collections are created and manipulated.
Note that many existing pipetask options that are primarily about running individual quanta as part of a larger batch job are not present here; I’m currently thinking that we should really have a separate lower-level command-line tool (and associated Python class) for that simpler user case.
-
-j<INT>,--processes<INT>¶ Number of processes used for local (single-node) execution. Batch-execution extensions are encouraged to use this to control the total number of processes if they have a mode in which that is all that is provided.
-
--finish,--no-finish¶ Controls whether or not to clear
collections.current_runafter all requested quanta are executed successfully, and hence whether the next invocationpipetaskthat writes to aRUN-type collection will use the same one. The default behavior depends on other options and the previous state ofcollections.current_run:- If
collections.current_runwas previously set and is being used (e.g.--pushwas not passed), or if the fullQuantumGraphwas not run, the default is to leavecollections.current_runin place for the next invocation. - If
collections.current_runwas not previously set, or if other options (e.g.--push) were used to create a newRUN-type collection anyway, the default is to clearcollections.current_runso the next invocation will create a newRUN-type collection as well.
- If
-
--push¶ Create a new
RUN-type collection for output datasets created by this method/subcommand. Ifcollections.current_runis not set, this is the default behavior. If it is set, the value ofcollections.current_runis inserted at the front ofcollections.past_runs.
-
--replace¶ Create a new
RUN-type collection for output datasets created by this method/subcommand., droppingcollections.current_run. It is an error to pass this option ifcollections.current_runis not set.
-
--continue¶ If
collections.current_runis not set, remove the first entry fromcollections.past_runs(which must not be empty) and setcollections.current_runto that. Does nothing ifcollections.current_runis already set.
-
--restart¶ Drop all runs in
collections.past_runsandcollections.current_run(if it exists), and create and prep a new one to contain all outputs.
-
--unstore¶ If an output collection is dropped by this action (via
--replace,--restart, orCampaign.pop), remove its dataset artifacts from the datastore only. Not valid if no collections can be dropped by this operation.
-
--purge¶ If an output collection is dropped by this action (via
--replace,--restart, orCampaign.pop), remove the collection and its datasets entirely from both the registry and the datastore. Supersedes--unstore. Not valid if no collections can be dropped by this operation.
-
--skip-existing-in[INPUTS|CAMPAIGN|RUN]¶ Do not execute quanta for which all outputs already exist in the given collection category.
Unlike
--trim-existing-in, this does not modify theQuantumGraph, but the argument choices have the same definition.
-
--rebuild¶ Rebuild the
QuantumGraphbefore running. This option may be passed even when there is no graph (it is ignored).
Discrepancy Resolution Options¶
Some options are used to resolve potential discrepancies between the QuantumGraph and the pipeline and input collections from which it is typically built, when these are set out-of-order or the graph is imported.
These include:
-
--use-task-configs¶ Do not rebuild or refresh the
QuantumGraphbefore running, but use the pipeline’s tasks and configuration instead of those in theQuantumGraph, matching them by label.Note that changes to
collections.inputssince the graph was generated are also ignored when this option is used; those cannot be used unless the graph is at least refreshed.
-
--use-qg-configs¶ Do not rebuild or refresh the
QuantumGraphbefore running, but use as-is, ignoring the tasks in the pipeline completely.Note that changes to
collections.inputssince the graph was generated are also ignored when this option is used; those cannot be used unless the graph is at least refreshed.
Two previously-mentioned options can also be used in some contexts to resolve these discrepancies on-the-fly: