Pipelines

Apache Beam Pipelines #

Pipelines control the entire data processing task.

Pipelines are made up of PCollections (the thing) and PTransforms (the action).

In the context of a pipeline, the runner is the specification of where the pipeline code is run.

Design #

The typical Beam program is structured as follows:

  1. Create a Pipeline object, including details of the pipeline runner.

  2. Create an initial Pcollection. This is done with the specification of the IO or by using a Create transform.

  3. Apply PTransforms to each Pcollection.

Pseudo-code #

Pseudo-code of a pipeline might look like:

output = (pipeline
    | step1
    | step2
    | step3
)

This can be extended with labels:

output = (pipeline
    | 'first step' >> step1
    | 'second step' >> step2
    | 'third step' >> step3
)

Creating #

Beam starts with creating an instance of the Beam SDK class Pipeline, often in the main() function. Also set necessary configuration options.

import apache_beam as beam

with beam.Pipeline() as p:
    pass # this is where the pipeline goes

Configuring options #

Beam pipelines can be configured.

from apache_beam.options.pipeline_options import PipelineOptions

beam_options = PipelineOptions()
...

For instance, to specify GCP details:

beam_options = PipelineOptions(
    runner='DataflowRunner', # for GCP Dataflow
    project='my-project-id',
    job_name='unique-job-name',
    temp_location='gs://my-bucket/temp', 
)

Custom options #

CUstom options can be added in addition to the standard `PipelineOptions.

To add input and output custom options:

(Also, yanked from the Tour of Beam)

from apache_beam.options.pipeline_options import PipelineOptions

class MyOptions(PipelineOptions):
  @classmethod
  def _add_argparse_args(cls, parser):
    parser.add_argument('--input')
    parser.add_argument('--output')

Resources #