Apache Beam Pipelines #
Pipelines control the entire data processing task.
Pipelines are made up of PCollections (the thing) and PTransforms (the action).
In the context of a pipeline, the runner
is the specification of where the pipeline code is run.
Design #
The typical Beam program is structured as follows:
-
Create a Pipeline object, including details of the pipeline runner.
-
Create an initial Pcollection. This is done with the specification of the IO or by using a
Create
transform. -
Apply
PTransforms
to eachPcollection
.
Pseudo-code #
Pseudo-code of a pipeline might look like:
output = (pipeline
| step1
| step2
| step3
)
This can be extended with labels:
output = (pipeline
| 'first step' >> step1
| 'second step' >> step2
| 'third step' >> step3
)
Creating #
Beam starts with creating an instance of the Beam SDK class Pipeline, often in the main()
function. Also set necessary configuration options.
import apache_beam as beam
with beam.Pipeline() as p:
pass # this is where the pipeline goes
Configuring options #
Beam pipelines can be configured.
from apache_beam.options.pipeline_options import PipelineOptions
beam_options = PipelineOptions()
...
For instance, to specify GCP details:
beam_options = PipelineOptions(
runner='DataflowRunner', # for GCP Dataflow
project='my-project-id',
job_name='unique-job-name',
temp_location='gs://my-bucket/temp',
)
Custom options #
CUstom options can be added in addition to the standard `PipelineOptions.
To add input
and output
custom options:
(Also, yanked from the Tour of Beam
)
from apache_beam.options.pipeline_options import PipelineOptions
class MyOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_argument('--input')
parser.add_argument('--output')
Resources #
- For official pipeline examples