Runners

Apache Beam Runners #

Runners == execution engines, upon which Beam pipelines can be run on.

Below are some variants of runners.

Direct runner #

Execute pipelines locally.

This is a good way to unit test first before making the bigger leap to remote clusters.

Example:

python -m apache_beam.examples.wordcount --input YOUR_INPUT_FILE --output counts

Google Cloud Dataflow runner #

As the name implies, running on GCP Dataflow. This works by:

  1. uploading the code to a Google Cloud Storage bucket

  2. creating a Dataflow job

  3. Executing the pipeline

To use this, first install:

pip3 install apache-beam[gcp]

Then run something like:

python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
                                         --output gs://YOUR_GCS_BUCKET/counts \
                                         --runner DataflowRunner \
                                         --project YOUR_GCP_PROJECT \
                                         --region YOUR_GCP_REGION \
                                         --temp_location gs://YOUR_GCS_BUCKET/tmp/
```