Apache Beam Runners #
Runners == execution engines, upon which Beam pipelines can be run on.
Below are some variants of runners
.
Direct runner #
Execute pipelines locally.
This is a good way to unit test first before making the bigger leap to remote clusters.
Example:
python -m apache_beam.examples.wordcount --input YOUR_INPUT_FILE --output counts
Google Cloud Dataflow runner #
As the name implies, running on GCP Dataflow. This works by:
-
uploading the code to a Google Cloud Storage bucket
-
creating a Dataflow job
-
Executing the pipeline
To use this, first install:
pip3 install apache-beam[gcp]
Then run something like:
python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://YOUR_GCS_BUCKET/counts \
--runner DataflowRunner \
--project YOUR_GCP_PROJECT \
--region YOUR_GCP_REGION \
--temp_location gs://YOUR_GCS_BUCKET/tmp/
```