Beam Minimum Viable Chunk #
This is an example of a minimum viable chunk of Beam code:
import apache_beam as beam
pipeline = beam.Pipeline()
outputs = (
pipeline
| 'Read in dataset' >> beam.io.ReadFromText('data/*.txt')
| 'Write out' >> beam.io.WriteToText(...)
)
pipeline.run()
Same thing, with comments:
# !pip install --quiet apache-beam
import apache_beam as beam
# initialize a pipeline
pipeline = beam.Pipeline()
# | is the apply function
# >> is a label operator
outputs = (
# start the pipeline
pipeline
# read from a bunch of text files
# also add a label
| 'Read in dataset' >> beam.io.ReadFromText('data/*.txt') # or some other data path
# need to write out before we can inspect b/c data might be distributed across workers
# where ... is a file path prefix. can add 'file_name_suffix = ...' usually with an extension
| 'Write out' >> beam.io.WriteToText(...)
# for instance: | 'Write out' >> beam.io.WriteToText(output, file_name_suffix = .txt)
)
# run the pipeline
pipeline.run()
# view outputs
# outputs would be in file form `output-0000-of-00001.txt` for instance
# ! head output*.txt