Minimum Viable Chunk

Beam Minimum Viable Chunk #

This is an example of a minimum viable chunk of Beam code:

import apache_beam as beam

pipeline = beam.Pipeline()

outputs = (
    pipeline
    | 'Read in dataset' >> beam.io.ReadFromText('data/*.txt') 
    | 'Write out' >> beam.io.WriteToText(...) 
)

pipeline.run()

Same thing, with comments:

# !pip install --quiet apache-beam
import apache_beam as beam

# initialize a pipeline
pipeline = beam.Pipeline()


# | is the apply function
# >> is a label operator

outputs = (
    # start the pipeline
    pipeline
    # read from a bunch of text files
    # also add a label
    | 'Read in dataset' >> beam.io.ReadFromText('data/*.txt') # or some other data path
    # need to write out before we can inspect b/c data might be distributed across workers
    # where ... is a file path prefix. can add 'file_name_suffix = ...' usually with an extension
    | 'Write out' >> beam.io.WriteToText(...) 
    # for instance: | 'Write out' >> beam.io.WriteToText(output, file_name_suffix = .txt) 
)

# run the pipeline
pipeline.run()

# view outputs
# outputs would be in file form `output-0000-of-00001.txt` for instance
# ! head output*.txt