PCollection

PCollection #

Distributed data set that Beam pipeline operates on.

It can be:

  • Bounded - from a fixed source

  • Unbounded - a source that continuously keep growing and changing

PCollection from In-Memory #

Create an initial PCollection from in-memory data (i.e., stuff you input).

with beam.Pipeline() as p:
    # numerical PCollection
    (p | beam.Create(range(1, 11)))

    # PCollection of strings
    (p | beam.Create(['This', 'is', 'a', 'string']))

PCollection from Text Files #

PCollction from csv files #