Dataflow Template That Reads Input And Schema From Gcs As Runtime Arguments
I am trying to create a custom dataflow template that takes 3 runtime arguments. An input file and schema file location from gcs and bigquery datasink table. The input file seems
Solution 1:
If the schema file is in a known location in GCS, you can add a ParDo
to your pipeline that directly reads it from GCS. For example, this can be done in a start_bundle()
[1] implementation of your ParseLine
DoFn
so that it only get invoked once per bundle (not per element). You can use Beam's FileSystem
abstraction[2] if you need to abstract out the file-system that you use to store the schema file (not just GCS).
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L504 [2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py
Post a Comment for "Dataflow Template That Reads Input And Schema From Gcs As Runtime Arguments"