Skip to content Skip to sidebar Skip to footer

Dataflow Template That Reads Input And Schema From Gcs As Runtime Arguments

I am trying to create a custom dataflow template that takes 3 runtime arguments. An input file and schema file location from gcs and bigquery datasink table. The input file seems

Solution 1:

If the schema file is in a known location in GCS, you can add a ParDo to your pipeline that directly reads it from GCS. For example, this can be done in a start_bundle() [1] implementation of your ParseLineDoFn so that it only get invoked once per bundle (not per element). You can use Beam's FileSystem abstraction[2] if you need to abstract out the file-system that you use to store the schema file (not just GCS).

[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L504 [2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py

Post a Comment for "Dataflow Template That Reads Input And Schema From Gcs As Runtime Arguments"