Skip to content Skip to sidebar Skip to footer

How To Use Gzip To Compress Json Data In Python Program?

I have an AWS Kinesis python program - Producer to send data to my stream. But my JSON file is 5MB. I would like to compress the data using GZIP or any other best methods. My produ

Solution 1:

There are 2 ways in which you can compress the data :

1. Enable GZIP/Snappy compression on Firehose Stream - This can be done via Console itself

Firehose buffers the data and after the treshold is reached, it takes all the data and compresses it together to create the gz object.

Pros :

  • Minimal Effort Required on Producer side - Just change the setting in console.
  • Minimal Effort Required on Consumer Side - Firehose creates .gz objects in S3 and sets the metadata on the objects to reflect the compression type. Hence, if you read the data via AWS SDK itself, the SDK will do the decompression for you.

Cons :

  • Since firehose charges on size of data ingested, you will not be saving on Firehose cost. You will save on S3 cost (due to smaller size of objects).

2. Compression by Producer code - Need to write the code

I implemented this in Java a few days back. We were ingesting over 100 Petabytes of data into Firehose (from where it gets written to S3). This was a massive cost for us.

So, we decided to do the compression on Producer side. This results in compressed data flowing to KF which is as is written to S3. Please note that since KF is not compressing it, it has no idea what data it is. As a result, the objects created in s3 don't have ".gz" compression. Hence, the consumers are none the wiser as to what data is in the objects. We then wrote a wrapper on top of AWS Java SDK for S3 which reads the object and decompresses it.

Pros :

  • Our compression factor was close to 90%. That directly resulted in a 90% savings on Firehose cost. Plus the additional savings of S3 as in approach 1.

Cons :

  • Not exactly a con, but more developmental effort would be required. To create a wrapper on top of the AWS SDK, testing effort etc.
  • Compression & Decompression are CPU intensive. On an average, the 2 together increased our CPU utilization by 22%.

Post a Comment for "How To Use Gzip To Compress Json Data In Python Program?"