Most Efficient Way To Store List Of Integers

June 17, 2024 Post a Comment

I have recently been doing a project in which one of the aims is to use as little memory as possible to store a series of files using Python 3. Almost all of the files take up very

Solution 1:

One stdlib solution you could use is arrays from array, from the docs:

This module defines an object type which can compactly represent an array of basic values: characters, integers, floating point numbers. Arrays are sequence types and behave very much like lists, except that the type of objects stored in them is constrained.

This generally sheds a bit of memory of large lists, for example, with a 10 million element a list, the array trims up 11mb:

import pickle    
from array import array

l = [i for i inrange(10000000)]
a = array('i', l)

# tofile can also be used.withopen('arrfile', 'wb') as f:  
    pickle.dump(a, f)

withopen('lstfile', 'wb') as f:
    pickle.dump(l, f)

Sizes:

!du -sh ./*
39M     arrfile
48M     lstfile

Solution 2:

Here is a small demo, which uses Pandas module:

import numpy as np
import pandas as pd
import feather

# let's generate an array of 1M int64 elements...
df = pd.DataFrame({'num_col':np.random.randint(0, 10**9, 10**6)}, dtype=np.int64)
df.info()

%timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)
%timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')

DataFrame info:

In [56]: df.info()
<class'pandas.core.frame.DataFrame'>RangeIndex:1000000 entries, 0to999999
Data columns (total 1 columns):
num_col    1000000 non-null int64
dtypes: int64(1)
memory usage: 7.6 MB

Results (speed):

In [49]: %timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')
1 loop, best of 1: 16.2 ms per loop

In [50]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 39.7 ms per loop

In [51]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 40.6 ms per loop

In [52]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)
1 loop, best of 1: 213 ms per loop

In [53]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)
1 loop, best of 1: 1.09 s per loop

In [54]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)
1 loop, best of 1: 32.1 ms per loop

In [55]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')
1 loop, best of 1: 3.49 ms per loop

Results (size):

{ temp }  » ls -lh a*                                                                                         /d/temp
-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.feather-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a.h5-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.pickle-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_blosc.h5-rw-r--r-- 1 Max None 4.0M Sep 20 23:15 a_bzip2.h5-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_lzo.h5-rw-r--r-- 1 Max None 3.9M Sep 20 23:15 a_zlib.h5

Conclusion: pay attention at HDF5 (+ blosc or lzo compression) if you need both speed and a reasonable size or at Feather-format if you only care of speed - it's 4 times faster compared to Pickle!

Solution 3:

I like Jim's suggestion of using the array module. If your numeric values are small enough to fit into the machine's native int type, then this is a fine solution. (I'd prefer to serialize the array with the array.tofile method instead of using pickle, though.) If an int is 32 bits, then this uses 4 bytes per number.

I would like to question how you did your text file, though. If I create a file with 333 000 integers in the range [0, 8 000] with one number per line,

import random

withopen('numbers.txt', 'w') as ostr:
    for i inrange(333000):
        r = random.randint(0, 8000)
        print(r, file=ostr)

it comes out to a size of only 1.6 MiB which isn't all that bad compared to the 1.3 MiB that the binary representation would use. And if you do happen to have a value outside the range of the native int type one day, the text file will handle it happily without overflow.

Furthermore, if I compress the file using gzip, the file size shrinks down to 686 KiB. That's better than gzipping the binary data! When using bzip2, the file size is only 562 KiB. Python's standard library has support for both gzip and bz2 so you might want to give the plain-text format plus compression another try.

Getting Started with Python

Most Efficient Way To Store List Of Integers

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Most Efficient Way To Store List Of Integers"