Most Efficient Way To Store List Of Integers
Solution 1:
One stdlib
solution you could use is arrays from array
, from the docs:
This module defines an object type which can compactly represent an array of basic values: characters, integers, floating point numbers. Arrays are sequence types and behave very much like lists, except that the type of objects stored in them is constrained.
This generally sheds a bit of memory of large lists, for example, with a 10 million element a list, the array trims up 11mb
:
import pickle
from array import array
l = [i for i inrange(10000000)]
a = array('i', l)
# tofile can also be used.withopen('arrfile', 'wb') as f:
pickle.dump(a, f)
withopen('lstfile', 'wb') as f:
pickle.dump(l, f)
Sizes:
!du -sh ./*
39M arrfile
48M lstfile
Solution 2:
Here is a small demo, which uses Pandas module:
import numpy as np
import pandas as pd
import feather
# let's generate an array of 1M int64 elements...
df = pd.DataFrame({'num_col':np.random.randint(0, 10**9, 10**6)}, dtype=np.int64)
df.info()
%timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)
%timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')
DataFrame info:
In [56]: df.info()
<class'pandas.core.frame.DataFrame'>RangeIndex:1000000 entries, 0to999999
Data columns (total 1 columns):
num_col 1000000 non-null int64
dtypes: int64(1)
memory usage: 7.6 MB
Results (speed):
In [49]: %timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')
1 loop, best of 1: 16.2 ms per loop
In [50]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 39.7 ms per loop
In [51]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 40.6 ms per loop
In [52]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)
1 loop, best of 1: 213 ms per loop
In [53]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)
1 loop, best of 1: 1.09 s per loop
In [54]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)
1 loop, best of 1: 32.1 ms per loop
In [55]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')
1 loop, best of 1: 3.49 ms per loop
Results (size):
{ temp } » ls -lh a* /d/temp
-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.feather-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a.h5-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.pickle-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_blosc.h5-rw-r--r-- 1 Max None 4.0M Sep 20 23:15 a_bzip2.h5-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_lzo.h5-rw-r--r-- 1 Max None 3.9M Sep 20 23:15 a_zlib.h5
Conclusion: pay attention at HDF5 (+ blosc
or lzo
compression) if you need both speed and a reasonable size or at Feather-format if you only care of speed - it's 4 times faster compared to Pickle!
Solution 3:
I like Jim's suggestion of using the array
module. If your numeric values are small enough to fit into the machine's native int
type, then this is a fine solution. (I'd prefer to serialize the array with the array.tofile
method instead of using pickle
, though.) If an int
is 32 bits, then this uses 4 bytes per number.
I would like to question how you did your text file, though. If I create a file with 333 000 integers in the range [0, 8 000] with one number per line,
import random
withopen('numbers.txt', 'w') as ostr:
for i inrange(333000):
r = random.randint(0, 8000)
print(r, file=ostr)
it comes out to a size of only 1.6 MiB which isn't all that bad compared to the 1.3 MiB that the binary representation would use. And if you do happen to have a value outside the range of the native int
type one day, the text file will handle it happily without overflow.
Furthermore, if I compress the file using gzip, the file size shrinks down to 686 KiB. That's better than gzipping the binary data! When using bzip2, the file size is only 562 KiB. Python's standard library has support for both gzip
and bz2
so you might want to give the plain-text format plus compression another try.
Post a Comment for "Most Efficient Way To Store List Of Integers"