Reasons Of Slowness In Numpy.dot() Function And How To Mitigate Them If Custom Classes Are Used?
Solution 1:
Pretty much anything you do with object arrays is going to be slow. None of the reasons NumPy is usually fast apply to object arrays.
- Object arrays cannot store their elements contiguously. They must store and dereference pointers.
- They don't know how much space they would have to allocate for their elements.
- Their elements may not all be the same size.
- The elements you insert into an object array have already been allocated outside the array, and they cannot be copied.
- Object arrays must perform dynamic dispatch on all element operations. Every time they add or multiply two elements, they have to figure out how to do that all over again.
- Object arrays have no way to accelerate the implementation of their elements, such as your slow, interpreted
__add__
and__mul__
. - Object arrays cannot avoid the memory allocation associated with their element operations, such as the allocation of a new
PseudoBinary
object and a new__dict__
for that object on every element__add__
or__mul__
. - Object arrays cannot parallelize operations, as all operations on their elements will require the GIL to be held.
- Object arrays cannot use LAPACK or BLAS, as there are no LAPACK or BLAS functions for arbitrary Python datatypes.
- Etc.
Basically, every reason doing Python math without NumPy is slow also applies to doing anything with object arrays.
As for how to improve your performance? Don't use object arrays. Use regular arrays, and either find a way to implement the thing you want in terms of the operations NumPy provides, or write out the loops explicitly and use something like Numba or Cython to compile your code.
Solution 2:
The dot
is an outer product followed by sum on one axis.
For your pseudo
, the dot
is marginally faster than the sum of product equivalents:
In [18]: timeit (pseudo[:,:,None]*pseudo[None,:,:]).sum(axis=1)
75.7 µs ± 3.14 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
In [19]: timeit np.dot(pseudo, pseudo)
63.9 µs ± 1.91 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
For base
, dot
is substantially faster than the equivalent.
In [20]: timeit (base[:,:,None]*base[None,:,:]).sum(axis=1)
13.9 µs ± 24.8 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)
In [21]: timeit np.dot(base,base)
1.58 µs ± 53.8 ns per loop (mean ± std. dev. of7 runs, 1000000 loops each)
So with a numeric array, dot
can pass the whole task to optimized compiled code (BLAS or other).
We could get a further idea of how object dtype affects the speed by creating a numeric object array, and comparing the simple element product:
In [28]: baso = base.astype(object)
In [29]: timeit base*base
766 ns ± 48.1 ns per loop (mean ± std. dev. of7 runs, 1000000 loops each)
In [30]: timeit baso*baso
2.45 µs ± 73.1 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)
In [31]: timeit pseudo*pseudo
13.7 µs ± 41.1 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)
Using 'or' (|
) instead of *
, we can calculate the same thing as pseudo
but with base
:
In [34]: (base[:,:,None] | base[None,:,:]).sum(axis=1)
Out[34]:
array([[3, 2, 2],
[3, 1, 2],
[3, 2, 2]], dtype=int32)
In [35]: timeit (base[:,:,None] | base[None,:,:]).sum(axis=1)
15.1 µs ± 492 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)
Post a Comment for "Reasons Of Slowness In Numpy.dot() Function And How To Mitigate Them If Custom Classes Are Used?"