Efficiently Get Indices Of Histogram Bins In Python

May 29, 2023 Post a Comment

Short Question I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values con

Solution 1:

I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

Solution 2:

I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.

vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()

#example of using the data:
for j in range(nbins-1):
    #I am using a plotting function for your f, to show that they cluster
    plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')

The figure shows that the bins are actually clusters as expected: enter image description here

Solution 3:

You can halve the computation time by sorting the array first, then use np.searchsorted.

vals = np.random.random(1e8)
vals.sort()

nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

results = [func(vals[np.searchsorted(ind,j,side='left'):
                     np.searchsorted(ind,j,side='right')])
           for j in range(1,nbins)]

Using 1e8 as my test case, I go from 34 seconds of computation to about 17.

Solution 4:

One efficient solution is using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
npi.group_by(ind).split(vals)

Solution 5:

Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :

import pandas as pd

pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values

or more generally :

pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values

Although the latter is slower for standard aggregation functions (like sum, mean, etc)

Getting Started with Python