Skip to content Skip to sidebar Skip to footer

Fill Matrix Of Occurences From Column/row Arrays Of Indexes

I'm searching for an efficient way to create a matrix of occurrences from two arrays that contains indexes, one represents the row indexes in this matrix, the other, the column one

Solution 1:

Use np.add.at, specifying a tuple of indices:

>>> np.add.at(new_matrix, (rows, columns), 1)
>>> new_matrix
array([[ 1.,  0.,  0.],
       [ 0.,  2.,  0.],
       [ 0.,  1.,  2.],
       [ 2.,  1.,  5.]])

np.add.at operates on the array in-place, adding 1 as many times to the indices as specified by the (row, columns) tuple.

Solution 2:

Approach #1

We can convert those pairs to linear indices and then use np.bincount -

defbincount_app(rows, columns, n_rows, n_columns):
    # Get linear index equivalent
    lidx = (columns.max()+1)*rows + columns

    # Use binned count on the linear indicesreturn np.bincount(lidx, minlength=n_rows*n_columns).reshape(n_rows,n_columns)

Sample run -

In [242]: n_rows    = 4
     ...: n_columns = 3
     ...: 
     ...: rows    = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
     ...: columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])

In [243]: bincount_app(rows, columns, n_rows, n_columns)
Out[243]: 
array([[1, 0, 0],
       [0, 2, 0],
       [0, 1, 2],
       [2, 1, 5]])

Approach #2

Alternatively, we can sort the linear indices and get the counts using slicing to have our second approach, like so -

def mask_diff_app(rows, columns, n_rows, n_columns):
    lidx = (columns.max()+1)*rows + columns
    lidx.sort()
    mask = np.concatenate(([True],lidx[1:] != lidx[:-1],[True]))
    count = np.diff(np.flatnonzero(mask))
    new_matrix = np.zeros([n_rows, n_columns],dtype=int)
    new_matrix.flat[lidx[mask[:-1]]] = count
    return new_matrix

Approach #3

This seems like a straight-forward one with sparse matrix csr_matrix as well, as it does accumulation on its own for repeated indices. The benefit is the memory efficiency, given that it's a sparse matrix, which would be noticeable if you are filling a small number of places in the output and a sparse matrix output is okay.

The implementation would look something like this -

from scipy.sparse import csr_matrix

defsparse_matrix_app(rows, columns, n_rows, n_columns):
    out_shp = (n_rows, n_columns)
    data = np.ones(len(rows),dtype=int)
    return csr_matrix((data, (rows, columns)), shape=out_shp)

If you need a regular/dense array, simply do -

sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()

Sample output -

In [319]: sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
Out[319]: 
array([[1, 0, 0],
       [0, 2, 0],
       [0, 1, 2],
       [2, 1, 5]])

Benchmarking

Other approach(es) -

# @cᴏʟᴅsᴘᴇᴇᴅ's solndef add_at_app(rows, columns, n_rows, n_columns):
    new_matrix = np.zeros([n_rows, n_columns],dtype=int)
    np.add.at(new_matrix, (rows, columns), 1)

Timings

Case #1 : Output array of shape (1000, 1000) and no. of indices = 10k

In [307]:# Setup...:n_rows=1000...:n_columns=1000...:rows=np.random.randint(0,1000,(10000))...:columns=np.random.randint(0,1000,(10000))In [308]:%timeitadd_at_app(rows,columns,n_rows,n_columns)...:%timeitbincount_app(rows,columns,n_rows,n_columns)...:%timeitmask_diff_app(rows,columns,n_rows,n_columns)...:%timeitsparse_matrix_app(rows,columns,n_rows,n_columns)1000 loops,best of 3:1.05msperloop1000 loops,best of 3:424µsperloop1000 loops,best of 3:1.05msperloop1000 loops,best of 3:1.41msperloop

Case #2 : Output array of shape (1000, 1000) and no. of indices = 100k

In [309]: # Setup
     ...: n_rows =1000
     ...: n_columns =1000
     ...: rows= np.random.randint(0,1000,(100000))
     ...: columns = np.random.randint(0,1000,(100000))

In [310]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of3: 11.4 ms per loop
1000 loops, best of3: 1.27 ms per loop
100 loops, best of3: 7.44 ms per loop
10 loops, best of3: 20.4 ms per loop

Case #3 : Sparse-ness in output

As stated earlier, for the sparse method to work better, we would need sparse-ness. Such a case would be like this -

In [314]: # Setup
     ...: n_rows =5000
     ...: n_columns =5000
     ...: rows= np.random.randint(0,5000,(1000))
     ...: columns = np.random.randint(0,5000,(1000))

In [315]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of3: 11.7 ms per loop
100 loops, best of3: 11.1 ms per loop
100 loops, best of3: 11.1 ms per loop
1000 loops, best of3: 269 µs per loop

If you need a dense array, we lose the memory efficiency and hence performance one as well -

In [317]: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
100 loops, best of 3: 11.7 ms per loop

Post a Comment for "Fill Matrix Of Occurences From Column/row Arrays Of Indexes"