Skip to content Skip to sidebar Skip to footer

Finding A Set Of Indices That Maps The Rows Of One Numpy Ndarray To Another

I have two structured 2D numpy arrays which are equal in principle, meaning A = numpy.array([[a1,b1,c1], [a2,b2,c2], [a3,b3,c3],

Solution 1:

Based on your example, it seems that you have shuffled all of the columns simultaneously, such that there is a vector of row indices that maps A→B. Here's a toy example:

A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]

print(repr(A))
# array([[ 7, 11,  6],
#        [ 4, 10,  8],
#        [ 9,  2,  0],
#        [ 1,  3,  5]])

print(repr(B))
# array([[ 1,  3,  5],
#        [ 4, 10,  8],
#        [ 7, 11,  6],
#        [ 9,  2,  0]])

We want to recover a set of indices, idx, such that A[idx] == B. This will be a unique mapping if and only if A and B contain no repeated rows.


One efficient* approach would be to find the indices that would lexically sort the rows in A, then find where each row in B would fall within the sorted version of A. A useful trick is to view A and B as 1D arrays using an np.void dtype that treats each row as a single element:

rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy herea = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()

a_to_as = np.argsort(a)     # indices that sort the rows of A in lexical order

Now we can use np.searchsorted to perform a binary search for where each row in B would fall within the sorted version of A:

# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`as_to_b = a.searchsorted(b, sorter=a_to_as)

The mapping from A→B can be expressed as a composite of A→As→B

a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True

If A and B contain no repeated rows, the inverse mapping from B→A can also be obtained using

b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True

As a single function:

deffind_row_mapping(A, B):
    """
    Given A and B, where B is a copy of A permuted over the first dimension, find
    a set of indices idx such that A[idx] == B.
    This is a unique mapping if and only if there are no repeated rows in A and B.

    Arguments:
        A, B:   n-dimensional arrays with same shape and dtype
    Returns:
        idx:    vector of indices into the rows of A
    """ifnot (A.shape == B.shape):
        raise ValueError('A and B must have the same shape')
    ifnot (A.dtype == B.dtype):
        raise TypeError('A and B must have the same dtype')

    rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
    a = np.ascontiguousarray(A).view(rowtype).ravel()
    b = np.ascontiguousarray(B).view(rowtype).ravel()
    a_to_as = np.argsort(a)
    as_to_b = a.searchsorted(b, sorter=a_to_as)

    return a_to_as.take(as_to_b)

Benchmark:

In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1loop, best of 3: 2.76 s per loop

*The most costly step would be the quicksort over rows which is O(n log n) on average. I'm not sure it's possible to do any better than this.

Solution 2:

Since either one of the arrays could be shuffled to match the other, no one has stopped us from re-arranging both. Using Jaime's Answer, we can vstack both the arrays and find the unique rows. Then the inverse indices returned by unique is essentially the desired mapping (as the arrays don't contain duplicate rows).

Let's first define a unique2d function for convenience:

defunique2d(arr,consider_sort=False,return_index=False,return_inverse=False): 
    """Get unique values along an axis for 2D arrays.

        input:
            arr:
                2D array
            consider_sort:
                Does permutation of the values within the axis matter? 
                Two rows can contain the same values but with 
                different arrangements. If consider_sort 
                is True then those rows would be considered equal
            return_index:
                Similar to numpy unique
            return_inverse:
                Similar to numpy unique
        returns:
            2D array of unique rows
            If return_index is True also returns indices
            If return_inverse is True also returns the inverse array 
            """if consider_sort isTrue:
        a = np.sort(arr,axis=1)
    else:
        a = arr
    b = np.ascontiguousarray(a).view(np.dtype((np.void, 
            a.dtype.itemsize * a.shape[1])))

    if return_inverse isFalse:
        _, idx = np.unique(b, return_index=True)
    else:
        _, idx, inv = np.unique(b, return_index=True, return_inverse=True)

    if return_index == Falseand return_inverse == False:
        return arr[idx]
    elif return_index == Trueand return_inverse == False:
        return arr[idx], idx
    elif return_index == Falseand return_inverse == True:
        return arr[idx], inv
    else:
        return arr[idx], idx, inv

We can now define our mapping as follows

defrow_mapper(a,b,consider_sort=False):
    """Given two 2D numpy arrays returns mappers idx_a and idx_b 
        such that a[idx_a] = b[idx_b] """assert a.dtype == b.dtype
    assert a.shape == b.shape

    c = np.concatenate((a,b))
    _, inv = unique2d(c, consider_sort=consider_sort, return_inverse=True)
    mapper_a = inv[:b.shape[0]]
    mapper_b = inv[b.shape[0]:]

    return np.argsort(mapper_a), np.argsort(mapper_b) 

Verify:

n = 100000
A = np.arange(n).reshape(n//4,4)
B = A[::-1,:]

idx_a, idx_b  = row_mapper(A,B)
print np.all(A[idx_a]==B[idx_b])
# True

Benchmark: benchmark against @ali_m's solution

%timeitfind_row_mapping(A,B)# ali_m's solution%timeitrow_mapper(A,B)# current solution# n = 100100000loops,best of 3:12.2µsperloop10000loops,best of 3:47.3µsperloop# n = 100010000loops,best of 3:49.1µsperloop10000loops,best of 3:148µsperloop# n = 100001000 loops,best of 3:548µsperloop1000 loops,best of 3:1.6msperloop# n = 100000100loops,best of 3:6.96msperloop100loops,best of 3:19.3msperloop# n = 100000010loops,best of 3:160msperloop1loops,best of 3:372msperloop# n = 100000001loops,best of 3:2.54sperloop1loops,best of 3:5.92sperloop

Although maybe there is room for improvement, the current solution is 2-3 times slower than ali_m's solution and perhaps a little messier as well as both arrays need to be mapped. Just thought this could be an alternate solution.

Post a Comment for "Finding A Set Of Indices That Maps The Rows Of One Numpy Ndarray To Another"