Looking For Fast Way To Compute Pair Wise Distances Of Many Strings
Solution 1:
If you only need the nearest neighbor for each bitarry (ignoring itself), and you could get away with a tiny chance of only getting an approximate nearest neighbor, you might consider implementing the "Bit Sampling" Locality Sensitive Hash for Hamming distance. In a nutshell, create three hash tables. From each 128-bit input, sample 16 bits, 3 times, using those 16 bit samples as keys. The values of your hash tables should be a list of all 128-bit inputs that had that sampled key. Once you place all million of your inputs into the LSH index, simply:
- Iterate over the million points
- For each input, perform the above 3-sampling
- Find the nearest neighbor in each of the three lists (with distance > 0), keep the best overall
Both loading and testing are ludicrously quick. I might recommend the excellent bitarray library for underpinning this.
Solution 2:
I tried to use numpy. Here is my code:
#!/usr/bin/env pythonimport numpy as np
import time
defgen_data(n):
arr = np.empty(shape=(n, 16))
for i inrange(n):
arr[i] = np.random.randint(ord('A'), ord('Z')+1, 16)
return arr
defdistance_from_array(i, arr):
r = arr[i] != arr
r[i,:] = True
min_i = np.argmin(np.sum(r, axis=1))
return min_i
data = gen_data(1000000)
distances = []
start = time.time()
for i inrange(200):
distances.append(distance_from_array(i, data))
end = time.time()
print end - start
You can convert your list of strings into an array of numbers. Then you can use numpy function for working with array, such as sum and argmin. I think you don't want to find only distances larger than 1, if it's possible that one string will appear twice.
I tested it on my computer and it takes about 10 seconds to process 200 strings. For each one you have to go through all 1 000 000 other strings, so we can compute the time it would take to process all of them fairly easily. It should be around 13 hours. However, don't forget that we are using only one core at the moment. If you split indexes and use http://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool you can get your results quite quickly.
Post a Comment for "Looking For Fast Way To Compute Pair Wise Distances Of Many Strings"