Python Efficient Way To Check If Very Large String Contains A Substring
Solution 1:
Is it really slow? You're talking about 30MB string; let's try it with even bigger string:
In [12]: string="agu82934u"*50*1024*1024+"string to be found"In [13]: len(string)
Out[13]: 471859218In [14]: %timeit "string to be found"instring1 loops, best of3: 335 ms per loopIn [15]: %timeit "string not to be found"instring1 loops, best of3: 200 ms per loopI wouldn't say that 335 ms is much time looking for substring in 450MB string.
Solution 2:
How slow is too slow? I just did an a in b test for a unique string at the end of a 170 MB string. It finished before my finger left the Enter key.
Solution 3:
You can use one of these algorithms:
Knuth–Morris–Pratt algorithm (aka KMP) see an implementation here
Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.
you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm
Solution 4:
I do not see how to make it more optimal on the comparison, to be honest. But you can use less memory and lose less time with I/O if you read the file line by line:
has_small_string=Falsefor line in large_file:if small_string in line:has_small_string=Truebreakif has_small_string:# ... Your stuff here ...This is no revolutionary improvement, and can be even less useful if you really need the large string in the memory, but it may be helpful, so I am posting here :)
Solution 5:
If you just want to check if that substring exists,
for line inopen("file"):
if substring in line:
print"exists"
sys.exit() # orbreak to do some other stuff
Post a Comment for "Python Efficient Way To Check If Very Large String Contains A Substring"