Python Efficient Way To Check If Very Large String Contains A Substring

May 31, 2023 Post a Comment

Python is not my best language, and so I'm not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB file) a

Solution 1:

Is it really slow? You're talking about 30MB string; let's try it with even bigger string:

In [12]: string="agu82934u"*50*1024*1024+"string to be found"In [13]: len(string)
Out[13]: 471859218In [14]: %timeit "string to be found"instring1 loops, best of3: 335 ms per loopIn [15]: %timeit "string not to be found"instring1 loops, best of3: 200 ms per loop

I wouldn't say that 335 ms is much time looking for substring in 450MB string.

Solution 2:

How slow is too slow? I just did an a in b test for a unique string at the end of a 170 MB string. It finished before my finger left the Enter key.

Solution 3:

You can use one of these algorithms:

Rabin–Karp string search algorithm
Knuth–Morris–Pratt algorithm (aka KMP) see an implementation here

Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.

you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm

Solution 4:

I do not see how to make it more optimal on the comparison, to be honest. But you can use less memory and lose less time with I/O if you read the file line by line:

has_small_string=Falsefor line in large_file:if small_string in line:has_small_string=Truebreakif has_small_string:# ... Your stuff here ...

This is no revolutionary improvement, and can be even less useful if you really need the large string in the memory, but it may be helpful, so I am posting here :)

Solution 5:

If you just want to check if that substring exists,

for line inopen("file"):
    if substring in line:
         print"exists"
         sys.exit() # orbreak to do some other stuff

Getting Started with Python