Hashtable/dictionary/map Lookup With Regular Expressions

January 31, 2023 Post a Comment

I'm trying to figure out if there's a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the keys

Solution 1:

What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.

Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.

single characters just become trie nodes.
.'s become wildcard insertions covering all children of the current trie node.
*'s become back links in the trie to node at the start of the previous item.
[a-z] ranges insert the same subsequent child nodes repeatedly under each of the characters in the range. With care, while inserts/updates may be somewhat expensive the search can be linear in the size of the string. With some placeholder stuff the common combinatorial explosion cases can be kept under control.
(foo)|(bar) nodes become multiple insertions

This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.

Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)

Solution 2:

This is not possible to do with a regular hash table in any language. You'll either have to iterate through the entire keyset, attempting to match the key to your regex, or use a different data structure.

You should choose a data structure that is appropriate to the problem you're trying to solve. If you have to match against any arbitrary regular expression, I don't know of a good solution. If the class of regular expressions you'll be using is more restrictive, you might be able to use a data structure such as a trie or suffix tree.

Solution 3:

In the general case, what you need is a lexer generator. It takes a bunch of regular expressions and compiles them into a recognizer. "lex" will work if you are using C. I have never used a lexer generator in Python, but there seem to be a few to choose from. Google shows PLY, PyGgy and PyLexer.

If the regular expressions all resemble each other in some way, then you may be able to take some shortcuts. We would need to know more about the ultimate problem that you are trying to solve in order to come up with any suggestions. Can you share some sample regular expressions and some sample data?

Also, how many regular expressions are you dealing with here? Are you sure that the naive approach won't work? As Rob Pike once said, "Fancy algorithms are slow when n is small, and n is usually small." Unless you have thousands of regular expressions, and thousands of things to match against them, and this is an interactive application where a user is waiting for you, you may be best off just doing it the easy way and looping through the regular expressions.

Baca Juga

Solution 4:

What happens if you have a dictionary such as

regex_dict = { re.compile("foo.*"): 5, re.compile("f.*"): 6 }

In this case regex_dict["food"] could legitimately return either 5 or 6.

Even ignoring that problem, there's probably no way to do this efficiently with the regex module. Instead, what you'd need is an internal directed graph or tree structure.

Solution 5:

Here's an efficient way to do it by combining the keys into a single compiled regexp, and so not requiring any looping over key patterns. It abuses the lastindex to find out which key matched. (It's a shame regexp libraries don't let you tag the terminal state of the DFA that a regexp is compiled to, or this would be less of a hack.)

The expression is compiled once, and will produce a fast matcher that doesn't have to search sequentially. Common prefixes are compiled together in the DFA, so each character in the key is matched once, not many times, unlike some of the other suggested solutions. You're effectively compiling a mini lexer for your keyspace.

This map isn't extensible (can't define new keys) without recompiling the regexp, but it can be handy for some situations.

# Regular expression map
# Abuses match.lastindex to figure out which key was matched
# (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
# Mostly for amusement.
# Richard Brooksby, Ravenbrook Limited, 2013-06-01

import re

class ReMap(object):

    def __init__(self, items):
        if not items:
            items = [(r'epsilon^', None)] # Match nothing
        key_patterns = []
        self.lookup = {}
        index = 1
        for key, value in items:
            # Ensure there are no capturing parens in the key, because
            # that would mess up match.lastindex
            key_patterns.append('(' + re.sub(r'\((?!\?:)', '(?:', key) + ')')
            self.lookup[index] = value
            index += 1
        self.keys_re = re.compile('|'.join(key_patterns))

    def __getitem__(self, key):
        m = self.keys_re.match(key)
        if m:
            return self.lookup[m.lastindex]
        raise KeyError(key)

if __name__ == '__main__':
    remap = ReMap([(r'foo.', 12), (r'FileN.*', 35)])
    print remap['food']
    print remap['foot in my mouth']
    print remap['FileNotFoundException: file.x does not exist']