Skip to content Skip to sidebar Skip to footer

Regex Find List Values Within Multiple Multi-line Strings Python

I'm looking for some help searching for list criteria within a string with multiple lines but similar patterns. Each subquery has as ( and ),. so as space ( and space ), pattern.

Solution 1:

You can find the subquery name and associated fields, and then build the desired dictionary:

import re, collections
qry = '\nwith\n\nqry_1 as ( select some code, varas var_1 from apple where code.. and code..\n),\nqry_2 as ( select some code, wherevaras var_2 from pear where code.. and code..\n),\nqry_3 as ( select some code from strawberry join some code, from apple wherevaras var_3, )\n)\n'
d, d1 = collections.defaultdict(list), {}
for i in re.split('(?<=\),)\n', qry):
    a, *_b = re.findall('\w+(?=\sas\s\()|(?<=from\s)\w+', i)
    b = [i for i in _b if i in sub]
    for k in b:
       d[k].append(a)
    d1[a] = b

print(dict(d))
print(dict(d1))

Output:

{'apple': ['qry_1', 'qry_3'], 'pear': ['qry_2'], 'strawberry': ['qry_3']}
{'qry_1': ['apple'], 'qry_2': ['pear'], 'qry_3': ['strawberry', 'apple']}

Edit: Due to the complexity of your queries, I suggest using the sqlparse package. sqlparse will create a navigatable structure that can be traversed to grab the desired info.

First, install sqlparse:

pip3 install sqlparse

Then, parse and traverse the query. The function get_fields searches for identifiers that come after from or join keywords. These identifiers can be table names or queries. The parameter all_identifiers will grab any identifier statement, regardless of whether or not it proceeds a from or join. In the context of your parsing problem, setting this parameter to True will search for the fields chosen by the select block, as well as identifiers after from or join:

import sqlparse
from sqlparse import tokens as T
sub = ['apple.apple','event.pear','strawberry']
qry = """
with qry_1 as (
   select a.* from apple.apple a
),
with qry_2 as (
   select a.* from apple a join strawberry s on a.id = s.id
),
with qry_3 as (
   select a.* from (select k.* from event.pear p) l join apple.apple a on l.id = a.id join (select x.* s from strawberry s where s.m = (select max(l) from ignore_field where l.id = s.id)) k3 on k3 = a.id
)
"""defget_fields(block, all_identifiers = False):
   seen_id = all_identifiers
   for i ingetattr(block, 'tokens', []):
      if i.ttype == T.Keyword and i.value.lower() in {'from', 'join'}:
         seen_id = Trueif seen_id andisinstance(i, sqlparse.sql.Identifier):
         yield i.get_alias()
         ifany(isinstance(k, sqlparse.sql.Parenthesis) for k ingetattr(i, 'tokens', [])):
            yieldfrom get_fields(i, all_identifiers = seen_id)
         else:
            yieldfrom re.findall('^[\w+\.]+|\w+', str(i))
      elif seen_id:
          yieldfrom get_fields(i, all_identifiers = seen_id)

p = sqlparse.parse(qry)
k = {i.tokens[0].value:list(get_fields(i.tokens[-1])) for j in p for i in j.tokens ifisinstance(i, sqlparse.sql.Identifier)}
d1, d2 = collections.defaultdict(list), {}
for a, _b in k.items():
    for i in (b:=[j for j in _b if j in sub]):
       d1[i].append(a)
    d2[a] = b

print(dict(d1))
print(dict(d2))

Output:

{'apple.apple': ['qry_1', 'qry_3'], 'strawberry': ['qry_2', 'qry_3'], 'event.pear': ['qry_3']}
{'qry_1': ['apple.apple'], 'qry_2': ['strawberry'], 'qry_3': ['event.pear', 'apple.apple', 'strawberry']}

Notes:

  1. This currently only searches for identifiers after from/join keywords. To search for field names chosen just after the select keyword, use list(get_fields(i.tokens[-1], True)).
  2. get_fields will also yield subquery/table alias i.e if apple.apple a exists then a will also be yielded, along with apple.apple. If you don't want this behavior, just comment out yield i.get_alias().

Post a Comment for "Regex Find List Values Within Multiple Multi-line Strings Python"