Regex Find List Values Within Multiple Multi-line Strings Python
Solution 1:
You can find the subquery name and associated fields, and then build the desired dictionary:
import re, collections
qry = '\nwith\n\nqry_1 as ( select some code, varas var_1 from apple where code.. and code..\n),\nqry_2 as ( select some code, wherevaras var_2 from pear where code.. and code..\n),\nqry_3 as ( select some code from strawberry join some code, from apple wherevaras var_3, )\n)\n'
d, d1 = collections.defaultdict(list), {}
for i in re.split('(?<=\),)\n', qry):
a, *_b = re.findall('\w+(?=\sas\s\()|(?<=from\s)\w+', i)
b = [i for i in _b if i in sub]
for k in b:
d[k].append(a)
d1[a] = b
print(dict(d))
print(dict(d1))
Output:
{'apple': ['qry_1', 'qry_3'], 'pear': ['qry_2'], 'strawberry': ['qry_3']}
{'qry_1': ['apple'], 'qry_2': ['pear'], 'qry_3': ['strawberry', 'apple']}
Edit: Due to the complexity of your queries, I suggest using the sqlparse package. sqlparse will create a navigatable structure that can be traversed to grab the desired info.
First, install sqlparse:
pip3 install sqlparse
Then, parse and traverse the query. The function get_fields searches for identifiers that come after from or join keywords. These identifiers can be table names or queries. The parameter all_identifiers will grab any identifier statement, regardless of whether or not it proceeds a from or join. In the context of your parsing problem, setting this parameter to True will search for the fields chosen by the select block, as well as identifiers after from or join:
import sqlparse
from sqlparse import tokens as T
sub = ['apple.apple','event.pear','strawberry']
qry = """
with qry_1 as (
select a.* from apple.apple a
),
with qry_2 as (
select a.* from apple a join strawberry s on a.id = s.id
),
with qry_3 as (
select a.* from (select k.* from event.pear p) l join apple.apple a on l.id = a.id join (select x.* s from strawberry s where s.m = (select max(l) from ignore_field where l.id = s.id)) k3 on k3 = a.id
)
"""defget_fields(block, all_identifiers = False):
seen_id = all_identifiers
for i ingetattr(block, 'tokens', []):
if i.ttype == T.Keyword and i.value.lower() in {'from', 'join'}:
seen_id = Trueif seen_id andisinstance(i, sqlparse.sql.Identifier):
yield i.get_alias()
ifany(isinstance(k, sqlparse.sql.Parenthesis) for k ingetattr(i, 'tokens', [])):
yieldfrom get_fields(i, all_identifiers = seen_id)
else:
yieldfrom re.findall('^[\w+\.]+|\w+', str(i))
elif seen_id:
yieldfrom get_fields(i, all_identifiers = seen_id)
p = sqlparse.parse(qry)
k = {i.tokens[0].value:list(get_fields(i.tokens[-1])) for j in p for i in j.tokens ifisinstance(i, sqlparse.sql.Identifier)}
d1, d2 = collections.defaultdict(list), {}
for a, _b in k.items():
for i in (b:=[j for j in _b if j in sub]):
d1[i].append(a)
d2[a] = b
print(dict(d1))
print(dict(d2))
Output:
{'apple.apple': ['qry_1', 'qry_3'], 'strawberry': ['qry_2', 'qry_3'], 'event.pear': ['qry_3']}
{'qry_1': ['apple.apple'], 'qry_2': ['strawberry'], 'qry_3': ['event.pear', 'apple.apple', 'strawberry']}
Notes:
- This currently only searches for identifiers after
from/joinkeywords. To search for field names chosen just after theselectkeyword, uselist(get_fields(i.tokens[-1], True)). get_fieldswill alsoyieldsubquery/table alias i.e ifapple.apple aexists thenawill also be yielded, along withapple.apple. If you don't want this behavior, just comment outyield i.get_alias().
Post a Comment for "Regex Find List Values Within Multiple Multi-line Strings Python"