Regex Find List Values Within Multiple Multi-line Strings Python
Solution 1:
You can find the subquery name and associated fields, and then build the desired dictionary:
import re, collections
qry = '\nwith\n\nqry_1 as ( select some code, varas var_1 from apple where code.. and code..\n),\nqry_2 as ( select some code, wherevaras var_2 from pear where code.. and code..\n),\nqry_3 as ( select some code from strawberry join some code, from apple wherevaras var_3, )\n)\n'
d, d1 = collections.defaultdict(list), {}
for i in re.split('(?<=\),)\n', qry):
a, *_b = re.findall('\w+(?=\sas\s\()|(?<=from\s)\w+', i)
b = [i for i in _b if i in sub]
for k in b:
d[k].append(a)
d1[a] = b
print(dict(d))
print(dict(d1))
Output:
{'apple': ['qry_1', 'qry_3'], 'pear': ['qry_2'], 'strawberry': ['qry_3']}
{'qry_1': ['apple'], 'qry_2': ['pear'], 'qry_3': ['strawberry', 'apple']}
Edit: Due to the complexity of your queries, I suggest using the sqlparse
package. sqlparse
will create a navigatable structure that can be traversed to grab the desired info.
First, install sqlparse
:
pip3 install sqlparse
Then, parse and traverse the query. The function get_fields
searches for identifiers that come after from
or join
keywords. These identifiers can be table names or queries. The parameter all_identifiers
will grab any identifier statement, regardless of whether or not it proceeds a from
or join
. In the context of your parsing problem, setting this parameter to True
will search for the fields chosen by the select
block, as well as identifiers after from
or join
:
import sqlparse
from sqlparse import tokens as T
sub = ['apple.apple','event.pear','strawberry']
qry = """
with qry_1 as (
select a.* from apple.apple a
),
with qry_2 as (
select a.* from apple a join strawberry s on a.id = s.id
),
with qry_3 as (
select a.* from (select k.* from event.pear p) l join apple.apple a on l.id = a.id join (select x.* s from strawberry s where s.m = (select max(l) from ignore_field where l.id = s.id)) k3 on k3 = a.id
)
"""defget_fields(block, all_identifiers = False):
seen_id = all_identifiers
for i ingetattr(block, 'tokens', []):
if i.ttype == T.Keyword and i.value.lower() in {'from', 'join'}:
seen_id = Trueif seen_id andisinstance(i, sqlparse.sql.Identifier):
yield i.get_alias()
ifany(isinstance(k, sqlparse.sql.Parenthesis) for k ingetattr(i, 'tokens', [])):
yieldfrom get_fields(i, all_identifiers = seen_id)
else:
yieldfrom re.findall('^[\w+\.]+|\w+', str(i))
elif seen_id:
yieldfrom get_fields(i, all_identifiers = seen_id)
p = sqlparse.parse(qry)
k = {i.tokens[0].value:list(get_fields(i.tokens[-1])) for j in p for i in j.tokens ifisinstance(i, sqlparse.sql.Identifier)}
d1, d2 = collections.defaultdict(list), {}
for a, _b in k.items():
for i in (b:=[j for j in _b if j in sub]):
d1[i].append(a)
d2[a] = b
print(dict(d1))
print(dict(d2))
Output:
{'apple.apple': ['qry_1', 'qry_3'], 'strawberry': ['qry_2', 'qry_3'], 'event.pear': ['qry_3']}
{'qry_1': ['apple.apple'], 'qry_2': ['strawberry'], 'qry_3': ['event.pear', 'apple.apple', 'strawberry']}
Notes:
- This currently only searches for identifiers after
from
/join
keywords. To search for field names chosen just after theselect
keyword, uselist(get_fields(i.tokens[-1], True))
. get_fields
will alsoyield
subquery/table alias i.e ifapple.apple a
exists thena
will also be yielded, along withapple.apple
. If you don't want this behavior, just comment outyield i.get_alias()
.
Post a Comment for "Regex Find List Values Within Multiple Multi-line Strings Python"