Skip to content Skip to sidebar Skip to footer

Extract Specific Letters From Text Using Regex And Compare With Dictionary

I am having a list of texts which is 90% in format AABBB-CCCDDD001. And there are also few texts in this list which may consist of AABBBICS-CCCDDD001 or AABBBIGW-CCCDDD001 or AA

Solution 1:

We can use .find to get the code word, if it exists, and then use the dictionary to map the code word to its code number. We can use the dictionary .get method to return the null code for missing or unknown code words. This version returns None if it encounters bad data: a name that doesn't contain '-', or a name that doesn't have either 8 or 5 letters before the '-'.

env_code = {
    'ICS': '1',
    'IGW': '2',
    'RTL': '3',
    'TDZ': '4',
}

null_code = '9'defget_env_code(name):
    idx = name.find('-')
    if idx == 8:
        # code may be valid
        code = name[idx-3:idx]
    elif idx == 5:
        # code is missing
        code = ''else:
        # Bad namereturnNonereturn env_code.get(code, null_code)

# test

data = [
    'AABBBICS-CCCDDD001',
    'AABBBIGW-CCCDDD001',
    'AABBBRTL-CCCDDD001',
    'AABBBTDZ-CCCDDD001',
    'USNYCRTL-LANDCE001',
    'AABBBXYZ-CCCDDD001',
    'AABBB-CCCDDD001',
    'BADDATA',
]

for s in data:
    print(s, get_env_code(s))

output

AABBBICS-CCCDDD001 1
AABBBIGW-CCCDDD001 2
AABBBRTL-CCCDDD001 3
AABBBTDZ-CCCDDD001 4
USNYCRTL-LANDCE001 3
AABBBXYZ-CCCDDD001 9
AABBB-CCCDDD001 9
BADDATA None

Here's a simpler version that returns the null code instead of None for bad data.

defget_env_code(name):
    idx = name.find('-')
    code = name[idx-3:idx] if idx == 8else''return env_code.get(code, null_code)

Solution 2:

If you're just checking if a member of ENVIRONMENTCODE is found within each test string, then regex not necessary. You can just use the python keyword in, e.g.

ENVIRONMENTCODE = {
    'ICS': '1',
    'IGW': '2',
    'RTL': '3',
    'TDZ': '4'
}

NULLCODE = {
    'NULL': '9'
}

def environment_code(test_string, code_dict):
    if'-' not in test_string:
        return'no dash'for code, value in code_dict.items():
        if code in test_string:
            return value
    return NULLCODE['NULL']


to_test = ['AABBBICS-CCCDDD001',
           'AABBBIGW-CCCDDD001',
           'AABBBRTL-CCCDDD001',
           'AABBBTDZ-CCCDDD001']
for test_str in to_test:
    print(environment_code(test_str, ENVIRONMENTCODE))

The problem with your original code was that you were trying to do

test_string in code_dict

which only checks for exact matches between the string under test and the keys withint the dictionary.

Solution 3:

My proposal:

def environmentcode(s):
    if"-" not in s:  #(**)
        returnNone   #(**)
    h,t=s.split("-")
    code=h.strip()[5:]
    return ENVIRONMENTCODE.get(code,9)   

data="AABBBICS-CCCDDD001 AABBBIGW-CCCDDD001 AABBBRTL-CCCDDD001 AABBBTDZ-CCCDDD001 USNYCRTL-LANDCE001 AABBB-CCCDDD001 something"forsin data.split():
    print(s,"-->",environmentcode(s))

Output:
AABBBICS-CCCDDD001 -->1
AABBBIGW-CCCDDD001 -->2
AABBBRTL-CCCDDD001 -->3
AABBBTDZ-CCCDDD001 -->4
USNYCRTL-LANDCE001 -->3
AABBB-CCCDDD001 -->9
something -->None

#---------------------------------------------------------
# Filtering text with regex. In this case, (**) not needed.
text="""AABBBICS-CCCDDD001 Alice was beginning to get very tired of sitting by her sister on the bank... AABBBIGW-CCCDDD001 AABBBRTL-CCCDDD001 AABBBTDZ-CCCDDD001 USNYCRTL-LANDCE001 AABBB-CCCDDD001 AABBBXYZ-CCCDDD001 something"""

import re

data= re.findall(r"\b[A-Z]{5,8}-[A-Z]{6}001\b",text)
forsin data:
    print(s,"-->",environmentcode(s))

Post a Comment for "Extract Specific Letters From Text Using Regex And Compare With Dictionary"