Reduce States To Abbreviations
I'm trying to clean a dataset that has the states either as abbreviations or fully spelled out. I need to make them all into abbreviations. Any cheats to do this? This is what I've
Solution 1:
Here is an approach.
import re
"""Table to Map States to Abbreviations Courtesy https://gist.github.com/Quenty/74156dcc4e21d341ce52da14a701c40c"""
statename_to_abbr = {
# Other'District of Columbia': 'DC',
# States'Alabama': 'AL',
'Montana': 'MT',
'Alaska': 'AK',
'Nebraska': 'NE',
'Arizona': 'AZ',
'Nevada': 'NV',
'Arkansas': 'AR',
'New Hampshire': 'NH',
'California': 'CA',
'New Jersey': 'NJ',
'Colorado': 'CO',
'New Mexico': 'NM',
'Connecticut': 'CT',
'New York': 'NY',
'Delaware': 'DE',
'North Carolina': 'NC',
'Florida': 'FL',
'North Dakota': 'ND',
'Georgia': 'GA',
'Ohio': 'OH',
'Hawaii': 'HI',
'Oklahoma': 'OK',
'Idaho': 'ID',
'Oregon': 'OR',
'Illinois': 'IL',
'Pennsylvania': 'PA',
'Indiana': 'IN',
'Rhode Island': 'RI',
'Iowa': 'IA',
'South Carolina': 'SC',
'Kansas': 'KS',
'South Dakota': 'SD',
'Kentucky': 'KY',
'Tennessee': 'TN',
'Louisiana': 'LA',
'Texas': 'TX',
'Maine': 'ME',
'Utah': 'UT',
'Maryland': 'MD',
'Vermont': 'VT',
'Massachusetts': 'MA',
'Virginia': 'VA',
'Michigan': 'MI',
'Washington': 'WA',
'Minnesota': 'MN',
'West Virginia': 'WV',
'Mississippi': 'MS',
'Wisconsin': 'WI',
'Missouri': 'MO',
'Wyoming': 'WY',
}
defmultiple_replace(lookup, text):
"""Perform substituions that map strings in the lookup table to valuees (modification from https://stackoverflow.com/questions/15175142/how-can-i-do-multiple-substitutions-using-regex-in-python)"""# re.IGNORECASE flags allows provides case insensitivity (i.e. matches New York, new york, NEW YORK, etc.)
regex = re.compile(r'\b(' + '|'.join(lookup.keys()) + r')\b', re.IGNORECASE)
# For each match, look-up corresponding value in dictionary and peform subsstituion# we convert match to title to capitalize first letter in each wordreturn regex.sub(lambda mo: lookup[mo.string.title()[mo.start():mo.end()]], text)
if __name__ == "__main__":
text = """United States Census Regions are:
Region 1: Northeast
Division 1: New England (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont)
Division 2: Mid-Atlantic (New Jersey, New York, and Pennsylvania)
Region 2: Midwest (Prior to June 1984, the Midwest Region was designated as the North Central Region.)[7]
Division 3: East North Central (Illinois, Indiana, Michigan, Ohio, and Wisconsin)
Division 4: West North Central (Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, and South Dakota)"""print(multiple_replace(statename_to_abbr, text))
Output Example
United States Census Regions are:Region 1: NortheastDivision 1:NewEngland(CT,ME,MA,NH,RI,andVT)Division2:Mid-Atlantic(NJ,NY,andPA)Region2:Midwest(PriortoJune1984,theMidwestRegionwasdesignatedastheNorthCentralRegion.)[7]Division3:EastNorthCentral(IL,IN,MI,OH,andWI)Division4:WestNorthCentral(IA,KS,MN,MO,NE,ND,andSD)
Solution 2:
Thanks for the help. I've final found all the answers to my use-case, so here is what I needed in case anyone else needs it.
#Creates new dataframe with two columns,removes all the NaN values
by_state = sales[['order state','total']].dropna()
#Map a dictionary of abbreviations to the dataframe
by_state['order state'] = by_state['order state'].map(abbr).fillna(by_state['order state'])
#Map values that were not capitalized correctly
by_state['order state'] = by_state['order state'].apply(lambda x:x.title()).map(abbr).fillna(by_state['order state'])
#Convert all abbreviations to uppercase
by_state['order state'] = by_state['order state'].apply(lambda x:x.upper())
#Remove a period after a abbreviation
by_state['order state'] = by_state['order state'].apply(lambda x:x.split('.')[0])
Post a Comment for "Reduce States To Abbreviations"