Skip to content Skip to sidebar Skip to footer

Reduce States To Abbreviations

I'm trying to clean a dataset that has the states either as abbreviations or fully spelled out. I need to make them all into abbreviations. Any cheats to do this? This is what I've

Solution 1:

Here is an approach.

import re

"""Table to Map States to Abbreviations Courtesy https://gist.github.com/Quenty/74156dcc4e21d341ce52da14a701c40c"""
statename_to_abbr = {
    # Other'District of Columbia': 'DC',

    # States'Alabama': 'AL',
    'Montana': 'MT',
    'Alaska': 'AK',
    'Nebraska': 'NE',
    'Arizona': 'AZ',
    'Nevada': 'NV',
    'Arkansas': 'AR',
    'New Hampshire': 'NH',
    'California': 'CA',
    'New Jersey': 'NJ',
    'Colorado': 'CO',
    'New Mexico': 'NM',
    'Connecticut': 'CT',
    'New York': 'NY',
    'Delaware': 'DE',
    'North Carolina': 'NC',
    'Florida': 'FL',
    'North Dakota': 'ND',
    'Georgia': 'GA',
    'Ohio': 'OH',
    'Hawaii': 'HI',
    'Oklahoma': 'OK',
    'Idaho': 'ID',
    'Oregon': 'OR',
    'Illinois': 'IL',
    'Pennsylvania': 'PA',
    'Indiana': 'IN',
    'Rhode Island': 'RI',
    'Iowa': 'IA',
    'South Carolina': 'SC',
    'Kansas': 'KS',
    'South Dakota': 'SD',
    'Kentucky': 'KY',
    'Tennessee': 'TN',
    'Louisiana': 'LA',
    'Texas': 'TX',
    'Maine': 'ME',
    'Utah': 'UT',
    'Maryland': 'MD',
    'Vermont': 'VT',
    'Massachusetts': 'MA',
    'Virginia': 'VA',
    'Michigan': 'MI',
    'Washington': 'WA',
    'Minnesota': 'MN',
    'West Virginia': 'WV',
    'Mississippi': 'MS',
    'Wisconsin': 'WI',
    'Missouri': 'MO',
    'Wyoming': 'WY',
}


defmultiple_replace(lookup, text):
  """Perform substituions that map strings in the lookup table to valuees (modification from https://stackoverflow.com/questions/15175142/how-can-i-do-multiple-substitutions-using-regex-in-python)"""# re.IGNORECASE flags allows provides case insensitivity (i.e. matches New York, new york, NEW YORK, etc.)
  regex = re.compile(r'\b(' + '|'.join(lookup.keys()) + r')\b', re.IGNORECASE)


  # For each match, look-up corresponding value in dictionary and peform subsstituion# we convert match to title to capitalize first letter in each wordreturn regex.sub(lambda mo: lookup[mo.string.title()[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  text = """United States Census Regions are:
Region 1: Northeast
Division 1: New England (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont)
Division 2: Mid-Atlantic (New Jersey, New York, and Pennsylvania)
Region 2: Midwest (Prior to June 1984, the Midwest Region was designated as the North Central Region.)[7]
Division 3: East North Central (Illinois, Indiana, Michigan, Ohio, and Wisconsin)
Division 4: West North Central (Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, and South Dakota)"""print(multiple_replace(statename_to_abbr, text))

Output Example

United States Census Regions are:Region 1: NortheastDivision 1:NewEngland(CT,ME,MA,NH,RI,andVT)Division2:Mid-Atlantic(NJ,NY,andPA)Region2:Midwest(PriortoJune1984,theMidwestRegionwasdesignatedastheNorthCentralRegion.)[7]Division3:EastNorthCentral(IL,IN,MI,OH,andWI)Division4:WestNorthCentral(IA,KS,MN,MO,NE,ND,andSD)

Solution 2:

Thanks for the help. I've final found all the answers to my use-case, so here is what I needed in case anyone else needs it.

#Creates new dataframe with two columns,removes all the NaN values
by_state = sales[['order state','total']].dropna()

#Map a dictionary of abbreviations to the dataframe
by_state['order state'] = by_state['order state'].map(abbr).fillna(by_state['order state'])

#Map values that were not capitalized correctly
by_state['order state'] = by_state['order state'].apply(lambda x:x.title()).map(abbr).fillna(by_state['order state'])

#Convert all abbreviations to uppercase
by_state['order state'] = by_state['order state'].apply(lambda x:x.upper())

#Remove a period after a abbreviation
by_state['order state'] = by_state['order state'].apply(lambda x:x.split('.')[0])

Post a Comment for "Reduce States To Abbreviations"