Python Split String By Spaces Except When In Quotes, But Keep The Quotes

Am wanting to split the following string: Quantity [*,'EXTRA 05',*] With the desired results being: ['Quantity', '[*,'EXTRA 05',*]'] The closest I have found is using shlex.spl

Solution 1:

To treat string, the basic way is the regular expression tool ( module re )

Given the infos you give (this mean they may be unsufficient) the following code does the job:

import re

r = re.compile('(?! )[^[]+?(?= *\[)''|''\[.+?\]')

s1 = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"print r.findall(s1)

s2 = "'zug hug'Quantity boondoggle 'fish face monkey "\
     "dung' [*,'EXTRA 05',*] [*,'EXTRA 09',*]"print r.findall(s2)


['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]  
["'zug hug'Quantity boondoggle 'fish face monkey dung'", "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]

The regular expression pattern must be undesrtood as follows:

'|' means OR

So the regex pattern expresses two partial RE: (?! )[^[]+?(?= *\[) and \[.+?\]

The first partial RE :

The core is [^[]+ Brackets define a set of characters. The symbol ^ being after the first bracket [ , it means that the set is defined as all the characters that aren't the ones that follow the symbol ^. Presently [^[] means any character that isn't an opening bracket [ and, as there's a + after this definition of set, [^[]+ means sequence of characters among them there is no opening bracket.

Now, there is a question mark after [^[]+ : it means that the sequence catched must stop before what is symbolized just after the question mark. Here, what follows the ? is (?= *\[) which is a lookahead assertion, composed of (?=....) that signals it is a positive lookahead assertion and of *\[, this last part being the sequence in front of which the catched sequence must stop. *\[ means: zero,one or more blanks until the opening bracket (backslash \ needed to eliminate the meaning of [ as the opening of a set of characters).

There's also (?! ) in front of the core, it's a negative lookahead assertion: it is necessary to make this partial RE to catch only sequences beginning with a blank, so avoiding to catch successions of blanks. Remove this (?! ) and you'll see the effect.

The second partial RE :

\[.+?\] means : the opening bracket characater [ , a sequence of characters catched by .+? (the dot matching with any character except \n) , this sequence must stop in front of the ending bracket character ] that is the last character to be catched.



string = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"import re
print re.split(' (?=\[)',string)


['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]


Solution 2:

Advised for picky people, the algorithm WON'T split well every string you pass through it, just strings like:

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"

"Quantity [*,'EXTRA 05',*]"

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 10',*] [*,'EXTRA 07',*] [*,'EXTRA 09',*]"

string = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
splitted_string = []

#This adds "Quantity"to the position 0of splitted_string
splitted_string.append(string.split(" ")[0])     

#The for goes from1to the lenght ofstring.split(" "),increasing the x by2
#The first iteration x is1and x+1is2, the second x=3and x+1=4 etc...
#The first iteration concatenate "[*,'EXTRA"and"05',*]"in one string
#The second iteration concatenate "[*,'EXTRA"and"09',*]"in one string#If the string would be bigger, it will worksfor x in range(1,len(string.split(" ")),2):
    splitted_string.append("%s %s" % (string.split(" ")[x],string.split(" ")[x+1]))

When I execute the code, splitted string at the end contains:

['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]
splitted_string[0] = 'Quantity'
splitted_string[1] = "[*,'EXTRA 05',*]"
splitted_string[2] = "[*,'EXTRA 09',*]"

I think that is exactly what you're looking for. If I'm wrong let me know, or if you need some explanation of the code. I hope it helps

Solution 3:

Assuming you want a general solution for splitting at spaces but not on space in quotations: I don't know of any Python library to do this, but there doesn't mean there isn't one.

In the absence of a known pre-rolled solution I would simply roll my own. It's relatively easy to scan a string looking for spaces and then use the Python slice functionality to divide up the string into the parts you want. To ignore spaces in quotes you can simply include a flag that switches on encountering a quote symbol to switch the space sensing on and off.

This is some code I knocked up to do this, it is not extensively tested:

def spaceSplit(string) :
  last = 0
  splits = []
  inQuote = None
  for i, letter in enumerate(string) :
    if inQuote :
      if (letter == inQuote) :
        inQuote = None
    else :
      if (letter == '"' or letter == "'") :
        inQuote = letter

    if not inQuote and letter == ' ' :
      last = i+1if last < len(string) :

  return splits

Solution 4:

Try this

def parseString(inputString):
    output = inputString.split()
    res = []
    count = 0
    temp = []
    for word in output:
        if (word.startswith('"')) and count % 2 == 0:
            count += 1
        elif count % 2 == 1 and not word.endswith('"'):
        elif word.endswith('"'):
            count += 1
            tempWord = ' '.join(temp)
            temp = []



parseString('This is "a test" to your split "string with quotes"')

Output: ['This', 'is', '"a test"', 'to', 'your', 'split', '"string with quotes"']

