Skip to content Skip to sidebar Skip to footer

Why Does String Concatenation Matter In Python?

I've seen a good number of posts (examples here and here) speaking about concatenation in Python, how best to do it ('+' vs ','), which is faster, etc. But I can't seem to find out

Solution 1:

Because in general +ing n strings will result in n-1 allocations of O(n) memory for the result. Concatenation via adjacency is done in the parser, and performs 1 allocation. Concatenation via for instance ''.join(iter(s)) will perform O(log(n)) allocations/copies of 2n total memory.

> a = ['a'] * 100000> def concat(strings):
      c = ''
      for s in strings:
          c += s
      return c
> %timeit ''.join(a)       # precalculates necessary buffer size
1000 loops, best of 3: 1.07 ms per loop
> %timeit ''.join(iter(a)) # allocates exponentially larger buffers
1000 loops, best of 3: 1.94 ms per loop
> %timeit concat(a)        # allocates a new buffer n-1 times
100 loops, best of 3: 7.15 ms per loop

Solution 2:

Strings are immutable objects in Python, so you cannot modify existing strings. That means that every concatenation of a string results in a new string object being created and two (the source objects) being thrown away. Memory allocation is expensive enough to make this matter a lot.

So when you know you need to concatenate multiple strings, store them in a list. And then at the end, just once, join that list using ''.join(list_of_strings). That way, a new string will only be created once.

Note that this also applies in other languages. For example Java and C# both have a StringBuilder type which is essentially the same. As long as you keep appending new string parts, they will just internally add that to a string, and only when you convert the builder into a real string, the concatenation happens—and again just once.

Also note, that this memory allocation overhead already happens when you just append a few strings in a single line. For example a + b + c + d will create three intermediary strings. You can see that, if you look at the byte code for that expression:

>>> dis.dis('a + b + c + d')
  10 LOAD_NAME                0 (a) 
              3 LOAD_NAME                1 (b) 
              6 BINARY_ADD           
              7 LOAD_NAME                2 (c) 
             10 BINARY_ADD           
             11 LOAD_NAME                3 (d) 
             14 BINARY_ADD           
             15 RETURN_VALUE         

Each BINARY_ADD concats the two previous values and creates a new object for the result on the stack. Note that for constant string literals, the compiler is smart enough to notice that you are adding constants:

>>> dis.dis('"foo" + "bar" + "baz"')
  10 LOAD_CONST               4 ('foobarbaz') 
              3 RETURN_VALUE         

If you do have some variable parts within that though—for example if you want to produce a nicely formatted output—then you are back to creating intermediary string objects. In that case, using str.format is a good idea, e.g. 'foo {} bar {} baz'.format(a, b).

Post a Comment for "Why Does String Concatenation Matter In Python?"