# Arrays and Lists

## Share

I often get questions about the difference between the Python `list`

and the NumPy `ndarray`

, and we talk about this a lot in *Python Foundations for Scientists & Engineers*. But recently I had a question about the `array`

object that is part of the Python language, and why don’t we use *it* for computations? Why is NumPy necessary if Python has an `array`

data type? Well, this is a *great* question!

First of all, let’s take a quick look at the Python `array`

. It’s found in the `array`

standard library and usually imported as `arr`

:

```
>>> import array as arr
```

Like a NumPy `ndarray`

, it has a data type, and every element of the array has to conform to that type. Set the data type in in the first argument when creating the array. `'l'`

in this case specifies a 4-byte signed integer:

```
>>> a = arr.array('l', range(5))
>>> a
array('l', [0, 1, 2, 3, 4])
```

It has the same indexing and slicing behavior as a Python `list`

, and we even see that it has the same “multiplication” behavior as lists: *i.e.* “multiplying” means *repetition*, not element-wise, arithmetic multiplication as with the NumPy `ndarray`

.

With that in mind, computations look the same for `array`

s as for `list`

s:

```
>>> b = arr.array(a.typecode, (v + 1 for v in a))
>>> b
array('l', [1, 2, 3, 4, 5])
```

Let’s use IPython’s `%timeit`

magic to see how `array`

computations fare in comparison with Python `list`

s and NumPy `ndarray`

s. For the `list`

and `array`

, we’ve specified a list comprehensions as the most efficient way to do the computations and a simple vector computation for the NumPy `ndarray`

, and our test is to increment sequence of 10 integers:

```
In [1]: import array as arr
...: import numpy as np
...:
...: %timeit [val + 1 for val in list(range(10))]
...: %timeit arr.array('q', [val + 1 for val in arr.array('q', range(10))])
...: %timeit np.arange(10) + 1
279 ns ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
825 ns ± 5.22 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.03 μs ± 98.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```

It’s surprising to note that `list`

comprehension is fastest, beating `array`

computation by a factor of 3 or so, and beating `ndarray`

computation by a factor of 3.7! Suspecting that a 10-integer sequence may not be representative, let’s bump it up to 1000:

```
In [2]: %timeit [val + 1 for val in list(range(1000))]
...: %timeit arr.array('q', [val + 1 for val in arr.array('q', range(1000))])
...: %timeit np.arange(1000) + 1
28.4 μs ± 3.47 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
82.4 μs ± 9.05 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
1.82 μs ± 175 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```

With more numbers to chew on, NumPy clearly pulls ahead, suggesting the overhead of creating an `ndarray`

may dominate computation time for small arrays, but the actual computation time is relatively small. The Python `array`

is still relatively inefficient compared to the Python `list`

. Let’s make some helper functions to compare how long it takes to increment each value in `list`

s, `array`

s, and NumPy `ndarray`

s accross a broad range of different sizes. Each function will accept a `list`

, `array`

, or `ndarray`

and return the time (in ms) required to increment each value by 1.

```
In [3]: from datetime import datetime
...:
...: def inc_list_by_one(l):
...: start = datetime.now()
...: res = [val + 1 for val in l]
...: return (datetime.now() - start).total_seconds() * 1000
...:
...: def inc_array_by_one(a):
...: start = datetime.now()
...: res = arr.array(a.typecode, [val + 1 for val in a])
...: return (datetime.now() - start).total_seconds() * 1000
...:
...: def inc_numpy_array_by_one(arr):
...: start = datetime.now()
...: res = arr + 1
...: return (datetime.now() - start).total_seconds() * 1000
```

Now let’s record the computation times for a range of array sizes and plot them using `matplotlib`

and `seaborn`

:

```
In [14]: import matplotlib.pyplot as plt
...: import seaborn as sns
...:
...: times = []
...: for size in range(1, 10_000_001, 200_000):
...: times.append((
...: size,
...: inc_list_by_one(list(range(size))),
...: inc_array_by_one(arr.array('q', range(size))),
...: inc_numpy_array_by_one(np.arange(size)),
...: ))
...: times_arr = np.array(times)
...:
...: sns.regplot(x=times_arr[:, 0], y=times_arr[:, 1], label='Python list')
...: sns.regplot(x=times_arr[:, 0], y=times_arr[:, 2], label='Python array')
...: sns.regplot(x=times_arr[:, 0], y=times_arr[:, -1], label='Numpy array')
...: plt.xlabel('Array Size (num elements)')
...: plt.ylabel('Time to Increment by 1 (ms)')
...: plt.legend()
...: plt.show()
```

…and now it’s clear that computation times for Python `list`

s and `array`

s scale with the size of the object much faster than NumPy `ndarray`

s do. And it’s also clear that the Python `array`

is no improvement for computations, static typing notwithstanding. Let’s quantify:

```
In [26]: print(f'Python list {np.polyfit(times_arr[:, 0], times_arr[:, 1], 1)[0]:.2e} (ms/element)')
...: print(f'Python array {np.polyfit(times_arr[:, 0], times_arr[:, 2], 1)[0]:.2e} (ms/element)')
...: print(f'NumPy ndarray {np.polyfit(times_arr[:, 0], times_arr[:, -1], 1)[0]:.2e} (ms/element)')
Python list 4.02e-05 (ms/element)
Python array 6.50e-05 (ms/element)
NumPy ndarray 2.01e-06 (ms/element)
```

So if we take the slope of those curves as a gauge of performance, Python `array`

s are roughly 40% slower in computation than `list`

s, and the NumPy `ndarray`

computations run in about 95% *less* time, on average for arrays that are not trivially small. This is because NumPy takes advantage of knowing the data type of each element and setting up efficient *strides* in memory for computations, which can be optimized at the operating system and hardware levels.

What, then, is the advantage of the Python `array`

over a Python `list`

? With helper functions similar to the ones above, we can return `sys.getsizeof()`

sample `list`

and `array`

objects over the same range of sizes. Plotting the results, we see something interesting:

Memory consumption for both Python `array`

s and NumPy `ndarray`

s scale exactly linearly with the number of elements (the `q`

typcode specifies a signed integer with a minimum 8 bytes, which corresponds to the default `int64`

dytpe for integer NumPy `ndarray`

s.) But notice how the Python `list`

memory consumption increases stepwise. This is an implementation detail of the Python `list`

type that allows it to quickly add elements without shifting memory after each `append`

operation. And this is likely the explanation for why `list`

s can be faster and more efficient than `array`

s. When a new value is written to an `array`

, it likely often occurs that the some or all of the object needs to be shuffled in memory to find contiguous locations, but a `list`

is optimized for appending, so a certain number of additions can be made before any reshuffling happens.

To test this, we can compare the computation times for an `array`

computed with a list comprehension against an `array`

that is copied and overwritten:

```
def inc_array_by_one_with_copy(a):
start = datetime.now()
res = copy(a)
for i, val in enumerate(a):
res[i] = val + 1
return (datetime.now() - start).total_seconds() * 1000
```

Plotted, we see that the copy-fill approach is marginally faster, and there is less variation, likely due to fewer memory relocations.

In conclusion, it helps to understand that the `array`

type is really intended as a Python wrapper around a C array, and there are some functions and codes that use it to efficiently return a Python object. But as for computations, I hope I’ve conviced you that **the NumPy ndarray is the right type for computations** in terms of both memory and computation efficiency. As for Python

`array`

s, now you know what they are, and you can safely ignore them in favor of the `ndarray`

for your scientific computations.