Deep Python #3: Generating yields
Generators, iterators, coroutines, single-threaded concurrency. How this helps writing more readable and concise code.
Python generators and iterators are one of the implementations of the idea of coroutines. Coroutines are “special” functions whose execution can be suspended and resumed in a later moment. They introduce the idea of concurrency without parallelism, that is, some functions cooperate in solving a given task (this is why they are called coroutines) by keeping their execution conceptually independent, and they may or may not run in separate execution threads at the same time. The famous programmer Rob Pike explains that the Go language (that he collaborated to create) is designed for concurrency, and that concurrency is a tool for designing code.
In this article, we start with a refresher of generators and iterators and then move to some interesting use case.
Introduction
Iterators in Python are defined by using the yield keyword inside a function. When the execution reaches yield, the corresponding value is returned to the caller and the function is suspended.
When an iterator function is called, it is not actually executed, but instead it returns an iterator object. Values can be extracted from the iterator by using the next
function.
def iter_example():
print("First step")
yield 1
print("Second step")
yield 2
print("Third step")
yield 3
it = iter_example()
print(next(it))
print(next(it))
print(next(it))
print(next(it))
When we execute the code above we get
Traceback (most recent call last):
File "/path/to/file.py", line 10, in <module>
print(next(it))
StopIteration
First step
1
Second step
2
Third step
3
From the output we can observe some important points:
the execution is actually interleaved between the two functions, and can also do more operations between two successive
next
so to use the values only when needed;an iterator behaves differently from a normal function, with two separate moments to initialize and to retrieve values;
Calling
next
on a terminated iterator raises aStopIteration
exception, handled implicitly in afor
loop.
for x in iter_example():
print(x)
In this case next
is called implicitly and the loop stops automatically when StopIteration is raised.
Concurrency as Design
Let us see how iterators can help us design code that is conceptually simple and modular. We can start with a common problem to illustrate the idea: retrieving data for a machine learning training loop.
We will not go into the full code here because it would be too much and out of scope for this article, but the high-level idea should suffice. We have our data stored in a file on disk, and we want to extract a minibatch to feed to our model at each training step. Without coroutines the process would be something like
def train(...):
shuffled_path = shuffle(train_data) # shuffle the training data and save them on a new file
with open(shuffled_path) as fd:
num_samples = 0
batch = []
for line in fd:
batch.append(line)
num_samples += 1
if num_samples == batch_size:
train_step(model, batch)
etc...
Ignoring that we are using an iterator to go through the lines of our file, what we can notice is that the logic for batching is interleaved with the training logic, and this makes changing code more complicated because different aspects are not localized. The way the problem can be solved with iterators is to separate batch creation and training steps:
def batches(data_path, batch_size):
with open(data_path) as fd:
num_samples = 0
batch = []
for line in fd:
batch.append(line)
num_samples += 1
if num_samples == batch_size:
yield batch
batch = []
num_samples = 0
def train(...):
shuffled_path = shuffle(train_data) # shuffle the training data and save them on a new file
data_batches = batches(shuffled_path, batch_size)
for batch in data_batches:
train_step(model, batch)
and here we are not just moving complexity from one function to another. What we are doing with iterators could not be achieved with normal functions. The batches are yielded to the calling function as soon as they are generated, while a normal function would first create a list with all the batches and then iterate over that list, which may even not be feasible if the data do not fit in the memory. Maybe this code here would look familiar to you, and that is exactly because iterators are the best way to implement it (on a single thread).
One important thing to notice is that we use with open
and the context manager stays active between different yield calls. This is a useful pattern to remember for the cases when we want to clearly separate logic and resource management.
Laziness
Laziness is strongly condemned by most religions and societies, but it appears to be a useful feature in programming languages. The Haskell programming language is famous to be lazy by default!
For those among the readers that do not know the computing meaning of laziness, it means that operations are performed as they are needed, not earlier. This can be done in several ways in Python, but iterators are arguably the most famous.
We have already seen it in the previous example, a new batch is computed when a new batch is needed, not earlier than that. This can help us saving resources by occupying less memory and performing less computation. Indeed, if we need to use data transformations for only a part of the sequence, we don’t really need to perform it on the whole sequence.
Let us see some code for comparison.
l1 = [x * 2 for x in range(100)]
l2 = (x * 2 for x in range(100))
l3 = map(lambda x: x * 2, range(100)]
for i, x in l1:
if i == 5:
break
print(x)
import itertools
for _, x in takewhile(lambda i: i[0] < 5, enumerate(l2)):
print(x)
import more-itertools
for x in take(5, l3):
print(x)
l1, l2, and l3 are quite similar but l1 is a list while l2 and l3 are iterators. Then we take from each of our iterables only the first 5 items. In l1 we have computed all 100 elements anyway using our list comprehension, while both the generator comprehension (l2) and map
(l3) will compute only the first 5 elements, since they are the only ones requested. l2 and l3 are essentially the same, with generator comprehensions being a quite pythonic feature while map
is a staple in functional languages.
We also see some differences in how we can iterate our iterables. The “imperative” way is the first example: let’s have an index for our iteration and break the loop when the index reaches the maximum value. Then, we have takewhile from itertools. itertools is a module in the standard library with many functions to work with iterators (and iterables). Having a deep look into it is outside the scope of this article, but I recommend you to give it a look because there is useful stuff in there. In this case we are using takewhile
, which keeps taking values from an iterable until the predicate passed as a first argument evaluates to true. For simply counting values it is not really well-suited and we need to use enumerate and then select the first or the second value from the tuple. The library more-itertools
is not part of the standard library, needs to be installed, but offer additional tools, like the simple take,
which we can write ourselves but it is nice to have a reference implementation.
Instead of giving it a predicate we just pass the number of item to take from the iterable and we get them. The results from the 3 loops are exactly the same, but l1 requires storing the full list into memory while l2 and l3 can be considered as pure computation and they produce a single value as it is needed.
Another difference to note between lists and iterators is how to iterate over them. If we run:
import more-itertools
l1 = [x * 2 for x in range(100)]
l2 = (x * 2 for x in range(100)]
for x in take(5, l1):
print(x)
for x in take(5, l1):
print(x)
for x in take(5, l2):
print(x)
for x in take(5, l2):
print(x)
The first example will start over at each loop because l1 does not change and will print
0
2
3
4
8
0
2
4
6
8
while the iterator can only continue to produce values and will restart from where it finished
0
2
4
6
8
10
12
14
16
18
This is something to keep in mind to decide for or against laziness. If we need the computed values more than once, then we use an in-memory data structure as a list, but if we need them only once we can use iterators.
Coroutines
The last point I want to mention here is that coroutines are not unidirectionals. So far we have only seen cases with a coroutine as an iterator that generates data, and another one that requests and uses them. But a fully-fledged generator can also receive data from outside and perform computation according to that. The full data type of a generator is Generator[YieldType, SendType, ReturnType]
, where YieldType is the type of data that the generator yields, SendType is the type for the data that generator receives, and ReturnType is the type of the return value. When SendType and ReturnType are both equal to None, then this is equivalent to Iterator[YieldType].
A generator can then be of the following form:
def generator():
item = 1
while item:
item = yield item * 2
return "finished"
after the generator yields, it expects a value from the external function, which will be used when the generator is resumed. This value can be passed using its send
method. For instance:
gen = generator()
x = next(gen)
y = gen.send(3)
z = gen.send(4)
try:
next(gen)
except StopIteration as e:
no = e.value
print(x, y, z, no)
which outputs
2 6 8 finished
Some things to notice about generators:
the first iteration must always occur using
next
, or the equivalentgen.send(None).
Since it is not possible to send a value before the first yield statement, the first iteration is meant to reach it.After that, the value passed with
send,
is assigned to the variable to the left of yield, and the execution resumes from there till the next yield, when it stops again.
When the generator has no more values to yield and hits the return
statement, it raises StopIteration.
In this example, StopIteration
must be explicitly taken care of, but it is handled implicitly when we use generators with for loops as in the previous examples. StopIteration.value
contains the returned value. And by the awkward syntax needed to extract it, you can easily understand that it is not supposed to be used in normal code.
Working with send is also awkward, and it is better to be used under the hood and not as an API for your code.
Despite the disadvantages, it is useful to know the full power of generators because they can be used as small local “servers” that allow a better separation of concerns between functions. It is not among the most used patterns, but still something to be aware of.
Yield from
Generators can cooperate also by chaining yield
statements from nested generators. If in the example of mini batches we want to have an outer loop that shuffles the dataset many times, and produces mini batches from each shuffle, we can separate the logic into two generators:
def batches(shuffled_path, batch_size):
with open(shuffled_path) as fd:
num_samples = 0
batch = []
for line in fd:
batch.append(line)
num_samples += 1
if num_samples == batch_size:
yield batch
batch = []
num_samples = 0
def data_iter(data_path, n_epochs, batch_size):
shuffled_path = ... # copy and shuffle the dataset on disk
for i in range(n_epochs):
mini_batches = batches(shuffled_path, batch_size)
for batch in mini_batches:
yield batch
This function separates the iterations over the number of epochs from the iteration over the number of batches. Nothing new here, except that the chain of yield functions exactly how we would expect it, and the value yielded from the inner function (batches) is also yielded by the outer function (data_iter). This pattern is so common that there is a keyword for that, yield from:
def data_iter(data_path, n_epochs, batch_size):
shuffled_path = ... # copy and shuffle the dataset on disk
for i in range(n_epochs):
yield from batches(shuffled_path, batch_size)
which is equivalent to the previous code. yield from
works with iterators but also with all other kinds of iterables, so you can yield from lists, sets, dictionaries and tuples. Also, we are not limited to a single yield from in a generator:
def multiplier():
for i in range(1_000_000):
yield from ((x+ 1) * i for x in range(10))
and this will yield all the sequences one after another. And because of the lazy evaluation, only the actually used values are computed and we do not really care about the potentially high number of values. No real sequence is created into memory from this. And of course we can combine send and yield from to create powerful interactive generators.
Another advantage of yield from is that exceptions are propagated from the inside to the outside, so that we can have a centralized exception handling outside all generators.
Finally, yield from can also capture the returned valued of another generator:
def coro1(n: int):
yield from range(n)
if n < 1:
return "It was empty"
else:
"ok"
def coro2(n: int):
g = coro1(n)
out = yield from g
if out == "It was empty"
yield from range(1000)
The line out = yield from g
still yields the values from g to the calling function, but its returned value is not captured by out
and we can do something with it.
Conclusions
Generators and iterators are among the most useful and used functionalities of Python. They help write clear and simple code that can separate loosely-related parts of the global logic, but in a way that everything can be interleaved together. In many cases they are just more convenient than normal functions, besides the potential savings in memory and computation.
Knowing them is necessary to read existing code and write better and more concise code.
There is much more to know about generators and iterators, like other ways to define them, or how they can be used to define context managers or, moving a step forward, asynchronous coroutines. This is a lot of material and I invite you to read the numerous sources from the web if you are interested.