Pypey
Pypey is a library for concisely composing data transformation primitives. It support lazily evaluated collection pipelines with standard operations like map, reduce , filter and several others others. It is fashioned after libraries like Java Streams , Immutable.js and C++ Streams. It has been inspired by, and leans on, the excellent itertools and more-itertools
Motivation
Many operations on data, like reading lines from a file, filtering or aggregating items, appear repeatedly across many domains and so they benefit from encapsulating to avoid duplication and enable code reuse. This encapsulation also has a second benefit: the abstracted operations now map 1-to-1 to the high-level description, ie, the intent of the coder. A third advantage is that it allows the different operations in a pipeline to be disentangled from each other.
Let’s illustrate all this with an example: a frequent data processing routine where you need to build a word-to-id dictionary from a file containing words of text. The high-level recipe would be something like this:
1. read lines from file
2. split lines into words
3. keep only unique words across all lines
4. assign a unique number id to each word
5. put words and their ids in a dictionary
A typical implementation using only python built-ins would look like this:
with open('text.txt') as file: # 1. open file for reading
idx = 0 # 2. make id counter
word_to_id = {} # 3. make empty dict
for line in file: # 4. loop through lines
for word in line.split(): # 5. split line and loop through words
word = word.rstrip() # 6. strip line terminator
if word not in word_to_id: # 7. check to see if it's in dictionary
word_to_id[word] = idx # 8. insert word with a new id if it's not
idx += 1 # 9. update id counter
Notice how there are steps 2.
, 3.
, 6.
and 9.
do not correspond to anything in the high level recipe.
Notice also, how operations 4.
to 9.
interleave with each other, happening once per loop iteration. Let’s see how
this could be implemented with itertools
:
from itertools import chain, count
with open('text.txt') as file: # 1. open file for reading
lines = file.readlines() # 2. read lines from file
stripped = map(str.rstrip, lines) # 3. strip line terminator
words = map(str.split, stripped) # 4. split lines into words
all_words = chain.from_iterable(words) # 5. concatenate all lines
unique = set(all_words) # 6. keep only unique words across all lines
words_ids = zip(unique, count()) # 7. assign a unique id to each word
word_to_id = dict(words_ids) # 8. put in dictionary
Now the steps match the original intent better because they operate at higher level, ie, they work at the level of whole collections of items, and do not concern themselves with the data and syntactic structures needed when the algorithm is specified at the level of the individual items, as in the first implementation. The next implementation is more concise and leverages the ability to pipe the collections together:
from itertools import chain, count
with open('text.txt') as file: # 1.
# 2. + 3. + 4. + 5. + 6. + 7. + 8.
word_to_id = dict(zip(set(chain.from_iterable(map(str.split, map(str.rstrip, file.readlines())))), count()))
However, the sequence of steps is now laid out in reverse order or “inside-out”. With Pypey, the code is still concise and the steps always flow right:
from pypey import pype
word_to_id = (pype.file('text.txt') # 1. read lines from file, strip line terminator by default
.map(str.split) # 2. split lines into words
.flat() # 3. concatenate all lines (by "flattening" them)
.uniq() # 4. keep only unique words across all lines
.enum(swap=True) # 5. assign a unique id to each word
.to(dict)) # 6. put in a dictionary
This implementation matches the original intent best and removes the need for the coder to write boiler-plate that is not domain-specific. A more terse implementation helps when using the Python interpreter’s interactive mode (REPL):
>>> from pypey import pype
>>> # 1. + 2. + 3. + 4. + 5. + 6.
>>> word_to_id = pype.file('text.txt').map(str.split).flat().uniq().enum(swap=True).to(dict)
Lazy and Deferred Evaluation
Both itertools
-‘s and Pypey’s implementation would incur a performance penalty if each step created an intermediate
collection. However by piping through lazy collections, ie, those that are evaluated incrementally only one item at
a time as they are iterated through (based on generators), the performance is similar to a loop-based implementation.
Furthermore, just as the loop-based approach, items are only read one at a time into memory, avoiding unnecessary
allocation.
Not all operations can be implemented lazily, for instance, sorting is necessarily “eager” as it entails traversing the whole collection before being able to retrieve the first sorted item. Pypey still makes these eager operations deferred to allow delaying the consumption of the lazy collection until it’s actually needed:
>>> p = pype(['a', 'fun', 'day']).sort()
>>> p
<pypey.pype.Pype object at 0x7f58edaf4970>
>>> list(p)
['a', 'day', 'fun']
Argument Unpacking
PEP 3113 removed Python 2’s ability to unpack function arguments from Python 3. This made using higher-order functions (functions taking or returning other functions) harder when applied to iterable items in a collection, especially so when lambdas are passed in, as it’s impossible to use unpacking assignments in them. Pypey brings back a limited form of argument unpacking that works only at the top level of nesting. For instance:
>>> pype.dict({'a':1, 'fun':2, 'day':3}).map(lambda kv: (kv[0], kv[1] + 1)).to(list)
[('a', 2), ('fun', 3), ('day', 4)]
can also be written more clearly as:
>>> pype.dict({'a':1, 'fun':2, 'day':3}).map(lambda k, v: (k, v + 1)).to(list)
[('a', 2), ('fun', 3), ('day', 4)]
Getting Started
To get started, install the library with pip:
pip install pypey
Then use as:
>>> from pypey import pype
>>> pype(range(-2, 3)).map(abs).print()
2
1
0
1
2
<pypey.pype.Pype object at 0x7f56401e0f40>
To run tests install pytest
:
pip install pytest
then run:
pytest