Lazy data structures have many advantages over classically eager data structures. Strictly lazy structures are not widely used in imperative code due to difficulties involved with composition of exception handling and mutable state. Despite these difficulties, they offer a compelling tool for solving a large class of problem spaces.
This post will attempt to highlight two of those benefits: memory footprint and overall programmatic computational cost. It will begin by first defining a common, basic data structure, a Tree. After implementing two methods on a Tree, map and exists, it will use those methods to solve a problem. Finally, using a lazy version of a tree, the differences in terms of cost and benefit will be shown.
Instead of using a tree as a container for other objects, like in many algorithms and structures classes, this tree will express a hierarchy of values. Let us suppose that all hierarchical trees have the following interface:
which exposes just two methods: map and exists.
We can implement an arbitrarily sized hierarchical tree structure by allowing each level to vary on the number of children it has by using a List:
Where for each action performed, the evaluation of that action is called directly upon evocation.
A Problem Statement
Suppose that we want to represent an extended family whose fortunes have swelled but net worth is tied up in a complex set of independent and interwoven investment vehicles. Each member of the family is entitled to a certain number of votes based upon some ownership of shares. The board of one of the companies has called for a vote where a large outside investor is attempting to influence the outcome.
Being an efficient and careful activist, this investor would like to first determine if there is a single family member who has enough voting power such that he could win the majority. If such a member does exist, the investor will spend all of his energy lobbying that person. If not, the investor will divide his time based on the number of votes:
Modeled like this, using the eager tree defined above, in the worst case the algorithm would require an evaluation of every single family member's voting rights twice; once for the expensive calculation and one time for the predicate condition. The best case with eager evaluation would also require every single family member to be evaluated at least once, N expensive calculations but just once for the predicate.
To improve upon the bounds and more accurately model the outside investor as an efficient actor, we would need another implementation. We could rewrite the function to use some form of short circuiting accumulation. This route would lead to increased implementation complexity and decreased maintainability. Or we could examine using lazy data structures and keep the function as written above completely unchanged and intact.
A Lazy Tree
The relationship between an eagerly evaluated tree and a lazy tree is much that same as that between map and exists (on an eager data structure.) Exists is a short circuiting function, executing the function the least number of times necessary until a positive result is found while map will call it N times for N objects. On a lazy data structure, map will be delayed until the first time a result is requested.
We can define a lazy tree as follows:
wherein the call to head is preferable to repeated calculations of value, if it is to be requested multiple times. Coincidentally, the definition of LazyTree is nearly identical to EagerTree save for the implementation detail of nodes as a Stream vs a List and the use of Thunks for member values.
While small in difference cosmetically, programmatically it is a powerful distinction. The actual calculation of value is delayed until it is explicitly called and so too is any calculation using nodes. To understand how this is so, we need but study the implementation of the cons cell of Stream and the map member function, copied with some liberty below:
So defined, calling map on a Stream of N items creates a Stream of 1 element with a tail wrapped in a Thunk. Iterating over each element creates yet another single cons cell with a delayed realization of the tail, hence the concept of lazy evaluation.
Like the eager tree, the worst case scenario for the problem statement would be the evaluation of each family member twice. The best case scenario, on the other hand, would be the evaluation of only a single family member once. Additionally, in the best case scenario the memory footprint would be greatly reduced to 2 (the value and the nodes Thunks) instead of the entire tree of weighted votes, saving not only time but also space.
Efficiency is defined as doing no more than the minimal effort required. Clearly the lazy tree accomplishes this by putting off as much as possible until explicitly asked. As shown in this post, a direct, easy to understand algorithm was made both cost and space efficient by switching from an eager data structure to a lazy data structure without destroying readability.
The key to the improvement of this algorithm was the potential need to retain intermediate values for some but not all branches of logic. In situations like that, lazy structures shine. However, it would be remiss to not caution you, the reader, to understand that lazy structures are not always ideal. With anything there are trade-offs. Lazy structures can induce hard to understand performance profiles and in multi-threaded applications, if coded with persistence in mind, lock based overheads (look into the how of Scala's lazy val.)
That said, knowing about lazy structures and understanding situations where they can help arms you with another tool in the tool box. In the next post, another use case of lazy structures will be talked about. One which leads to patterns that would be impossible to write with eager structures.