paperlined.org
dev > concepts > stream_processing

document updated 12 years ago, on Sep 18, 2012

When processing large amounts of data (eg. calculating statistics for reports, doing GROUP-BY operations, etc), the simplest way to implement it is [usually] to collect all of the data in RAM, and then iterate over the array.

When implementing it this way, the code is relatively small, and easy to comprehend.

However, there are several downsides when working with very large datasets, or data sets that take time to retrieve:

the user doesn't get to see any output data until the very end, so they don't know if the program is working correctly
the entire dataset needs to fit in RAM
even if there's enough RAM to store the whole dataset, it can be wasteful to do so

Stream processing means handling the input in smaller chunks, and keeping data in RAM only if you really need to.

Texts

Section 3.5 "Streams" of the very famous book "Structure and Interpretation of Computer Programs"
Chapter 6 "Infinite Streams" of the book "Higher Order Perl"

Wikipedia articles

Streaming algorithm
Stream (type theory)
Data stream clustering
Data stream mining
- Data-stream management system
  - StreamSQL
- Event stream processing
  - Event-driven architecture
specific algorithms:
- count-min sketch
loosely related:
- Category:Online algorithms
- Pipeline (software)

Implementations

From the user's standpoint, it's a relatively simple concept. However, it can be implemented in various ways, and the code isn't always as easy to understand:

closures that remember state
coroutines

Perl modules

very important (IMHO) modules
- HOP::Stream
- Perlude
- Iterator
lazy lists that work for this
- Tie::LazyList
- Data::Lazy
lazy lists — that incorporate the notion of an EOF