Lazy Pipelines Guide¶

This guide covers lazy evaluation in functional_list, including when to use it, how to build pipelines, and advanced patterns.

What is Lazy Evaluation?¶

Lazy evaluation defers computation until results are needed. Instead of executing each operation immediately, lazy pipelines record operations and execute them all at once when you materialize the results.

Key Principle: All transformation methods (map, filter, distinct, sort, etc.) just record operations. Execution happens only when you call an action method like to_list(), collect(), or foreach().

Create a Lazy Pipeline¶

from functional_list import ListMapper

lazy = (
    ListMapper[int](1, 2, 3, 4)
    .lazy()                      # Switch to lazy mode
    .map(lambda x: x * x)        # Recorded, not executed ✅
    .filter(lambda x: x > 5)     # Recorded, not executed ✅
)

# Nothing is executed until you materialize
print(lazy.to_list())  # NOW executes: [9, 16]

Truly Lazy Operations¶

ALL transformation operations in lazy mode are truly lazy:

Recording vs Execution¶

lazy = ListMapper(1, 2, 3).lazy()
print("Created lazy mapper - no execution")

mapped = lazy.map(lambda x: x * 2)
print("Called map() - still no execution!")

filtered = mapped.filter(lambda x: x > 3)
print("Called filter() - still no execution!")

sorted_lazy = filtered.sort()
print("Called sort() - still no execution!")

# NOW everything executes in one pass
result = sorted_lazy.to_list()
print(f"to_list() called - execution happens now: {result}")
# Output: [4, 6]

Zero-Cost Pipeline Building¶

# Build complex pipelines with zero execution cost
pipeline = data.lazy()

if condition1:
    pipeline = pipeline.filter(predicate1)  # Recorded

if condition2:
    pipeline = pipeline.map(transform)       # Recorded

if condition3:
    pipeline = pipeline.sort(key=sort_key)   # Recorded

# Execute once at the end
result = pipeline.to_list()  # All operations execute together

Lazy Transformations¶

All standard transformations work in lazy mode:

Map¶

lazy = ListMapper[int](1, 2, 3).lazy().map(lambda x: x * 2)
# Recorded - not executed
result = lazy.to_list()  # Executes: [2, 4, 6]

Filter¶

lazy = ListMapper[int](1, 2, 3, 4, 5).lazy().filter(lambda x: x % 2 == 0)
# Recorded - not executed
result = lazy.to_list()  # Executes: [2, 4]

FlatMap¶

lazy = (
    ListMapper[str]("hello world", "foo bar")
    .lazy()
    .flat_map(lambda s: s.split())
)
# Recorded - not executed
result = lazy.to_list()  # Executes: ['hello', 'world', 'foo', 'bar']

Sort (Truly Lazy!)¶

Sort is recorded and executed only on action methods:

lazy = ListMapper(3, 1, 4, 1, 5, 9).lazy()
sorted_lazy = lazy.sort()  # Just recorded - NO execution!
print("Sort operation recorded")

# Can add more operations
mapped = sorted_lazy.map(lambda x: x * 2)  # Still no execution
print("Map operation recorded")

# NOW executes: sort first, then map
result = mapped.to_list()
# Output: [2, 2, 6, 8, 10, 18]

With key function:

users = ListMapper(
    {"name": "Alice", "score": 85},
    {"name": "Bob", "score": 92}
).lazy()

sorted_lazy = users.sort(key=lambda x: x["score"])  # Recorded
names = sorted_lazy.map(lambda x: x["name"])        # Recorded

result = names.to_list()  # Executes all
# Output: ['Alice', 'Bob']

Order By Key (Truly Lazy!)¶

lazy = ListMapper(3, 1, 4, 1, 5, 9).lazy()
sorted_lazy = lazy.order_by_key()  # Recorded, not executed

result = sorted_lazy.map(lambda x: x * 2).to_list()  # Executes
# Output: [2, 2, 6, 8, 10, 18]

Distinct¶

Remove duplicates efficiently in a streaming fashion:

lazy = (
    ListMapper[int](1, 2, 2, 3, 1, 4, 3, 5)
    .lazy()
    .distinct()  # Removes duplicates as it streams
)
result = lazy.to_list()  # [1, 2, 3, 4, 5]

Advanced Example:

# Deduplication in a complex pipeline
lazy = (
    ListMapper[int](*range(100))
    .lazy()
    .filter(lambda x: x % 2 == 0)   # [0, 2, 4, 6, ...]
    .map(lambda x: x % 10)           # [0, 2, 4, 6, 8, 0, 2, ...]
    .distinct()                      # [0, 2, 4, 6, 8]
    .map(lambda x: x + 10)           # [10, 12, 14, 16, 18]
)
result = lazy.to_list()

Memory Efficiency

distinct() in lazy mode uses streaming deduplication, making it very memory-efficient for large datasets. It only keeps track of seen elements, not intermediate lists.

Action Methods (Trigger Execution)¶

Action methods are the only operations that trigger execution. All transformation methods just record operations.

to_list()¶

Execute the pipeline and return a Python list:

lazy = ListMapper(1, 2, 3).lazy().map(lambda x: x * 2)
result = lazy.to_list()  # Executes NOW - returns [2, 4, 6]

collect()¶

Execute the pipeline and return an eager ListMapper:

lazy = ListMapper(1, 2, 3).lazy().map(lambda x: x * 2)
eager = lazy.collect()  # Executes NOW - returns ListMapper[2, 4, 6]

# Can now use eager operations
result = eager.sort().take(2)

foreach()¶

Execute the pipeline for side effects:

lazy = ListMapper(1, 2, 3).lazy().map(lambda x: x * 2)
lazy.foreach(lambda x: print(x))  # Executes NOW
# Prints: 2, 4, 6

reduce()¶

Execute and reduce to a single value:

lazy = ListMapper(1, 2, 3, 4).lazy().map(lambda x: x * 2)
total = lazy.reduce(lambda x, y: x + y)  # Executes NOW - returns 20

Direct Iteration¶

Iterating over a lazy mapper triggers execution:

lazy = ListMapper(1, 2, 3).lazy().map(lambda x: x * 2)

for item in lazy:  # Executes NOW
    print(item)
# Prints: 2, 4, 6

Execution Only on Action

# NO execution - just recording
lazy = (
    data.lazy()
    .map(...)      # Recorded
    .filter(...)   # Recorded
    .sort(...)     # Recorded
    .distinct(...) # Recorded
)

# NOW execution happens
result = lazy.to_list()  # All operations execute together

Performance Benefit

Building a lazy pipeline has virtually zero cost. You can build complex conditional pipelines without any performance penalty until you materialize:

pipeline = data.lazy()

# Add operations conditionally - zero cost!
if filter_needed:
    pipeline = pipeline.filter(predicate)
if transform_needed:
    pipeline = pipeline.map(transform)
if sort_needed:
    pipeline = pipeline.sort(key=sort_key)

# Execute once
result = pipeline.to_list()

Materialization Methods¶

Combine two lazy pipelines in a streaming fashion:

lazy1 = ListMapper[int](1, 2, 3).lazy()
lazy2 = ListMapper[int](4, 5, 6).lazy()
result = lazy1.union(lazy2).to_list()  # [1, 2, 3, 4, 5, 6]

With Transformations:

# Union after map operations
lazy1 = ListMapper[int](1, 2, 3).lazy().map(lambda x: x * 2)
lazy2 = ListMapper[int](4, 5, 6).lazy().map(lambda x: x * 3)
result = lazy1.union(lazy2).to_list()
# Result: [2, 4, 6, 12, 15, 18]

# Map after union
lazy1 = ListMapper[int](1, 2, 3).lazy()
lazy2 = ListMapper[int](4, 5, 6).lazy()
result = lazy1.union(lazy2).map(lambda x: x * 2).to_list()
# Result: [2, 4, 6, 8, 10, 12]

Deduplication Pattern:

# Union preserves duplicates - use distinct() to remove them
lazy1 = ListMapper[int](1, 2, 3).lazy()
lazy2 = ListMapper[int](3, 4, 5).lazy()

# With duplicates
result = lazy1.union(lazy2).to_list()
# Result: [1, 2, 3, 3, 4, 5]

# Remove duplicates
result = lazy1.union(lazy2).distinct().to_list()
# Result: [1, 2, 3, 4, 5]

Streaming Efficiency

Lazy union iterates through both sequences sequentially without creating intermediate lists, making it perfect for combining large datasets:

# Load large files without loading all into memory
source1 = ListMapper.from_parquet("large1.parquet").lazy()
source2 = ListMapper.from_parquet("large2.parquet").lazy()

result = (
    source1
    .union(source2)
    .filter(lambda row: row["active"])
    .take(1000)  # Only processes what's needed!
)

Materialization Methods¶

to_list()¶

Execute and return a Python list:

lazy = ListMapper[int](1, 2, 3).lazy().map(lambda x: x * 2)
result = lazy.to_list()  # [2, 4, 6]

collect()¶

Execute and return an eager ListMapper:

lazy = ListMapper[int](1, 2, 3).lazy().map(lambda x: x * 2)
eager = lazy.collect()  # ListMapper[2, 4, 6]

take()¶

Execute only what's needed for the first n elements:

# Very efficient - only processes first 10 items
lazy = (
    ListMapper[int](*range(1_000_000))
    .lazy()
    .map(expensive_function)
)
first_10 = lazy.take(10)  # Only computes 10 items!

Operations That Force Materialization¶

Some operations require seeing all data and return ListMapper:

Sorting¶

lazy = ListMapper[int](3, 1, 2).lazy().map(lambda x: x * 2)
sorted_result = lazy.order_by_key()  # Forces materialization
# Returns: ListMapper[2, 4, 6]

Distinct (when materialized)¶

While distinct() can be lazy, calling collect() materializes it:

lazy = ListMapper[int](1, 2, 2, 3).lazy().distinct()
eager = lazy.collect()  # Materializes to ListMapper[1, 2, 3]

Terminal Operations¶

Operations that consume the pipeline and return a single value:

Reduce¶

lazy = ListMapper[int](1, 2, 3, 4).lazy()
total = lazy.reduce(lambda x, y: x + y)  # 10

Foreach¶

lazy = ListMapper[int](1, 2, 3).lazy().map(lambda x: x * 2)
lazy.foreach(lambda x: print(f"Value: {x}"))
# Prints: Value: 2, Value: 4, Value: 6

Advanced Patterns¶

Pattern 1: Large Dataset Filtering¶

# Process millions of records efficiently
lazy = (
    ListMapper.from_parquet("huge_file.parquet")
    .lazy()
    .filter(lambda row: row["active"])
    .map(transform_row)
    .distinct()  # Remove duplicates
    .take(1000)  # Only get top 1000
)

Pattern 2: Chained Deduplication¶

# Remove duplicates at multiple stages
lazy = (
    ListMapper[str].from_text("large_log.txt")
    .lazy()
    .map(parse_log_line)
    .distinct()  # Remove duplicate log lines
    .map(lambda e: e["user_id"])
    .distinct()  # Remove duplicate user IDs
)
unique_users = lazy.to_list()

Pattern 3: Lazy → Eager → Lazy¶

# Use eager for small results, lazy for large processing
result = (
    large_dataset
    .lazy()
    .filter(expensive_filter)  # Lazy: reduces data
    .distinct()                 # Lazy: removes duplicates
    .collect()                  # Materialize (now smaller)
    .sort()                     # Eager: sort efficiently
    .lazy()                     # Back to lazy
    .take(100)                  # Efficient top-100
)

Pattern 4: Union Multiple Sources¶

# Combine data from multiple sources efficiently
csv_data = ListMapper.from_csv("source1.csv").lazy()
json_data = ListMapper.from_json("source2.json").lazy()
parquet_data = ListMapper.from_parquet("source3.parquet").lazy()

# Merge and process
result = (
    csv_data
    .union(json_data)
    .union(parquet_data)
    .filter(lambda row: row["status"] == "active")
    .distinct()
    .map(transform)
    .take(10000)
)

Pattern 5: With Backends¶

from functional_list.backend import LocalBackend

# Execute lazy pipeline with specific backend
lazy = (
    ListMapper[int](*range(10000))
    .lazy()
    .map(expensive_computation)
    .filter(lambda x: x > 0)
    .distinct()
)

# Materialize using threading
result = lazy.collect(backend=LocalBackend(mode="threads", workers=8))

Best Practices¶

Use Lazy for Large Data

Always use lazy mode when processing large datasets or when you only need a subset of results.

Combine Operations

Chain multiple lazy operations before materializing to minimize overhead.

Early Filtering

Put filters early in the pipeline to reduce data volume as soon as possible.

Single Consumption

Lazy pipelines backed by generators can only be consumed once. If you need to iterate multiple times, call collect() first.

Performance Comparison¶

# Eager - creates intermediate lists
eager = (
    ListMapper[int](*range(1_000_000))
    .map(lambda x: x * 2)        # Creates 1M list
    .filter(lambda x: x > 10)     # Creates another list
    .distinct()                   # Creates another list
)

# Lazy - streams through data
lazy = (
    ListMapper[int](*range(1_000_000))
    .lazy()
    .map(lambda x: x * 2)        # No intermediate list
    .filter(lambda x: x > 10)     # No intermediate list
    .distinct()                   # Streaming deduplication
    .take(100)                    # Only processes what's needed!
)

Lazy Pipelines Guide¶

What is Lazy Evaluation?¶

Create a Lazy Pipeline¶

Truly Lazy Operations¶

Recording vs Execution¶

Zero-Cost Pipeline Building¶

Lazy Transformations¶

Map¶

Filter¶

FlatMap¶

Sort (Truly Lazy!)¶

Order By Key (Truly Lazy!)¶

Distinct¶

Action Methods (Trigger Execution)¶

to_list()¶

collect()¶

foreach()¶

reduce()¶

Direct Iteration¶

Materialization Methods¶

Materialization Methods¶

to_list()¶

collect()¶

take()¶

Operations That Force Materialization¶

Sorting¶

Distinct (when materialized)¶

Terminal Operations¶

Reduce¶

Foreach¶

Advanced Patterns¶

Pattern 1: Large Dataset Filtering¶

Pattern 2: Chained Deduplication¶

Pattern 3: Lazy → Eager → Lazy¶

Pattern 4: Union Multiple Sources¶

Pattern 5: With Backends¶

Best Practices¶

Performance Comparison¶

See Also¶