Mineo
Notebooks

Top strategies for optimising Python notebooks

We discuss the challenges of optimizing Python notebooks to address issues like slow performance, sluggishness, and high memory usage, which can hinder productivity and efficiency when working with large datasets or complex computations.

Diego Garcia 5 min
Top strategies for optimising Python notebooks

Introduction

Python notebooks have gained immense popularity among data scientists, researchers, and programmers for their interactive, user-friendly, and collaborative nature. However, they can sometimes become sluggish and slow due to the nature of the tasks being performed. In this blog post, we’ll explore some of the most effective strategies to speed up your Python notebooks and enhance your overall experience.

Embrace Vectorized Operations

Firstly, you should always prefer vectorized operations over loops when working with large datasets. A vectorized operation is a technique that applies a single operation simultaneously to multiple elements in an array or a collection of data, rather than iterating through each element individually. This approach enables faster computations by taking advantage of low-level optimizations, hardware capabilities, and parallelism.

Libraries like NumPy and Pandas provide vectorized operations that reduce the number of iterations and enable faster computations by exploiting low-level optimizations. However, it’s essential to understand the trade-offs when using vectorization, such as increased memory usage, which can be an issue for memory-constrained systems.

Example: Loop vs Vectorized Operations

import numpy as np
import time

# Create a large array
data = np.random.rand(1000000)

# Slow approach: Using Python loops
start_time = time.time()
result_loop = []
for x in data:
    result_loop.append(x ** 2 + 2 * x + 1)
loop_time = time.time() - start_time

# Fast approach: Using vectorized operations
start_time = time.time()
result_vectorized = data ** 2 + 2 * data + 1
vectorized_time = time.time() - start_time

print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vectorized_time:.4f} seconds")
print(f"Speedup: {loop_time / vectorized_time:.1f}x faster")

Pandas Vectorization Example

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'values': np.random.rand(100000),
    'multiplier': np.random.randint(1, 10, 100000)
})

# Slow approach: Using apply with lambda
%timeit df['result_slow'] = df.apply(lambda row: row['values'] * row['multiplier'], axis=1)

# Fast approach: Using vectorized operations
%timeit df['result_fast'] = df['values'] * df['multiplier']

Choose Efficient Data Structures

Choosing the right data structures can have a significant impact on your notebook’s performance. For instance, using Python’s built-in set and dict types can improve the speed of membership tests and lookups, compared to using lists. However, keep in mind that these data structures might have slightly higher memory overheads.

Understanding and knowing how different data structures work under the hood is crucial to writing optimised code.

Example: Efficient Data Structure Choices

import time

# Sample data
data = list(range(100000))
search_items = [999, 50000, 99999]

# Using list (inefficient for membership testing)
start_time = time.time()
for item in search_items:
    result = item in data  # O(n) operation
list_time = time.time() - start_time

# Using set (efficient for membership testing)
data_set = set(data)
start_time = time.time()
for item in search_items:
    result = item in data_set  # O(1) operation
set_time = time.time() - start_time

print(f"List search time: {list_time:.6f} seconds")
print(f"Set search time: {set_time:.6f} seconds")
print(f"Speedup: {list_time / set_time:.1f}x faster")

Dictionary vs List for Lookups

# Creating sample data
names = [f"user_{i}" for i in range(10000)]
ids = list(range(10000))

# Using list of tuples (inefficient)
user_list = list(zip(names, ids))

def find_user_id_list(username):
    for name, user_id in user_list:
        if name == username:
            return user_id
    return None

# Using dictionary (efficient)
user_dict = dict(zip(names, ids))

def find_user_id_dict(username):
    return user_dict.get(username)

# Performance comparison
test_user = "user_8888"

%timeit find_user_id_list(test_user)
%timeit find_user_id_dict(test_user)

Optimize Memory Usage

Working with large datasets can consume a lot of memory, slowing down your notebook. To counter this, use memory-efficient data structures and libraries like Dask, which allows you to work with larger-than-memory datasets by breaking them into smaller, manageable chunks. Keep in mind that Dask’s performance is heavily dependent on available system resources, so ensure your hardware meets its requirements.

Another strategy that works very well is setting up data types when reading DataFrames. This is an important step to optimize memory usage and improve the performance of your data processing tasks. By explicitly defining the data types of columns, you can reduce memory consumption and ensure that your DataFrame operations are executed efficiently.

In Pandas, you can set up data types when reading a DataFrame from a file (e.g., CSV) by using the dtype parameter in the read_csv function. The dtype parameter accepts a dictionary that maps column names to their respective data types.

Here’s an example of how to set up data types when reading a CSV file using Pandas:

Example: Setting Data Types When Reading DataFrames

import pandas as pd
import numpy as np

# Define data types for each column
column_types = {
    'user_id': 'int32',          # Instead of default int64
    'age': 'int8',               # Ages typically 0-120
    'score': 'float32',          # Instead of default float64
    'category': 'category',      # For repeated strings
    'is_active': 'bool',         # For boolean values
    'timestamp': 'datetime64[ns]'
}

# Read CSV with specified data types
df = pd.read_csv('large_dataset.csv', dtype=column_types, parse_dates=['timestamp'])

# Compare memory usage
print("Memory usage with optimized dtypes:")
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")

# Compare with default dtypes (for demonstration)
df_default = pd.read_csv('large_dataset.csv')
print("Memory usage with default dtypes:")
print(df_default.memory_usage(deep=True).sum() / 1024**2, "MB")

# Show the difference
memory_saved = df_default.memory_usage(deep=True).sum() - df.memory_usage(deep=True).sum()
print(f"Memory saved: {memory_saved / 1024**2:.2f} MB")

Example: Converting Existing DataFrame Types

# If you already have a DataFrame, you can optimize it
def optimize_dataframe(df):
    """Optimize DataFrame memory usage by converting data types"""
    
    # Convert integer columns
    for col in df.select_dtypes(include=['int64']):
        if df[col].min() >= 0:
            if df[col].max() < 255:
                df[col] = df[col].astype('uint8')
            elif df[col].max() < 65535:
                df[col] = df[col].astype('uint16')
            elif df[col].max() < 4294967295:
                df[col] = df[col].astype('uint32')
        else:
            if df[col].min() > -128 and df[col].max() < 127:
                df[col] = df[col].astype('int8')
            elif df[col].min() > -32768 and df[col].max() < 32767:
                df[col] = df[col].astype('int16')
            elif df[col].min() > -2147483648 and df[col].max() < 2147483647:
                df[col] = df[col].astype('int32')
    
    # Convert float columns
    for col in df.select_dtypes(include=['float64']):
        df[col] = df[col].astype('float32')
    
    # Convert object columns with few unique values to category
    for col in df.select_dtypes(include=['object']):
        num_unique_values = len(df[col].unique())
        num_total_values = len(df[col])
        if num_unique_values / num_total_values < 0.5:  # Less than 50% unique
            df[col] = df[col].astype('category')
    
    return df

# Usage
optimized_df = optimize_dataframe(df.copy())

By setting up data types when reading DataFrames, you can optimize memory usage and improve the performance of your data processing tasks in Python notebooks.

Leverage Parallelization

Take advantage of multiple cores on your machine by parallelizing your code using libraries like concurrent.futures or multiprocessing. This can significantly improve the performance of your notebook for CPU-bound tasks. Keep in mind that parallelization introduces complexity and might not be suitable for all problems.

Example: Parallelizing with concurrent.futures

import concurrent.futures
import time
import math

def cpu_intensive_task(n):
    """Simulate a CPU-intensive task"""
    result = 0
    for i in range(n):
        result += math.sqrt(i)
    return result

# Sequential execution
numbers = [1000000, 1000000, 1000000, 1000000]

start_time = time.time()
sequential_results = [cpu_intensive_task(n) for n in numbers]
sequential_time = time.time() - start_time

# Parallel execution with ThreadPoolExecutor (for I/O-bound tasks)
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    thread_results = list(executor.map(cpu_intensive_task, numbers))
thread_time = time.time() - start_time

# Parallel execution with ProcessPoolExecutor (for CPU-bound tasks)
start_time = time.time()
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    process_results = list(executor.map(cpu_intensive_task, numbers))
process_time = time.time() - start_time

print(f"Sequential time: {sequential_time:.2f} seconds")
print(f"Thread parallel time: {thread_time:.2f} seconds")
print(f"Process parallel time: {process_time:.2f} seconds")
print(f"Process speedup: {sequential_time / process_time:.1f}x")

Example: Parallel DataFrame Processing

import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp

def process_chunk(chunk):
    """Process a chunk of the DataFrame"""
    # Example: Calculate some complex statistics
    result = {
        'mean': chunk.mean(),
        'std': chunk.std(),
        'quantiles': chunk.quantile([0.25, 0.5, 0.75])
    }
    return result

# Create sample data
df = pd.DataFrame(np.random.rand(1000000, 5), columns=['A', 'B', 'C', 'D', 'E'])

# Split DataFrame into chunks
num_cores = mp.cpu_count()
chunk_size = len(df) // num_cores
chunks = [df[i:i + chunk_size] for i in range(0, len(df), chunk_size)]

# Process chunks in parallel
with ProcessPoolExecutor(max_workers=num_cores) as executor:
    results = list(executor.map(process_chunk, chunks))

print(f"Processed {len(chunks)} chunks using {num_cores} cores")

Consider Polars as an alternative to Pandas

Polars is a relatively new DataFrame library that can offer better performance than Pandas in certain scenarios. It is designed to be faster and more memory-efficient, making it an attractive alternative for large-scale data processing tasks. However, as a newer library, Polars may not have the same level of community support and extensive documentation that Pandas offers, so be prepared to invest some time in learning and adapting to this library.

Example: Polars vs Pandas Performance Comparison

import pandas as pd
import polars as pl
import numpy as np
import time

# Create sample data
n_rows = 1_000_000
data = {
    'id': range(n_rows),
    'value': np.random.rand(n_rows),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
    'timestamp': pd.date_range('2023-01-01', periods=n_rows, freq='1min')
}

# Create DataFrames
df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)

print(f"Dataset size: {n_rows:,} rows")

# Test 1: Group by and aggregation
start_time = time.time()
pandas_result = df_pandas.groupby('category')['value'].agg(['mean', 'sum', 'count'])
pandas_time = time.time() - start_time

start_time = time.time()
polars_result = df_polars.group_by('category').agg([
    pl.col('value').mean().alias('mean'),
    pl.col('value').sum().alias('sum'),
    pl.col('value').count().alias('count')
])
polars_time = time.time() - start_time

print(f"\nGroup by aggregation:")
print(f"Pandas time: {pandas_time:.3f} seconds")
print(f"Polars time: {polars_time:.3f} seconds")
print(f"Polars speedup: {pandas_time / polars_time:.1f}x")

# Test 2: Filtering and sorting
start_time = time.time()
pandas_filtered = df_pandas[df_pandas['value'] > 0.5].sort_values('timestamp')
pandas_filter_time = time.time() - start_time

start_time = time.time()
polars_filtered = df_polars.filter(pl.col('value') > 0.5).sort('timestamp')
polars_filter_time = time.time() - start_time

print(f"\nFiltering and sorting:")
print(f"Pandas time: {pandas_filter_time:.3f} seconds")
print(f"Polars time: {polars_filter_time:.3f} seconds")
print(f"Polars speedup: {pandas_filter_time / polars_filter_time:.1f}x")

Example: Polars Lazy Evaluation

import polars as pl

# Polars supports lazy evaluation for better performance
df = pl.scan_csv("large_dataset.csv")  # Lazy - doesn't load data yet

# Chain operations lazily
result = (
    df
    .filter(pl.col("age") > 25)
    .group_by("department")
    .agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("employee_id").count().alias("employee_count")
    ])
    .sort("avg_salary", descending=True)
    .collect()  # Only now does the computation happen
)

print("Lazy evaluation allows Polars to optimize the entire query plan")
print("This can result in significant performance improvements")

By considering Polars as an alternative to Pandas, you can potentially speed up your Python notebooks

Profile Your Code

Identify performance bottlenecks using Python’s built-in profiling tools such as %timeit and %prun IPython magic commands. This will help you focus your optimization efforts on the most time-consuming parts of your code.

Example: Using %timeit for Performance Measurement

import numpy as np
import pandas as pd

# Create sample data
data = np.random.rand(100000)
df = pd.DataFrame({'values': data})

# Compare different approaches using %timeit
# Method 1: Using numpy
%timeit result1 = np.sqrt(data)

# Method 2: Using pandas
%timeit result2 = df['values'].apply(np.sqrt)

# Method 3: Using pandas vectorized operation
%timeit result3 = np.sqrt(df['values'])

# You can also time multi-line code blocks
%%timeit
temp_data = data.copy()
result = []
for x in temp_data:
    result.append(x ** 0.5)

Example: Using %prun for Detailed Profiling

def slow_function():
    """A function with multiple performance bottlenecks"""
    # Bottleneck 1: Inefficient loop
    result1 = []
    for i in range(100000):
        result1.append(i ** 2)
    
    # Bottleneck 2: Inefficient string concatenation
    text = ""
    for i in range(10000):
        text += str(i)
    
    # Bottleneck 3: Inefficient data structure usage
    data = []
    for i in range(50000):
        if i in data:  # O(n) operation
            continue
        data.append(i)
    
    return len(result1), len(text), len(data)

# Profile the function
%prun slow_function()

# This will show output like:
#      50003 function calls in 2.234 seconds
#
# Ordered by: internal time
#
# ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#      1    1.234    1.234    2.234    2.234 <ipython>:15(slow_function)
#  50000    0.543    0.000    0.543    0.000 {method 'append' of 'list'}
#  10000    0.234    0.000    0.234    0.000 {built-in method builtins.str}

Example: Using line_profiler for Line-by-Line Analysis

# First install line_profiler: !pip install line_profiler

%load_ext line_profiler

def optimize_me(data):
    # Line 1: Create empty list
    results = []
    
    # Line 2: Loop through data (potential bottleneck)
    for item in data:
        # Line 3: Expensive computation
        processed = item ** 2 + 2 * item + 1
        # Line 4: Append to list
        results.append(processed)
    
    # Line 5: Return results
    return results

# Profile line by line
%lprun -f optimize_me optimize_me(list(range(100000)))

# This shows time spent on each line:
# Line #      Hits         Time  Per Hit   % Time
#      1         1          1.0      1.0      0.0
#      2    100001      45234.0      0.5     23.4
#      3    100000     123456.0      1.2     64.2
#      4    100000      23456.0      0.2     12.2
#      5         1          1.0      1.0      0.0

Example: Memory Profiling with memory_profiler

# Install memory profiler: !pip install memory-profiler

%load_ext memory_profiler

def memory_intensive_function():
    # Create large data structures
    big_list = [i for i in range(1000000)]  # ~40MB
    big_dict = {i: i**2 for i in range(500000)}  # ~20MB
    
    # Process data
    processed = [x * 2 for x in big_list]
    
    return len(processed)

# Monitor memory usage
%memit memory_intensive_function()

# Line-by-line memory profiling
%mprun -f memory_intensive_function memory_intensive_function()
  • %timeit: The %timeit magic command is used to measure the execution time of a single statement or expression. It runs the given code multiple times, calculates the average execution time, and provides the results in a human-readable format. %timeit is helpful for quickly assessing the performance of different solutions or algorithms, allowing you to compare their efficiency.

  • %prun: The %prun magic command is used to profile your Python code by collecting detailed statistics about the execution time and the number of calls for each function. This information helps you identify performance bottlenecks, enabling you to focus your optimization efforts on the most time-consuming parts of your code.

    %prun provides a comprehensive report that includes the total execution time, the time spent on each function call, and the number of times each function was called. This command is particularly useful for analyzing complex code with multiple function calls.

Pre-generate Docker Images to avoid repeated Package Installation

Installing Python packages can be time-consuming, particularly when working with large and complex libraries like Tensorflow, Torch, … Pre-generating Docker images can help you circumvent this issue by bundling all the necessary packages and dependencies in a single, reusable container. By doing so, you can avoid installing packages each time you run your notebook, thus speeding up the execution process.

To create a custom Docker image, follow these steps:

1. Create a Dockerfile in your project directory.

2. Build the Docker image.

3. Run your Python notebook using the custom Docker image.

By using pre-generated Docker images, you can significantly reduce the time spent on package installation and create a consistent environment for your Python notebooks. This approach ensures that your notebooks run smoothly and quickly, allowing you to focus on your work without worrying about package management or dependency issues.

Conclusion

In conclusion, optimizing Python notebooks is crucial for enhancing productivity, reducing execution time, and improving the overall user experience.

Each of these strategies addresses different aspects of notebook performance, such as computational speed, memory efficiency, and hardware utilization. By applying these techniques, you can significantly improve your Python notebook’s performance and tackle the challenges associated with large datasets and complex computations.

It’s important to remember that no single strategy will be a silver bullet for all scenarios. Each situation requires a critical evaluation of the specific challenges and trade-offs, and a tailored approach to optimization. By understanding the underlying principles and techniques presented in this blog post, you’ll be well-equipped to make informed decisions and optimize your Python notebooks for a more streamlined and efficient workflow.

Happy coding!