Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Speed up function _serialize_dataframe by 123% in PR #6044 (refactor-serialization) #6078

Merged
merged 24 commits into from
Feb 3, 2025

Conversation

codeflash-ai[bot]
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Feb 3, 2025

⚡️ This pull request contains optimizations for PR #6044

If you approve this dependent PR, these changes will be merged into the original PR branch refactor-serialization.

This PR will be automatically closed if the original PR is merged.


📄 123% (1.23x) speedup for _serialize_dataframe in src/backend/base/langflow/serialization/serialization.py

⏱️ Runtime : 23.9 milliseconds 10.7 milliseconds (best of 141 runs)

📝 Explanation and details

Certainly! Here's a more efficient version of the given program. The primary optimization performed here is removing the redundant .apply() call and directly truncating values in a more performant way.

Changes Made.

  1. Removed redundant apply calls: In the original code, there were nested apply calls which can be very slow on larger DataFrames. The new implementation converts the DataFrame to a list of dictionaries first and then truncates the values if needed.
  2. Optimized truncation logic: Applied truncation directly while iterating over the dictionary after conversion from a DataFrame. This reduces overhead and improves readability.

These changes should enhance the runtime performance of the serialization process, especially for larger DataFrames.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 38 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage undefined
🌀 Generated Regression Tests Details
import numpy as np
import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.serialization.serialization import _serialize_dataframe


# function to test
def _truncate_value(value, max_length, max_items):
    """Helper function to truncate values based on max_length."""
    if max_length is not None and isinstance(value, str) and len(value) > max_length:
        return value[:max_length]
    return value
from langflow.serialization.serialization import _serialize_dataframe

# unit tests

def test_basic_functionality_no_limits():
    """Test simple DataFrame without truncation or row limits."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_basic_functionality_max_items():
    """Test DataFrame with max_items specified."""
    df = pd.DataFrame({'A': [1, 2, 3], 'B': ['abc', 'def', 'ghi']})
    codeflash_output = _serialize_dataframe(df, None, 2)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_basic_functionality_max_length():
    """Test DataFrame with max_length specified."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abcdef', 'ghijkl']})
    codeflash_output = _serialize_dataframe(df, 3, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'ghi'}]

def test_empty_dataframe():
    """Test empty DataFrame."""
    df = pd.DataFrame()
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = []

def test_single_row_dataframe():
    """Test single row DataFrame."""
    df = pd.DataFrame({'A': [1], 'B': ['abc']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}]

def test_single_column_dataframe():
    """Test single column DataFrame."""
    df = pd.DataFrame({'A': [1, 2, 3]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1}, {'A': 2}, {'A': 3}]

def test_mixed_data_types():
    """Test DataFrame with mixed data types."""
    df = pd.DataFrame({'A': [1, 2], 'B': [3.0, 4.5], 'C': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 3.0, 'C': 'abc'}, {'A': 2, 'B': 4.5, 'C': 'def'}]

def test_none_values():
    """Test DataFrame with None values."""
    df = pd.DataFrame({'A': [1, None], 'B': ['abc', None]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': None, 'B': None}]

def test_nan_values():
    """Test DataFrame with NaN values."""
    df = pd.DataFrame({'A': [1, np.nan], 'B': ['abc', np.nan]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': np.nan, 'B': np.nan}]

def test_special_characters():
    """Test DataFrame with special characters in strings."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['a\nb\tc', 'd\te\nf']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'a\nb\tc'}, {'A': 2, 'B': 'd\te\nf'}]

def test_large_dataframe():
    """Test large DataFrame for performance and scalability."""
    df = pd.DataFrame({'A': range(1000), 'B': ['x' * 1000] * 1000})
    codeflash_output = _serialize_dataframe(df, None, None)

def test_large_string_values():
    """Test DataFrame with very long string values."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['x' * 1000, 'y' * 1000]})
    codeflash_output = _serialize_dataframe(df, 10, None)
    expected = [{'A': 1, 'B': 'xxxxxxxxxx'}, {'A': 2, 'B': 'yyyyyyyyyy'}]

def test_boundary_max_length_zero():
    """Test boundary value for max_length=0."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, 0, None)
    expected = [{'A': 1, 'B': ''}, {'A': 2, 'B': ''}]

def test_boundary_max_items_zero():
    """Test boundary value for max_items=0."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, 0)
    expected = []

def test_non_dataframe_input():
    """Test non-DataFrame input."""
    with pytest.raises(AttributeError):
        _serialize_dataframe([1, 2, 3], None, None)

def test_negative_max_length():
    """Test negative value for max_length."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, -1, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_negative_max_items():
    """Test negative value for max_items."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, -1)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_multiindex_dataframe():
    """Test DataFrame with MultiIndex."""
    index = pd.MultiIndex.from_tuples([('a', 1), ('b', 2)], names=['first', 'second'])
    df = pd.DataFrame({'A': [1, 2]}, index=index)
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'first': 'a', 'second': 1, 'A': 1}, {'first': 'b', 'second': 2, 'A': 2}]

def test_datetime_dataframe():
    """Test DataFrame with datetime objects."""
    df = pd.DataFrame({'A': [pd.Timestamp('20230101'), pd.Timestamp('20230102')]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': pd.Timestamp('20230101')}, {'A': pd.Timestamp('20230102')}]
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.serialization.serialization import _serialize_dataframe


# function to test
def _truncate_value(value, max_length, max_items):
    if isinstance(value, str) and max_length is not None:
        return value[:max_length]
    return value
from langflow.serialization.serialization import _serialize_dataframe

# unit tests

# Basic Functionality
def test_serialize_basic():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'bar'}]

def test_serialize_max_items():
    df = pd.DataFrame({'A': list(range(10)), 'B': ['foo']*10})
    codeflash_output = _serialize_dataframe(df, None, 5)
    expected = [{'A': i, 'B': 'foo'} for i in range(5)]

def test_serialize_max_length():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foobarbaz', 'quxquux']})
    codeflash_output = _serialize_dataframe(df, 3, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'qux'}]

# Edge Cases
def test_serialize_empty_dataframe():
    df = pd.DataFrame()
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = []

def test_serialize_single_row():
    df = pd.DataFrame({'A': [1], 'B': ['foo']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}]

def test_serialize_single_column():
    df = pd.DataFrame({'A': [1, 2, 3]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1}, {'A': 2}, {'A': 3}]

def test_serialize_mixed_data_types():
    df = pd.DataFrame({'A': [1, 2.5, 'three'], 'B': [None, 'foo', 3.14]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': None}, {'A': 2.5, 'B': 'foo'}, {'A': 'three', 'B': 3.14}]

# Large DataFrames
def test_serialize_large_dataframe():
    df = pd.DataFrame({'A': list(range(1000)), 'B': ['foo']*1000})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': i, 'B': 'foo'} for i in range(1000)]

def test_serialize_large_string_values():
    df = pd.DataFrame({'A': [1, 2], 'B': ['a'*1000, 'b'*1000]})
    codeflash_output = _serialize_dataframe(df, 10, None)
    expected = [{'A': 1, 'B': 'a'*10}, {'A': 2, 'B': 'b'*10}]

# Special Characters and Unicode
def test_serialize_special_characters():
    df = pd.DataFrame({'A': [1, 2], 'B': ['@foo, '#bar#']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': '@foo}, {'A': 2, 'B': '#bar#'}]

def test_serialize_unicode_characters():
    df = pd.DataFrame({'A': [1, 2], 'B': ['😊', '🚀']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': '😊'}, {'A': 2, 'B': '🚀'}]

# Null and NaN Values
def test_serialize_nan_values():
    df = pd.DataFrame({'A': [1, 2, float('nan')], 'B': [float('nan'), 'foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': float('nan')}, {'A': 2, 'B': 'foo'}, {'A': float('nan'), 'B': 'bar'}]

def test_serialize_none_values():
    df = pd.DataFrame({'A': [1, 2, None], 'B': [None, 'foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': None}, {'A': 2, 'B': 'foo'}, {'A': None, 'B': 'bar'}]

# Boundary Conditions
def test_serialize_exact_max_items():
    df = pd.DataFrame({'A': list(range(5)), 'B': ['foo']*5})
    codeflash_output = _serialize_dataframe(df, None, 5)
    expected = [{'A': i, 'B': 'foo'} for i in range(5)]

def test_serialize_exact_max_length():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, 3, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'bar'}]

# Nested DataFrames
def test_serialize_nested_values():
    df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], {'key': 'value'}]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': [1, 2]}, {'A': 2, 'B': {'key': 'value'}}]

# DataFrame with Custom Index
def test_serialize_custom_index():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foo', 'bar']}, index=['row1', 'row2'])
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'bar'}]

# DataFrame with Duplicates
def test_serialize_duplicate_rows():
    df = pd.DataFrame({'A': [1, 1], 'B': ['foo', 'foo']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 1, 'B': 'foo'}]

# DataFrame with Date and Time
def test_serialize_datetime_values():
    df = pd.DataFrame({'A': [1, 2], 'B': [pd.Timestamp('2023-01-01'), pd.Timestamp('2023-01-02')]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': pd.Timestamp('2023-01-01')}, {'A': 2, 'B': pd.Timestamp('2023-01-02')}]
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

Codeflash

ogabrielluiz and others added 23 commits January 31, 2025 12:03
…lize method for improved clarity and maintainability
…unction for improved consistency and clarity
…tems_length for improved handling of outputs, logs, messages, and artifacts
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
…actor-serialization`)

Certainly! Here's a more efficient version of the given program. The primary optimization performed here is removing the redundant `.apply()` call and directly truncating values in a more performant way.



### Changes Made.
1. **Removed redundant `apply` calls**: In the original code, there were nested `apply` calls which can be very slow on larger DataFrames. The new implementation converts the DataFrame to a list of dictionaries first and then truncates the values if needed.
2. **Optimized truncation logic**: Applied truncation directly while iterating over the dictionary after conversion from a DataFrame. This reduces overhead and improves readability.

These changes should enhance the runtime performance of the serialization process, especially for larger DataFrames.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Feb 3, 2025
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. enhancement New feature or request labels Feb 3, 2025
Base automatically changed from refactor-serialization to main February 3, 2025 15:22
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 3, 2025
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 3, 2025
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 3, 2025
@ogabrielluiz ogabrielluiz changed the title ⚡️ Speed up function _serialize_dataframe by 123% in PR #6044 (refactor-serialization) refactor: Speed up function _serialize_dataframe by 123% in PR #6044 (refactor-serialization) Feb 3, 2025
@ogabrielluiz ogabrielluiz added this pull request to the merge queue Feb 3, 2025
Merged via the queue into main with commit d676aef Feb 3, 2025
28 of 36 checks passed
@ogabrielluiz ogabrielluiz deleted the codeflash/optimize-pr6044-2025-02-03T12.09.54 branch February 3, 2025 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI enhancement New feature or request lgtm This PR has been approved by a maintainer size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant