Skip to content

Conversation

@Chiwendaiyue
Copy link
Contributor

  • closes BUG: DataFrame.astype leave the dataframe extremely fragmented (one block per column) #63433
    This PR fixes the issue by actively consolidating blocks of the same dtype after a dictionary-based astype operation. The fix is minimal (the alternative code change I thought is to determine and perform the correct partitioning behavior during the initial transformation. )and I think it's safe.
    It adds a call to _consolidate_inplace() on the result's BlockManager when dtype is a dict.
    The consolidation is wrapped in a try-except block with a warning to ensure it never breaks the core functionality of astype. Failures are silent and backward compatible.
    I tried the Reproducible Example and it worked well. If there is any problem, I'm happy to fix it.

@Chiwendaiyue
Copy link
Contributor Author

I've implemented a fix that consolidates blocks only when block count explodes (currently, when blocks == columns). I'm unsure if this threshold is optimal. It feels somewhat subjective. Could any maintainer provide guidance on a better criterion please? Thanks!

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment on lines +6530 to +6534
warnings.warn(
f"astype block consolidation failed: {type(e).__name__}",
UserWarning,
stacklevel=2,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case can this fail?

total_cols = len(self.columns)
# only when the number of blocks explode do this
if current_blocks == total_cols and total_cols > 5:
mgr._consolidate_inplace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still creating a very fragmented DataFrame and then performing a copy. We would prefer not fragmenting the DataFrame at all in the first place (I think this should be possible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.astype leave the dataframe extremely fragmented (one block per column)

2 participants