-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Description
The feature_names property in xrocket/block.py incorrectly pairs patterns with channels and thresholds. This causes features to be mislabeled with the wrong channel information, making it impossible to correctly interpret which input channels contribute to each feature.
Impact
- Features are labeled with incorrect channel information
- Features that should use only one channel are labeled as using a different channel
- This breaks any downstream analysis that relies on knowing which channels each feature uses
- In my case, features that were labeled as using an all-zero channel actually had variance, which revealed the bug
Root Cause
The bug is in xrocket/block.py in the feature_names property (around lines 142-159):
Current (incorrect) implementation:
for pattern, channels, threshold in zip(
self.conv.patterns * self.num_combinations * self.num_thresholds,
self.mix.combinations * self.num_thresholds,
self.thresholds.thresholds,
)Problem: This uses zip() with repeated lists, which creates incorrect pairings. The features are actually generated in nested order (pattern → channels → threshold), but the zip operation pairs them linearly.
Minimal Reproducible Example
import torch
import numpy as np
from xrocket import XRocket
# Create data with 2 channels: one random, one all zeros
np.random.seed(42)
data = []
for _ in range(5):
sample = np.zeros((2, 100))
sample[0, :] = np.random.randn(100) # Channel 0: random
sample[1, :] = 0.0 # Channel 1: zeros
data.append(torch.FloatTensor(sample))
# Initialize and fit XRocket
rocket = XRocket(in_channels=2, max_kernel_span=100, combination_order=1,
feature_cap=100, kernel_length=3, max_dilations=2)
rocket.fit(data[0].unsqueeze(0))
# Generate embeddings
embeddings = np.array([rocket(x.unsqueeze(0)).numpy().squeeze() for x in data])
# Check features labeled as using "only channel 1" (the zero channel)
for i in range(min(20, embeddings.shape[1])):
feature_name = rocket.feature_names[i]
channels_str = feature_name[2] # String like "[1.0, 0.0]" or "[0.0, 1.0]"
if "[0.0, 1.0]" in channels_str: # Labeled as using only channel 1
values = embeddings[:, i]
variance = np.var(values)
print(f"Feature {i}: channels={channels_str}, variance={variance:.6f}")
if variance > 1e-9:
print(f" ❌ Has variance despite zero-channel label - feature_names is WRONG!")Expected: Features labeled as using only the zero channel should have variance = 0.0
Actual: These features have non-zero variance, proving they don't actually use the channel they're labeled with.
Proposed Fix
Replace the zip-based approach with proper nested loops in the feature_names property:
@property
def feature_names(self) -> list[tuple]:
"""(pattern, dilation, channels, threshold) tuples to identify features."""
assert self.is_fitted, "module needs to be fitted for thresholds to be named"
feature_names = []
for pattern in self.conv.patterns:
for channels in self.mix.combinations:
for threshold in self.thresholds.thresholds:
feature_names.append((
str(pattern),
self.dilation,
str(channels),
f"{threshold:.4f}",
))
return feature_namesVerification
After applying the fix:
- Run the minimal example above
- Features labeled as using only the zero channel should now have variance = 0.0
- The feature_names should correctly match the actual feature generation order
Environment
- XRocket version: Commit 1511e81
- Python version: 3.11.9
- PyTorch version: 2.2.2