Why choosing with size (len(dataset) * portion) / K / round #8

JessePrince · 2024-12-27T10:47:50Z

Hi, thanks for the interesting work!

I'm reading the code and there is a detail I couldn't understand.

When selecting data from each cluster, the corresponding code is

size = (len(dataset) * portion) / K / round
exp_reward_diff = merged_df["exp_reward_diff"]
  
# random select from K clusters, with p as the weight.
select_new_iter = np.random.choice(
    K, size=int(size), p=exp_reward_diff, replace=True
)
# Count how many times a cluster is chosen
selected_clusters_size = Counter(select_new_iter)

remaining_dataset = dataset.select(set(range(len(dataset))) - set(selected_indices))
remaining_dataset_df = remaining_dataset.to_pandas()

new_indices = []
for i in range(K):
    # get current indices in the remaining dataset
    indices = remaining_dataset_df[remaining_dataset_df["cluster"] == i]["index"]
    # adjust size if the selected size exceeds the remaining size
    size = min(selected_clusters_size[i], len(indices))
    # pick real samples from each cluster
    indices = np.random.choice(indices, size=size, replace=False)
    new_indices.extend(indices)
new_indices = np.array(new_indices)
# update the selected samples
new_indices = np.concatenate([selected_indices, new_indices])

If I understand correctly, in this iteration, the chosen size is (len(dataset) * portion) / K / round, then the code selects from clusters with weight, and Conuter is used to count how many samples are chosen from a cluster, the subsequent for loop is used to choose samples from K clusters. This results in (len(dataset) * portion) / K / round samples in total. But in the paper, the size for each iteration should be $b_{it} = \frac{b}{N}$, so I guess no division by K?

It would be great if you can help me understand this detail.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why choosing with size (len(dataset) * portion) / K / round #8

Why choosing with size (len(dataset) * portion) / K / round #8

JessePrince commented Dec 27, 2024 •

edited

Loading

Why choosing with size (len(dataset) * portion) / K / round #8

Why choosing with size (len(dataset) * portion) / K / round #8

Comments

JessePrince commented Dec 27, 2024 • edited Loading

JessePrince commented Dec 27, 2024 •

edited

Loading