Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why choosing with size (len(dataset) * portion) / K / round #8

Open
JessePrince opened this issue Dec 27, 2024 · 0 comments
Open

Why choosing with size (len(dataset) * portion) / K / round #8

JessePrince opened this issue Dec 27, 2024 · 0 comments

Comments

@JessePrince
Copy link

JessePrince commented Dec 27, 2024

Hi, thanks for the interesting work!

I'm reading the code and there is a detail I couldn't understand.

When selecting data from each cluster, the corresponding code is

size = (len(dataset) * portion) / K / round
exp_reward_diff = merged_df["exp_reward_diff"]
  
# random select from K clusters, with p as the weight.
select_new_iter = np.random.choice(
    K, size=int(size), p=exp_reward_diff, replace=True
)
# Count how many times a cluster is chosen
selected_clusters_size = Counter(select_new_iter)

remaining_dataset = dataset.select(set(range(len(dataset))) - set(selected_indices))
remaining_dataset_df = remaining_dataset.to_pandas()

new_indices = []
for i in range(K):
    # get current indices in the remaining dataset
    indices = remaining_dataset_df[remaining_dataset_df["cluster"] == i]["index"]
    # adjust size if the selected size exceeds the remaining size
    size = min(selected_clusters_size[i], len(indices))
    # pick real samples from each cluster
    indices = np.random.choice(indices, size=size, replace=False)
    new_indices.extend(indices)
new_indices = np.array(new_indices)
# update the selected samples
new_indices = np.concatenate([selected_indices, new_indices])

If I understand correctly, in this iteration, the chosen size is (len(dataset) * portion) / K / round, then the code selects from clusters with weight, and Conuter is used to count how many samples are chosen from a cluster, the subsequent for loop is used to choose samples from K clusters. This results in (len(dataset) * portion) / K / round samples in total. But in the paper, the size for each iteration should be $b_{it} = \frac{b}{N}$, so I guess no division by K?

It would be great if you can help me understand this detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant