Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide more details about the dataset dict #5

Open
ascetic-monk opened this issue Jan 8, 2025 · 6 comments
Open

Can you provide more details about the dataset dict #5

ascetic-monk opened this issue Jan 8, 2025 · 6 comments

Comments

@ascetic-monk
Copy link

Thanks for your work. I found out that the results are quite sensitive to the randomly selected order of the novel classes, can you provide more details about the online_dataset_dict.txt'' of the 4 datasets for better alignment? Additionally, I also want to confirm that it is the soft'' result instead of the ``hard'' one shown in your paper.

@mashijie1028
Copy link
Owner

mashijie1028 commented Jan 8, 2025

Hi!

For generic datasets (e.g., C100, IN100, Tiny-ImageNet), the class order follows the natural order. For fine-grained datasets (e.g., CUB, and Aircraft), we use ssb_splits (https://github.com/sgvaze/SSB).

Specifically, for C100 and IN100, the stage-0 labeled classes are 0-49. Then the model continually learns classes 50-99. For Tiny, the labeled classes are 0-99, while the unlabeled ones are 100-199. Importantly, we use np.random.seed(0) to shuffle the unlabeled classes for all datasets, but not for the labeled datasets. (for example here)

Additionally, we first subsample 100 classes in IN-1k to obtain IN100, also with np.random.seed(0), as shown here.

In short, you could reproduce the reported results for C100, IN100 and Tiny using the released code. For CUB, please refer to this issue. We have released the logs. Here, we use ssb_splits for CUB, please adjust the released code in get_datasets.py.

@mashijie1028
Copy link
Owner

By the way, what do 'soft' and 'hard' results mean?

@ascetic-monk
Copy link
Author

Thank you for your reply. We are concerned that the random numbers may differ across different hardware, so we hope to achieve this alignment. Additionally, we have actually conducted multiple reproductions and found that the fluctuations in the "all" metric are relatively small, while the fluctuations in the "unseen" metric are larger. We will continue to try adjusting the random seed in PyTorch to reproduce results with the current settings.

In line 297 of ``train_happy.py'', you listed two metrics: "hard" and "soft." According to our understanding, these correspond to calculations under two different partitions. We would like to know if the main experimental results you reported are based on the "soft" metric.

@mashijie1028
Copy link
Owner

Hi! Thanks for your reminder!

Yes, in our paper, we report the "soft" metric. "hard" only treats initially labeled classes as old (Stage-0). "soft" dynamically treats the firstly seen classes at each stage as new, while all seen classes (including initial classes and previously discovered new classes) as old. As a result, "soft" reflects the dynamic nature of continual learning thus more rational.

As for the fluctuations in the "unseen" metric, we attribute the unstable results to the evaluation protocol of category discovery (clustering). Conventionally, GCD implements the Hungarian algorithm for all classes once to obtain the best All Acc, at different learning steps, the optimal correspondence produced by the Hungarian algorithm might be different, leading to fluctuations in the "unseen". I think this is the inherent issue in GCD. You could run the experiments several times and report the average results.

@mashijie1028
Copy link
Owner

As for the detailed dataset dicts, currently, there are some issues in the GPU server.

I will upload them later, please stay tuned.

@ascetic-monk
Copy link
Author

As for the detailed dataset dicts, currently, there are some issues in the GPU server.

I will upload them later, please stay tuned.

Thanks a lot. By the way, I have found another reason for the instability of the unseen results from the experiments these days. As noted by you, the results are selected according to the best score for all accuracy but the unseen score contributes little to all accuracy. Therefore, the ``best'' score often occurs in the early stage of training where the new classes cannot achieve sufficient learning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants