Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can't achieve the accuracy in bench mark, could somebody help? #38

Open
GranMin opened this issue Aug 12, 2021 · 18 comments
Open

I can't achieve the accuracy in bench mark, could somebody help? #38

GranMin opened this issue Aug 12, 2021 · 18 comments

Comments

@GranMin
Copy link

GranMin commented Aug 12, 2021

output19 42
I use the same train dataset and test dataset as you proposed, but the best result I've got so far is as the picture shows.
I used SGD optimizer and lr=0.1,0.05,0.01,0.0001,0.00001, each lr an epoch. And when I found the loss increasing rather than decreasing, I stoped training. And I got the test result for loss 19.42 as up picture.
More, this is test result when the train loss is 21.15, shown as down picture.
output21 15

@Androsimus
Copy link

@GranMin what backbone do you use?
I tried MobileNetV2 and got similar to author's results.
I used constant learning rate = 0.01 during about 10 epochs and achieved loss about 10.
I think you should try similar to my lr schedule at first and decrease lr only after that (and maybe not so fast). Otherwise your model haven't enough time to use relatively big gradients to decrease loss.

@GranMin
Copy link
Author

GranMin commented Aug 13, 2021

@Androsimus I used the Resnet50. I will try and reply as soon. Thanks for your advice.

@GranMin
Copy link
Author

GranMin commented Aug 18, 2021

the newest info: I use the Resnet50, SGD optimizer and set lr=0.01, after 30 epoch, I reached 99.08 in lfw, and 92 in AgeDB-30, as shown follow.
image

@Androsimus
Copy link

@GranMin Maybe use of 30 epochs is too much and you have got some overfitting?..
Do you have previous checkpoints from 10-15 epochs? If yes, try to check them on val datasets.

@GranMin
Copy link
Author

GranMin commented Aug 19, 2021

@Androsimus I have the same feeling of too much epochs.But the loss of 10-15 epoch is about 20, and the acc of lfw is about 98.

@Androsimus
Copy link

@GranMin This is very strange. Maybe you changed some other parameters? Maybe parameters of Arcface: margin, scale?
Because there are other issues, where persons wrote about good results using Resnet50 on native for this repository dataset.

@GranMin
Copy link
Author

GranMin commented Aug 19, 2021

@Androsimus I don't change any other parameters. And I tried NormHead for one epoch, then use the Archead, it's amazing that just after one epoch in Archead, the loss comes to about 11. But then the same phenomenon took place: the loss increase a few at the begin of the epoch, and then decrease, but at very low speed. Like this:
image

@Androsimus
Copy link

Androsimus commented Aug 19, 2021

@GranMin I'm not sure how NormHead is supposed to use. Maybe as a warmup.
But the NormHead and the ArcHead are completely different. As far as I understand, NormHead is for ordinary classification, if so, then classification problem is much easier and due to this you achieve low loss much faster. Otherwise, when you use ArcHead with its margin and scale parameters and its different concept the classification problem becomes harder.
But I don't understand your situation: on the one hand you said about loss ~20 on 10-15 epochs, on the other hand after first epoch you had loss <11 and then it decreased at low speed...

To sum up. I used strictly ArcHead, I suppose other people did it too. So I propose to try using only ArcHead.

@GranMin
Copy link
Author

GranMin commented Aug 19, 2021

@Androsimus In fact, I tried two times about training the model recently.
The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw.
The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down.
As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.

@Androsimus
Copy link

@GranMin
If you look at this post and #4 (comment) and thread, then you will see big differences from your results. Strange.
Nevertheless, if you have your old logs and chechpoints from training using only ArcHead, then try to find checkpoint that correspond to loss about 8-9 and try it on validation datasets.

@GranMin
Copy link
Author

GranMin commented Aug 19, 2021

results as:
first for loss 9.16 and second for loss 8.00
1
2

@Androsimus
Copy link

@GranMin this is some mystery )

@Androsimus
Copy link

Androsimus commented Aug 20, 2021

@GranMin could you post your config file *.yaml?
Anyway if you figure out a reason of that strange model training behavior, please write about it.

@Androsimus
Copy link

Androsimus commented Aug 27, 2021

@GranMin There is one idea. For correct inference model must be used as
model(input, training=False)
The author didn't use it for some reason.
So you can try to add training=False in modules/evaluations.py to perform_val function.

@GranMin
Copy link
Author

GranMin commented Aug 30, 2021

@Androsimus Sorry for long waiting. Hahah...I just took a vacation to Jiuzhai Gou nature reserve last week.
I talk to my teacher and then I know they broaden my dataset with some asian faces.
I redown the dataset the author provided, and got this result just for 2 epoch, just use arcface head.
image
I decide to train resnet152 from scratch following time. And the experience that use softmax first may also help.
Again, thank you for your advice~best wishes, my friend!

@GranMin
Copy link
Author

GranMin commented Aug 30, 2021

And for 5 epoch finished, this is final result:
image

@Androsimus
Copy link

@GranMin glad you got nice results :)
Best wishes!

@xalbertoisorna
Copy link

@Androsimus In fact, I tried two times about training the model recently. The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw. The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down. As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.

How do you change head (normhead to archead)? I tried it but I have this error:
raise ValueError( ValueError: Cannot assign value to variable ' conv2d/bias:0': Shape mismatch.The variable shape (24,), and the assigned value shape (32,) are incompatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants