-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.xml
executable file
·2049 lines (1205 loc) · 119 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Harshit Gaur</title>
<link>https://harshit2000.github.io/</link>
<atom:link href="https://harshit2000.github.io/index.xml" rel="self" type="application/rss+xml" />
<description>Harshit Gaur</description>
<generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>Harshit Gaur, 2020 ©</copyright><lastBuildDate>Tue, 25 Aug 2020 19:19:35 +0530</lastBuildDate>
<image>
<url>https://harshit2000.github.io/images/icon_hu616effff6bc497e1f3ccd40e4a444d66_14554_512x512_fill_lanczos_center_2.png</url>
<title>Harshit Gaur</title>
<link>https://harshit2000.github.io/</link>
</image>
<item>
<title>How Google bowled me over with a Googly</title>
<link>https://archana1998.github.io/post/summer-school-sumup/</link>
<pubDate>Tue, 25 Aug 2020 19:19:35 +0530</pubDate>
<guid>https://archana1998.github.io/post/summer-school-sumup/</guid>
<description><p>I recently got the opportunity to attend the AI Summer School conducted by Google Research India. I was one of the 150 people selected to attend it, out of over 75,000 applications. Probably one of my most noteworthy achievements till date, if not the most (?). I remember screaming after getting the acceptance mail for over two hours, it was the happiest I have been in a while. I was selected as part of the Computer Vision track, and I was elated as I knew absolutely nothing about the other two tracks (Natural Language Understanding and AI for Social Good)</p>
<p>The summer school happened over three days, between August 20 and 22, 2020. Due to the pandemic that taught us that we can do everything over a computer screen, the summer school was held in a virtual mode. The people at Google made us feel very welcome, and sent out a batch of goodies from Google to all the participants (side note: I&rsquo;m a little salty about this as I haven&rsquo;t gotten mine yet, it got lost on the way). Saying I loved the experience would be an understatement of sorts, I was constantly elated after each and every event.</p>
<p>Day 1 started off with a keynote by Jeff Dean, the head of Google AI Research, at 9 am. Waking up so early was a huge achievement for me in a quarantine-home restricted environment where I sleep late and wake up late. Working remotely at a lab in a different country provides insane flexibility, I am my most productive in the afternoons and evenings. I sat in front of my computer and tuned into the YouTube live stream which was engaging and amazing (see my <a href="https://archana1998.github.io/post/opening-keynote/"> post</a>)
<figure id="figure-opening-keynote">
<img data-src="https://archana1998.github.io/post/summer-school-sumup/1_hu44f229a9c87ce6272342d7409ef1f45d_159695_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="1478" height="1108">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Opening Keynote
</figcaption>
</figure>
</p>
<p>After a lunch break, we had our first lecture by Jean-Phillipe Vert, which had so much rigorous math that we were slightly intimidated, however it was a pleasure being taught by someone so amazing all the same. (shameless plug to <a href="https://archana1998.github.io/post/jean-vert/"> post</a> again).</p>
<p>We had an amazing panel discussion that was titled <b>Why Choose a Career in Research</b>. The panel consisted of eminent names from Google Research. We had a &ldquo;virtual social&rdquo; after that on GatherTown, which was not the easiest to use on Day 1, but it was quite an experience. We had a second lecture after that by Neil Houlsby, finally on computer vision (I loved it, here&rsquo;s my <a href="https://archana1998.github.io/post/neil-houlsby/"> post</a>). And just like that, I was done for the day and had learnt more in these 6 hours than I did in the last semester.</p>
<p>Day 2 started off well with a lovely talk by Vineet Gupta, on math again :(. But this was nice math, easy to understand and follow and talked about very interesting theoretical math for machine learning that provided very promising results in optimization (once again, here&rsquo;s my <a href="https://archana1998.github.io/post/vineet-gupta/"> post</a>). We had a social before lunch once again, and I got to meet and greet with a lot of people this time, having finally understood how to use the GatherTown UI. I interacted with a lot of my fellow attendees and the Google Lab members, it was super fun.</p>
<p>After lunch, we had our first computer vision-centric lecture by Cristian Sminchisescu that was BEAUTIFUL. The fact that it perfectly aligned to my research interests was a cherry on top of the cake. (<a href="https://archana1998.github.io/post/cristian-sminchisescu/"> post</a> again). We had a panel discussion titled &ldquo;AI For India&rdquo; after that, which was insightful as well. I was done with my second day of the school, and had learned more than I did in half of my math degree.</p>
<p>Day 3 had lovely lectures, by Rahul Sukthankar and Arsha Nagrani who were so, so, good at presenting their work! Rahul&rsquo;s lecture was simple but beautifully presented, and I loved it! (here&rsquo;s my <a href="https://archana1998.github.io/post/rahul-sukthankar/"> post</a>). Arsha&rsquo;s talk was about some very interesting research that&rsquo;s probably going to revolutionize multimodal learning (last time, here&rsquo;s my <a href ="https://archana1998.github.io/post/arsha-nagrani/"> post </a>)
The summer school concluded with a closing keynote delivered by Manish Gupta, the director of Google Research India, who talked about opportunities in Google Research for us. We then had socials that lasted two hours (last day, woohoo) and I, who had mastered navigating GatherTown by then was a proper social butterfly, talking to everyone and anyone and sending connection requests on LinkedIn to stay in touch.</p>
<p>That was it! <em>curtain closes</em> The experience was CRAZY, and I never knew I could learn so much in just three days. More than learning new concepts, I got an insight into how these amazing people conduct cutting edge research, and the fact that we have to learn so much to get there was a little inspirational too. I was jumping with happiness and rambled nonstop about how much fun I had to my family and my friends, thankfully for me, they shared my enthusiasm :)
I can&rsquo;t wait to experience more things like this in the future!</p>
<p>PS: Still waiting for my goodies @Google. Thanks again for a lovely time.</p>
</description>
</item>
<item>
<title>Summer School Series: Lecture 6 by Arsha Nagrani</title>
<link>https://archana1998.github.io/post/arsha-nagrani/</link>
<pubDate>Tue, 25 Aug 2020 18:36:00 +0530</pubDate>
<guid>https://archana1998.github.io/post/arsha-nagrani/</guid>
<description><p>This final lecture was delivered by <a href ="http://www.robots.ox.ac.uk/~arsha/">Arsha Nagrani</a>, a recent Ph.D. graduate from Oxford University&rsquo;s VGG group, and an incoming research scientist at Google Research. Her talk was called <b>Multimodality for Video Understanding</b>.</p>
<h3 id="video-understanding">Video Understanding</h3>
<p>Videos provide us with far more information than images. Multimodal refers to many mediums for learning, here it can be time, sound and speech. Videos are all around us (30k newly created content videos are uploaded to YouTube every <b>hour</b>).
However, these have high dimensionality and are difficult to process and annotate.</p>
<h4 id="complementarity-among-signals">Complementarity among signals</h4>
<ul>
<li>Vision (scene)</li>
<li>Sound (content of speech)</li>
</ul>
<h4 id="redundancy-between-signals">Redundancy between signals</h4>
<p>Helps recognize person, face+sound, thus can be a useful form of weak supervision. The redundant information comes from background sounds, foreground audio, signals identified from speech and the content of speech.</p>
<p>Thus, best way to exploit multimodal nature of videos is to work with the complementarity and redundancy.</p>
<h4 id="suitable-tasks">Suitable tasks</h4>
<p>Suitable tasks for video understanding are:</p>
<ol>
<li>Video classification</li>
</ol>
<ul>
<li>single label</li>
<li>infinite number of possible classes</li>
<li>ambiguity in the label space</li>
</ul>
<ol start="2">
<li>Action recognition: more fine grained, the motion is important, human centric</li>
</ol>
<p>It is important to note that labelling actions in videos is extremely expensive and existing models do not generalize well to new domains.</p>
<p>In this context, can we use speech as a form of supervision? For example, narrated video clips and lifestyle Vlogs.</p>
<h3 id="movies">Movies</h3>
<p>General domain of movies: people speak about their actions. However, sometimes speech is completely unrelated, giving us noise. We need to learn when speech matches action. An example of work in this field is <a href ="https://arxiv.org/abs/1912.06430">End-to-End Learning of Visual Representations from Uncurated Instructional Videos</a>. This work reduces noise by using the MIL-NCE loss.</p>
<p>Can we first train a model to recognize actions and then see if it should be used for supervision? An interesting discovery Arsha made was using Movie Screenplays, that contain both speech segments and scene directions with actions. Using this:</p>
<ul>
<li>We can obtain speech-action pairs</li>
<li>Retrieve speech segments with verbs</li>
<li>Train the <a href="https://www.robots.ox.ac.uk/~vgg/research/speech2action/">Speech2Action</a> model to predict action, with a BERT-Backbone (movie scripts scraped from IMSDB)</li>
<li>Apply to closed captions of unlabelled videos</li>
<li>Apply to large movie corpus</li>
</ul>
<figure id="figure-speech2action-model">
<img data-src="https://archana1998.github.io/post/arsha-nagrani/1_hu3dfb93f108aa0e1c42c1039d245e09c2_102757_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="784" height="358">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Speech2Action model
</figcaption>
</figure>
<p>The Speech2Action model recognizes rare actions, and is a visual classifier on weakly labelled data (S3D-G model with cross-entropy loss)</p>
<p>Evaluation is done on the AVA and HMDB-51 (transfer learning) datasets. It gets abstract actions like <b>count</b> and <b>follow</b> too.</p>
<h3 id="multimodal-complementarity">Multimodal Complementarity</h3>
<p>This refers to fusing info from multiple modalities for video text retrieval, like:</p>
<ul>
<li>Finding video corresponding to text queries</li>
<li>More to videos than just actions like object, scene etc.</li>
</ul>
<p>Supervisions:</p>
<ul>
<li>It&rsquo;s not easy to get the complete combination of captions, this is a very subjective task</li>
<li>Need extremely large datasets</li>
</ul>
<p>What Arsha does is rely on expert models trained for different tasks like object detection, face detection, action recognition, OCR etc. These are all applied to the video and features are extracted. The framework is a joint video text embedding, with the video encoder + text query encoder = joint embedding space (similarity should be really high if related). It is necessary for the video encoder to be discriminative and retain specific information.</p>
<h3 id="collaborative-gating">Collaborative Gating</h3>
<p>For each expert, generate attention mask by looking at the other experts <a href = "https://bmvc2019.org/wp-content/uploads/papers/0363-paper.pdf"> (Use What You Have: Video Retrieval Using Representations From Collaborative Experts, BMVC 2019)</a></p>
<ul>
<li>Trained using bi-directional max margin ranking loss</li>
<li>Adding in more experts massively increases performance</li>
<li>Main boost is from the object embeddings</li>
</ul>
<figure id="figure-collaborative-gating">
<img data-src="https://archana1998.github.io/post/arsha-nagrani/2_huc44033e910d0af756e6885ccfb6b6932_14850_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="150" height="197">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Collaborative Gating
</figcaption>
</figure>
<p>Another paper that Arsha discussed was <a href ="https://arxiv.org/abs/2007.10639"> Multi-modal Transformer for Video Retrieval, ECCV 2020 </a>. This takes features that are taken at different time stamps for each task and aggregrate for the embeddings. The expert and temporal embeddings are added and summed up.</p>
<h3 id="conclusion">Conclusion</h3>
<ul>
<li>More modalities is better (because more complementarity)</li>
<li>Time (modelling time along with modalities is interesting, some modalities train faster than the others)</li>
<li>Mid fusion is better than late (Attention truly is what you need)</li>
<li>Our world is multimodal, it doesn&rsquo;t make sense to work with modalities in isolation</li>
<li>Use the redundant and complementary information from vision, audio and speech to massively reduce annotations</li>
</ul>
<p><b>Open Research Questions:</b></p>
<ol>
<li>Extended Temporal Sequences (beyond 10s):</li>
</ol>
<ul>
<li>Backprop + memory restricts current video architectures to 64 frames</li>
<li>For longer we rely on pre-extracted features</li>
<li>Need new datasets to drive innovation</li>
</ul>
<ol start="2">
<li>Moving away from supervision: is an upper bound on self supervision being appraoched?</li>
<li>The world is multimodal: how do we design good fusion architectures?</li>
</ol>
<p>Arsha thus concluded a fantastic talk that described the cutting-edge research that her team at Oxford and Google is conducting. It was tremendously insightful and inspirational.</p>
</description>
</item>
<item>
<title>Summer School Series: Lecture 5 by Rahul Sukthankar</title>
<link>https://archana1998.github.io/post/rahul-sukthankar/</link>
<pubDate>Tue, 25 Aug 2020 17:19:29 +0530</pubDate>
<guid>https://archana1998.github.io/post/rahul-sukthankar/</guid>
<description><p>This Lecture was presented by <a href="https://research.google/people/RahulSukthankar/">Rahul Sukthankar</a>, a research scientist at Google Research and an Adjunct Professor at Carnegie Mellon University. It was titled <b>Deep Learning in Computer Vision</b>.</p>
<h3 id="popular-computer-vision-tasks">Popular Computer Vision tasks</h3>
<p>Some popular tasks in the domain of computer vision include:</p>
<ul>
<li>Image Classification (assign to one class)</li>
<li>Image Labelling/Object Recognition (multiple classes)</li>
<li>Object Detection/Localization (predicts bounding box+label, works well for objects but not for fuzzy concepts)</li>
<li>Semantic Segmentation (Pixel level dense labelling)</li>
<li>Image Captioning (Description of image in text)</li>
<li>Human Body Part Segmentation</li>
<li>Human Pose Estimation (predicting 2D pose keypoints)</li>
<li>Generating 3D Human Pose and Body Models from an image</li>
<li>Depth Prediction from a single image (foreground and background semantic segmentation based on a heatmap)</li>
<li>3D Scene Understanding</li>
<li>Autonomous navigation</li>
</ul>
<p>While thinking about a particular problem statement:</p>
<ul>
<li>We need to take in specific considerations (such as semantic segmentation, classicaiton, object detection etc)</li>
<li>What is the output? (binary yes/no, bounding box, label/pixel etc)</li>
<li>How is the training data labelled? (Fully Supervised/Weakly or Cross-Modal/Self-supervised)</li>
<li>Architecture: Usually a Convolutional Neural Network, but what is the final layer?</li>
<li>What loss function do we use?</li>
</ul>
<p>CMU Navlabs (30 years ago) built a self steering car only with an artificial neural network, in the pre-CNN era (<a href= "https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf"> ALVINN: AN AUTONOMOUS LAND VEHICLE IN A NEURAL NETWORK</a>)</p>
<h3 id="convolutional-neural-networks">Convolutional Neural Networks</h3>
<p>The structure of a convolutional neural network follows as input + conv, relu, pooling layers (hidden layers) + flatten, fully connected and softmax layers (for classification). Key concepts behind CNNs are:</p>
<ul>
<li>Local connectivity (not connected to every pixel, but just a few)</li>
<li>Shared weights (translational invariance)</li>
<li>Pooling (reducing dimensions, leads to local patch becoming bigger (filter size))</li>
<li>Filter stride (cuts down weights, reduces computations)</li>
<li>Multiple feature maps</li>
</ul>
<p>It is essential to choose the right conv layer, pooling layer, activation function, loss function, optimization and regularization methods, etc.</p>
<h4 id="convolutions">Convolutions</h4>
<ul>
<li>2D vs 3D convolutions: 3D convolutions are used to capture patterns across 3 dimensions, for example Video Understanding and Medical Imaging.</li>
<li>1x1 convolution: weighed average across channel axis, feature pooling technique to reduce dimensions</li>
<li>Other types of convolutions are dilated convolutions, regular vs depth wise separable convolutions, grouped convolutions (AlexNet uses it, it reduces computation)</li>
</ul>
<h3 id="famous-architectures">Famous architectures</h3>
<ol>
<li>
<p>Inceptionv1 (2014):
<figure id="figure-inception-v1">
<img data-src="https://archana1998.github.io/post/rahul-sukthankar/1_hu0095bbad7d2e55015cc682d2a4670f59_127543_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="762" height="294">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Inception v1
</figcaption>
</figure>
</p>
</li>
<li>
<p>ResNet:</p>
</li>
</ol>
<p>Resnet uses skipped connections with residual blocks, the added paths help solve vanishing gradient problems and gives a shorter route for backpropagation</p>
<figure id="figure-residual-blocks">
<img data-src="https://archana1998.github.io/post/rahul-sukthankar/2_hub70c3df55df6f4f2b8fb68ff07d9a5f0_58255_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="557" height="332">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Residual blocks
</figcaption>
</figure>
<h3 id="object-detection-in-images">Object Detection in Images</h3>
<ul>
<li>Object Classification: Task of identifying a picture is a dog</li>
<li>Object Localization: Involves finding class labels as well as a bounding box to show where an object is located</li>
<li>Object Detection: Localizing with box</li>
<li>Semantic Segmentation: Dense pixel labelling</li>
</ul>
<p>There are two ways to do detection:</p>
<ol>
<li>Sliding window approach: computationally expensive and unbalanced</li>
<li>Selective search: guessing promising bounding boxes and selecting the best out of them</li>
</ol>
<ul>
<li>RCNN did this when they extracted region proposals</li>
<li>Fast RCNN did class labelling+ bounding box prediction at the same time (softmax+bounding box regression)</li>
</ul>
<p>Bounding box evaluation is commonly done by the Intersection over Union Metric
$$ \text{Intersection over Union} = \frac{\text{Area of Overlap}}{\text{Area of Union}}$$
(Ground truth bounding box and predicted bounding box)</p>
<h3 id="classic-cnn-vs-fully-convolutional-net">Classic CNN vs Fully Convolutional Net</h3>
<p>A classic CNN comprises of a conv+Fully Connected Layer, a fully convolutional layer contains convolutional blocks that help us retain the same number of weights no matter what the input image size is. An example of a fully convolutional net is the U-Net, that is used extensively for semantic segmentation.</p>
<p>Other applications of a fully convolutional net are : Residual Encoding Decoding, Dense Prediction, Superresolution, Colorization (self supervised)</p>
<h3 id="last-layer-activation-function-and-loss-function-summary">Last-Layer Activation function and Loss Function Summary</h3>
<figure id="figure-functions-to-use">
<img data-src="https://archana1998.github.io/post/rahul-sukthankar/3_hu44f5fe12bf2e14613e9c883ade35e838_85577_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="712" height="219">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Functions to use
</figcaption>
</figure>
<p>Any differentiable function can be used as a loss function: even another neural net! (perceptual loss, GAN loss, differentiable renderer etc)</p>
<p>Rahul concluded this introduction lecture focused in Computer Vision using fully supervised deep learning, with key concepts on CNNs and their extensions and the importance of choosing the right loss function. It was a wonderful lecture with all the concepts beautifully explained.</p>
</description>
</item>
<item>
<title>Summer School Series: Lecture 4 by Cristian Sminchisescu</title>
<link>https://archana1998.github.io/post/cristian-sminchisescu/</link>
<pubDate>Tue, 25 Aug 2020 16:11:56 +0530</pubDate>
<guid>https://archana1998.github.io/post/cristian-sminchisescu/</guid>
<description><p>This talk was presented by <a href ="https://research.google/people/CristianSminchisescu/">Cristian Sminchisescu</a>, who is a Research Scientist leading a team at Google, and a Professor at Lund University. His talk was titled <b>End-to-end Generative 3D Human Shape and Pose models, and active human sensing</b></p>
<p>3D Human Sensing has many applications, in the field of animation, sports motion, AR/VR, medical industry etc. It is a known fact that humans are very complex, the body has 600 muscles, 200 bones and 200 joints. Clothing that humans wear have folds and wrinkles, there are many different types of garments and cloth-body interactions.</p>
<h3 id="challenges">Challenges</h3>
<p>Typical challenges in 3D human sensing include:</p>
<ul>
<li>High dimensionality, articulation and deformation</li>
<li>Complex appearance variations, clothing and multiple people</li>
<li>Self occlusion or occlusion by scene objects</li>
<li>Observation (depth) uncertainty (especially in monocular images)</li>
<li>Difficult to obtain accurate supervision of humans</li>
</ul>
<p>This is where we can exploit the power of machine and deep learning, we aim to come up with a learning model that:</p>
<ol>
<li>Understands large volumes of data</li>
<li>Connects between images and 3D models</li>
</ol>
<h3 id="problems-that-need-to-be-solved">Problems that need to be solved</h3>
<p>It is imperative to <b>FIND THE PEOPLE </b>. We then need to infer their pose, body shape and clothing. The next step would be to recognize actions, behavioral states and social signals that they make, followed by recognizing what objects they use.</p>
<h3 id="visual-human-models">Visual Human Models</h3>
<p>Different Data types we take into consideration are:</p>
<ul>
<li>Multiple Subjects</li>
<li>Soft Tissue Dynamics</li>
<li>Clothing
This is all fed into the learning model</li>
</ul>
<h3 id="generative-human-modeling">Generative Human Modeling</h3>
<p>Dynamic Human Scans $\mathbf{\xrightarrow[\text{deep learning}]{\text{end to end}}}$ Full Body articulated generative human models. The Dynamic Human Scans are in the form of very dense 3D Point Clouds.</p>
<h3 id="ghum-and-ghuml">GHUM and GHUML</h3>
<p>Cristian then talked about his paper <a href = "https://openaccess.thecvf.com/content_CVPR_2020/papers/Xu_GHUM__GHUML_Generative_3D_Human_Shape_and_Articulated_Pose_CVPR_2020_paper.pdf">GHUM &amp; GHUML: Generative 3D Human Shape and Articulated Pose Models</a>
GHUM is the moderate generative model with 10168 vertices and GHUML is the light version with 3190 vertices, however both have a shared skeleton that has minimal parameterization and anatomical joint limits.</p>
<p>The model faciliates Automatic 3D Landmark detection with multiview renderings, 2D landmark detection and 3D landmark triangulation. Automatic Registration is able to calculate deformations.</p>
<h3 id="end-to-end-training-pipeline">End to End Training Pipeline</h3>
<figure id="figure-end-to-end-training-pipeline">
<img data-src="https://archana1998.github.io/post/cristian-sminchisescu/1_hu5ce619d08456037865eced6552c42e7e_329781_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="500" height="500">
<figcaption data-pre="Figure " data-post=":" class="numbered">
End To End Training Pipeline
</figcaption>
</figure>
<ul>
<li>Once data is mapped to meshes and put into registered format, next step is to encode and decode static shapes (using VAE)</li>
<li>Kinematics is learned using Normalizing Flow model</li>
<li>Mesh filter (mask): to integrate close up scans with models, fed into the optimization step</li>
<li>To train landmarks, we use annotated image data</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>For the variational shape and expression autoencoder, VAE works better than OCA, with reconstruction error lying between 0-20mm. Motion Retargeting and Kinematic Priors are done by retargetting models to 2.8M CMU and 2.2M Humans3.6M motion capture frames.</p>
<h3 id="normalizing-flows-for-kinematic-priors">Normalizing Flows for Kinematic Priors</h3>
<figure id="figure-normalizing-flows-for-kinematic-priors">
<img data-src="https://archana1998.github.io/post/cristian-sminchisescu/2_huf2e22827a723675b2a79947a570691ce_140050_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="869" height="244">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Normalizing Flows for Kinematic Priors
</figcaption>
</figure>
<ul>
<li>A normalizing flow is a sequence of invertible transformations applied to an original distribution</li>
<li>Use a dataset $\mathcal{D}$ of human kinematic poses $\theta$ as statistics for natural human movements</li>
<li>Use normalizing flow to warp the distribution of poses into a simple and tractable density function e.g. $\mathbf{z} \sim \mathcal{N}(0 ; \mathbf{I})$</li>
<li>The flow is bijective, trained by maximizing data log-likelihood
$$\max _{\phi} \sum _{\partial \in \mathcal{D}} \log p _{\phi}(\theta)$$</li>
</ul>
<h3 id="ghum-and-smpl">GHUM and SMPL</h3>
<figure id="figure-ghum-vs-smpl">
<img data-src="https://archana1998.github.io/post/cristian-sminchisescu/3_hu125e7c529aa96451d645ff796805a88a_150708_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="500" height="500">
<figcaption data-pre="Figure " data-post=":" class="numbered">
GHUM vs SMPL
</figcaption>
</figure>
<ul>
<li>GHUM is close (slightly better) to SMPL in skinning visual quality</li>
<li>The vertex point-to-plane error (body-only) is GHUM: 4.23mm and SMPL: 4.96mm</li>
</ul>
<h3 id="conclusions">Conclusions</h3>
<ul>
<li>An effective Deep Learning Pipeline to build generative, articulated 3D human shape models</li>
<li>GHUM and GHUM are two full body human models that are available for research:(<a href="https://github.com/google-research/google-research/tree/master/ghum)">https://github.com/google-research/google-research/tree/master/ghum)</a>.</li>
<li>We can jointly sample shape, facial expressions (VAEs) and pose (normalizing flows)</li>
<li>We have low res and high res models, that are non-linear (linear as special case)</li>
</ul>
<h3 id="other-work">Other Work</h3>
<p>Some other interesting papers that Cristian pointed out were <a href="https://arxiv.org/abs/2003.10350"> Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows</a> (ECCV 2020) that works on Full Body Reconstruction in Monocular Images, and <a href ="https://arxiv.org/abs/2008.06910">Neural Descent for Visual 3D Human Pose and Shape </a> (submitted to NeurIPS 2020) that talks about Self-Supervised 3D Human Shape and Pose Estimation.</p>
<h3 id="human-interactions">Human Interactions</h3>
<p>A problem that many 3D deep learning practitioners face is dealing with human interactions during estimation and reconstruction. Contacts are difficult to estimate correctly because of:</p>
<ul>
<li>Uncertainty in 3D monocular depth prediction</li>
<li>Reduced evidence of contact due to occlusion</li>
</ul>
<p>Cristian then talked about his paper <a href="https://openaccess.thecvf.com/content_CVPR_2020/papers/Fieraru_Three-Dimensional_Reconstruction_of_Human_Interactions_CVPR_2020_paper.pdf"> Three-dimensional Reconstruction of Human Interactions </a> and to move towards accurate reconstruction of interactions we need to:</p>
<ul>
<li>Detect contact</li>
<li>Predict contact interaction signatures</li>
<li>3D reconstruction under contact constraints
<figure id="figure-modelling-interactions">
<img data-src="https://archana1998.github.io/post/cristian-sminchisescu/4_hud88f9b9f04f1033960e9f73ca89377df_196249_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="930" height="315">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Modelling interactions
</figcaption>
</figure>
</li>
</ul>
<h3 id="conclusion-interactions">Conclusion (Interactions)</h3>
<ul>
<li>New models and datasets for contact detection, contact surface signature prediction, and 3d reconstruction under contact constraints</li>
<li>Annotation has an underlying contact ground truth but not always easy to precisely identify from a single image</li>
<li>Humans are reasonably consistent at identifying contacts at 9 and 17 region granularity, and contact can be predicted with reasonable accuracy too</li>
<li>Contact-constrained 3D human reconstruction produces considerably better and more meaningful estimates, compared to non-contact methods</li>
</ul>
<p>Cristian then concluded his wonderful lecture that talked about the most recent advances in Computer Vision in the 3D Deep learning field. It was a very informative and engaging lecture.</p>
</description>
</item>
<item>
<title>Summer School Series: Lecture 3 by Vineet Gupta</title>
<link>https://archana1998.github.io/post/vineet-gupta/</link>
<pubDate>Sun, 23 Aug 2020 15:53:16 +0530</pubDate>
<guid>https://archana1998.github.io/post/vineet-gupta/</guid>
<description><p>This talk was delivered by <a href="http://www-cs-students.stanford.edu/~vgupta/"> Vineet Gupta </a>, a research scientist at Google Brain, Mountain View California.
His talk was titled <b>Adaptive Optimization</b>.</p>
<h3 id="the-optimization-problem">The optimization problem</h3>
<p>The optimization problem aims to learn the best function from a class of functions.
$$\operatorname{Class} : { \hat{y} = M(x | w), for \space w \in \mathbb{R}^{n} }\ $$</p>
<p>A class is most often specified as a neural network, parameterized by w. If the class is too large, overfitting happens. If the class is too small, well we end up getting bad results.</p>
<p>The most common function to find the best function is supervised learning.</p>
<p>Training examples: input output pairs such as (x<sub>1</sub>, y<sub>1</sub>),&hellip;.(x<sub>n</sub>, y<sub>n</sub>)</p>
<p>Learning rule: Estimating $w$ such that $\hat{y_{i}} = M(x_{i}|w) \approx y_{i}$, and $w$ approximately minimizes $ F(w) = \sum_{i=1}^{n} l(\hat{y_{i}},y_{i})$ (the loss function)</p>
<p>In a feed-forward Deep Neural Network, gradient descent for the entire training is expensive. For this reason, we sample points and find the gradient for them.</p>
<h3 id="stochastic-optimization">Stochastic Optimization</h3>
<p>The optimizer starts with the network denoted as $M(x|w)$.</p>
<p>At each round t: (the goal is to minimize $F(w)$)</p>
<ul>
<li>Optimizer has decided upon $w_{t}$</li>
<li>Optimizer receives the input $ [x_{i} ]_{i=1}^{k}$</li>
<li>Optimizer makes prediction $[\hat{y_{i}}= M(x_{i}|w_{t})]_{i=1}^{k}$</li>
<li>Optimizer receives the true outcome</li>
<li>Optimizer computes the loss $l_{t} = \sum_{i} l(y_{i},\hat{y_{i}})$ and gradient $g_{t} = \frac{\partial }{\partial w} \sum_{i} l(y_{i},\hat{y_{i}})$</li>
<li>Optimizer uses $g_{t}$ to update $w_{t}$ to get $w_{t+1}$</li>
</ul>
<p>We stop when the gradients vanish or run out of time (or epochs).</p>
<h3 id="regret">Regret</h3>
<p>Convergence can be defined as the average loss compared to the optimum $w^{*}$</p>
<p>$$ R_{T} = \frac{1}{T} \sum_{t=1}^{T} l_{t} (w_{t}) - \frac{1}{T} \sum_{t=1}^{T} l_{t}(w^{*})$$</p>
<p>The proof of convergence can be picked up when $R_{T} \rightarrow 0 \text{ as } T \rightarrow 0$. This is a very <b>strong </b> requirement, regret tending to 0.</p>
<p>In convex optimization, $R_{T}$ can be computed in $O(\frac{1}{\sqrt{T}})$ time. Convex problems in SGD include faster convergence implies better condition number.</p>
<h3 id="momentum">Momentum</h3>
<p>What happens when the gradients become very noisy? To solve this, we can take a running average of the gradients.
$$ \bar{g_{t}} = \gamma \bar{g_{t}} + (1-\gamma)g_{t}$$
Thus the momentum step becomes:
$$w_{t+1} = w_{t}-\eta_{t} \bar{g_{t}}$$
The momentum approach works very well and is extremely popular, till date.</p>
<p>Another way to solve the problem is by using second order methods.
To minimize $F(w)$,</p>
<p>$$F(w) \approx F(w_{t}) + (w - w_{t})^{T} \nabla F(w_{T}) + \frac{1}{2} (w - w_{t})^{T} \nabla^{2} F(w_{t}) (w - w_{t}) $$ (first two terms of the Taylor series).</p>
<p>The minimum is at: $w_{t+1} = w_{t} - \nabla^{2} F(w_{t})^{-1} \nabla F(w_{t})$</p>
<p>The biggest problem with this is computing the $\nabla^{2} F(w_{t})$ (Hessian) is very expensive, as it is a $n*n$ matrix with $n$ number of parameters.</p>
<h3 id="adagrad">AdaGrad</h3>
<p>For gradient $g_{i}$
$$ H_{t} = \sqrt{(\sum_{s\leq{t}} g_{s} g_{s}^T)} $$</p>
<p>This is used as matrix for the Mahalnobis metric, that will be used.
$$ \therefore w_{t+1} = \operatorname{argmin}_{w} \frac{1}{2\eta} ||w - w _{t}|| _{H _{t}}^{2} +\hat{l _{t}}(w) $$</p>
<p>The AdaGrad update rule is: $w_{t+1} = w_{t} - \eta H_{t}^{-1} g_{t}$.
This is again very expensive, $O(n^{2})$ storage and $O(n^{3})$ time complexity per step.</p>
<h4 id="the-solution">The solution</h4>
<p>One way to solve this is by the diagonal approximation, by taking only the diagonal matrix of the Hessian instead of the entire matrix. $$H_{t} = \operatorname{diag}{(\sum_{s\leq{t}}g_{s}g_{s}^{T}+\epsilon\operatorname{I})}^{\frac{1}{2}}$$</p>
<p>This take $O(n)$ space and $O(n)$ time per step.</p>
<p>AdaGrad has been so successful that there have been plenty of variants like AdaDelta/RMS Prop and Adam.</p>
<h3 id="full-matrix-preconditioning">Full-matrix Preconditioning</h3>
<h4 id="adagrad-preconditioner">AdaGrad Preconditioner</h4>
<p>For $w_{t}$ of size 100 * 200, $g_{t}$ flattens to a 20,000 vector and then becomes 20k * 20k in size.</p>
<h4 id="the-kronecker-product">The Kronecker Product</h4>
<p>Given a $m * n$ matrix $A$ and $p * q$ matrix $B$, their <b>Kronecker Product</b> $C$ is defined as $$C = A \bigotimes B $$
This is also called the matrix direct product, and is a $(mp)*(nq)$ matrix (every element of $A$ multiplied with $B$). It commutes with standard matrix product along with exponentials.</p>
<h3 id="the-shampoo-preconditioner">The Shampoo Preconditioner</h3>
<figure id="figure-decomposed-matrix">
<img data-src="https://archana1998.github.io/post/vineet-gupta/1_hu82cf2efea7539d0f27dd878118672ff1_8584_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="500" height="500">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Decomposed Matrix
</figcaption>
</figure>
<h3 id="the-shampoo-update">The Shampoo Update:</h3>
<p><b>Adagrad update</b>: ${w} _{t+1} ={w} _{t}-\eta H _{t}^{-1} {g} _{t}$</p>
<p><b>Shampoo factorization</b>: $w_{t+1}=w_{t}-\eta\left(L_{i}^{\frac{1}{4}} \otimes R_{t}^{\frac{1}{4}}\right)^{-1} g_{t}$</p>
<p><b>Shampoo update</b>: $W_{t+1}=W_{t}-\eta L_{t}^{-\frac{1}{4}} G_{t} R_{t}^{-\frac{1}{4}}$
<b>Theorem (convergence)</b>:
If ${G} _{1}, \mathrm{G} _{2}, \ldots, \mathrm{G} _{\mathrm{T}}$ of rank $\leq \mathrm{r},$ then the rate of convergence is:</p>
<p>$$\frac{\sqrt{\mathrm{r}}}{\mathrm{T}} \operatorname{Tr}\left(\mathrm{L} _{\mathrm{T}}^{\frac{1}{4}}\right) \operatorname{Tr}\left(\mathrm{R} _{\mathrm{T}}^{\frac{1}{4}}\right)=\mathrm{O}\left(\frac{1}{\sqrt{\mathrm{T}}}\right)$$</p>
<p>$$({R_{t}}=\sum_{s \leq t} G_{s}^{\top} G_{s}$ and $L_{t}=\sum_{s \leq t} G_{s} G_{s}^{T})$$</p>
<h3 id="implementing-shampoo">Implementing Shampoo</h3>
<p>The training system can be of two types:</p>
<ul>
<li>Asynchronous (accelerators don&rsquo;t need to talk to each other, however it is hard for the parameter servers to handle)</li>
<li>Synchronous (accelerator sends gradients to all the other accelerators, for them to average and update)</li>
</ul>
<h3 id="challenges">Challenges</h3>
<ul>
<li>Tensorflow and PyTorch focus on 1<sup>st</sup> order optimizations</li>
<li>Computing $L_{t}^{-\frac{1}{4}}$ and $R_{t}^{-\frac{1}{4}}$ is expensive</li>
<li>L, R have large condition numbers (upto the order of 10<sup>13</sup>).</li>
<li>SVD is very expensive: $O(n^{3})$ in largest dimension</li>
<li>Large layers are still impossible to precondition</li>
</ul>
<h3 id="solutions">Solutions</h3>
<ul>
<li>Using high precision arithmetic (float 64), not performing the computations on a TPU.</li>
<li>Computing preconditioners every 1000 steps is alright.</li>
<li>Replace SVD with an iterative method</li>
<li>Only matrix multiplications needed
<ul>
<li>Warm start: use previous preconditioner</li>
<li>Reduce condition number, remove top singular values</li>
</ul>
</li>
<li>Optimization for large layers
<ul>
<li>Precondition only one dimension</li>
<li>Block partioning the layer works better</li>
</ul>
</li>
</ul>
<h3 id="shampoo-implementation-and-conclusion">Shampoo implementation and conclusion</h3>
<p>Shampoo gets implemented on a TPU+CPU. It is a little more expensive than AdaGrad but waay faster (saves 40% of the training time with 1.95 times fewer steps). Shampoo works well in language and speech domains, it isn&rsquo;t suitable for image classication yet (for this Adam and AdaGrad work much better).</p>
<p>The Shampoo paper can be found <a href ="https://arxiv.org/abs/1802.09568"> here </a></p>
</description>
</item>
<item>
<title>Summer School Series: Lecture 2 by Neil Houlsby</title>
<link>https://archana1998.github.io/post/neil-houlsby/</link>
<pubDate>Fri, 21 Aug 2020 23:11:51 +0530</pubDate>
<guid>https://archana1998.github.io/post/neil-houlsby/</guid>
<description><p><a href="https://research.google/people/NeilHoulsby/">Neil Houlsby</a> presented a great talk on Large Scale Visual Representation Learning and how Google has come up with solutions to some classical problems.</p>
<h3 id="evaluation-of-parameters">Evaluation of parameters</h3>
<p>There are two main ways of evaluating parameters from a network, that extracts the parameters. They are:</p>
<ul>
<li>Linear Evaluation: We freeze the weights and retrain the head</li>
<li>Transfer Evaluation: We retrain end to end with new head</li>
</ul>
<h3 id="visual-task-adaptation-benchmark-vtab">Visual Task Adaptation Benchmark (VTAB)</h3>
<p><a href="https://ai.googleblog.com/2019/11/the-visual-task-adaptation-benchmark.html">VTAB</a> is an evaluation protocal designed to measure progress towards general and useful visual representations, and consists of a suite of evaluation vision tasks that a learning algorithm must solve. We mainly have three types of tasks, <b> Natural tasks, Specialized tasks and Structured Datasets. </b></p>
<p>A query that was posed was how useful ImageNet labels would be for pretrained models to work on these three tasks. It has been seen that ImageNet labels work well for Natural images, and not well for the other two tasks.</p>
<p>Representation learners pre-trained on ImageNet can be of three forms:</p>
<ul>
<li>GANs and autoencoders</li>
<li>Self-supervised</li>
<li>Semi-supervised / Supervised approach</li>
</ul>
<p>It has been seen that for natural tasks, representations prove to be more important than obtaining more data, and the supervised approach is far better than the unsupervised approach. For structured tasks, a combination of supervised and self-supervised learning works the best.</p>
<p>It was also mentioned that by modern standards, ImageNet is of incredibly small-scale, thus scaling models on ImageNet were not proven to be effective.</p>
<p>Something to specifically keep in mind is that upstream can be expensive, but downstream should be cheap (in terms of both data and compute). For the upstream, examples of suitable large datasets are ImageNet-21k for supervised learning, and YouTube-8M for self-supervised learning.</p>
<h3 id="bit-l">BiT-L</h3>
<p>Neil introduced the <a href="https://blog.tensorflow.org/2020/05/bigtransfer-bit-state-of-art-transfer-learning-computer-vision.html">Big Transfer Learning (BiT-L)</a> algorithm and talked about it in detail.
The first thing he mentioned about BiT-L was that batch normalization was replaced with <b> group normalization </b> for ultra-large data. Advantages of this were having no train/test discrepancy, and no state which made it easier to co-train with multiple steps.</p>
<p>It was highlighted that optimization at scale implies that schedule is crucial and not obvious. Also, early results of models can be misleading.</p>
<p>To perform cheap tranfer, we need low compute, few/no validation data and diverse tasks. For doing few-shot transfer, pretraining on ImageNet-21k and JFT-300M helps.</p>
<h4 id="robustness">Robustness</h4>
<p>Models trained with ImageNet aren&rsquo;t necessarily robust most of the times. To test OOD robustness (Out-Of-Distribution), we use datasets like ImageNet C, ImageNet R and ObjectNet.</p>
<h4 id="modern-transfer-learning">Modern Transfer Learning</h4>
<p>Modern Transfer Learning calls for a big, labelled datset, a big model and careful training (using about 10 optimization recipes)
While testing with OOD, increasing datset size with a fixed model and increasing dataset size leads to an increase in performance, especially in the case of very large models.</p>
<p>To summarize, Bigger transfer $\rightarrow$ Better Accuracy $\rightarrow$ Better Robustness</p>
<ul>
<li>For checking impact on object <b> location </b> invariance, we see accuracy improves and becomes more uniform across location</li>
<li>This proves to be the same in the case of impact on object <b>size</b> invariance</li>
<li>However, in the case of object rotation invariance for ResNet50, it does not become more uniform across rotation angles, but for ResNet101*3, it maintains uniformity</li>
</ul>
<h3 id="conclusion">Conclusion</h3>
<p>Main takeaways from the talk and BiT-L were:</p>
<ul>
<li>Scale is one of the key drivers of representation learning performance</li>
<li>Especially effective for few-shot learning and OOD Robustness</li>
<li>Also seen and mirrored in language domain</li>
</ul>
<p>Links to the GitHub repositories are: <a href ="https://github.com/google-research/big_transfer"> Big Transfer </a> and <a href="https://github.com/google-research/task_adaptation"> Visual Task Adaptation Benchmark (VTAB)</a></p>
</description>
</item>
<item>
<title>Summer School Series: Lecture 1 by Jean-Phillipe Vert</title>
<link>https://archana1998.github.io/post/jean-vert/</link>
<pubDate>Fri, 21 Aug 2020 19:07:17 +0530</pubDate>
<guid>https://archana1998.github.io/post/jean-vert/</guid>
<description><p>This is an article about what <a href ="http://members.cbio.mines-paristech.fr/~jvert/">Jean-Phillipe Vert</a> talked about at the Google Research India-AI Summer School 2020. The lecture was titled <b> Differentiable Ranking and Sorting </b> and lasted about 2 hours.</p>
<h3 id="differentiable-programming">Differentiable Programming</h3>
<p>What is machine learning and deep learning?</p>
<p>Machine learning is just to give trained data to a program and get better results for complex problems. For example:</p>
<figure id="figure-a-neural-network-to-recognize-cats-and-dogs">
<img data-src="https://archana1998.github.io/post/jean-vert/fig1_hu2361eef24ba1250aaf0d087e444736ee_322268_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="1166" height="626">
<figcaption data-pre="Figure " data-post=":" class="numbered">
A neural network to recognize cats and dogs
</figcaption>
</figure>
<p>These networks usually use <b>vectors</b> to do the computations within the network, however in recent research models are getting extended to non-vector objects (strings, graphs etc.)</p>
<p>Jean then gave an introduction to permutations and rankings and what he aspired to do, informally. Permutations are not vectors/graphs, but something else entirely. Some data are permutations (input, output etc) and some operations may involve ranking (histogram equalization, quantile normalization)</p>
<p>What do these operations aspire to do?</p>
<ul>
<li>Rank pixels</li>
<li>Extract a permutation and assign values to pixels only based on rankings</li>
</ul>
<h3 id="permutations">Permutations</h3>
<p>A permutation is formally defined as a bijection, that is:</p>
<p>$$\sigma:[1, N] \rightarrow[1, N]$$</p>
<ul>
<li>
<p>Over here, $\sigma(i)=$ rank of item $i$</p>
</li>
<li>
<p>The composition property is defined as: $\left(\sigma_{1} \sigma_{2}\right)(i)=\sigma_{1}\left(\sigma_{2}(i)\right)$</p>
</li>
<li>
<p>$\mathrm{S}_{N}$ is the symmetric group and</p>
</li>
<li>
<p>$\left|\mathbb{S}_{N}\right|=N !$</p>
</li>
</ul>
<h3 id="goal">Goal</h3>
<p>Our primary goal is:</p>
<figure id="figure-moving-between-spaces">
<img data-src="https://archana1998.github.io/post/jean-vert/2_huacbb429b45f880db3bfef78880895f1e_33174_2000x2000_fit_lanczos_2.png" class="lazyload" alt="" width="1021" height="294">
<figcaption data-pre="Figure " data-post=":" class="numbered">
Moving between spaces
</figcaption>
</figure>
<p>Some definitions here are:</p>
<ol>
<li>Embed:</li>
</ol>
<ul>
<li>To define/optimize $f_{\theta}(\sigma)=g_{\theta}($embed$(\sigma))$ for $\sigma \in \mathbb{S}_{N}$</li>
<li>E.g., $\sigma$ given as input or output</li>
</ul>
<ol start="2">
<li>Differentiate:</li>
</ol>
<ul>
<li>To define/optimize $h_{\theta}(x)=f_{\theta}($argsort$(x))$ for $x \in \mathbb{R}^{n}$</li>
<li>E.g., normalization layer or rank-based loss</li>
</ul>
<h3 id="argmax">Argmax</h3>
<p>To put it in simple words, the argmax function identifies the dimension in a vector with the largest value. For example, $\operatorname{argmax}(2.1, -0.4, 5.8) = 3$</p>
<p>It is not differentiable because:</p>
<ul>
<li>As a function, $\mathbb{R}^{n} \rightarrow[1,n]$, the output space is <b> not continuous </b></li>
<li>It is <b>piecewise constant</b> (i.e, gradient = 0 almost everywhere even if the output space was continuous)</li>
</ul>
<h3 id="softmax">Softmax</h3>
<p>It is a <b>differentiable</b> function that maps from $\mathbb{R}^{n} \rightarrow \mathbb{R}^{n}$, where</p>
<p>$$\operatorname{softmax}_ {\epsilon} (x)_ {i} =\frac{e^{x_{i} / \epsilon}}{\sum_{j=1}^{n} e^{x_{j} / \epsilon}}$$</p>
<p>For example, $\operatorname{softmax}(2.1, -0.4, 5.8) = (0.027, 0.02, 0.972)$</p>
<h3 id="moving-from-softmax-to-argmax">Moving from Softmax to Argmax</h3>
<p>$$\lim _ {\epsilon \rightarrow 0} \operatorname{softmax}_{\epsilon}(2.1,-0.4, 5.8)=(0,0,1)=\Psi(3)$$</p>
<p>where $\psi:[1, n] \rightarrow \mathbb{R}^{n}$ is the one-hot encoding. More generally,
$$
\forall x \in \mathbb{R}^{n}, \quad \lim_ {\epsilon \rightarrow 0} \operatorname{softmax}_{\epsilon}(x)=\Psi(\operatorname{argmax}(x))
$$</p>
<h3 id="moving-from-argmax-to-softmax">Moving from Argmax to Softmax</h3>
<h4 id="1-embedding">1. Embedding</h4>
<p>Let the simplex
$$
\Delta_{n-1}=\operatorname{conv}({\Psi(y): y \in[1, n]})
$$
Then we have a variational characterization (exercice left to us):
$$
\Psi(\operatorname{argmax}(x))=\underset{z \in \Delta_{n-1}}{\operatorname{argmax}}\left(x^{\top} z\right)
$$</p>
<figure id="figure-simplex-representation">