index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Harshit Gaur</title>
    <link>https://harshit2000.github.io/</link>
      <atom:link href="https://harshit2000.github.io/index.xml" rel="self" type="application/rss+xml" />
    <description>Harshit Gaur</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>Harshit Gaur, 2020 ©</copyright><lastBuildDate>Tue, 25 Aug 2020 19:19:35 +0530</lastBuildDate>
    <image>
      <url>https://harshit2000.github.io/images/icon_hu616effff6bc497e1f3ccd40e4a444d66_14554_512x512_fill_lanczos_center_2.png</url>
      <title>Harshit Gaur</title>
      <link>https://harshit2000.github.io/</link>
    </image>
    
    <item>
      <title>How Google bowled me over with a Googly</title>
      <link>https://archana1998.github.io/post/summer-school-sumup/</link>
      <pubDate>Tue, 25 Aug 2020 19:19:35 +0530</pubDate>
      <guid>https://archana1998.github.io/post/summer-school-sumup/</guid>
      <description>&lt;p&gt;I recently got the opportunity to attend the AI Summer School conducted by Google Research India. I was one of the 150 people selected to attend it, out of over 75,000 applications. Probably one of my most noteworthy achievements till date, if not the most (?). I remember screaming after getting the acceptance mail for over two hours, it was the happiest I have been in a while. I was selected as part of the Computer Vision track, and I was elated as I knew absolutely nothing about the other two tracks (Natural Language Understanding and AI for Social Good)&lt;/p&gt;
&lt;p&gt;The summer school happened over three days, between August 20 and 22, 2020. Due to the pandemic that taught us that we can do everything over a computer screen, the summer school was held in a virtual mode. The people at Google made us feel very welcome, and sent out a batch of goodies from Google to all the participants (side note: I&amp;rsquo;m a little salty about this as I haven&amp;rsquo;t gotten mine yet, it got lost on the way). Saying I loved the experience would be an understatement of sorts, I was constantly elated after each and every event.&lt;/p&gt;
&lt;p&gt;Day 1 started off with a keynote by Jeff Dean, the head of Google AI Research, at 9 am. Waking up so early was a huge achievement for me in a quarantine-home restricted environment where I sleep late and wake up late. Working remotely at a lab in a different country provides insane flexibility, I am my most productive in the afternoons and evenings. I sat in front of my computer and tuned into the YouTube live stream which was engaging and amazing (see my &lt;a href=&#34;https://archana1998.github.io/post/opening-keynote/&#34;&gt; post&lt;/a&gt;)


&lt;figure id=&#34;figure-opening-keynote&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/summer-school-sumup/1_hu44f229a9c87ce6272342d7409ef1f45d_159695_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1478&#34; height=&#34;1108&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Opening Keynote
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;After a lunch break, we had our first lecture by Jean-Phillipe Vert, which had so much rigorous math that we were slightly intimidated, however it was a pleasure being taught by someone so amazing all the same. (shameless plug to &lt;a href=&#34;https://archana1998.github.io/post/jean-vert/&#34;&gt; post&lt;/a&gt; again).&lt;/p&gt;
&lt;p&gt;We had an amazing panel discussion that was titled &lt;b&gt;Why Choose a Career in Research&lt;/b&gt;. The panel consisted of eminent names from Google Research. We had a &amp;ldquo;virtual social&amp;rdquo; after that on GatherTown, which was not the easiest to use on Day 1, but it was quite an experience. We had a second lecture after that by Neil Houlsby, finally on computer vision (I loved it, here&amp;rsquo;s my &lt;a href=&#34;https://archana1998.github.io/post/neil-houlsby/&#34;&gt; post&lt;/a&gt;). And just like that, I was done for the day and had learnt more in these 6 hours than I did in the last semester.&lt;/p&gt;
&lt;p&gt;Day 2 started off well with a lovely talk by Vineet Gupta, on math again :(. But this was nice math, easy to understand and follow and talked about very interesting theoretical math for machine learning that provided very promising results in optimization (once again, here&amp;rsquo;s my &lt;a href=&#34;https://archana1998.github.io/post/vineet-gupta/&#34;&gt; post&lt;/a&gt;). We had a social before lunch once again, and I got to meet and greet with a lot of people this time, having finally understood how to use the GatherTown UI. I interacted with a lot of my fellow attendees and the Google Lab members, it was super fun.&lt;/p&gt;
&lt;p&gt;After lunch, we had our first computer vision-centric lecture by Cristian Sminchisescu that was BEAUTIFUL. The fact that it perfectly aligned to my research interests was a cherry on top of the cake. (&lt;a href=&#34;https://archana1998.github.io/post/cristian-sminchisescu/&#34;&gt; post&lt;/a&gt; again). We had a panel discussion titled &amp;ldquo;AI For India&amp;rdquo; after that, which was insightful as well. I was done with my second day of the school, and had learned more than I did in half of my math degree.&lt;/p&gt;
&lt;p&gt;Day 3 had lovely lectures, by Rahul Sukthankar and Arsha Nagrani who were so, so, good at presenting their work! Rahul&amp;rsquo;s lecture was simple but beautifully presented, and I loved it! (here&amp;rsquo;s my &lt;a href=&#34;https://archana1998.github.io/post/rahul-sukthankar/&#34;&gt; post&lt;/a&gt;). Arsha&amp;rsquo;s talk was about some very interesting research that&amp;rsquo;s probably going to revolutionize multimodal learning (last time, here&amp;rsquo;s my &lt;a href =&#34;https://archana1998.github.io/post/arsha-nagrani/&#34;&gt; post &lt;/a&gt;)
The summer school concluded with a closing keynote delivered by Manish Gupta, the director of Google Research India, who talked about opportunities in Google Research for us. We then had socials that lasted two hours (last day, woohoo) and I, who had mastered navigating GatherTown by then was a proper social butterfly, talking to everyone and anyone and sending connection requests on LinkedIn to stay in touch.&lt;/p&gt;
&lt;p&gt;That was it! &lt;em&gt;curtain closes&lt;/em&gt; The experience was CRAZY, and I never knew I could learn so much in just three days. More than learning new concepts, I got an insight into how these amazing people conduct cutting edge research, and the fact that we have to learn so much to get there was a little inspirational too. I was jumping with happiness and rambled nonstop about how much fun I had to my family and my friends, thankfully for me, they shared my enthusiasm :)
I can&amp;rsquo;t wait to experience more things like this in the future!&lt;/p&gt;
&lt;p&gt;PS: Still waiting for my goodies @Google. Thanks again for a lovely time.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Summer School Series: Lecture 6 by Arsha Nagrani</title>
      <link>https://archana1998.github.io/post/arsha-nagrani/</link>
      <pubDate>Tue, 25 Aug 2020 18:36:00 +0530</pubDate>
      <guid>https://archana1998.github.io/post/arsha-nagrani/</guid>
      <description>&lt;p&gt;This final lecture was delivered by &lt;a href =&#34;http://www.robots.ox.ac.uk/~arsha/&#34;&gt;Arsha Nagrani&lt;/a&gt;, a recent Ph.D. graduate from Oxford University&amp;rsquo;s VGG group, and an incoming research scientist at Google Research. Her talk was called &lt;b&gt;Multimodality for Video Understanding&lt;/b&gt;.&lt;/p&gt;
&lt;h3 id=&#34;video-understanding&#34;&gt;Video Understanding&lt;/h3&gt;
&lt;p&gt;Videos provide us with far more information than images. Multimodal refers to many mediums for learning, here it can be time, sound and speech. Videos are all around us (30k newly created content videos are uploaded to YouTube every &lt;b&gt;hour&lt;/b&gt;).
However, these have high dimensionality and are difficult to process and annotate.&lt;/p&gt;
&lt;h4 id=&#34;complementarity-among-signals&#34;&gt;Complementarity among signals&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Vision (scene)&lt;/li&gt;
&lt;li&gt;Sound (content of speech)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id=&#34;redundancy-between-signals&#34;&gt;Redundancy between signals&lt;/h4&gt;
&lt;p&gt;Helps recognize person, face+sound, thus can be a useful form of weak supervision. The redundant information comes from background sounds, foreground audio, signals identified from speech and the content of speech.&lt;/p&gt;
&lt;p&gt;Thus, best way to exploit multimodal nature of videos is to work with the complementarity and redundancy.&lt;/p&gt;
&lt;h4 id=&#34;suitable-tasks&#34;&gt;Suitable tasks&lt;/h4&gt;
&lt;p&gt;Suitable tasks for video understanding are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Video classification&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;single label&lt;/li&gt;
&lt;li&gt;infinite number of possible classes&lt;/li&gt;
&lt;li&gt;ambiguity in the label space&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&#34;2&#34;&gt;
&lt;li&gt;Action recognition: more fine grained, the motion is important, human centric&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It is important to note that labelling actions in videos is extremely expensive and existing models do not generalize well to new domains.&lt;/p&gt;
&lt;p&gt;In this context, can we use speech as a form of supervision? For example, narrated video clips and lifestyle Vlogs.&lt;/p&gt;
&lt;h3 id=&#34;movies&#34;&gt;Movies&lt;/h3&gt;
&lt;p&gt;General domain of movies: people speak about their actions. However, sometimes speech is completely unrelated, giving us noise. We need to learn when speech matches action. An example of work in this field is &lt;a href =&#34;https://arxiv.org/abs/1912.06430&#34;&gt;End-to-End Learning of Visual Representations from Uncurated Instructional Videos&lt;/a&gt;. This work reduces noise by using the MIL-NCE loss.&lt;/p&gt;
&lt;p&gt;Can we first train a model to recognize actions and then see if it should be used for supervision? An interesting discovery Arsha made was using Movie Screenplays, that contain both speech segments and scene directions with actions. Using this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We can obtain speech-action pairs&lt;/li&gt;
&lt;li&gt;Retrieve speech segments with verbs&lt;/li&gt;
&lt;li&gt;Train the &lt;a href=&#34;https://www.robots.ox.ac.uk/~vgg/research/speech2action/&#34;&gt;Speech2Action&lt;/a&gt; model to predict action, with a BERT-Backbone (movie scripts scraped from IMSDB)&lt;/li&gt;
&lt;li&gt;Apply to closed captions of unlabelled videos&lt;/li&gt;
&lt;li&gt;Apply to large movie corpus&lt;/li&gt;
&lt;/ul&gt;


&lt;figure id=&#34;figure-speech2action-model&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/arsha-nagrani/1_hu3dfb93f108aa0e1c42c1039d245e09c2_102757_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;784&#34; height=&#34;358&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Speech2Action model
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The Speech2Action model recognizes rare actions, and is a visual classifier on weakly labelled data (S3D-G model with cross-entropy loss)&lt;/p&gt;
&lt;p&gt;Evaluation is done on the AVA and HMDB-51 (transfer learning) datasets. It gets abstract actions like &lt;b&gt;count&lt;/b&gt; and &lt;b&gt;follow&lt;/b&gt; too.&lt;/p&gt;
&lt;h3 id=&#34;multimodal-complementarity&#34;&gt;Multimodal Complementarity&lt;/h3&gt;
&lt;p&gt;This refers to fusing info from multiple modalities for video text retrieval, like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Finding video corresponding to text queries&lt;/li&gt;
&lt;li&gt;More to videos than just actions like object, scene etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Supervisions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It&amp;rsquo;s not easy to get the complete combination of captions, this is a very subjective task&lt;/li&gt;
&lt;li&gt;Need extremely large datasets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What Arsha does is rely on expert models trained for different tasks like object detection, face detection, action recognition, OCR etc. These are all applied to the video and features are extracted. The framework is a joint video text embedding, with the video encoder + text query encoder = joint embedding space (similarity should be really high if related). It is necessary for the video encoder to be discriminative and retain specific information.&lt;/p&gt;
&lt;h3 id=&#34;collaborative-gating&#34;&gt;Collaborative Gating&lt;/h3&gt;
&lt;p&gt;For each expert, generate attention mask by looking at the other experts &lt;a href = &#34;https://bmvc2019.org/wp-content/uploads/papers/0363-paper.pdf&#34;&gt; (Use What You Have: Video Retrieval Using Representations From Collaborative Experts, BMVC 2019)&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Trained using bi-directional max margin ranking loss&lt;/li&gt;
&lt;li&gt;Adding in more experts massively increases performance&lt;/li&gt;
&lt;li&gt;Main boost is from the object embeddings&lt;/li&gt;
&lt;/ul&gt;


&lt;figure id=&#34;figure-collaborative-gating&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/arsha-nagrani/2_huc44033e910d0af756e6885ccfb6b6932_14850_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;150&#34; height=&#34;197&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Collaborative Gating
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;Another paper that Arsha discussed was &lt;a href =&#34;https://arxiv.org/abs/2007.10639&#34;&gt; Multi-modal Transformer for Video Retrieval, ECCV 2020 &lt;/a&gt;. This takes features that are taken at different time stamps for each task and aggregrate for the embeddings. The expert and temporal embeddings are added and summed up.&lt;/p&gt;
&lt;h3 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;More modalities is better (because more complementarity)&lt;/li&gt;
&lt;li&gt;Time (modelling time along with modalities is interesting, some modalities train faster than the others)&lt;/li&gt;
&lt;li&gt;Mid fusion is better than late (Attention truly is what you need)&lt;/li&gt;
&lt;li&gt;Our world is multimodal, it doesn&amp;rsquo;t make sense to work with modalities in isolation&lt;/li&gt;
&lt;li&gt;Use the redundant and complementary information from vision, audio and speech to massively reduce annotations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;b&gt;Open Research Questions:&lt;/b&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Extended Temporal Sequences (beyond 10s):&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Backprop + memory restricts current video architectures to 64 frames&lt;/li&gt;
&lt;li&gt;For longer we rely on pre-extracted features&lt;/li&gt;
&lt;li&gt;Need new datasets to drive innovation&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&#34;2&#34;&gt;
&lt;li&gt;Moving away from supervision: is an upper bound on self supervision being appraoched?&lt;/li&gt;
&lt;li&gt;The world is multimodal: how do we design good fusion architectures?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Arsha thus concluded a fantastic talk that described the cutting-edge research that her team at Oxford and Google is conducting. It was tremendously insightful and inspirational.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Summer School Series: Lecture 5 by Rahul Sukthankar</title>
      <link>https://archana1998.github.io/post/rahul-sukthankar/</link>
      <pubDate>Tue, 25 Aug 2020 17:19:29 +0530</pubDate>
      <guid>https://archana1998.github.io/post/rahul-sukthankar/</guid>
      <description>&lt;p&gt;This Lecture was presented by &lt;a href=&#34;https://research.google/people/RahulSukthankar/&#34;&gt;Rahul Sukthankar&lt;/a&gt;, a research scientist at Google Research and an Adjunct Professor at Carnegie Mellon University. It was titled &lt;b&gt;Deep Learning in Computer Vision&lt;/b&gt;.&lt;/p&gt;
&lt;h3 id=&#34;popular-computer-vision-tasks&#34;&gt;Popular Computer Vision tasks&lt;/h3&gt;
&lt;p&gt;Some popular tasks in the domain of computer vision include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Image Classification (assign to one class)&lt;/li&gt;
&lt;li&gt;Image Labelling/Object Recognition (multiple classes)&lt;/li&gt;
&lt;li&gt;Object Detection/Localization (predicts bounding box+label, works well for objects but not for fuzzy concepts)&lt;/li&gt;
&lt;li&gt;Semantic Segmentation (Pixel level dense labelling)&lt;/li&gt;
&lt;li&gt;Image Captioning (Description of image in text)&lt;/li&gt;
&lt;li&gt;Human Body Part Segmentation&lt;/li&gt;
&lt;li&gt;Human Pose Estimation (predicting 2D pose keypoints)&lt;/li&gt;
&lt;li&gt;Generating 3D Human Pose and Body Models from an image&lt;/li&gt;
&lt;li&gt;Depth Prediction from a single image (foreground and background semantic segmentation based on a heatmap)&lt;/li&gt;
&lt;li&gt;3D Scene Understanding&lt;/li&gt;
&lt;li&gt;Autonomous navigation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While thinking about a particular problem statement:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We need to take in specific considerations (such as semantic segmentation, classicaiton, object detection etc)&lt;/li&gt;
&lt;li&gt;What is the output? (binary yes/no, bounding box, label/pixel etc)&lt;/li&gt;
&lt;li&gt;How is the training data labelled? (Fully Supervised/Weakly or Cross-Modal/Self-supervised)&lt;/li&gt;
&lt;li&gt;Architecture: Usually a Convolutional Neural Network, but what is the final layer?&lt;/li&gt;
&lt;li&gt;What loss function do we use?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;CMU Navlabs (30 years ago) built a self steering car only with an artificial neural network, in the pre-CNN era (&lt;a href= &#34;https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf&#34;&gt; ALVINN: AN AUTONOMOUS LAND VEHICLE IN A NEURAL NETWORK&lt;/a&gt;)&lt;/p&gt;
&lt;h3 id=&#34;convolutional-neural-networks&#34;&gt;Convolutional Neural Networks&lt;/h3&gt;
&lt;p&gt;The structure of a convolutional neural network follows as input + conv, relu, pooling layers (hidden layers) + flatten, fully connected and softmax layers (for classification). Key concepts behind CNNs are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Local connectivity (not connected to every pixel, but just a few)&lt;/li&gt;
&lt;li&gt;Shared weights (translational invariance)&lt;/li&gt;
&lt;li&gt;Pooling (reducing dimensions, leads to local patch becoming bigger (filter size))&lt;/li&gt;
&lt;li&gt;Filter stride (cuts down weights, reduces computations)&lt;/li&gt;
&lt;li&gt;Multiple feature maps&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is essential to choose the right conv layer, pooling layer, activation function, loss function, optimization and regularization methods, etc.&lt;/p&gt;
&lt;h4 id=&#34;convolutions&#34;&gt;Convolutions&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;2D vs 3D convolutions: 3D convolutions are used to capture patterns across 3 dimensions, for example Video Understanding and Medical Imaging.&lt;/li&gt;
&lt;li&gt;1x1 convolution: weighed average across channel axis, feature pooling technique to reduce dimensions&lt;/li&gt;
&lt;li&gt;Other types of convolutions are dilated convolutions, regular vs depth wise separable convolutions, grouped convolutions (AlexNet uses it, it reduces computation)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;famous-architectures&#34;&gt;Famous architectures&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Inceptionv1 (2014):


&lt;figure id=&#34;figure-inception-v1&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/rahul-sukthankar/1_hu0095bbad7d2e55015cc682d2a4670f59_127543_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;762&#34; height=&#34;294&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Inception v1
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ResNet:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Resnet uses skipped connections with residual blocks, the added paths help solve vanishing gradient problems and gives a shorter route for backpropagation&lt;/p&gt;


&lt;figure id=&#34;figure-residual-blocks&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/rahul-sukthankar/2_hub70c3df55df6f4f2b8fb68ff07d9a5f0_58255_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;557&#34; height=&#34;332&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Residual blocks
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h3 id=&#34;object-detection-in-images&#34;&gt;Object Detection in Images&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Object Classification: Task of identifying a picture is a dog&lt;/li&gt;
&lt;li&gt;Object Localization: Involves finding class labels as well as a bounding box to show where an object is located&lt;/li&gt;
&lt;li&gt;Object Detection: Localizing with box&lt;/li&gt;
&lt;li&gt;Semantic Segmentation: Dense pixel labelling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are two ways to do detection:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Sliding window approach: computationally expensive and unbalanced&lt;/li&gt;
&lt;li&gt;Selective search: guessing promising bounding boxes and selecting the best out of them&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;RCNN did this when they extracted region proposals&lt;/li&gt;
&lt;li&gt;Fast RCNN did class labelling+ bounding box prediction at the same time (softmax+bounding box regression)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Bounding box evaluation is commonly done by the Intersection over Union Metric
$$ \text{Intersection over Union} = \frac{\text{Area of Overlap}}{\text{Area of Union}}$$
(Ground truth bounding box and predicted bounding box)&lt;/p&gt;
&lt;h3 id=&#34;classic-cnn-vs-fully-convolutional-net&#34;&gt;Classic CNN vs Fully Convolutional Net&lt;/h3&gt;
&lt;p&gt;A classic CNN comprises of a conv+Fully Connected Layer, a fully convolutional layer contains convolutional blocks that help us retain the same number of weights no matter what the input image size is. An example of a fully convolutional net is the U-Net, that is used extensively for semantic segmentation.&lt;/p&gt;
&lt;p&gt;Other applications of a fully convolutional net are : Residual Encoding Decoding, Dense Prediction, Superresolution, Colorization (self supervised)&lt;/p&gt;
&lt;h3 id=&#34;last-layer-activation-function-and-loss-function-summary&#34;&gt;Last-Layer Activation function and Loss Function Summary&lt;/h3&gt;


&lt;figure id=&#34;figure-functions-to-use&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/rahul-sukthankar/3_hu44f5fe12bf2e14613e9c883ade35e838_85577_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;712&#34; height=&#34;219&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Functions to use
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;Any differentiable function can be used as a loss function: even another neural net! (perceptual loss, GAN loss, differentiable renderer etc)&lt;/p&gt;
&lt;p&gt;Rahul concluded this introduction lecture focused in Computer Vision using fully supervised deep learning, with key concepts on CNNs and their extensions and the importance of choosing the right loss function. It was a wonderful lecture with all the concepts beautifully explained.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Summer School Series: Lecture 4 by Cristian Sminchisescu</title>
      <link>https://archana1998.github.io/post/cristian-sminchisescu/</link>
      <pubDate>Tue, 25 Aug 2020 16:11:56 +0530</pubDate>
      <guid>https://archana1998.github.io/post/cristian-sminchisescu/</guid>
      <description>&lt;p&gt;This talk was presented by &lt;a href =&#34;https://research.google/people/CristianSminchisescu/&#34;&gt;Cristian Sminchisescu&lt;/a&gt;, who is a Research Scientist leading a team at Google, and a Professor at Lund University. His talk was titled &lt;b&gt;End-to-end Generative 3D Human Shape and Pose models, and active human sensing&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;3D Human Sensing has many applications, in the field of animation, sports motion, AR/VR, medical industry etc. It is a known fact that humans are very complex, the body has 600 muscles, 200 bones and 200 joints. Clothing that humans wear have folds and wrinkles, there are many different types of garments and cloth-body interactions.&lt;/p&gt;
&lt;h3 id=&#34;challenges&#34;&gt;Challenges&lt;/h3&gt;
&lt;p&gt;Typical challenges in 3D human sensing include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;High dimensionality, articulation and deformation&lt;/li&gt;
&lt;li&gt;Complex appearance variations, clothing and multiple people&lt;/li&gt;
&lt;li&gt;Self occlusion or occlusion by scene objects&lt;/li&gt;
&lt;li&gt;Observation (depth) uncertainty (especially in monocular images)&lt;/li&gt;
&lt;li&gt;Difficult to obtain accurate supervision of humans&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where we can exploit the power of machine and deep learning, we aim to come up with a learning model that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Understands large volumes of data&lt;/li&gt;
&lt;li&gt;Connects between images and 3D models&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;problems-that-need-to-be-solved&#34;&gt;Problems that need to be solved&lt;/h3&gt;
&lt;p&gt;It is imperative to &lt;b&gt;FIND THE PEOPLE &lt;/b&gt;. We then need to infer their pose, body shape and clothing. The next step would be to recognize actions, behavioral states and social signals that they make, followed by recognizing what objects they use.&lt;/p&gt;
&lt;h3 id=&#34;visual-human-models&#34;&gt;Visual Human Models&lt;/h3&gt;
&lt;p&gt;Different Data types we take into consideration are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple Subjects&lt;/li&gt;
&lt;li&gt;Soft Tissue Dynamics&lt;/li&gt;
&lt;li&gt;Clothing
This is all fed into the learning model&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;generative-human-modeling&#34;&gt;Generative Human Modeling&lt;/h3&gt;
&lt;p&gt;Dynamic Human Scans $\mathbf{\xrightarrow[\text{deep learning}]{\text{end to end}}}$ Full Body articulated generative human models. The Dynamic Human Scans are in the form of very dense 3D Point Clouds.&lt;/p&gt;
&lt;h3 id=&#34;ghum-and-ghuml&#34;&gt;GHUM and GHUML&lt;/h3&gt;
&lt;p&gt;Cristian then talked about his paper &lt;a href = &#34;https://openaccess.thecvf.com/content_CVPR_2020/papers/Xu_GHUM__GHUML_Generative_3D_Human_Shape_and_Articulated_Pose_CVPR_2020_paper.pdf&#34;&gt;GHUM &amp;amp; GHUML: Generative 3D Human Shape and Articulated Pose Models&lt;/a&gt;
GHUM is the moderate generative model with 10168 vertices and GHUML is the light version with 3190 vertices, however both have a shared skeleton that has minimal parameterization and anatomical joint limits.&lt;/p&gt;
&lt;p&gt;The model faciliates Automatic 3D Landmark detection with multiview renderings, 2D landmark detection and 3D landmark triangulation. Automatic Registration is able to calculate deformations.&lt;/p&gt;
&lt;h3 id=&#34;end-to-end-training-pipeline&#34;&gt;End to End Training Pipeline&lt;/h3&gt;


&lt;figure id=&#34;figure-end-to-end-training-pipeline&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/cristian-sminchisescu/1_hu5ce619d08456037865eced6552c42e7e_329781_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;500&#34; height=&#34;500&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    End To End Training Pipeline
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;ul&gt;
&lt;li&gt;Once data is mapped to meshes and put into registered format, next step is to encode and decode static shapes (using VAE)&lt;/li&gt;
&lt;li&gt;Kinematics is learned using Normalizing Flow model&lt;/li&gt;
&lt;li&gt;Mesh filter (mask): to integrate close up scans with models, fed into the optimization step&lt;/li&gt;
&lt;li&gt;To train landmarks, we use annotated image data&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;evaluation&#34;&gt;Evaluation&lt;/h3&gt;
&lt;p&gt;For the variational shape and expression autoencoder, VAE works better than OCA, with reconstruction error lying between 0-20mm. Motion Retargeting and Kinematic Priors are done by retargetting models to 2.8M CMU and 2.2M Humans3.6M motion capture frames.&lt;/p&gt;
&lt;h3 id=&#34;normalizing-flows-for-kinematic-priors&#34;&gt;Normalizing Flows for Kinematic Priors&lt;/h3&gt;


&lt;figure id=&#34;figure-normalizing-flows-for-kinematic-priors&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/cristian-sminchisescu/2_huf2e22827a723675b2a79947a570691ce_140050_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;869&#34; height=&#34;244&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Normalizing Flows for Kinematic Priors
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;ul&gt;
&lt;li&gt;A normalizing flow is a sequence of invertible transformations applied to an original distribution&lt;/li&gt;
&lt;li&gt;Use a dataset $\mathcal{D}$ of human kinematic poses $\theta$ as statistics for natural human movements&lt;/li&gt;
&lt;li&gt;Use normalizing flow to warp the distribution of poses into a simple and tractable density function e.g. $\mathbf{z} \sim \mathcal{N}(0 ; \mathbf{I})$&lt;/li&gt;
&lt;li&gt;The flow is bijective, trained by maximizing data log-likelihood
$$\max _{\phi} \sum _{\partial \in \mathcal{D}} \log p _{\phi}(\theta)$$&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;ghum-and-smpl&#34;&gt;GHUM and SMPL&lt;/h3&gt;


&lt;figure id=&#34;figure-ghum-vs-smpl&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/cristian-sminchisescu/3_hu125e7c529aa96451d645ff796805a88a_150708_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;500&#34; height=&#34;500&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    GHUM vs SMPL
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;ul&gt;
&lt;li&gt;GHUM is close (slightly better) to SMPL in skinning visual quality&lt;/li&gt;
&lt;li&gt;The vertex point-to-plane error (body-only) is GHUM: 4.23mm and SMPL: 4.96mm&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;An effective Deep Learning Pipeline to build generative, articulated 3D human shape models&lt;/li&gt;
&lt;li&gt;GHUM and GHUM are two full body human models that are available for research:(&lt;a href=&#34;https://github.com/google-research/google-research/tree/master/ghum)&#34;&gt;https://github.com/google-research/google-research/tree/master/ghum)&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;We can jointly sample shape, facial expressions (VAEs) and pose (normalizing flows)&lt;/li&gt;
&lt;li&gt;We have low res and high res models, that are non-linear (linear as special case)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;other-work&#34;&gt;Other Work&lt;/h3&gt;
&lt;p&gt;Some other interesting papers that Cristian pointed out were &lt;a href=&#34;https://arxiv.org/abs/2003.10350&#34;&gt; Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows&lt;/a&gt; (ECCV 2020) that works on Full Body Reconstruction in Monocular Images, and &lt;a href =&#34;https://arxiv.org/abs/2008.06910&#34;&gt;Neural Descent for Visual 3D Human Pose and Shape &lt;/a&gt; (submitted to NeurIPS 2020) that talks about Self-Supervised 3D Human Shape and Pose Estimation.&lt;/p&gt;
&lt;h3 id=&#34;human-interactions&#34;&gt;Human Interactions&lt;/h3&gt;
&lt;p&gt;A problem that many 3D deep learning practitioners face is dealing with human interactions during estimation and reconstruction. Contacts are difficult to estimate correctly because of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uncertainty in 3D monocular depth prediction&lt;/li&gt;
&lt;li&gt;Reduced evidence of contact due to occlusion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cristian then talked about his paper &lt;a href=&#34;https://openaccess.thecvf.com/content_CVPR_2020/papers/Fieraru_Three-Dimensional_Reconstruction_of_Human_Interactions_CVPR_2020_paper.pdf&#34;&gt; Three-dimensional Reconstruction of Human Interactions &lt;/a&gt; and to move towards accurate reconstruction of interactions we need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detect contact&lt;/li&gt;
&lt;li&gt;Predict contact interaction signatures&lt;/li&gt;
&lt;li&gt;3D reconstruction under contact constraints


&lt;figure id=&#34;figure-modelling-interactions&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/cristian-sminchisescu/4_hud88f9b9f04f1033960e9f73ca89377df_196249_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;930&#34; height=&#34;315&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Modelling interactions
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;conclusion-interactions&#34;&gt;Conclusion (Interactions)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;New models and datasets for contact detection, contact surface signature prediction, and 3d reconstruction under contact constraints&lt;/li&gt;
&lt;li&gt;Annotation has an underlying contact ground truth but not always easy to precisely identify from a single image&lt;/li&gt;
&lt;li&gt;Humans are reasonably consistent at identifying contacts at 9 and 17 region granularity, and contact can be predicted with reasonable accuracy too&lt;/li&gt;
&lt;li&gt;Contact-constrained 3D human reconstruction produces considerably better and more meaningful estimates, compared to non-contact methods&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cristian then concluded his wonderful lecture that talked about the most recent advances in Computer Vision in the 3D Deep learning field. It was a very informative and engaging lecture.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Summer School Series: Lecture 3 by Vineet Gupta</title>
      <link>https://archana1998.github.io/post/vineet-gupta/</link>
      <pubDate>Sun, 23 Aug 2020 15:53:16 +0530</pubDate>
      <guid>https://archana1998.github.io/post/vineet-gupta/</guid>
      <description>&lt;p&gt;This talk was delivered by &lt;a href=&#34;http://www-cs-students.stanford.edu/~vgupta/&#34;&gt; Vineet Gupta &lt;/a&gt;, a research scientist at Google Brain, Mountain View California.
His talk was titled &lt;b&gt;Adaptive Optimization&lt;/b&gt;.&lt;/p&gt;
&lt;h3 id=&#34;the-optimization-problem&#34;&gt;The optimization problem&lt;/h3&gt;
&lt;p&gt;The optimization problem aims to learn the best function from a class of functions.
$$\operatorname{Class} :  { \hat{y} = M(x | w), for \space w \in \mathbb{R}^{n} }\ $$&lt;/p&gt;
&lt;p&gt;A class is most often specified as a neural network, parameterized by w. If the class is too large, overfitting happens. If the class is too small, well we end up getting bad results.&lt;/p&gt;
&lt;p&gt;The most common function to find the best function is supervised learning.&lt;/p&gt;
&lt;p&gt;Training examples: input output pairs such as (x&lt;sub&gt;1&lt;/sub&gt;, y&lt;sub&gt;1&lt;/sub&gt;),&amp;hellip;.(x&lt;sub&gt;n&lt;/sub&gt;, y&lt;sub&gt;n&lt;/sub&gt;)&lt;/p&gt;
&lt;p&gt;Learning rule: Estimating $w$ such that $\hat{y_{i}} = M(x_{i}|w) \approx y_{i}$, and $w$ approximately minimizes $ F(w) = \sum_{i=1}^{n} l(\hat{y_{i}},y_{i})$ (the loss function)&lt;/p&gt;
&lt;p&gt;In a feed-forward Deep Neural Network, gradient descent for the entire training is expensive. For this reason, we sample points and find the gradient for them.&lt;/p&gt;
&lt;h3 id=&#34;stochastic-optimization&#34;&gt;Stochastic Optimization&lt;/h3&gt;
&lt;p&gt;The optimizer starts with the network denoted as $M(x|w)$.&lt;/p&gt;
&lt;p&gt;At each round t: (the goal is to minimize $F(w)$)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Optimizer has decided upon $w_{t}$&lt;/li&gt;
&lt;li&gt;Optimizer receives the input $ [x_{i} ]_{i=1}^{k}$&lt;/li&gt;
&lt;li&gt;Optimizer makes prediction $[\hat{y_{i}}= M(x_{i}|w_{t})]_{i=1}^{k}$&lt;/li&gt;
&lt;li&gt;Optimizer receives the true outcome&lt;/li&gt;
&lt;li&gt;Optimizer computes the loss $l_{t} = \sum_{i} l(y_{i},\hat{y_{i}})$ and gradient $g_{t} = \frac{\partial }{\partial w} \sum_{i} l(y_{i},\hat{y_{i}})$&lt;/li&gt;
&lt;li&gt;Optimizer uses $g_{t}$ to update $w_{t}$ to get $w_{t+1}$&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We stop when the gradients vanish or run out of time (or epochs).&lt;/p&gt;
&lt;h3 id=&#34;regret&#34;&gt;Regret&lt;/h3&gt;
&lt;p&gt;Convergence can be defined as the average loss compared to the optimum $w^{*}$&lt;/p&gt;
&lt;p&gt;$$ R_{T} = \frac{1}{T} \sum_{t=1}^{T} l_{t} (w_{t}) - \frac{1}{T} \sum_{t=1}^{T} l_{t}(w^{*})$$&lt;/p&gt;
&lt;p&gt;The proof of convergence can be picked up when $R_{T} \rightarrow 0 \text{ as } T \rightarrow 0$. This is a very &lt;b&gt;strong &lt;/b&gt; requirement, regret tending to 0.&lt;/p&gt;
&lt;p&gt;In convex optimization, $R_{T}$ can be computed in $O(\frac{1}{\sqrt{T}})$ time. Convex problems in SGD include faster convergence implies better condition number.&lt;/p&gt;
&lt;h3 id=&#34;momentum&#34;&gt;Momentum&lt;/h3&gt;
&lt;p&gt;What happens when the gradients become very noisy? To solve this, we can take a running average of the gradients.
$$ \bar{g_{t}} = \gamma \bar{g_{t}} + (1-\gamma)g_{t}$$
Thus the momentum step becomes:
$$w_{t+1} = w_{t}-\eta_{t} \bar{g_{t}}$$
The momentum approach works very well and is extremely popular, till date.&lt;/p&gt;
&lt;p&gt;Another way to solve the problem is by using second order methods.
To minimize $F(w)$,&lt;/p&gt;
&lt;p&gt;$$F(w) \approx F(w_{t}) + (w - w_{t})^{T} \nabla F(w_{T}) + \frac{1}{2} (w - w_{t})^{T} \nabla^{2} F(w_{t}) (w - w_{t}) $$ (first two terms of the Taylor series).&lt;/p&gt;
&lt;p&gt;The minimum is at: $w_{t+1} = w_{t} - \nabla^{2} F(w_{t})^{-1} \nabla F(w_{t})$&lt;/p&gt;
&lt;p&gt;The biggest problem with this is computing the $\nabla^{2} F(w_{t})$ (Hessian) is very expensive, as it is a $n*n$ matrix with $n$ number of parameters.&lt;/p&gt;
&lt;h3 id=&#34;adagrad&#34;&gt;AdaGrad&lt;/h3&gt;
&lt;p&gt;For gradient $g_{i}$
$$ H_{t} = \sqrt{(\sum_{s\leq{t}} g_{s} g_{s}^T)} $$&lt;/p&gt;
&lt;p&gt;This is used as matrix for the Mahalnobis metric, that will be used.
$$ \therefore w_{t+1} = \operatorname{argmin}_{w} \frac{1}{2\eta} ||w - w _{t}|| _{H _{t}}^{2} +\hat{l _{t}}(w) $$&lt;/p&gt;
&lt;p&gt;The AdaGrad update rule is: $w_{t+1} = w_{t} - \eta H_{t}^{-1} g_{t}$.
This is again very expensive, $O(n^{2})$ storage and $O(n^{3})$ time complexity per step.&lt;/p&gt;
&lt;h4 id=&#34;the-solution&#34;&gt;The solution&lt;/h4&gt;
&lt;p&gt;One way to solve this is by the diagonal approximation, by taking only the diagonal matrix of the Hessian instead of the entire matrix. $$H_{t} = \operatorname{diag}{(\sum_{s\leq{t}}g_{s}g_{s}^{T}+\epsilon\operatorname{I})}^{\frac{1}{2}}$$&lt;/p&gt;
&lt;p&gt;This take $O(n)$ space and $O(n)$ time per step.&lt;/p&gt;
&lt;p&gt;AdaGrad has been so successful that there have been plenty of variants like AdaDelta/RMS Prop and Adam.&lt;/p&gt;
&lt;h3 id=&#34;full-matrix-preconditioning&#34;&gt;Full-matrix Preconditioning&lt;/h3&gt;
&lt;h4 id=&#34;adagrad-preconditioner&#34;&gt;AdaGrad Preconditioner&lt;/h4&gt;
&lt;p&gt;For $w_{t}$ of size 100 * 200, $g_{t}$ flattens to a 20,000 vector and then becomes 20k * 20k in size.&lt;/p&gt;
&lt;h4 id=&#34;the-kronecker-product&#34;&gt;The Kronecker Product&lt;/h4&gt;
&lt;p&gt;Given a $m * n$ matrix $A$ and $p * q$ matrix $B$, their &lt;b&gt;Kronecker Product&lt;/b&gt; $C$ is defined as $$C = A \bigotimes B $$
This is also called the matrix direct product, and is a $(mp)*(nq)$ matrix (every element of $A$ multiplied with $B$). It commutes with standard matrix product along with exponentials.&lt;/p&gt;
&lt;h3 id=&#34;the-shampoo-preconditioner&#34;&gt;The Shampoo Preconditioner&lt;/h3&gt;


&lt;figure id=&#34;figure-decomposed-matrix&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/vineet-gupta/1_hu82cf2efea7539d0f27dd878118672ff1_8584_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;500&#34; height=&#34;500&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Decomposed Matrix
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h3 id=&#34;the-shampoo-update&#34;&gt;The Shampoo Update:&lt;/h3&gt;
&lt;p&gt;&lt;b&gt;Adagrad update&lt;/b&gt;: ${w} _{t+1} ={w} _{t}-\eta H _{t}^{-1} {g} _{t}$&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Shampoo factorization&lt;/b&gt;: $w_{t+1}=w_{t}-\eta\left(L_{i}^{\frac{1}{4}} \otimes R_{t}^{\frac{1}{4}}\right)^{-1} g_{t}$&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Shampoo update&lt;/b&gt;: $W_{t+1}=W_{t}-\eta L_{t}^{-\frac{1}{4}} G_{t} R_{t}^{-\frac{1}{4}}$
&lt;b&gt;Theorem (convergence)&lt;/b&gt;:
If ${G} _{1}, \mathrm{G} _{2}, \ldots, \mathrm{G} _{\mathrm{T}}$ of rank $\leq \mathrm{r},$ then the rate of convergence is:&lt;/p&gt;
&lt;p&gt;$$\frac{\sqrt{\mathrm{r}}}{\mathrm{T}} \operatorname{Tr}\left(\mathrm{L} _{\mathrm{T}}^{\frac{1}{4}}\right) \operatorname{Tr}\left(\mathrm{R} _{\mathrm{T}}^{\frac{1}{4}}\right)=\mathrm{O}\left(\frac{1}{\sqrt{\mathrm{T}}}\right)$$&lt;/p&gt;
&lt;p&gt;$$({R_{t}}=\sum_{s \leq t} G_{s}^{\top} G_{s}$ and $L_{t}=\sum_{s \leq t} G_{s} G_{s}^{T})$$&lt;/p&gt;
&lt;h3 id=&#34;implementing-shampoo&#34;&gt;Implementing Shampoo&lt;/h3&gt;
&lt;p&gt;The training system can be of two types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Asynchronous (accelerators don&amp;rsquo;t need to talk to each other, however it is hard for the parameter servers to handle)&lt;/li&gt;
&lt;li&gt;Synchronous (accelerator sends gradients to all the other accelerators, for them to average and update)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;challenges&#34;&gt;Challenges&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Tensorflow and PyTorch focus on 1&lt;sup&gt;st&lt;/sup&gt; order optimizations&lt;/li&gt;
&lt;li&gt;Computing $L_{t}^{-\frac{1}{4}}$ and $R_{t}^{-\frac{1}{4}}$ is expensive&lt;/li&gt;
&lt;li&gt;L, R have large condition numbers (upto the order of 10&lt;sup&gt;13&lt;/sup&gt;).&lt;/li&gt;
&lt;li&gt;SVD is very expensive: $O(n^{3})$ in largest dimension&lt;/li&gt;
&lt;li&gt;Large layers are still impossible to precondition&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;solutions&#34;&gt;Solutions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Using high precision arithmetic (float 64), not performing the computations on a TPU.&lt;/li&gt;
&lt;li&gt;Computing preconditioners every 1000 steps is alright.&lt;/li&gt;
&lt;li&gt;Replace SVD with an iterative method&lt;/li&gt;
&lt;li&gt;Only matrix multiplications needed
&lt;ul&gt;
&lt;li&gt;Warm start: use previous preconditioner&lt;/li&gt;
&lt;li&gt;Reduce condition number, remove top singular values&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Optimization for large layers
&lt;ul&gt;
&lt;li&gt;Precondition only one dimension&lt;/li&gt;
&lt;li&gt;Block partioning the layer works better&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;shampoo-implementation-and-conclusion&#34;&gt;Shampoo implementation and conclusion&lt;/h3&gt;
&lt;p&gt;Shampoo gets implemented on a TPU+CPU. It is a little more expensive than AdaGrad but waay faster (saves 40% of the training time with 1.95 times fewer steps). Shampoo works well in language and speech domains, it isn&amp;rsquo;t suitable for image classication yet (for this Adam and AdaGrad work much better).&lt;/p&gt;
&lt;p&gt;The Shampoo paper can be found &lt;a href =&#34;https://arxiv.org/abs/1802.09568&#34;&gt; here &lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Summer School Series: Lecture 2 by Neil Houlsby</title>
      <link>https://archana1998.github.io/post/neil-houlsby/</link>
      <pubDate>Fri, 21 Aug 2020 23:11:51 +0530</pubDate>
      <guid>https://archana1998.github.io/post/neil-houlsby/</guid>
      <description>&lt;p&gt;&lt;a href=&#34;https://research.google/people/NeilHoulsby/&#34;&gt;Neil Houlsby&lt;/a&gt; presented a great talk on Large Scale Visual Representation Learning and how Google has come up with solutions to some classical problems.&lt;/p&gt;
&lt;h3 id=&#34;evaluation-of-parameters&#34;&gt;Evaluation of parameters&lt;/h3&gt;
&lt;p&gt;There are two main ways of evaluating parameters from a network, that extracts the parameters. They are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Linear Evaluation: We freeze the weights and retrain the head&lt;/li&gt;
&lt;li&gt;Transfer Evaluation: We retrain end to end with new head&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;visual-task-adaptation-benchmark-vtab&#34;&gt;Visual Task Adaptation Benchmark (VTAB)&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://ai.googleblog.com/2019/11/the-visual-task-adaptation-benchmark.html&#34;&gt;VTAB&lt;/a&gt; is an evaluation protocal designed to measure progress towards general and useful visual representations, and consists of a suite of evaluation vision tasks that a learning algorithm must solve. We mainly have three types of tasks, &lt;b&gt; Natural tasks, Specialized tasks and Structured Datasets. &lt;/b&gt;&lt;/p&gt;
&lt;p&gt;A query that was posed was how useful ImageNet labels would be for pretrained models to work on these three tasks. It has been seen that ImageNet labels work well for Natural images, and not well for the other two tasks.&lt;/p&gt;
&lt;p&gt;Representation learners pre-trained on ImageNet can be of three forms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GANs and autoencoders&lt;/li&gt;
&lt;li&gt;Self-supervised&lt;/li&gt;
&lt;li&gt;Semi-supervised / Supervised approach&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It has been seen that for natural tasks, representations prove to be more important than obtaining more data, and the supervised approach is far better than the unsupervised approach. For structured tasks, a combination of supervised and self-supervised learning works the best.&lt;/p&gt;
&lt;p&gt;It was also mentioned that by modern standards, ImageNet is of incredibly small-scale, thus scaling models on ImageNet were not proven to be effective.&lt;/p&gt;
&lt;p&gt;Something to specifically keep in mind is that upstream can be expensive, but downstream should be cheap (in terms of both data and compute). For the upstream, examples of suitable large datasets are ImageNet-21k for supervised learning, and YouTube-8M for self-supervised learning.&lt;/p&gt;
&lt;h3 id=&#34;bit-l&#34;&gt;BiT-L&lt;/h3&gt;
&lt;p&gt;Neil introduced the &lt;a href=&#34;https://blog.tensorflow.org/2020/05/bigtransfer-bit-state-of-art-transfer-learning-computer-vision.html&#34;&gt;Big Transfer Learning (BiT-L)&lt;/a&gt; algorithm and talked about it in detail.
The first thing he mentioned about BiT-L was that batch normalization was replaced with &lt;b&gt; group normalization &lt;/b&gt; for ultra-large data. Advantages of this were having no train/test discrepancy, and no state which made it easier to co-train with multiple steps.&lt;/p&gt;
&lt;p&gt;It was highlighted that optimization at scale implies that schedule is crucial and not obvious. Also, early results of models can be misleading.&lt;/p&gt;
&lt;p&gt;To perform cheap tranfer, we need low compute, few/no validation data and diverse tasks. For doing few-shot transfer, pretraining on ImageNet-21k and JFT-300M helps.&lt;/p&gt;
&lt;h4 id=&#34;robustness&#34;&gt;Robustness&lt;/h4&gt;
&lt;p&gt;Models trained with ImageNet aren&amp;rsquo;t necessarily robust most of the times. To test OOD robustness (Out-Of-Distribution), we use datasets like ImageNet C, ImageNet R and ObjectNet.&lt;/p&gt;
&lt;h4 id=&#34;modern-transfer-learning&#34;&gt;Modern Transfer Learning&lt;/h4&gt;
&lt;p&gt;Modern Transfer Learning calls for a big, labelled datset, a big model and careful training (using about 10 optimization recipes)
While testing with OOD, increasing datset size with a fixed model and increasing dataset size leads to an increase in performance, especially in the case of very large models.&lt;/p&gt;
&lt;p&gt;To summarize, Bigger transfer $\rightarrow$ Better Accuracy $\rightarrow$ Better Robustness&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For checking impact on object &lt;b&gt; location &lt;/b&gt; invariance, we see accuracy improves and becomes more uniform across location&lt;/li&gt;
&lt;li&gt;This proves to be the same in the case of impact on object &lt;b&gt;size&lt;/b&gt; invariance&lt;/li&gt;
&lt;li&gt;However, in the case of object rotation invariance for ResNet50, it does not become more uniform across rotation angles, but for ResNet101*3, it maintains uniformity&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Main takeaways from the talk and BiT-L were:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Scale is one of the key drivers of representation learning performance&lt;/li&gt;
&lt;li&gt;Especially effective for few-shot learning and OOD Robustness&lt;/li&gt;
&lt;li&gt;Also seen and mirrored in language domain&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Links to the GitHub repositories are: &lt;a href =&#34;https://github.com/google-research/big_transfer&#34;&gt; Big Transfer &lt;/a&gt; and &lt;a href=&#34;https://github.com/google-research/task_adaptation&#34;&gt; Visual Task Adaptation Benchmark (VTAB)&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Summer School Series: Lecture 1 by Jean-Phillipe Vert</title>
      <link>https://archana1998.github.io/post/jean-vert/</link>
      <pubDate>Fri, 21 Aug 2020 19:07:17 +0530</pubDate>
      <guid>https://archana1998.github.io/post/jean-vert/</guid>
      <description>&lt;p&gt;This is an article about what &lt;a href =&#34;http://members.cbio.mines-paristech.fr/~jvert/&#34;&gt;Jean-Phillipe Vert&lt;/a&gt; talked about at the Google Research India-AI Summer School 2020. The lecture was titled &lt;b&gt; Differentiable Ranking and Sorting &lt;/b&gt; and lasted about 2 hours.&lt;/p&gt;
&lt;h3 id=&#34;differentiable-programming&#34;&gt;Differentiable Programming&lt;/h3&gt;
&lt;p&gt;What is machine learning and deep learning?&lt;/p&gt;
&lt;p&gt;Machine learning is just to give trained data to a program and get better results for complex problems. For example:&lt;/p&gt;


&lt;figure id=&#34;figure-a-neural-network-to-recognize-cats-and-dogs&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/jean-vert/fig1_hu2361eef24ba1250aaf0d087e444736ee_322268_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1166&#34; height=&#34;626&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    A neural network to recognize cats and dogs
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;These networks usually use &lt;b&gt;vectors&lt;/b&gt; to do the computations within the network, however in recent research models are getting extended to non-vector objects (strings, graphs etc.)&lt;/p&gt;
&lt;p&gt;Jean then gave an introduction to permutations and rankings and what he aspired to do, informally. Permutations are not vectors/graphs, but something else entirely. Some data are permutations (input, output etc) and some operations may involve ranking (histogram equalization, quantile normalization)&lt;/p&gt;
&lt;p&gt;What do these operations aspire to do?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Rank pixels&lt;/li&gt;
&lt;li&gt;Extract a permutation and assign values to pixels only based on rankings&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;permutations&#34;&gt;Permutations&lt;/h3&gt;
&lt;p&gt;A permutation is formally defined as a bijection, that is:&lt;/p&gt;
&lt;p&gt;$$\sigma:[1, N] \rightarrow[1, N]$$&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Over here, $\sigma(i)=$ rank of item $i$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The composition property is defined as: $\left(\sigma_{1} \sigma_{2}\right)(i)=\sigma_{1}\left(\sigma_{2}(i)\right)$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mathrm{S}_{N}$ is the symmetric group and&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\left|\mathbb{S}_{N}\right|=N !$&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;goal&#34;&gt;Goal&lt;/h3&gt;
&lt;p&gt;Our primary goal is:&lt;/p&gt;


&lt;figure id=&#34;figure-moving-between-spaces&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/jean-vert/2_huacbb429b45f880db3bfef78880895f1e_33174_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1021&#34; height=&#34;294&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Moving between spaces
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;Some definitions here are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Embed:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;To define/optimize $f_{\theta}(\sigma)=g_{\theta}($embed$(\sigma))$ for $\sigma \in \mathbb{S}_{N}$&lt;/li&gt;
&lt;li&gt;E.g., $\sigma$ given as input or output&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&#34;2&#34;&gt;
&lt;li&gt;Differentiate:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;To define/optimize $h_{\theta}(x)=f_{\theta}($argsort$(x))$ for $x \in \mathbb{R}^{n}$&lt;/li&gt;
&lt;li&gt;E.g., normalization layer or rank-based loss&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;argmax&#34;&gt;Argmax&lt;/h3&gt;
&lt;p&gt;To put it in simple words, the argmax function identifies the dimension in a vector with the largest value. For example, $\operatorname{argmax}(2.1, -0.4, 5.8) = 3$&lt;/p&gt;
&lt;p&gt;It is not differentiable because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;As a function, $\mathbb{R}^{n} \rightarrow[1,n]$, the output space is &lt;b&gt; not continuous &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;It is &lt;b&gt;piecewise constant&lt;/b&gt; (i.e, gradient = 0 almost everywhere even if the output space was continuous)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;softmax&#34;&gt;Softmax&lt;/h3&gt;
&lt;p&gt;It is a &lt;b&gt;differentiable&lt;/b&gt; function that maps from $\mathbb{R}^{n} \rightarrow \mathbb{R}^{n}$, where&lt;/p&gt;
&lt;p&gt;$$\operatorname{softmax}_ {\epsilon} (x)_ {i} =\frac{e^{x_{i} / \epsilon}}{\sum_{j=1}^{n} e^{x_{j} / \epsilon}}$$&lt;/p&gt;
&lt;p&gt;For example, $\operatorname{softmax}(2.1, -0.4, 5.8) = (0.027, 0.02, 0.972)$&lt;/p&gt;
&lt;h3 id=&#34;moving-from-softmax-to-argmax&#34;&gt;Moving from Softmax to Argmax&lt;/h3&gt;
&lt;p&gt;$$\lim _ {\epsilon \rightarrow 0} \operatorname{softmax}_{\epsilon}(2.1,-0.4, 5.8)=(0,0,1)=\Psi(3)$$&lt;/p&gt;
&lt;p&gt;where $\psi:[1, n] \rightarrow \mathbb{R}^{n}$ is the one-hot encoding. More generally,
$$
\forall x \in \mathbb{R}^{n}, \quad \lim_ {\epsilon \rightarrow 0} \operatorname{softmax}_{\epsilon}(x)=\Psi(\operatorname{argmax}(x))
$$&lt;/p&gt;
&lt;h3 id=&#34;moving-from-argmax-to-softmax&#34;&gt;Moving from Argmax to Softmax&lt;/h3&gt;
&lt;h4 id=&#34;1-embedding&#34;&gt;1. Embedding&lt;/h4&gt;
&lt;p&gt;Let the simplex
$$
\Delta_{n-1}=\operatorname{conv}({\Psi(y): y \in[1, n]})
$$
Then we have a variational characterization (exercice left to us):
$$
\Psi(\operatorname{argmax}(x))=\underset{z \in \Delta_{n-1}}{\operatorname{argmax}}\left(x^{\top} z\right)
$$&lt;/p&gt;


&lt;figure id=&#34;figure-simplex-representation&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/jean-vert/fig3_hu5c4a640c4ff8ae0e0520f62ab190a639_28338_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;528&#34; height=&#34;371&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Simplex representation
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h4 id=&#34;2a-regularization&#34;&gt;2a. Regularization&lt;/h4&gt;
&lt;p&gt;Let the entropy be defined as $H(z)=-\sum_{i=1}^{n} z_{i} \ln \left(z_{i}\right)$ for $z_{i} \in \Delta_{n-1}$&lt;/p&gt;
&lt;p&gt;Then we have (exercise left to us):
$$
\operatorname{softmax}_ {\epsilon}(x)=\underset{z \in \Delta_{n-1}}{\operatorname{argmax}}\left[x^{\top} z+\epsilon H(z)\right]
$$&lt;/p&gt;
&lt;p&gt;The entropy is maximum at the middle and minimum as the corners, as displayed below&lt;/p&gt;


&lt;figure id=&#34;figure-entropy-in-the-simplex&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/jean-vert/4_hud40178e92a0e11db369c09c80b6c13e7_135667_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;485&#34; height=&#34;356&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Entropy in the simplex
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h4 id=&#34;2b-pertubation&#34;&gt;2b. Pertubation&lt;/h4&gt;
&lt;p&gt;Let $G=\left(G_{1}, \ldots, G_{n}\right)$ be i.i.d. Gumbel (0,1) random variables. Then we have (exercice):
$$
\operatorname{softmax}_{\epsilon}(x)=E \underset{z \in \Delta_{n-1}}{\operatorname{argmax}}\left[x^{\top}(z+\epsilon G)\right]
$$&lt;/p&gt;
&lt;h3 id=&#34;summary&#34;&gt;Summary&lt;/h3&gt;
&lt;p&gt;From moving between argmax and softmax, we can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Embed, such that
$$
\Psi(\operatorname{argmax}(x))=\underset{z \in \Delta_{n-1}}{\operatorname{argmax}}\left(x^{\top} z\right)
$$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Regularize or pertub:
$$
\operatorname{softmax}_ {\epsilon}(x)=\underset{z \in \Delta_{n-1}}{\operatorname{argmax}}\left[x^{\top} z+\epsilon H(z)\right] = E \underset{z \in \Delta_{n-1}}{\operatorname{argmax}}\left[x^{\top}(z+\epsilon G)\right]
$$&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both of these lead to efficient and stochastic Jacobian estimates. We can generalize this to other discrete operations such as rankings, by various techniques, examples being the SUQUAN and Kendall embeddings.&lt;/p&gt;
&lt;p&gt;We then have to make a differentiable approximation to $\Phi (\operatorname{argsort}(x))$, which can be done using &lt;b&gt;Optimal Transport&lt;/b&gt; and &lt;b&gt;Entropic Regularization&lt;/b&gt;. It has been experimentally proven that this works faster than neural sort for sorting 5 numbers between 0 and 9999.&lt;/p&gt;
&lt;p&gt;Jean concluded saying that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Machine learning can exist beyond vectors, strings and graphs&lt;/li&gt;
&lt;li&gt;We can calculate different embeddings of symmetric groups&lt;/li&gt;
&lt;li&gt;Differentiable sorting and ranking can be done through regularization and perturbation&lt;/li&gt;
&lt;li&gt;This can be generalized to other discrete operations&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Opening Keynote at the Google AI Summer School, 2020</title>
      <link>https://archana1998.github.io/post/opening-keynote/</link>
      <pubDate>Thu, 20 Aug 2020 10:53:53 +0530</pubDate>
      <guid>https://archana1998.github.io/post/opening-keynote/</guid>
      <description>&lt;p&gt;This article has been written from notes I took throughout the Opening Keynote at the Google AI Summer School. The opening keynote was delivered by &lt;a href=&#34;https://research.google/people/jeff/&#34;&gt; Jeff Dean&lt;/a&gt;, Head of Google AI and moderated by &lt;a href=&#34;https://research.google/people/106704/&#34;&gt;Manish Gupta&lt;/a&gt;, Director of Google AI Research, Bangalore. The Keynote was titled &lt;b&gt; Deep Learning to Solve Challenging Problems. &lt;/b&gt;&lt;/p&gt;
&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Jeff Dean is a Senior Fellow at Google and the global head of Google AI. He saved Google at a very critical time and is essential to what contributed to make the Google Search Engine the best and the fastest in the world today. He is currently doing exciting research in the field of explainable AI for problems that the world is facing. He also helped create Tensorflow, the world&amp;rsquo;s most used Machine Learning Library.&lt;/p&gt;
&lt;h2 id=&#34;notes-from-the-talk&#34;&gt;Notes from the Talk:&lt;/h2&gt;
&lt;h3 id=&#34;the-marvel-of-deep-learning&#34;&gt;The marvel of Deep Learning&lt;/h3&gt;
&lt;p&gt;Deep Learning has revolutionized the way of solving challenging problems. There are over 130 new papers on Machine Learning on Arxiv every day. Deep Learning can be considered a modern reincarnation of Artificial Neural Networks. Key benefits and features of Deep Learning are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Availability of new network architectures&lt;/li&gt;
&lt;li&gt;Ability to scale to larger datasets and efficient computation of the math&lt;/li&gt;
&lt;li&gt;Learns features from raw, noisy, heterogenous data&lt;/li&gt;
&lt;li&gt;No explicit feature engineering required&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Deep Learning architectures are remarkably flexible with taking in inputs and giving outputs of various forms, some examples are getting a categorical label from a pixel input (image), an audio input translating to a phrase that is a string, and language translation from one language to another.&lt;/p&gt;
&lt;p&gt;Deep Learning has also helped us come up with solutions to problems where the computer can achieve better results than a human. One such example is the &lt;a href=&#34;http://www.image-net.org/challenges/LSVRC/&#34;&gt;Imagenet challenge&lt;/a&gt;, that Stanford conducts every year that classifies images into classes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In 2011, the winner of the challenge was able to achieve 26% error, where humans were able to do the same task with 5% error.&lt;/li&gt;
&lt;li&gt;In 2012, &lt;a href= &#34;https://scholar.google.co.uk/citations?hl=en&amp;user=JicYPdAAAAAJ&#34;&gt;Geoffrey Hinton&lt;/a&gt; and his team used Deep Learning for the very first time in this challenge, and was the pioneer of bringing deep convolutional networks for the image classification task. Following his attempt, Deep Learning became very popular in further editions of the challenge.&lt;/li&gt;
&lt;li&gt;In 2017, the winner of the challenge was able to achieve 3% error on the Imagenet dataset, finally beating the human error of 5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;deep-learning-to-solve-world-problems&#34;&gt;Deep Learning to solve world problems&lt;/h3&gt;
&lt;p&gt;One thing that Jeff emphasized on, is how Deep Learning is being used to tackle the &lt;a href=&#34;http://www.engineeringchallenges.org/challenges.aspx&#34;&gt;Grand Engineering Challenges of the 21st century&lt;/a&gt;
One of the primary challenges that are under focus are restoring and improving urban infrastructure.&lt;/p&gt;
&lt;p&gt;A key advancement in this field is autonomous driving, which Deep Learning has aided to such an extent that the autonomous driving is far safer than the usual human driver, with 360-degree vision utilizing around 18 cameras to form a dense LiDAR point cloud.&lt;/p&gt;
&lt;p&gt;Another field that Deep Learning revolutionalized is combining vision with robotics. For the task of a robot arm picking up an unseen object,&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In 2015, there was a 65% grasp success rate&lt;/li&gt;
&lt;li&gt;In 2016, with the robot trained to pick up multiple categories of objects, the accuracy rose up to 78%&lt;/li&gt;
&lt;li&gt;In 2018, this accuracy shot up to 96% when Deep Learning was introduced into the mix
Self supervised imitation learning also uses deep learning, which is the ability of a robot to imitate actions from pixels (human footage) without supervision&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Another of these challenges was Advancing Health Informatics. We got an insight into what Google AI is working on for this field.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One of these is diagnosing diabetic retinopathy, the fastest growing cause of preventable blindness&lt;/li&gt;
&lt;li&gt;Screening of the individual can prevent blindness, however it is extremely specialized so most MD&amp;rsquo;s cannot do it.&lt;/li&gt;
&lt;li&gt;Google came up with a &lt;a href=&#34;https://ai.googleblog.com/2018/12/improving-effectiveness-of-diabetic.html&#34;&gt;model&lt;/a&gt; that could diagnose the disease from image scans of the eye, in 2016 it was at par with the performance of general opthamologists, and in 2017 the accuracy became State of the Art, with accuracy matching that of Retinal Specialists&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Many of these challenges and advances in the field of engineering and technology depend on the ability to understand text. The 2017 Tranformer Paper: &lt;a href=&#34;https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf&#34;&gt; Attention is all you need!&lt;/a&gt; was a revolutionary step in this direction, which was followed by &lt;a href=&#34;https://arxiv.org/abs/1810.04805&#34;&gt;BERT&lt;/a&gt; in 2018. BERT introduced principles for training that was very popular and appreciated, that are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pre train a model on the &amp;ldquo;fill in the blanks&amp;rdquo; task, using large amounts of self supervised text.&lt;/li&gt;
&lt;li&gt;This model is then fine tuned on individual language tasks, on a smaller scale.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This brought in light the desire to have large model architectures that are sparsely activated, that desirably have huge remembering capacity but utilize only a small fraction of the model while testing with individual examples
An example of this is the Per-Example Routing architecture.&lt;/p&gt;
&lt;p&gt;Jeff highlighted one of the most major contributions from Google towards deep learning, the introduction of the open-source deep learning library &lt;a href=&#34;https://www.tensorflow.org/&#34;&gt;Tensorflow&lt;/a&gt;. It remains the most popular and most downloaded Deep Learning Library until date, and has a vibrant open-source community, to the extent that only 1/3&lt;sup&gt;rd&lt;/sup&gt; of the current contributors are employees of Google!&lt;/p&gt;
&lt;h3 id=&#34;computer-architecture-for-deep-learning&#34;&gt;Computer architecture for Deep Learning&lt;/h3&gt;
&lt;p&gt;There was a time in the past where complex problems couldn&amp;rsquo;t be solved because of the lack of computational power. We have finally made strides that do not restrict the power available to us, so optimizing this is an important task.
Google AI has been focusing on redesigning computers, as Deep Learning has transformed this field completely. They kept two main things in mind while devising a computer to do deep learning:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First is, reduced precision is okay. The computer does not have to calculate results acccurately to the 10&lt;sup&gt;th&lt;/sup&gt; or 20&lt;sup&gt;th&lt;/sup&gt; decimal point.&lt;/li&gt;
&lt;li&gt;Second is, there are mostly only a handful of specific operations that constitute the math of Deep Learning, for example matrix multiplication, dot products, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Keeping these in mind, Google introduced the &lt;a href=&#34;https://cloud.google.com/tpu/docs/tpus&#34;&gt;Tensor Processing Unit&lt;/a&gt;, that does just this. We can connect TPUs together to form Pods, that are currently available to the public on cloud services. Pods can be connected together to make supercomputers, that can train architectures like ResNet50 and Inceptionv2 in under 30 seconds! TPUs are being designed for edge applications also, to do deep learning on smartphones.&lt;/p&gt;
&lt;h3 id=&#34;problems-of-doing-machine-learning-today&#34;&gt;Problems of doing Machine Learning Today&lt;/h3&gt;
&lt;p&gt;The usual flow of work for a machine learning specialist is to collect the data, use his ML expertise (data augmentation, hyperparameter tuning etc) and train and test the model. A rather new approach that reduces human intervention here is &lt;a href=&#34;https://en.wikipedia.org/wiki/Automated_machine_learning&#34;&gt;AutoML&lt;/a&gt;, where the &amp;ldquo;ML expertise&amp;rdquo; is tuned and tested by automatic methods.&lt;/p&gt;
&lt;p&gt;Problems that still remain are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We still start with little to no knowledge about the problem and have to rely on random initialization&lt;/li&gt;
&lt;li&gt;New problems need significant data and compute power&lt;/li&gt;
&lt;li&gt;Transfer learning and multi-task learning help with this, but are done modestly&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;what-is-desired-to-be-achieved&#34;&gt;What is desired to be achieved&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Large but sparsely activated neural network architectures&lt;/li&gt;
&lt;li&gt;A single model that can be used to solve many tasks, by activating different parts of the network&lt;/li&gt;
&lt;li&gt;Dynamically adapting to new problems&lt;/li&gt;
&lt;li&gt;Adding new tasks easily&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thus concluded the Keynote. It was fantastic and insightful. A couple of Q&amp;amp;A that I found interesting have been mentioned below:&lt;/p&gt;
&lt;p&gt;Q: What advice do you have for young researchers?&lt;/p&gt;
&lt;p&gt;A: Focus on problems that matter to you, and learn as much as you can. Create a constellation of techniques and ideas that can help you gather and organize your thoughts&lt;/p&gt;
&lt;p&gt;Q: How do you read new papers and get a gist of it?&lt;/p&gt;
&lt;p&gt;A: You&amp;rsquo;ll find many discussions on LinkedIn and Twitter about the paper, sometimes just reading this will give you a gist of what&amp;rsquo;s going on in the paper&lt;/p&gt;
&lt;p&gt;Q: Something unrealistic that you wish would happen in the field of AI in the future?&lt;/p&gt;
&lt;p&gt;A: The creation of a system that can absorb the world&amp;rsquo;s knowledge and solve all our problems&lt;/p&gt;
&lt;p&gt;Q: Hyperparameter tuning is expensive for large models, how do researchers work on this?&lt;/p&gt;
&lt;p&gt;A: Scaling down the problem to probably 1% of it and training and tuning that completely, and marginal scaling up to the level you desire is the best approach&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Flipkart Grid 2.0 Hackathon</title>
      <link>https://archana1998.github.io/post/flipkart-grid/</link>
      <pubDate>Mon, 17 Aug 2020 17:39:32 +0530</pubDate>
      <guid>https://archana1998.github.io/post/flipkart-grid/</guid>
      <description>&lt;p&gt;Flipkart recently concluded their 2 month long annual hackathon for students of Indian engineering colleges. This year’s edition saw over 20,000 participants and boasted of a prize pool of around Rs. 300,000 (-4000 USD). Our team (Gradient Ascent) made it to the 3rd round of the competition and I am writing about our experience in this article.&lt;/p&gt;
&lt;h2 id=&#34;problem-statement&#34;&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;A fashion retailer wants to source ongoing and upcoming fashion trends from major online fashion portals and online magazines in a consumable and actionable format, so that they are able to effectively and efficiently design an upcoming fashion product portfolio.&lt;/p&gt;
&lt;p&gt;Deliverables:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Identify products that are better performers (in a rank ordered fashion)&lt;/li&gt;
&lt;li&gt;Help the user view the products that are both trending and lagging&lt;/li&gt;
&lt;li&gt;Identify a logic for classifying products as per their trendiness&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We were asked to complete the challenge for just the t-shirt product vertical, but to ensure that our solution would be scalable to other products as well.&lt;/p&gt;
&lt;h2 id=&#34;initial-analysis&#34;&gt;Initial Analysis&lt;/h2&gt;
&lt;p&gt;We started off by performing a literature review on current research in the field of fashion with respect to deep learning. We looked at previous attempts of learning attributes from fashion images, modelling trends as timeseries data, fashion image encodings, object detection, etc.&lt;/p&gt;
&lt;p&gt;After spending some time on our research, we split the problem into the following
subproblems to tackle independently:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Data Collection&lt;/li&gt;
&lt;li&gt;Object Detection&lt;/li&gt;
&lt;li&gt;Attribute/Feature learning&lt;/li&gt;
&lt;li&gt;Ranking&lt;/li&gt;
&lt;li&gt;Grouping (trending/lagging)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;data-collection&#34;&gt;Data Collection&lt;/h2&gt;
&lt;p&gt;According to the problem statement, we had to extract data from e-Commerce sites and other fashion portals and magazines. We tried our best to include data from all those categories to ensure we had a balanced dataset for our classification and ranking later on. After scouring the web for some good resources, we finally settled on the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Flipkart&lt;/li&gt;
&lt;li&gt;Amazon&lt;/li&gt;
&lt;li&gt;Pinterest (curated collections of fashion trends)&lt;/li&gt;
&lt;li&gt;Vogue India&lt;/li&gt;
&lt;li&gt;Myntra&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We felt this combination of multipurpose e-Commerce sites, well established fashion magazines, social network sites and dedicated fashion shopping sites would ensure we had good representation from all sectors. We collected an average of around 600 images from each website, giving us a total of 3000 to work with.
Web scraping was done in Python using the Selenium framework. The scripts used to scrape data from any website were pretty similar and any new sites could be added with minor modifications, hence this step was easily scalable.
From e-commerce sites, we scraped the images, product names, ratings and the number of reviews to with ranking later on. From the other portals, we extracted just the images.&lt;/p&gt;
&lt;h2 id=&#34;object-detection&#34;&gt;Object Detection&lt;/h2&gt;
&lt;p&gt;One of the biggest problems we faced when extracting images from fashion magazines and social media sites is that they don’t limit themselves to just t-shirts. When they put out a catalogue/collection, it has everything ranging from skirts to sweaters to scarves. Furthermore, even in pictures where the shirt was the highlight, other features such as the model’s pose, skin colour and distance from the camera could confuse our model in the later stages of this project. Keeping all this in mind, we decided to use an object detection model to filter our data to ensure we had only pictures of t-shirts. Additionally, we cropped the images according to their bounding boxes to counter the other aforementioned problems.
This was done using a pretrained YOLOv3 model trained on the DeepFashion2 dataset, implemented using PyTorch.&lt;/p&gt;
&lt;h2 id=&#34;attributefeature-learning&#34;&gt;Attribute/Feature learning&lt;/h2&gt;
&lt;p&gt;This is where we faced our major setback. Our initial plan was to train a model to learn the attributes (neck type, sleeve length, patterns, etc) and to return them back for later use. We were then going to perform FP growth on our set of attributes of each image to obtain the frequent itemsets which would correspond to the most common combination of features and hence, trending/popular styles.&lt;/p&gt;
&lt;p&gt;It didn&amp;rsquo;t work out however, as we couldn’t find an appropriate dataset to work with such a task given our time constraints so we had to try out our backup plan.&lt;/p&gt;
&lt;p&gt;Our plan involved getting numeric encodings for the fashion images in place of the attribute list and performing clustering on the encodings. The largest clusters would correspond to the most popular types of clothes, and similarly, the smallest clusters would represent the lagging ones, assuming our calculated encodings are a fair representation of the original image. Since we were working with just images (unlabeled) data, we had to devise an unsupervised approach for learning the image encodings. After considering various options, we decided to go ahead using an autoencoder based on a CNN architecture. We did this for 2 major reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Convolutional layers would help notice particular features of t-shirts such as the necktype length and patterns if any&lt;/li&gt;
&lt;li&gt;We can insight on how accurate our encodings to reconstruct the image&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here’s a summary of the model we used:&lt;/p&gt;


&lt;figure id=&#34;figure-frequent-itemset-mining&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/flipkart-grid/1_hud8ee6b60bc41ae2f19ca216ed0cbf7e3_180328_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1265&#34; height=&#34;177&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Frequent Itemset Mining
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;We then plotted some of the reconstructed images side by side with their original counterparts and got pretty good results considering the simplicity of the network and size of the dataset. The encodings were able to capture some important features of the clothes in question.&lt;/p&gt;


&lt;figure id=&#34;figure-model-architecture&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/flipkart-grid/2_huf8d20e2591676411ac2452c151a71b7c_32522_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;352&#34; height=&#34;613&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Model Architecture
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h2 id=&#34;ranking&#34;&gt;Ranking&lt;/h2&gt;
&lt;p&gt;As far as e-commerce sites go, there are 2 main criteria used to determine how “good” a product is – the number of reviews and the rating it has. What would you consider to be better? 10 reviews with a 5-star rating? Or 50 reviews with a 4.7-star rating? This was the major question we had to answer to be able to rank these products properly. We needed an effective way of combining these 2 into one reliable metric. After doing some research on this area and tying out different methods of combing them, we settled with an approach based on  a Bayesian view of the beta distribution, described beautifully in this video by &lt;a href =&#34; https://www.youtube.com/watch?v=8idr1WZ1A7Q&amp;feature=emb_logo&#34;&gt;3blue1brown&lt;/a&gt;
We used this principle to come up with our own “Popularity Metric” which was calculated as follows:&lt;/p&gt;


&lt;figure id=&#34;figure-reconstructed-images-from-encodings&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/flipkart-grid/3_hua9d9b1a4f907d92d1804ac04cfd1acaa_208064_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1093&#34; height=&#34;234&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Reconstructed Images from encodings
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;We now had a mechanism to compare and rank products effectively and a way to calculate accurate image encodings. We used both of these to train a model which predicts the Popularity Metric of a given clothing item given an input as the image encoding. We envisioned such a model to be extremely useful for designers that are looking for insight as to how their clothes might fair if they were put up for sale on e-commerce websites. Furthermore, the Popularity Metric could be calculated for all the images from magazines and portals like Vogue and Pinterest, so those products can be ranked and compared too!
The architecture, simplified pipeline and a screenshot of the program in action are shared below.&lt;/p&gt;


&lt;figure id=&#34;figure-popularity-metric&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/flipkart-grid/4_hu67e707b293e8bb2e83945f78f4b29e05_7331_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;238&#34; height=&#34;51&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Popularity Metric
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;center&gt; (n = number of reviews, s = star rating ) &lt;/center&gt;


&lt;figure id=&#34;figure-popularity-metric-model&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/post/flipkart-grid/5_hu1fc31ec500f87e3d3c214e201fb0429c_18291_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;540&#34; height=&#34;306&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Popularity Metric Model
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h2 id=&#34;grouping&#34;&gt;Grouping&lt;/h2&gt;
&lt;p&gt;Since the FP growth idea fell through the roof, we went with clustering as our method of choice for grouping products in such a way that we can obtain the trending and lagging items. To ensure our clustering was done well, we experimented on a variety of clustering algorithms and chose the one with the highest silhouette coefficient. The algorithms tested were –&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;K means&lt;/li&gt;
&lt;li&gt;Gaussian mixture model&lt;/li&gt;
&lt;li&gt;DBSCAN&lt;/li&gt;
&lt;li&gt;Mini batch k means&lt;/li&gt;
&lt;li&gt;Spectral clustering&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Among those, K means had the highest silhouette efficient so we went ahead with that. We then split the data into clusters according to how many images were being considered for clustering (no. of clusters = no. of images/10). The products in the largest cluster could be inferred as the trending/popular products and those in the smallest clusters would be lagging products. We gave the user the option to spec which sources they wanted to consider for their clustering, giving them more flexibility with regards to analyzing what’s not and what’s not (what’s trending on Vogue might not be popular on Amazon).&lt;/p&gt;
&lt;p&gt;To conclude, we were able to come up with a way to rank products properly and to group them based on whether they are trending or lagging. We also ensured that our solution is scalable on 2 fronts:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Getting more data can be done easily with minor modifications to the existing script&lt;/li&gt;
&lt;li&gt;We can expand to different product verticals by changing the object of interest in the object detection model&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The link to the GitHub Repo is at the top of this page.
Hope you found this interesting, thanks for reading!&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Structural Damage Detection using ConvNets</title>
      <link>https://archana1998.github.io/project/damage-detection/</link>
      <pubDate>Sun, 16 Aug 2020 21:12:27 +0530</pubDate>
      <guid>https://archana1998.github.io/project/damage-detection/</guid>
      <description>&lt;p&gt;Presented our work at the &lt;a href = &#34;https://cmos.in1touch.org/site/congress_home&#34;&gt;  54th Canadian CMOS conference, 2020. &lt;/a&gt; Worked under the supervision of &lt;a href =&#34;https://scholar.google.co.in/citations?user=42ZAdYUAAAAJ&amp;hl=en&#34;&gt;Dr. S Radhika&lt;/a&gt;, of the Department of Electrical and Electronics Engineering, BITS Pilani. Manuscript accepted and to be published.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Contactless Gesture Recognition</title>
      <link>https://archana1998.github.io/project/gesture-recognition/</link>
      <pubDate>Sun, 16 Aug 2020 11:03:12 +0530</pubDate>
      <guid>https://archana1998.github.io/project/gesture-recognition/</guid>
      <description>&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Gesture Recognition Systems are commonly utilized as an interface between computers and humans, along with interacting with many electronic instruments. These systems can be classified into three classes as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Motion-based: When the user holds a device or a controller that detects the gesture made.&lt;/li&gt;
&lt;li&gt;Touch-based: When the system includes a touch-screen and the positions and directions of the finger or equivalent tool of the user are mapped, thus recognizing the gesture.&lt;/li&gt;
&lt;li&gt;Vision-based: When the system makes use of image and signal processing to detect gestures made without touching any device.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first two types of systems need the users to hold and contact certain devices, for the gesture recognition, and vision-based systems use camera setups, image processing and techniques that involve computer vision. These systems are difficult to set up for small scale use and are also expensive and extremely power hungry. For building a system that needs to function when there are limited resources available, it is important that the setup cost, power consumption and ease and size of setup is taken into consideration. Keeping this in mind, we have built a contactless gesture recognition system that consists of a couple of digital infrared sensors, that have been programmed to do the gesture recognition using a custom algorithm, with an Arduino Uno Microcontroller.&lt;/p&gt;
&lt;h2 id=&#34;problem-solving-methodology&#34;&gt;Problem Solving Methodology&lt;/h2&gt;
&lt;p&gt;Components Used and Setup:
The components we used are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Breadboard&lt;/li&gt;
&lt;li&gt;Arduino Uno Microcontroller&lt;/li&gt;
&lt;li&gt;2 digital IR Sensors&lt;/li&gt;
&lt;li&gt;Jumper wires&lt;/li&gt;
&lt;li&gt;Laptop for interfacing
Languages used:&lt;/li&gt;
&lt;li&gt;Arduino IDE (Based on C++)&lt;/li&gt;
&lt;li&gt;Python (for interfacing sensor output with VLC)&lt;/li&gt;
&lt;/ul&gt;


&lt;figure id=&#34;figure-setup&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/gesture-recognition/setup_hu3c30132babc4cc16104ed6d6062aecb7_271842_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;966&#34; height=&#34;724&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Setup
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;We connected two IR sensors to the Breadboard, placed at a distance of approximately 3 cm from each other. These were then interfaced with the Arduino Uno, which was connected to the laptop.&lt;/p&gt;
&lt;p&gt;Voltage is applied to the pair of IR LEDs, which in succession emit Infrared light. This light propagates through the air and once it hits the hand (or object), which acts as a hurdle, it is reflected back to the receiver. The LED on the diode glows, thus indicating that an object has been detected.&lt;/p&gt;
&lt;p&gt;A digital sensor system consists of the sensor itself, a cable, and a transmitter. The sensor has an electronic chip. The measuring signal is directly converted into a digital signal inside the sensor. The data transmission through the cable is also digital. This digital data transmission is not sensitive to cable length, cable resistance or impedance.&lt;/p&gt;
&lt;p&gt;Using the concept of states and delay as in Digital Design, we have created two states of the two sensors each placed on the left and right. These two states are defined and calibrated using a time delay of a few hundred microseconds in the gesture classification algorithm written using the Arduino IDE.&lt;/p&gt;
&lt;p&gt;We have taken two states in the algorithm into consideration namely Q(t) and Q(t+d) where d is the delay defined. The algorithm is defined such that left sensor and right sensor digital values are checked first and then after the defined delay, both sensors are checked for their Boolean values again and therefore the gesture is recognized and printed on the screen after running the code in the Arduino software. The chip on the Arduino Uno board plugs straight into the laptop’s USB port and supports the computer as a virtual serial port.&lt;/p&gt;


&lt;figure id=&#34;figure-state-table&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/gesture-recognition/table_hu2d1c1a3241fc2a5a508ad4f659b75269_15173_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;486&#34; height=&#34;209&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    State table
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;To make the gesture recognition feature interactive, we have interfaced the output with VLC Media player, so that we can pause, play and rewind/go forward with the playback. To do this interfacing, we have written a Python script, importing the Python library pyautogui, that provides functionality of control of the computer’s keyboard.&lt;/p&gt;
&lt;h2 id=&#34;results-and-conclusions&#34;&gt;Results and Conclusions&lt;/h2&gt;
&lt;p&gt;We tested the gesture recognition system for accuracy by using the precision-recall matric. We took in 30 different samples for input.
The precision is calculated as TP/(TP+FP), where TP denotes the number of true positives and FP denotes the number of false positives.
The precision of our system is = 90% (27 TPs and 3 FPs)&lt;/p&gt;
&lt;p&gt;The recall is calculated as TP/(TP+FN), where TP denotes the number of true positives, and FN denotes the number of false negatives.
The recall of our system is = 86.6% (26 TPs and 4 FNs)&lt;/p&gt;
&lt;p&gt;High precision implies that there is less chances of getting false alarms, and recall expresses the ability to find all relevant information from the dataset.
Practical Applications
We have integrated the recognition system with VLC, to control playback of the video. Similarly, the setup can easily be integrated with any mobile device that is low on resources, as well as used with complex devices, as IR sensors are fundamental in building light reflection systems and are extremely versatile.&lt;/p&gt;
&lt;h2 id=&#34;further-scope&#34;&gt;Further scope&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Friendly user interface that can be easily understood by any user and eventually its application can be extended to more applications like PDF reader, video games etc.&lt;/li&gt;
&lt;li&gt;Computationally inexpensive and low power consuming hardware and software setup, that makes it ideal for integrating with any device, both simple and complex.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;limitations&#34;&gt;Limitations&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Ambient light obstructs the functioning as is the case with infrared sensors, as they are extremely sensitive. A proper optical barrier must be used to prevent this.&lt;/li&gt;
&lt;li&gt;We have assumed values of time delays between gestures according to what worked well for our test dataset. This leads to the system being slightly inflexible with different speeds of gestures.&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Image Encryption and Decryption using Artificial Neural Networks</title>
      <link>https://archana1998.github.io/project/encryption-decryption/</link>
      <pubDate>Thu, 30 Jul 2020 12:05:59 +0530</pubDate>
      <guid>https://archana1998.github.io/project/encryption-decryption/</guid>
      <description>&lt;h1 id=&#34;introduction&#34;&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This project is inspired from the schematic described in the paper by I. A. Ismail
and Galal H. Galal-Edeen. A multilayer perceptron network is used for both the
encryption and decryption of images. The keys used for decryption are the fixed
bias vectors, which remain constant throughout training. Multiplicative neural
networks are used to help generate this constant vector, which is derived from a
vector specified by the sender. The images are sent into the neural network and
the output of the hidden layer gives the cipher. The cipher on being passed into
the output layer gives the decrypted image. The weights are trained and updated
using the backpropagation algorithm for learning, while the bias vector remains
constant.&lt;/p&gt;
&lt;p&gt;To get the constant bias vectors, the sender of the image specifies a numeric
vector of the same size as the layer it is a bias of. The vector is broken down
into subvectors and subsequent permutations of this vector are fed into a
multiplicative neuron. The output of the multiplicative neural network will be
added to the initial bias vector specified by the sender of the images. Since the
bias vector is now a constant that is entirely dependent on the way the initial
6bias vector is arranged, it provides an additional level of security over the
existing paradigm that employs a sender specified bias vector without any
modifications done to it.
All experiments have been done on, and results have been obtained from
MATLAB R2018b.&lt;/p&gt;
&lt;h1 id=&#34;multiplicative-neural-network&#34;&gt;Multiplicative Neural Network&lt;/h1&gt;
&lt;p&gt;A general structure for the multiplicative neuron is given below:&lt;/p&gt;


&lt;figure id=&#34;figure-structure-of-multiplicative-neuron&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/encryption-decryption/1_hua14e20862d5f8aa201f826f6aacbb8a3_9020_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;485&#34; height=&#34;214&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Structure of multiplicative neuron
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The input vector is (x 1 , x 2 ,&amp;hellip;..,x n ) which is a permutation of the initial specified bias vector. The weights vector is (w 1 , w 2, &amp;hellip;.., wn) and the bias vector (for this multiplicative neural network) is (b 1 , b 2, &amp;hellip;.., bn). Ω is a multiplicative operator and has the formula&lt;/p&gt;
&lt;p&gt;$$\Omega=\prod_{i=1}^{n}\left(w_{i} x_{i}+b_{i}\right)$$&lt;/p&gt;
&lt;p&gt;The output of the neuron is then processed using ƒ(u) which is the logsig function $y=\frac{1}{1+e^{-u}}$.&lt;/p&gt;
&lt;h2 id=&#34;training-algorithm&#34;&gt;Training Algorithm&lt;/h2&gt;
&lt;p&gt;The standard backpropagation algorithm has been modified for the training of
the multiplicative neural network, which is used in optimizing the weights and
biases. It is based on the popular steepest gradient descent approach. The error
function E is defined as&lt;/p&gt;
&lt;p&gt;$$E=\frac{1}{2 N} \sum_{p=1}^{N}\left(y_{p}-y_{p}^{d}\right)^{2}$$&lt;/p&gt;
&lt;p&gt;where $\ y_{p}^{d}$ is the desired output and y&lt;sub&gt;p&lt;/sub&gt; is the actual output for the p th input that is fed into the multiplicative neural network. The weights and biases of the model are updated using the following rules:&lt;/p&gt;
&lt;p&gt;$$w_{i}^{\text {new}}=w_{i}^{\text {old}}+\Delta w_{i}$$&lt;/p&gt;
&lt;p&gt;$$b_{i}^{\text {new}}=b_{i}^{\text {old}}+\Delta b_{i}$$&lt;/p&gt;
&lt;p&gt;where&lt;/p&gt;
&lt;p&gt;$$\Delta w_{i}=-\eta \frac{d \boldsymbol{E}}{d w_{i}}=-\eta \frac{1}{N} \sum_{p=1}^{N}\left(\left(y_{p}-y_{p}^{d}\right) y_{p}\left(1-y_{p}\right) \frac{u}{w_{i} x_{i}+b_{i}} x_{i}\right)$$&lt;/p&gt;
&lt;p&gt;$$\Delta b_{i}=-\eta \frac{d \boldsymbol{E}}{d b_{i}}=-\eta \frac{1}{N} \sum_{p=1}^{N}\left(\left(y_{p}-y_{p}^{d}\right) y_{p}\left(1-y_{p}\right) \frac{u}{w_{i} x_{i}+b_{i}}\right)$$&lt;/p&gt;
&lt;p&gt;$\eta$ is the learning rate parameter. The main purpose of this parameter is to control the convergent speed as desired.&lt;/p&gt;
&lt;h2 id=&#34;procedure&#34;&gt;Procedure&lt;/h2&gt;
&lt;p&gt;Two multiplicative neural network structures are used to generate the bias
vector for the hidden layer and output layer of the multilayer perceptron
network respectively. It is necessary to have a separate model for each vector as
the layers are of different dimensions (different number of neurons) and thus,
the bias vectors will be of different dimensions for both the hidden layer and the
output layer. The number of elements in the bias vector will be the size of the
input into the multiplicative neural network model, hence we require two
different models.&lt;/p&gt;
&lt;p&gt;The sender of the images first specifies a vector containing the same number of
elements as the number of neurons of the (hidden/output) layer of the MLP.
This is now broken down into subvectors (if the number of elements is 16, it can
be broken into 4 subvectors of 4 elements each, etc.). This breaking down is
essential as it becomes computationally difficult to calculate the permutations of
a combination of numbers greater than 10. Once the subvectors are obtained, the
individual permutations of each of the subvectors is stored into a matrix, which
are then concatenated to form a bigger matrix of size p*q where p is the number
of subvectors, and q is the dimension of the initially specified bias vector. A
target vector (dummy vector, but must remain constant and not be generated
randomly) is also specified by the sender.&lt;/p&gt;
&lt;p&gt;This matrix is now fed into as input to the multiplicative neural network, where
each row depicts an input sample. The network is trained using the algorithm specified, and the output is stored. This output is now added to the initial vector
specified by the sender for the MLP, and the result thus becomes the new bias
vector, which remains a constant throughout the experiment.&lt;/p&gt;
&lt;p&gt;This procedure is repeated for defining and training the second multiplicative
neural network, which generates the second bias vector for the MLP. Both these
vectors are essential for proper image encryption and decryption.&lt;/p&gt;
&lt;h1 id=&#34;mlp-used-for-image-encryption-and-decryption&#34;&gt;MLP used for image encryption and decryption&lt;/h1&gt;
&lt;p&gt;The network has a structure of one input layer, one hidden layer, and one output
layer. Adding further hidden layers can help in achieving image compression as
dimensionality of the image is being reduced. The output of the hidden layer gives the cipher and the output of the output layer gives the decipher of the image. There are N elements in the input layer that are fed to the next (hidden layer), which consists of M neurons. The output layer has the same number of neurons as the input layer.&lt;/p&gt;
&lt;p&gt;The MLP structure used in this project is given below.&lt;/p&gt;


&lt;figure id=&#34;figure-structure-of-mlp&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/encryption-decryption/2_hubb326033ac04905f65ab4941418dd105_12509_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;472&#34; height=&#34;413&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Structure of MLP
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The network configuration is of NxMxN neurons which represent the input layer, hidden layer and output layer respectively. The sigmoid function (logsig) is used to generate the output of the hidden layer, which is defined as&lt;/p&gt;
&lt;p&gt;$$\text { Logsig function: } Z=\frac{1}{1+e^{\left[-\left(\left(\sum_{i=1}^{n} w_{1 i} x_{i}\right)+b_{1 i}\right)\right]}}$$&lt;/p&gt;
&lt;p&gt;where w&lt;sub&gt;1i&lt;/sub&gt; denotes the weight vector for the hidden layer, and b&lt;sub&gt;1i&lt;/sub&gt; denotes the bias vector for the hidden layer.&lt;/p&gt;
&lt;p&gt;The output of the output layer is calculated using a linear function (purelin in
MATLAB), which is defined as&lt;/p&gt;
&lt;p&gt;$$\text { Purelin function: } Y=m\left[\left(\sum_{i=1}^{n} w_{2 i} z_{i}\right)+b_{2 i}\right]+c$$&lt;/p&gt;
&lt;p&gt;where w&lt;sub&gt;2i&lt;/sub&gt; denotes the weight vector for the output layer, and b&lt;sub&gt;2i&lt;/sub&gt; denotes the bias vector for the output layer.&lt;/p&gt;
&lt;h2 id=&#34;training-algorithm-1&#34;&gt;Training Algorithm&lt;/h2&gt;
&lt;p&gt;The error of the output of the network in each step ‘n’ while training, (the
difference between the desired value and the actual value) is calculated by the
following formula.&lt;/p&gt;
&lt;p&gt;$$\Delta(n)=\left[\sum_{i=1}^{N}\left(x_{i}-y_{i}\right)^{2}\right]^{1 / 2}$$&lt;/p&gt;
&lt;p&gt;The weights are calculated using the following rules:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;For the hidden-output layer:&lt;/p&gt;
&lt;p&gt;a. The error signal for the q th neuron in the output layer is&lt;/p&gt;
&lt;p&gt;$$\delta_{q}=m\left(x_{q}-y_{q}\right)$$&lt;/p&gt;
&lt;p&gt;b. The updated weight w&lt;sub&gt;2(p, q)&lt;/sub&gt; is calculated as:&lt;/p&gt;
&lt;p&gt;$$\begin{array}{c}w_{2(p, q)}(n+1)=w_{2(p, q)}(n)+\Delta w_{2(p, q)}(n+1)\end{array}$$&lt;/p&gt;
&lt;p&gt;$$\begin{array}{c}\Delta w_{2(p, q)}(n+1)=\eta \delta_{q} z_{p}+\alpha\left[\Delta w_{2(p, q)}(n)\right]\end{array}$$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For the input-hidden layer:&lt;/p&gt;
&lt;p&gt;a. The error signal for p&lt;sup&gt;th&lt;/sup&gt; hidden neuron is calculated using:&lt;/p&gt;
&lt;p&gt;$$\delta_{p}=z_{p}\left(1-z_{p}\right)\left[\sum_{k=1}^{N} \delta_{k} w_{1(p,k)}\right]$$&lt;/p&gt;
&lt;p&gt;b. The weight vector w&lt;sub&gt;1(i,p)&lt;/sub&gt;(n) is calculated similarly as the above
formula for the adjustment of weights, with z&lt;sub&gt;p&lt;/sub&gt; being replaced with
x&lt;sub&gt;i&lt;/sub&gt; and $\delta_{q}$ with $\delta_{p}$ .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;After one epoch, let $\Delta(n)$ and $\Delta(n+1)$ be the previous and current errors of the outputs of the neural network respectively. The rule that is followed
whether to decide if the weights are being updated or not, is:&lt;/p&gt;
&lt;p&gt;a. If $\Delta(n+1)&amp;gt;1.04[\Delta(n)]$,
The new weights, output, keys (always constant), error are
unchanged, and $\alpha$ is changed to $0.7 \alpha$&lt;/p&gt;
&lt;p&gt;b. If $\Delta(n+1)&amp;lt;=1.04[\Delta(n)]$,
All the variables except the keys are updated to their new values,
and $\alpha$ is modified to $1.05 \alpha$&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These steps are carried out and repeated in each epoch, until the maximum
number of epochs has been reached, or the error becomes less than a value that
is predefined.&lt;/p&gt;
&lt;h2 id=&#34;procedure-1&#34;&gt;Procedure&lt;/h2&gt;
&lt;p&gt;The keys obtained from the multiplicative neural network are first normalized to
lie within the range of (0,1). The normalization is simply done by dividing the
elements of the bias vector by the maximum value of the elements of the vector.&lt;/p&gt;
&lt;p&gt;The images that are fed into the neural network must all be of the same dimension, irrespective of them being training images or test images. For this project, images of various dimensions (256 x 256, 512 x 512 etc) have been scaled down to a dimension of 50 x 50. The images of this specified dimension are now segmented into sub images, of the number L (for the purpose of this project, L=100). The size of each sub image is x times x = N pixels, which makes N = 25. The segmentation is done using a custom defined segmentation function, that also converts each sub image into a 1-dimensional vector, and creates a matrix of dimension N x L (25 x 100) which is then fed in as input to the neural network.&lt;/p&gt;
&lt;p&gt;Once the network is trained with the training set, it is ready to encrypt and decrypt images. The test images are segmented using the same segmentation function which was used to segment the training images, and are fed into the trained neural network as input.&lt;/p&gt;
&lt;p&gt;The output of the hidden layer is computed using the output function (logsig),
which was previously defined. Since the size of the input image is NL, the output of the hidden layer is a matrix of size ML. (For this project, N = 25 and M = 16). The encrypted image (cipher) is then obtained after the output matrix is transformed into a 2-D matrix, for which the segmented images must be properly arranged back.&lt;/p&gt;
&lt;p&gt;The decrypted image is obtained when the output of the hidden layer is fed into
the output layer. In short, it is the final output of the neural network and can
directly be computed by feeding the input test image into the neural network.&lt;/p&gt;
&lt;h2 id=&#34;experiments-and-results&#34;&gt;Experiments and results&lt;/h2&gt;
&lt;p&gt;The project was done on MATLAB version R2018b on a computer with Intel
Core i5 6 th generation processor.&lt;/p&gt;
&lt;p&gt;The neural network was trained using 38 test images, out of which 14 were colour images, all downloaded from the USC Vision Database. The 14 colour images and any subsequent images henceforth used for testing were all converted into single channel images. The test images were segmented and passed into the neural network, which took approximately 40 seconds to train. The initial bias vectors were specified by the programmer and then was input into the two multiplicative neural networks to generate new elements, which were then added with the previous bias vector. Only after this, the bias vector for the MLP was fixed with this value and the MLP was made to train.&lt;/p&gt;
&lt;p&gt;Encryption and decryption of a test image of the Earth was extremely fast, with
the NPCR and UACI tests giving scores of 99.9665% and 0.34916 respectively. The PSNR ratios for the original image with the decrypted image and the cipher were 39.4156 and 39.3973 respectively.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Compressive Sensing and Denoising of Images using the Ramanujan Fourier Transform</title>
      <link>https://archana1998.github.io/project/compressive-sensing/</link>
      <pubDate>Sat, 18 Jul 2020 18:40:47 +0530</pubDate>
      <guid>https://archana1998.github.io/project/compressive-sensing/</guid>
      <description>&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;The Ramanujan Sums were first proposed by Srinivasan Ramanujan in 1918, and have become exceedingly popular in the fields of signal processing,time-frequency analysis and shape recognition. The sums are by nature, orthogonal. This results in them offering excellent conservation of energy, which is a property shared by Fourier Transform as well.&lt;/p&gt;
&lt;p&gt;We have used Matrix Multiplication to obtain the Ramanujan Basis, for our computation. The Ramanujan Sums are defined as n&lt;sup&gt;th&lt;/sup&gt; powers of q&lt;sup&gt;th&lt;/sup&gt;primitive roots of unity, which can be computed using this simple formula:&lt;/p&gt;
&lt;p&gt;$$c_{q}(n)=\mu\left(\frac{q}{g c d(q, n)}\right) \frac{\varphi(q)}{\varphi\left(\frac{q}{g c d(q, n)}\right)}$$&lt;/p&gt;
&lt;p&gt;Where $q=\prod_{i} q_{i}^{a_{i}}$,(q&lt;sub&gt;i&lt;/sub&gt; is prime). Then, $\varphi(q)=q \prod_{i}\left(1-\frac{1}{q i}\right)$.&lt;/p&gt;
&lt;p&gt;$\mu(n)$ is the Mobius function, which is equal to 0 if n contains a square number, 1 if n = 1 and (-1)*k if n is a product of k distinct prime numbers.&lt;/p&gt;
&lt;p&gt;The Ramanujan matrix can be defined as:&lt;/p&gt;
&lt;p&gt;$$A(q, j)=\frac{1}{\varphi(q) M} c_{q}(\bmod (j-1, q)+1)$$&lt;/p&gt;
&lt;p&gt;The 2-D forward Ramanujan Sum Transform is given as:&lt;/p&gt;
&lt;p&gt;$$Y(p, q)=\frac{1}{\varphi(p) \varphi(q)} \frac{1}{M N} \sum_{m=1}^{M} \sum_{n=1}^{N} x(m, n) C_{p}(m) C_{q}(n)$$&lt;/p&gt;
&lt;p&gt;which in matrix terms can be defined as&lt;/p&gt;
&lt;p&gt;$$Y=A * A^{\top}$$&lt;/p&gt;
&lt;p&gt;and the inverse 2D Ramanujan transform in matrix terms is:&lt;/p&gt;
&lt;p&gt;$$X=A^{-1} Y\left(A^{-1}\right)^{\top}$$&lt;/p&gt;


&lt;figure id=&#34;figure-original-transformed-and-inversed-image&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/compressive-sensing/montage1_hu3f8e044785174c9754427187ccca8f30_16041_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;673&#34; height=&#34;184&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Original, Transformed and Inversed Image
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h2 id=&#34;compressive-sensing&#34;&gt;Compressive Sensing&lt;/h2&gt;
&lt;p&gt;The principle behind the use of compressive sensing as a signal processing technique, is that most test signals are not actually completely comprised of noise, but most have a great degree of redundancy in them. Sparse representation of signals in a particular domain signifies that most of the signal coefficients are either zero or close to zero.&lt;/p&gt;
&lt;p&gt;Compressive measurements, which are a weighed linear combination of signal samples, are first taken in a basis that is different from the sparse basis.&lt;/p&gt;
&lt;h3 id=&#34;algorithm&#34;&gt;Algorithm&lt;/h3&gt;
&lt;p&gt;We use the generated Ramanujan Basis to do the sparse reconstruction. First, the sparse signal is obtained by multiplyingthe Ramanujan Basis A with the flattened image vector (here we are using the Cameraman Image that has been resized to 50 * 50).
Our Ramanujan Basis has dimensions of 2500 * 2500.&lt;/p&gt;
&lt;p&gt;Thus,&lt;/p&gt;
&lt;p&gt;$$Z=A^{*} x$$&lt;/p&gt;
&lt;p&gt;Where Z is the sparse representation of the cameraman image, and x is the flattened image vector of the original image.
We next create a random measurement matrix of dimension m*n, where we keep m = 5000 and n = 2500. This measurement matrix (Phi) is then multiplied by the sparse signal z.&lt;/p&gt;
&lt;p&gt;$$Y=P h i * Z$$&lt;/p&gt;
&lt;p&gt;We then use Orthogonal Matching Pursuit Algorithm, which aims to approximately find the most accurate  projections of data in multiple dimensions on to the span of a redundant or overcomplete dictionary. Here, the overcomplete dictionary we use is the Ramanujan Basis A.
The orthogonal matching pursuit algorithm is then applied onto the signal Y, and we have considered 1700 iterations.&lt;/p&gt;
&lt;p&gt;The plot of the original sparse representation and the OMP representation is given below:&lt;/p&gt;


&lt;figure id=&#34;figure-plot-of-original-sparse-representation-blue-and-omp-representation-red&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/compressive-sensing/plot1_hudbf82b9e3e6b647977479df89ebf2cb6_10556_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;628&#34; height=&#34;333&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Plot of original sparse representation (blue) and OMP representation (red)
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The image is then reconstructed by taking the inverse of the Ramanujan Basis, and multiplying it with the OMP sparse representation.&lt;/p&gt;
&lt;p&gt;$$R e c  =A^{-1} * x w s r$$&lt;/p&gt;
&lt;p&gt;Where xwsr is the OMP representation, and Rec is the reconstructed image signal. This is then resized to obtain the final image, which has been compared with the original image below:&lt;/p&gt;


&lt;figure id=&#34;figure-original-and-final-image&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/compressive-sensing/orgvsfinal_hu69f64e62c985d6a5e0fb9a9c8fbbb991_25782_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;500&#34; height=&#34;317&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Original and Final image
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;h3 id=&#34;results&#34;&gt;Results&lt;/h3&gt;
&lt;p&gt;We compare and evaluate the performance of the Compressive Sensing Algorithm by using PSNR, SSIM and MSE image evaluation metrics.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PSNR: Peak Signal to Noise Ratio&lt;/li&gt;
&lt;li&gt;SSIM: Structural Similarity Index&lt;/li&gt;
&lt;li&gt;MSE: Mean Square Error&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Ideally, high values of PSNR, SSIM(max=1) and MSE show favourable performance of the reconstruction algorithm.The results obtained for this approach are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PSNR = 23.1362&lt;/li&gt;
&lt;li&gt;SSIM = 0.6265&lt;/li&gt;
&lt;li&gt;MSE = 315.8316&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;image-denoising&#34;&gt;Image Denoising&lt;/h2&gt;
&lt;p&gt;Image denoising is commonly analysed and solved as an inverse problem. A method of doing this is to decompose the image signal in a sparse way, over a dictionary that is overcomplete. We use the Ramanujan Dictionary here to do the denoising, which is trained with three images using the K-SVD algorithm, based on Orthogonal Matching Pursuit (OMP).&lt;/p&gt;
&lt;h3 id=&#34;k-svd-algorithm&#34;&gt;K-SVD Algorithm&lt;/h3&gt;
&lt;p&gt;The K-SVD algorithm is a type of K-means clustering, which has been generalized. The k-
means clustering is also considered a method of doing representation of sparse signals. This
implies solving the equation below, to find the best code to represent the signal data {y&lt;sub&gt;i&lt;/sub&gt;} from i=1 to M&lt;/p&gt;
&lt;p&gt;$$\min &lt;em&gt;{D, X}\left{|Y-D X|&lt;/em&gt;{F}^{2}\right}, \text { subject to } \forall i,\left|x_{i}\right|_{0}=1$$&lt;/p&gt;
&lt;p&gt;F here is the Frobenius norm. The K-SVD algorithm is similar to the K-means in terms of the process of construction, but differs in the sense of the relaxation of the sparsity term in the constraint. This helps achieve a linear combination of the dictionary atoms. The relaxation is that the number of entries that are not zero in each column can be greater than 1, but less than a defined number T&lt;sub&gt;0&lt;/sub&gt;.&lt;/p&gt;
&lt;p&gt;Thus, the objective function hence becomes&lt;/p&gt;
&lt;p&gt;$$\min &lt;em&gt;{D, X}\left{|Y-D X|&lt;/em&gt;{F}^{2}\right} \text { , subject to } \forall i,\left|x_{i}\right|_{0} \leq T_{0}$$&lt;/p&gt;
&lt;p&gt;In the algorithm, the dictionary D is first fixed, and the aim is to find the perfect coefficient matrix X. To find this, a pursuit method that does the approximation of the optimal X is used. OMP was chosen as the suitable method to calculate the coefficients of the matrix here.&lt;/p&gt;
&lt;h3 id=&#34;implementation-and-results&#34;&gt;Implementation and Results&lt;/h3&gt;
&lt;p&gt;In the implementation for training the dictionary  nd the sparse data representation, the
parameters for the training are as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Patch Size = 15&lt;/li&gt;
&lt;li&gt;Percentage of Overlap = 0.5&lt;/li&gt;
&lt;li&gt;Sparsity Threshold = 6&lt;/li&gt;
&lt;li&gt;Error Tolerance = 11.5
Three images, Boat, Lena and Barbara were used for the dictionary training. Patches were made
of these images and stacked. The number of iterations was the size of the stacked images that
formed the training data. In this case, the number was 2883. The dictionary and sparse data
representation were trained using the K-SVD algorithm and saved, the training process took
approximately 5.5 hours on an Intel i5 6 th gen processor, with a Nvidia 940mx GPU (personal laptop, using MATLAB R2019a).&lt;/li&gt;
&lt;/ul&gt;


&lt;figure id=&#34;figure-original-and-reconstructed-boat-image&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/compressive-sensing/reconmontage1_hu1cc507d436712babecbb0e0103bd6f3d_16355_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;500&#34; height=&#34;205&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Original and Reconstructed Boat image
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;We added Gaussian Noise to a Lena Image and denoised it using the trained dictionary.&lt;/p&gt;


&lt;figure id=&#34;figure-original-noisy-and-denoised-lena-image&#34;&gt;


  &lt;img data-src=&#34;https://archana1998.github.io/project/compressive-sensing/denoised_huf16ae9dc124035bdd7e9ad8e397ffe59_15436_2000x2000_fit_q90_lanczos.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;500&#34; height=&#34;154&#34;&gt;


  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Original, Noisy and Denoised Lena image
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;ul&gt;
&lt;li&gt;MSE for noisy image = 68.0747&lt;/li&gt;
&lt;li&gt;MSE for denoised image = 55.2772&lt;/li&gt;
&lt;li&gt;PSNR for noisy image = 29.8009&lt;/li&gt;
&lt;li&gt;PSNR for denoised image = 29.9672&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary&lt;/h2&gt;
&lt;p&gt;The Ramanujan Transform is a powerful transform and basis dictionary that can be used for sparse representation of an image. It can be trained efficiently and reconstructs images in a
better way as compared to using DCT dictionary. For compressive sensing algorithm, the training time of the Ramanujan dictionary is more compared to the DCT dictionary training time, but it is more efficient in reconstruction. It is also good at denoising images, and is efficiently trained using the K-SVD algorithm.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Understanding Deep Learning requires rethinking generalization</title>
      <link>https://archana1998.github.io/post/regularization/</link>
      <pubDate>Sat, 18 Jul 2020 14:11:22 +0530</pubDate>
      <guid>https://archana1998.github.io/post/regularization/</guid>
      <description>&lt;p&gt;This is a review of the ICLR 2017 paper by Zhang et. al. titled &amp;ldquo;Understanding Deep Learning requires rethinking generalization&amp;quot;&lt;a href = &#34;https://bengio.abracadoudou.com/cv/publications/pdf/zhang_2017_iclr.pdf&#34;&gt; Link to paper &lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The paper starts off with aiming to provide an introspection into what distinguishes networks that generalize well, from those who don’t.&lt;/p&gt;
&lt;p&gt;One of the experiments conducted in the studies of the paper, is checking how well neural networks adapt to training when labels are randomized. Their findings establish that when the true data is completely randomly labelled, the training error that results is zero. Observations from this indicate that the effective capacity of neural networks is more than enough to memorize the entire dataset. Randomization of the labels is only a transformation of the data, and other learning parameters are constant and unchanged still. The resulting training time also increases by only a small factor. However, when this trained network is tested, it does badly. This indicates that just by randomizing labels, the generalization error can shoot up significantly without changing any other parameters of the experiment like the size of the model, the optimizer etc.&lt;/p&gt;
&lt;p&gt;Another experiment conducted was that when the ground truth images were switched with random noise. This resulted in the networks training to zero training error, even faster than the case with the random labels. Varying the amount of randomization resulted in a steady deterioration of the generalization error, as the noise level increased. There were a wide variety of changes introduced into the dataset, that played with degrees and kinds of randomization with the pixels and labels. All of this still resulted in the networks able to fit the training data perfectly. A key takeaway from this is that the neural networks are able to capture the signals remaining in the data, while fitting the noise and randomization with brute force. The question that still remains unanswered after this is why some models generalize better than others, because it is evident that some decisions made while constructing model architectures do make a difference in its ability to generalize.&lt;/p&gt;
&lt;p&gt;Traditional approaches in statistical learning theory such as Rademacher complexity, VC dimension and uniform stability are threatened by the randomization experiments performed.&lt;/p&gt;
&lt;p&gt;Three specific regularizers are then considered to note the impact of explicit regularization, data augmentation, weight decay and dropout. These are tried out on Inception, Alexnet and MLPs on the CIFAR10 dataset, and later with ImageNet. Regularization helps to improve generalization performance, but the models still generalize well enough with the regularizers turned off. It was then inferred that this is more of a tuning parameter than a fundamental cause of good generalization. A similar result was noted with implicit regularization.&lt;/p&gt;
&lt;p&gt;An interesting result proved in the paper was that there two layer depth networks of linear size, that can represent any labelling of the training data. A parallel approach in trying to understand the source of regularization for linear models was also not easy to point out.
To sum up, this paper presents a thorough insight into how empirically easy optimization does not imply good regularization, and effective capacity of network architectures is better understood and defined.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>