-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
557 lines (509 loc) · 34.6 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<!-- Meta tags for social media banners, these should be filled in appropriately as they are your "business card" -->
<!-- Replace the content tag with appropriate information -->
<meta name="description" content="Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.">
<meta property="og:title" content="PooDLe: Pooled and dense self-supervised learning from naturalistic videos"/>
<meta property="og:description" content="Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective."/>
<meta property="og:url" content="https://poodle-ssl.github.io/"/>
<!-- Path to banner image, should be in the path listed below. Optimal dimenssions are 1200X630-->
<meta property="og:image" content="static/image/method.png" />
<meta property="og:image:width" content="1200"/>
<meta property="og:image:height" content="630"/>
<meta name="twitter:title" content="PooDLe: Pooled and dense self-supervised learning from naturalistic videos">
<meta name="twitter:description" content="Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.">
<!-- Path to banner image, should be in the path listed below. Optimal dimenssions are 1200X600-->
<meta name="twitter:image" content="static/images/method.png">
<meta name="twitter:card" content="summary_large_image">
<!-- Keywords for your paper to be indexed by-->
<meta name="keywords" content=" computer vision, representation learning, self-supervised learning, egocentric video, visual representation">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>PooDLe</title>
<link rel="icon" type="image/x-icon" href="static/images/poodle.ico">
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="static/css/bulma.min.css">
<link rel="stylesheet" href="static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="static/css/bulma-slider.min.css">
<link rel="stylesheet" href="static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="static/css/index.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
<script defer src="static/js/fontawesome.all.min.js"></script>
<script src="static/js/bulma-carousel.min.js"></script>
<script src="static/js/bulma-slider.min.js"></script>
<script src="static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">PooDLe<img src="static/images/poodle.ico">: Pooled and dense self-supervised learning from naturalistic videos</h1>
<div class="is-size-5 publication-authors">
<!-- Paper authors -->
<span class="author-block">
<a href="https://www.alexn.wang/" target="_blank">Alex N. Wang</a><sup>*,1</sup>,</span>
<span class="author-block">
<a href="https://www.chrishoang.com/" target="_blank">Chris Hoang</a><sup>*,1</sup>,</span>
<span class="author-block">
<a href="https://www.cs.toronto.edu/~yuwen/" target="_blank">Yuwen Xiong</a>,</span>
<span class="author-block">
<a href="http://yann.lecun.com/" target="_blank">Yann LeCun</a><sup>1,2</sup>,</span>
<span class="author-block">
<a href="https://mengyeren.com/" target="_blank">Mengye Ren</a><sup>1</sup></span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><small><sup>1</sup>New York University, <sup>2</sup>Meta</small></span>
<!-- <br>Conferance name and year</span> -->
<span class="eql-cntrb"><small><br><sup>*</sup>Indicates equal contribution</small></span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- Arxiv PDF link -->
<span class="link-block">
<a href="https://arxiv.org/abs/2408.11208" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<!-- Supplementary PDF link -->
<!-- <span class="link-block">
<a href="static/pdfs/supplementary_material.pdf" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Supplementary</span>
</a>
</span> -->
<!-- Github link -->
<span class="link-block">
<a href="https://poodle-ssl.github.io/" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code coming soon!</span>
</a>
</span>
<!-- ArXiv abstract Link -->
<!-- <span class="link-block">
<a href="https://arxiv.org/abs/<ARXIV PAPER ID>" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span> -->
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- Teaser video-->
<!-- <section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<video poster="" id="tree" autoplay controls muted loop height="100%">
<source src="static/videos/banner_video.mp4"
type="video/mp4">
</video>
<h2 class="subtitle has-text-centered">
Aliquam vitae elit ullamcorper tellus egestas pellentesque. Ut lacus tellus, maximus vel lectus at, placerat pretium mi. Maecenas dignissim tincidunt vestibulum. Sed consequat hendrerit nisl ut maximus.
</h2>
</div>
</div>
</section> -->
<!-- End teaser video -->
<!-- Paper abstract -->
<!-- <section class="section hero is-light">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Self-supervised learning has driven significant progress in learning from single-subject iconic images. However, there remain unanswered questions regarding the use of minimally curated naturalistic video data, particularly concerning object size imbalance and the preservation of geometric details in learned representations. In this paper, we propose a novel approach that combines the traditional SSL objective on pooled representations with a dense spatial objective aligned by external optical flow predictions. Our findings indicate that a unified objective, extended across multiple feature scales, is essential for effectively learning about objects of varying scales present in high-resolution naturalistic videos. We validate our approach on the BDD100K driving video dataset and the WalkingTours first-person video dataset, demonstrating its ability to capture both the fine-grained understanding from a dense objective and the semantic understanding facilitated from pooled representation learning.
</p>
</div>
</div>
</div>
</div>
</section> -->
<!-- End paper abstract -->
<!-- Problem -->
<section class="hero is-small">
<div class="hero-body">
<div class="container">
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title">Problem: current SSL methods rely on iconic data assumptions</h2>
<hr style="border: 0; border-top: 1px solid lightgray;">
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-two-fifths">
<figure class="image" style="height: 0; padding-bottom: 75%; position: relative;">
<img src="static/images/imagenet-image.jpeg" alt="Iconic image" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: cover;">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Iconic image from ImageNet</figcaption>
</div>
<div class="column is-two-fifths">
<figure class="image" style="height: 0; padding-bottom: 75%; position: relative;">
<img src="static/images/bdd-image.png" alt="Naturalistic video" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: cover;">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Scene image from BDD100K driving video</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
Self-supervised learning (SSL) is able to learn visual representations without manual labels, enabling the use of large-scale internet data such as naturalistic video for training.
<!-- These methods could allow us to harness large-scale internet data such as naturalistic in-the-wild videos for training powerful vision models. -->
However, many SSL methods still revolve around the ImageNet dataset, which consists of iconic images with a single central subject and a balanced class distribution.
<!-- Nevertheless, many SSL methods still revolve around the typical ImageNet dataset and may implicitly rely on its biases. -->
<!-- In particular, the dataset consists of iconic images (shown above left) which contain a single central subject, and the dataset provides a balanced class distribution of subjects. -->
In contrast, naturalistic videos often contain scenes with multiple objects, imbalanced class distributions, and varying object scales, making them ill-suited for iconic SSL methods.
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-two-fifths">
<figure class="image" style="height: 0; padding-bottom: 75%; position: relative;">
<img src="static/images/bdd-semseg.png" alt="BDD semantic segmentation" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: contain;">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Semantic segmentation map for multi-object scene from BDD100K</figcaption>
</div>
<div class="column is-two-fifths">
<figure class="image" style="height: 0; padding-bottom: 75%; position: relative;">
<img src="static/images/class-distribution-colored.png" alt="BDD class distribution" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: contain;">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Crucial foreground objects only represent a small proportion of pixels</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
Iconic SSL methods learn pooled features from large crops of an image, which may not be effective for multi-object scenes where each crop may contain different objects.
Recently proposed dense SSL objectives differentiate between different objects by learning 2D feature maps which maintain spatial information.
<!-- Iconic SSL methods typically learn representations by minimizing the distance between the pooled features of two augmented global crops of the same image. -->
<!-- This approach may not be effective for multi-object scenes because the model would try to learn incorrect invariances between two augmented views which are likely to contain semantically different object instances. -->
<!-- Recent works have proposed dense SSL objectives that can differentiate between different objects by learning 2D feature representations which maintain spatial information. -->
Nevertheless, dense SSL can suffer from spatial region imbalance, where the model is incentivized to focus on the background rather than smaller foreground objects.
To bootstrap learning of foreground objects, current methods for naturalistic video still rely on globally-pooled iconic objectives or iconic datasets.
<!-- For example, in the BDD100K dataset of driving videos, traffic signs on average occupy only 0.2% of pixels per video frame and thus, may be underrepresented in dense SSL objectives yet traffic signs are crucial for self-driving applications. -->
<!-- Overall, there remains a lack of SSL methods that can leverage both dense SSL and object-centric semantic learning to effectively learn visual representations from naturalistic videos. -->
</div>
</div>
</div>
</div>
</section>
<!-- End problem -->
<!-- Method -->
<section class="hero is-small">
<div class="hero-body">
<div class="container">
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title">Method: Pooled and Dense Learning from naturalistic videos</h2>
<hr style="border: 0; border-top: 1px solid lightgray;">
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-four-fifths is-centered">
<figure>
<img src="static/images/method.png" alt="Method" class="image">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Overview of PooDLe. <span style="color: green;">Green path</span>: dense objective applied to 2D feature maps of full paired frames outputted by spatial decoder. <span style="color: orange;">Orange path</span>: pooled objective applied to pooled features of flow-aware subcrops outputted by encoder</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
We propose PooDLe, a SSL method that combines a dense flow equivariance objective and a pooled invariance objective.
The dense objective captures spatial information by learning features that are equivariant to flow, or the motion between frames, at the scene level.
Conversely, the pooled objective learns high-level object semantics from small subcrops, which act as <em>pseudo-iconic</em> data.
The two objectives are unified within a single architecture that uses a lightweight spatial decoder to upsample high-level features into fine-grained feature maps.
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-four-fifths is-centered">
<figure>
<img src="static/images/subcrop.png" alt="Method" class="image" style="display: block; margin-left: auto; margin-right: auto;">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Probability of random subcrop covering ≥ 10% of an object versus pixel-level probability for varying object sizes. Graph data generated using simulation in toy and empirical settings</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
We hypothesize that using small subcrops can mitigate spatial region imbalance because the percentage of subcrops that have sufficient coverage of small objects will be significantly higher than the percentage of pixels that containing those objects.
We believe that the threshold for sufficient coverage can be relatively small because foreground objects have higher information content compared to repetitive background textures, and thus, their information is preserved in pooled representations.
The graph above shows that the relative difference in subcrop hit probability to pixel probability is greater for smaller objects and tapers off for larger objects.
<!-- We hypothesize that using small subcrops can mitigate spatial region imbalance because the percentage of subcrops that have sufficient coverage of small objects will be significantly higher than the percentage of pixels that containing those objects.
We define <em>sufficient coverage</em> as when a subcrop contains ≥10% of an object, which we believe is a reasonable threshold as foreground objects have higher information content compared to repetitive background textures and thus, their information is preserved in pooled representations.
Using data generated from a toy model of subcrop coverage of foreground objects, the graph above shows that not only is the subcrop probability higher than the pixel probability, but also the relative proportion of subcrop-to-pixel probability (green labels) is much greater for smaller objects. -->
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-two-fifths">
<figure class="image" style="height: 0; padding-bottom: 75%; position: relative;">
<img src="static/images/top-only-decoder.png" alt="Base architecture" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: contain;">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Naive: place both objectives at last encoder layer</figcaption>
</div>
<div class="column is-two-fifths">
<figure class="image" style="height: 0; padding-bottom: 75%; position: relative;">
<img src="static/images/top-down-decoder.png" alt="Spatial decoder architecture" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: contain;">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">PooDLe: pooled objective on last encoder layer; dense objective on high-resolution output from spatial decoder</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
How do we combine the pooled and dense objectives within a single architecture?
Our intuition is that the pooled objective should operate on high-level semantic features while the dense objective should operate on high-resolution features that preserves small objects and fine details.
This leads us to introduce a lightweight spatial decoder that leverages skip connections to earlier layers to upsample features from the last encoder layer into high-resolution feature maps.
We believe the last encoder layer serves as an information bottleneck, as the features need to capture high-level object invariance for the pooled objective, but must also preserve spatial information for the spatial decoder and dense objective.
<!-- Consider a typical convolutional encoder network with multiple feature levels of decreasing resolution scale, such as ResNet-50.
A naive baseline (shown above left) is to apply both objectives at the last feature level.
However, the downsampling effect of the encoder (e.g. 32x in ResNet-50) limits the preservation of small objects and fine details for the dense objective.
To remedy this, we introduce a spatial decoder module (shown above right) that leverages skip connections to earlier feature layers to upsample and refine the last-layer encoder features into a higher-resolution feature map to be used for the dense objective.
We keep the pooled objective at the last encoder layer following prior SSL methods.
Our intuition is that the last-layer encoder features serve as an information bottleneck, because they need to capture high-level object invariances to optimize the pooled objective, but also must preserve sufficient spatial information to be readily upsampled by the spatial decoder and ultimately, satisfy the dense objective. -->
</div>
</div>
</div>
</div>
</section>
<!-- End method -->
<!-- Results -->
<section class="hero is-small">
<div class="hero-body">
<div class="container">
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title">Results: semantic segmentation and object detection</h2>
<hr style="border: 0; border-top: 1px solid lightgray;">
</div>
</div>
<div class="columns is-centered">
<div class="column is-four-fifths is-centered">
<figure>
<video poster="" id="tree" autoplay controls muted loop height="100%">
<source src="static/videos/semseg-comparison.mov"
type="video/mp4">
</video>
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">Comparison of methods on semantic segmentation linear readout</figcaption>
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-four-fifths is-centered">
<figure>
<img src="static/images/bdd-results-table.png" alt="BDD Results" class="image">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">BDD100K semantic segmentation and object detection and Cityscapes semantic segmentation results using either lightweight or heavier readout headers. *Pretrained on BDD, initialized with supervised IN1K weights</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
PooDLe outperforms all BDD-pretrained baselines by a significant margin on in-distribution BDD semantic segmentation and object detection and transfer to Cityscapes semantic segmentation.
PooDLe also surpasses ImageNet (IN1K) supervised pretraining despite the latter's advantages in having a class-balanced dataset with iconic views of objects.
In addition, we may pretrain PooDLe on BDD with weights initialized from the IN1K supervised checkpoint to further improve performance.
In the video, the IN1K supervised baseline generates noisy segmentation boundaries and the FlowE baseline struggles with small objects such as traffic lights while PooDLe gives cleaner boundaries and picks up more small objects.
<!-- We observe that for BDD100K pretraining, PooDLe outperforms all baselines on all three tasks by a significant margin.
BDD-pretrained PooDLe is also competitive with IN1K-pretrained models despite the spatial region and class imbalance of BDD's naturalistic videos.
When initialized with weights from IN1K supervised classification pretraining and further pretrained on BDD100K, we observe that PooDLe is able to match or slightly outperform the strongest IN1K-pretrained models. -->
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-three-fifths is-centered">
<figure>
<img src="static/images/wt-results-table.png" alt="ADE Results" class="image">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">ADE20K semantic segmentation results using either linear readout or UperNet finetuneing</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
We also experiment with pretraining on the recent <a href="https://huggingface.co/datasets/shawshankvkt/Walking_Tours" target="_blank">WalkingTours</a> (WT) dataset, a first-person video dataset.
PooDLe outperforms all WT-pretrained baselines and is competitive with IN1K-pretrained DINO when pretrained on WT<sub>all</sub>.
</div>
</div>
<div class="columns is-centered image-container">
<div class="column is-three-fifths is-centered">
<figure>
<img src="static/images/grouping-results-table.png" alt="Grouping Results" class="image">
</figure>
<figcaption class="has-text-centered" style="margin-top: 10px;">BDD100K semantic semgmentation linear readout results with classes grouped by average pixel size (small vs. large) or occurrence frequency (rare vs. common)</figcaption>
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
<!-- We further analyze the BDD semantic segmentation linear readout results by grouping classes by either their average pixel size (small vs. large) or their occurrence proportion (rare vs. common). -->
PooDLe improves over FlowE, a dense SSL method, on small classes while maintaining strong performance on large classes, suggesting that PooDLe is able to learn better representations across object scales.
Conversely, PooDLe improves over IN1K supervised pretraining on large and common classes.
When initialized from IN1K supervised weights, PooDLe is able to retain the strong performance on small and rare classes that likely stems from the class balanced distribution of IN1K.
<!-- We observe that FlowE, a dense SSL method, performs well on large classes, but poorly on small classes due to spatial region imbalance.
PooDLe improves over FlowE on small classes, while maintaining strong performance on large classes, suggesting that PooDLe is able to learn better representations across object scales. -->
<!-- On the other hand, IN1K supervised pretraining underperforms on large and common classes, but outperforms on small and rare classes, which we hypothesize is due to the lack of spatial learning and the class balanced distribution of IN1K respectively.
When initialized from IN1K supervised weights, BDD-pretrained PooDLe is able to retain the strong performance on small and rare classes while also improving on large and common classes. -->
</div>
</div>
</div>
</div>
</section>
<!-- End results -->
<!-- Conclusion -->
<section class="hero is-small">
<div class="hero-body">
<div class="container">
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title">Conclusion: further exploration of learning from naturalistic video</h2>
<hr style="border: 0; border-top: 1px solid lightgray;">
</div>
</div>
<div class="columns is-centered">
<div class="column is-size-5 is-four-fifths">
We have proposed PooDLe, a SSL method that combines pooled invariance and dense flow equivariance objectives to learn visual representations from naturalistic videos.
We hope that this work will motivate further exploration on how to leverage naturalistic visual data for training next-generation vision models.
</div>
</div>
</div>
</div>
</section>
<!-- End conclusion -->
<!-- Image carousel -->
<!-- <section class="hero is-small">
<div class="hero-body">
<div class="container">
<div id="results-carousel" class="carousel results-carousel">
<div class="item">
<img src="static/images/carousel1.jpg" alt="MY ALT TEXT"/>
<h2 class="subtitle has-text-centered">
First image description.
</h2>
</div>
<div class="item">
<img src="static/images/carousel2.jpg" alt="MY ALT TEXT"/>
<h2 class="subtitle has-text-centered">
Second image description.
</h2>
</div>
<div class="item">
<img src="static/images/carousel3.jpg" alt="MY ALT TEXT"/>
<h2 class="subtitle has-text-centered">
Third image description.
</h2>
</div>
<div class="item">
<img src="static/images/carousel4.jpg" alt="MY ALT TEXT"/>
<h2 class="subtitle has-text-centered">
Fourth image description.
</h2>
</div>
</div>
</div>
</div>
</section> -->
<!-- End image carousel -->
<!-- Youtube video -->
<!-- <section class="hero is-small is-light">
<div class="hero-body">
<div class="container">
<h2 class="title is-3">Video Presentation</h2>
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<div class="publication-video">
<iframe src="https://www.youtube.com/embed/JkaxUblCGz0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</div>
</div>
</div>
</div>
</div>
</section> -->
<!-- End youtube video -->
<!-- Video carousel -->
<!-- <section class="hero is-small">
<div class="hero-body">
<div class="container">
<h2 class="title is-3">Another Carousel</h2>
<div id="results-carousel" class="carousel results-carousel">
<div class="item item-video1">
<video poster="" id="video1" autoplay controls muted loop height="100%">
<source src="static/videos/carousel1.mp4"
type="video/mp4">
</video>
</div>
<div class="item item-video2">
<video poster="" id="video2" autoplay controls muted loop height="100%">
<source src="static/videos/carousel2.mp4"
type="video/mp4">
</video>
</div>
<div class="item item-video3">
<video poster="" id="video3" autoplay controls muted loop height="100%">\
<source src="static/videos/carousel3.mp4"
type="video/mp4">
</video>
</div>
</div>
</div>
</div>
</section> -->
<!-- End video carousel -->
<!-- Paper poster -->
<!-- <section class="hero is-small is-light">
<div class="hero-body">
<div class="container">
<h2 class="title">Poster</h2>
<iframe src="static/pdfs/sample.pdf" width="100%" height="550">
</iframe>
</div>
</div>
</section> -->
<!--End paper poster -->
<!--BibTex citation -->
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@misc{wanghoang2024poodle,
title={PooDLe: Pooled and dense self-supervised learning from naturalistic videos},
author={Alex N. Wang and Chris Hoang and Yuwen Xiong and Yann LeCun and Mengye Ren},
year={2024},
eprint={TBD},
archivePrefix={arXiv},
primaryClass={cs.CV}
}</code></pre>
</div>
</section>
<!--End BibTex citation -->
<footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This page was built using the <a href="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Academic Project Page Template</a> which was adopted from the <a href="https://nerfies.github.io" target="_blank">Nerfies</a> project page.
You are free to borrow the of this website, we just ask that you link back to this page in the footer. <br> This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</div>
</div>
</footer>
<!-- Statcounter tracking code -->
<!-- You can add a tracker to track page visits by creating an account at statcounter.com -->
<!-- End of Statcounter Code -->
</body>
</html>