108
108
< div class ="container is-max-desktop ">
109
109
< div class ="columns is-centered ">
110
110
< div class ="column has-text-centered ">
111
- < h1 class ="title is-1 publication-title "> DiMSUM < img src ="./static/images/dimsum_icon.png " class ="logo " width =5.5% /> : < span style ="color:red; "> Di</ span > ffusion < span style ="color:red; "> M</ span > amba - A < span style ="color:red; "> S</ span > calable and < span style ="color:red; "> U</ span > nified
111
+ < h1 class ="title is-1 publication-title "> DiMSUM < img src ="./static/images/dimsum_icon.png " class ="logo " width =" 50px " /> : < span style ="color:red; "> Di</ span > ffusion < span style ="color:red; "> M</ span > amba - A < span style ="color:red; "> S</ span > calable and < span style ="color:red; "> U</ span > nified
112
112
Spatial-Frequency < span style ="color:red; "> M</ span > ethod for Image Generation</ h1 >
113
113
< div class ="is-size-5 publication-authors ">
114
114
< span class ="author-block ">
@@ -124,10 +124,10 @@ <h1 class="title is-1 publication-title">DiMSUM <img src="./static/images/dimsum
124
124
< a href ="https://viethoang1512.github.io/ "> Hoang Phan</ a > < sup > 4</ sup >  
125
125
</ span >
126
126
< span class ="author-block ">
127
- < a href ="https://people.cs.rutgers.edu/~dnm/ "> Dimitris N. Metaxas</ a > < sup > 3 </ sup >  
127
+ < a href ="https://people.cs.rutgers.edu/~dnm/ "> Dimitris N. Metaxas</ a > < sup > 2 </ sup >  
128
128
</ span >
129
129
< span class ="author-block ">
130
- < a href ="https://sites .google.com/site/anhttranusc/ "> Anh Tran</ a > < sup > 1</ sup >
130
+ < a href ="https://scholar .google.com/citations?user=FYZ5ODQAAAAJ "> Anh Tran</ a > < sup > 1</ sup >
131
131
</ span >
132
132
</ div >
133
133
@@ -524,11 +524,10 @@ <h2 class="title is-3">Unconditional Generation</h2>
524
524
525
525
<!-- Motivation -->
526
526
< h2 class ="title is-3 "> Why is scanning in frequency space helpful?</ h2 >
527
- < div class ="columns is-centered ">
527
+ < div class ="columns is-centered " style =" text-align: center; " >
528
528
< img src ="./static/images/wavelet_vs_spatial_window.png " width ="60% " class ="scanning " />
529
529
</ div >
530
530
< div class ="content has-text-justified ">
531
-
532
531
< p >
533
532
Previous state-space models, particularly in processing visual data, failed to effectively address the design choice of scanning order due to their exclusive reliance on spatial processing, neglecting crucial long-range relations in the frequency spectrum.
534
533
We propose a novel approach that integrates frequency scanning with the conventional spatial scanning mechanism.
@@ -595,88 +594,58 @@ <h3 class="title is-4">Globally-shared Transformer Block</h3>
595
594
< div class ="column is-full-width ">
596
595
< h2 class ="title is-3 "> Results</ h2 >
597
596
</ div >
598
- < div class ="columns is-centered ">
599
- < div class ="column ">
600
- < div class ="content ">
601
- < figure >
602
- < img src ="./static/images/celeb256.jpg " class ="interpolation-image " width =45% style ="margin-right: 60px; " />
603
- < img src ="./static/images/celeb512.jpg " class ="interpolation-image " width =45% />
604
- < figcaption > Figure 1. Unconditional generation on CelebA HQ</ figcaption >
605
- </ figure >
606
-
607
- </ div >
608
- </ div >
609
- </ div >
610
- < br >
597
+ < div class ="columns is-centered " style ="text-align: center; ">
598
+ < table >
599
+ < tr >
600
+ < td > < img src ="./static/images/celeb256.jpg " width ="90% " /> </ td >
601
+ < td > < img src ="./static/images/celeb512.jpg " width ="90% " /> </ td >
602
+ </ tr >
603
+ < caption style ="text-align: center; color: black "> Figure 1. Unconditional generation on CelebA HQ 256 & 512</ caption >
604
+ </ table >
611
605
</ div >
612
- < div class ="columns is-centered ">
613
- < div class ="column " style ="margin-left: 13%; ">
614
- < div class ="content ">
615
- < figure >
616
- < img src ="./static/images/church256.jpg " class ="interpolation-image " width =70% />
617
- < figcaption > Figure 2. Unconditional generation on LSUN Church</ figcaption >
618
- </ figure >
619
- </ div >
620
- <!-- <div class="content">
621
- <figure>
622
- <img src="./static/images/imnet.jpg" class="interpolation-image" width=35% />
623
- <figcaption>Figure 3. Class-conditional generation on ImageNet1k 256</figcaption>
624
- </figure>
625
- </div> -->
606
+ < div class ="columns is-centered " style ="text-align: center; ">
607
+ < div class ="column ">
608
+ < figure >
609
+ < img src ="./static/images/training_convergence.png " class ="interpolation-image " width ="70% "/>
610
+ < figcaption > Figure 2. Training convergence on CelebA HQ 256.</ figcaption >
611
+ </ figure >
626
612
</ div >
627
613
< div class ="column ">
628
- < div class ="content " style ="margin-left: -42%; ">
629
- < figure >
630
- < img src ="./static/images/imnet.jpg " class ="interpolation-image " width =60% />
631
- < figcaption > Figure 3. Class-conditional generation on ImageNet1k 256</ figcaption >
632
- </ figure >
633
- </ div >
614
+ < figure >
615
+ < img src ="./static/images/church256.jpg " class ="interpolation-image " width ="90% " />
616
+ < figcaption > Figure 3. Unconditional generation on LSUN Church</ figcaption >
617
+ </ figure >
634
618
</ div >
635
- <!-- <div class="column">
636
- <div class="content">
637
- <figure>
638
- <img src="./static/images/training_convergence.png" class="interpolation-image" width=35% />
639
- <figcaption style="margin-left: 400px; margin-right: 400px;">
640
- Figure 4. Training convergence on CelebA HQ 256.
641
- Our method achieves faster training convergence, requiring fewer than half the training epochs compared to other diffusion models, while delivering a more stable training curve.
642
- </figcaption>
643
- </figure>
644
- </div>
645
- </div> -->
619
+ </ div >
620
+ < div class ="columns is-centered " style ="text-align: center; ">
646
621
</ div >
647
622
< div class ="columns is-centered ">
648
623
< div class ="column ">
649
624
< div class ="content ">
650
625
< figure >
651
- < img src ="./static/images/training_convergence.png " class ="interpolation-image " width =35% />
652
- < figcaption style ="margin-left: 400px; margin-right: 400px; ">
653
- Figure 4. Training convergence on CelebA HQ 256.
654
- Our method achieves faster training convergence, requiring fewer than half the training epochs compared to other diffusion models, while delivering a more stable training curve.
655
- </ figcaption >
626
+ < img src ="./static/images/imnet.jpg " class ="interpolation-image " width =60% />
627
+ < figcaption > Figure 4. Class-conditional generation on ImageNet1k 256</ figcaption >
656
628
</ figure >
657
629
</ div >
658
630
</ div >
659
631
</ div >
660
632
</ section >
661
-
662
- <!-- <section class="section" id="bound">
663
- <div class="container is-max-desktop content">
664
- <h2 class="title">Theoretical analysis: Bounding estimation error</h2>
665
- <div class="content has-text-justified">
666
- We have shown that minimizing the FM objective on latent space controls the Wasserstein distance between the
667
- target density \( p_0 \) and the reconstructed density \( \hat{p}_0 \), which coincides with Fréchet inception
668
- distance (FID), a common metric for image generation.
669
- This means that our latent flow matching is guaranteed to control this metric, given reasonable estimation of \(
670
- \hat{v}(\mathbf{z}_t, t) \). Nonetheless, the analysis also suggests that the quality of latent flow matching
671
- depends on the constants that define the expressivity of the decoders and encoders, which has been observed in
672
- prior research on generative modeling in latent space.
673
-
674
- <img src="./static/images/bound.jpeg" class="interpolation-image" />
675
-
676
-
633
+
634
+
635
+ < section class ="section " id ="speed ">
636
+ < div class ="container is-max-desktop ">
637
+ < div class ="column is-full-width ">
638
+ < h2 class ="title is-3 "> Speed</ h2 >
677
639
</ div >
678
- </div>
679
- </section> -->
640
+ < div class ="container is-max-desktop content " style ="text-align: center; ">
641
+ < div >
642
+ < img src ="./static/images/speed.jpg " class ="interpolation-image " width ="80% "/>
643
+ </ div >
644
+ < div class ="content has-text-justified ">
645
+ The speed gap between our method and DiT widens as the input resolution increases, highlighting the efficiency of our method for high-resolution synthesis.
646
+ </ div >
647
+ </ div >
648
+ </ section >
680
649
681
650
<!-- <section class="section" id="Related">
682
651
<div class="container is-max-desktop content">
0 commit comments