Skip to content

Latest commit

 

History

History
421 lines (377 loc) · 42.9 KB

README.md

File metadata and controls

421 lines (377 loc) · 42.9 KB

X-Dreamer 💤

A pytorch implementation of “X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation”

【Paper】 【Project Page】

Introduction Video 🎥

blenderV2.mp4
intro.MP4

Requirement

  • System requirement: Ubuntu20.04
  • Tested GPUs: RTX3090
  • Environment Installation
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

How to Run

Text-to-3D generation from an ellipsoid

# Geometry modeling
python -m torch.distributed.launch --nproc_per_node=4 \
        train_x_dreamer.py \
        --config configs/cupcake_geometry.json \
        --out-dir 'results/result_XDreamer/cupcake_geometry'

# Geometry modeling
python -m torch.distributed.launch --nproc_per_node=4 \
        train_x_dreamer.py \
        --config configs/cupcake_appearance.json \
        --out-dir 'results/result_XDreamer/cupcake_appearance' \
        --base-mesh 'results/result_XDreamer/cupcake_geometry/dmtet_mesh/mesh.obj'

Text-to-3D generation from coarse-grained meshes

# Geometry modeling
python -m torch.distributed.launch --nproc_per_node=4 \
        train_x_dreamer.py \
        --config configs/Batman_geometry.json \
        --out-dir 'results/result_XDreamer/Batman_geometry'

# Geometry modeling
python -m torch.distributed.launch --nproc_per_node=4 \
        train_x_dreamer.py \
        --config configs/Batman_appearance.json \
        --out-dir 'results/result_XDreamer/Batman_appearance' \
        --base-mesh 'results/result_XDreamer/Batman_geometry/dmtet_mesh/mesh.obj'

Overview 💻

Overview of the proposed X-Dreamer, which consists of two main stages: geometry learning and appearance learning.For the geometry learning stage, we employ DMTET as the 3D representation and initialize it with a 3D ellipsoid using the mean squared error (MSE) loss. Subsequently, we optimize DMTET and CG-LoRA using the score distillation sampling (SDS) loss and our proposed attention-mask alignment (AMA) loss to ensure the alignment between the 3D representation and the input text prompt. For the appearance learning, we leverage bidirectional reflectance distribution function (BRDF) modeling. Specifically, we utilize an MLP with trainable parameters to predict surface materials. Similar to the geometry learning stage, we optimize the MLP and CG-LoRA using the SDS loss and the AMA loss to achieve alignment between the 3D representation and the input text prompt.

News 📝

  • 2023.11.27: Create Repository
  • 2023.12.28: Release Code

Results 🔍

result.mp4

Example generated objects

We conduct the experiments using four Nvidia RTX 3090 GPUs and the PyTorch library. To calculate the SDS loss, we utilize the Stable Diffusion implemented by Hugging Face Diffusers. For the DMTET and material encoder, we implement them as a two-layer MLP and a single-layer MLP, respectively, with a hidden dimension of 32. We optimize X-Dreamer for 2000 iterations for geometry learning and 1000 iterations for appearance learning.

Text-to-3D generation from an ellipsoid

We present representative results of X-Dreamer for text-to-3D generation, utilizing an ellipsoid as the initial geometry.

Image Normal
Vase Shaded Vase Normal
A DSLR photo of a blue and white porcelain vase, highly detailed, 8K, HD. A DSLR photo of a blue and white porcelain vase, highly detailed, 8K, HD.
Cabbage Shaded Cabbage Normal
A cabbage, highly detailed. A cabbage, highly detailed.
Cupcake Shaded Cupcake Normal
A chocolate cupcake, highly detailed. A chocolate cupcake, highly detailed.
Bread Shaded Bread Normal
A sliced loaf of fresh bread. A sliced loaf of fresh bread.
Pear Shaded Pear Normal
A DSLR photo of a pear, highly detailed, 8K, HD. A DSLR photo of a pear, highly detailed, 8K, HD.
Hamburger Shaded Hamburger Normal
A hamburger. A hamburger.
Corn Shaded Corn Normal
A DSLR photo of a corn, highly detailed, 8K, HD. A DSLR photo of a corn, highly detailed, 8K, HD.

Text-to-3D generation from coarse-grained meshes

X-Dreamer also supports text-based mesh geometry editing and is capable of delivering excellent results.

Coarse-grained Mesh Image Normal
Chess Piece Chess Shaded Chess Normal
A beautifully carved wooden queen chess piece. A beautifully carved wooden queen chess piece.
Obama's Head Obama Shaded Obama Normal
Barack Obama's head. Barack Obama's head.

Different lighting conditions

We demonstrate how swapping the HDR environment map results in diverse lighting, thereby creating various reflective effects on the generated 3D assets in X-Dreamer.

Env. Map1 Env. Map2 Env. Map3 Env. Map4 Env. Map5
A DSLR photo of a brown cowboy hat.
Messi's head, highly detailed, 8K, HD.
A DSLR photo of a fox, highly detailed.
A DSLR photo of red rose, highly detailed, 8K, HD.
A marble bust of a mouse.
A small saguaro cactus planted in a clay pot.
A DSLR photo of a vase, highly detailed, 8K, HD.

Editing process

We demonstrate the editing process of the geometry and appearance of 3D assets in X-Dreamer using an ellipsoid and coarse-grained guided meshes as geometric shapes for initialization, respectively.

From an ellipsoid From coarse-grained guided meshes
A DSLR photo of a blue and white porcelain vase, highly detailed, 8K, HD. A marble bust of an angel, 3D model, high resolution.
A stack of pancakes covered in maple syrup. A DSLR photo of the Terracotta Army, 3D model, high resolution.

Comparison

We compared X-Dreamer with four state-of-the-art (SOTA) methods: DreamFusion, Magic3D, Fantasia3D, and ProlificDreamer. The results are shown below:

DreamFusion Magic3D Fantasia3D ProlificDreamer X-Dreamer
A 3D rendering of Batman, highly detailed.
A cat, highly detailed.
Garlic with white skin, highly detailed, 8K, HD.
A statue of Leonardo DiCaprio's head.
A DSLR photo of Lord Voldemort's head, highly detailed.

Using results in 3D computer graphics software 🔧

blender.mp4

BibTeX 📚

  @article{ma2023xdreamer,
    title={X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation},
    author={Ma, Yiwei and Fan, Yijun and Ji, Jiayi and Wang, Haowei and Sun, Xiaoshuai and Jiang, Guannan and Shu, Annan and Ji, Rongrong},
    journal={arXiv preprint arXiv:2312.00085},
    year={2023}
  }