GitHub - JarvisZhang24/Inkpunk-Diffusion-PyTorch

Inkpunk Diffusion (PyTorch)

A from-scratch, modular PyTorch re-implementation of an Inkpunk-style Stable Diffusion v1.5 pipeline. I built this to demonstrate hands-on expertise with diffusion models, attention-based UNet architectures, CLIP text encoding, and VAE image compression — with clean, readable code and reproducible results across CUDA, Apple Silicon (MPS), and CPU.

What this project demonstrates

End-to-end diffusion pipeline: CLIP tokenizer/encoder → UNet denoiser → VAE decode
Weight compatibility: load SD-compatible checkpoints (e.g., Inkpunk) into custom modules
Deterministic, reproducible generation via seed control and fixed samplers
Practical engineering: device auto-detection, modular design, and simple CLI

Tech stack

Python, PyTorch, PIL
Hugging Face transformers (CLIPTokenizer)
CUDA / Apple Metal (MPS) / CPU

Results Gallery (Inkpunk style)

Below are sample generations produced by the codebase using curated Inkpunk prompts. Each image is reproducible with the listed prompt and seed.

Prompt (positive):

nvinkpunk cyberpunk samurai with neon mask, glowing swords, graffiti wall background, rainbow smoke, vibrant spray paint textures, ultra detailed

Negative prompt:

flat shading, sticker-like outline, deformed fins, bland palette

Prompt (positive):

nvinkpunk hacker shrine, floating keyboards, rainbow cables, CRT glow, graffiti calligraphy, ultra detailed

Negative prompt:

generic UI overlays, unreadable text blocks, banding, chromatic noise

Prompt (positive):

nvinkpunk neon jellyfish city, bioluminescent tendrils, rainbow mist, aerosol dots, ultra detailed

Negative prompt:

duplicated tendrils, watery blur, plastic look, low contrast

Prompt (positive):

nvinkpunk desert racer hoverbike, sand neon trail, rainbow heat haze, graffiti decals, cinematic, ultra detailed

Negative prompt:

soft focus, mushy edges, duplicated handlebars, warped geometry

To add more results, generate with your preferred prompts (see Reproduce My Results) and place the images under outputs/. Then embed them here in the same way.

Quickstart

Navigate to the project root:
```
cd Stable_diffusion
```
Install dependencies:
```
pip install -r requirements.txt
```
Weights
- This project expects assets/Inkpunk-Diffusion-v2.ckpt to be present (already included in my setup).
- If you need to fetch base SD 1.5 weights for experimentation, configure your Hugging Face token and use the helper script:
```
# optional
# cp .env 
# python download_weights.py
```
Run the interactive app:
```
python main.py
```

Reproduce My Results

Interactive (recommended)

Run python main.py and select “Text-to-Image Generation”.
Pick a built-in Inkpunk prompt or enter your own. Use seed 42 to match my gallery.

Programmatic usage

from transformers import CLIPTokenizer
from src.models.model_loader import preload_models_from_standard_weights
from src.pipeline import generate

device = "mps"  # or "mps", "cpu"
tokenizer = CLIPTokenizer(
    "assets/tokenizer/vocab.json",
    merges_file="assets/tokenizer/merges.txt"
)

models = preload_models_from_standard_weights(
    "assets/Inkpunk-Diffusion-v2.ckpt", device
)

image = generate(
    prompt=(
        "nvinkpunk neon taxi drifting in rain, chrome reflections, "
        "rainbow streaks, street tags, cinematic, ultra detailed"
    ),
    uncond_prompt=(
        "warped wheels, melted chrome, motion smear, muddy puddles"
    ),
    models=models,
    device=device,
    tokenizer=tokenizer,
    seed=42,
    do_cfg=True,
    cfg_scale=9.0,
    sampler_name="ddpm",
    n_inference_steps=80,
)
# Save to outputs/ using your preferred filename

Example additional curated prompts (InkPunk style):

nvinkpunk cyber koi swirling in midair, holographic water, rainbow reflections, ink splatter, ultra detailed
Negative: flat shading, sticker-like outline, deformed fins, bland palette
nvinkpunk hacker shrine, floating keyboards, rainbow cables, CRT glow, graffiti calligraphy, ultra detailed
Negative: generic UI overlays, unreadable text blocks, banding, chromatic noise

Architecture Overview

VAE: Variational Autoencoder for latent-space image compression and decoding
CLIP: Text encoder (tokenization + embeddings) for prompt conditioning
UNet: Denoising network with self- and cross-attention blocks
DDPM sampler: Iterative denoising loop (configurable steps and guidance scale)

All components are implemented in PyTorch with a clean, modular design for readability and extension.

Technical Pipeline

The following diagram illustrates the complete text-to-image generation pipeline, showing how each component transforms data through the system:

graph TD
    A["Text Input<br/>nvinkpunk cyberpunk samurai"] --> B["CLIP Tokenizer<br/>Text→Token IDs<br/>(1,77)"]
    
    B --> C["CLIP Embedding<br/>Token+Position Embeddings<br/>(1,77,768)"]
    
    C --> D["12-Layer CLIP Processing<br/>Self-Attention+Feed-Forward<br/>(1,77,768)"]
    
    D --> E["Text Context Vector<br/>Positive+Negative Prompts<br/>(2,77,768)"]
    
    F["Random Noise Generation<br/>Standard Normal Distribution<br/>(1,4,64,64)"] --> G["DDPM Sampler Init<br/>80 Timesteps<br/>999→0"]
    
    G --> H["Iterative Denoising Loop<br/>80 Iterations"]
    
    H --> I["Time Embedding<br/>Sinusoidal Encoding<br/>(1,320)"]
    
    I --> J["Time Information Expansion<br/>(1,320)→(1,1280)"]
    
    E --> K["UNet Encoder Path<br/>Downsampling+Attention<br/>64×64→8×8"]
    
    J --> K
    F --> K
    
    K --> L["UNet Bottleneck<br/>Deep Feature Extraction<br/>(1,1280,8,8)"]
    
    L --> M["UNet Decoder Path<br/>Upsampling+Skip Connections<br/>8×8→64×64"]
    
    M --> N["Noise Prediction<br/>UNet Output Layer<br/>(1,4,64,64)"]
    
    N --> O["Classifier-Free Guidance<br/>Conditional vs Unconditional<br/>Noise Mixing"]
    
    O --> P["DDPM Sampling Step<br/>Remove Predicted Noise<br/>Update Latent Vector"]
    
    P --> Q{"Completed<br/>80 Steps?"}
    
    Q -->|No| H
    Q -->|Yes| R["Final Latent Representation<br/>Fully Denoised<br/>(1,4,64,64)"]
    
    R --> S["VAE Decoder<br/>Latent Space→Image Space<br/>(1,4,64,64)→(1,3,512,512)"]
    
    S --> T["Post-processing<br/>Normalization+Format Conversion<br/>(-1,1)→(0,255)"]
    
    T --> U["Final Output Image<br/>512×512 RGB<br/>Inkpunk Samurai Artwork"]

    style A fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000,font-weight:bold
    style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000,font-weight:bold
    style F fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000,font-weight:bold
    style R fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px,color:#000,font-weight:bold
    style U fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000,font-weight:bold

Key Pipeline Stages:

Text Processing: CLIP transforms natural language into semantic embeddings
Noise Initialization: Pure Gaussian noise in compressed latent space (8× smaller than image space)
Iterative Denoising: UNet progressively removes noise guided by text context over 80 steps
Image Reconstruction: VAE decoder converts latent representation back to RGB image

For a detailed technical walkthrough, see Inkpunk_Diffusion_Technical_Overview.md.

Project Structure

Inkpunk_Diffusion/
├── src/
│   ├── models/
│   │   ├── attention.py      # Self & cross-attention
│   │   ├── vae.py            # VAE encoder & decoder
│   │   ├── clip.py           # CLIP text encoder
│   │   ├── unet.py           # UNet denoiser
│   │   ├── diffusion.py      # Diffusion loop & DDPM sampler
│   │   └── model_loader.py   # Load/convert SD-compatible weights
│   └── pipeline.py           # Orchestration for generation
├── assets/
│   └── Inkpunk-Diffusion-v2.ckpt # Inkpunk weights
│   └── tokenizer/               # CLIP tokenizer 
│       └── merges.txt           # Merges file
│       └── vocab.json           # Vocab file
├── .env                         # Environment variables(Hugging Face token)
├── outputs/                     # Generated images
├── main.py                   # Interactive CLI
├── download_weights.py       # Optional weights helper
├── requirements.txt          # Dependencies
├── Inkpunk_Diffusion_Technical_Overview.md # Technical Overview
└── README.md

License

This repository is for educational and portfolio purposes. The Inkpunk Diffusion model weights are governed by their original license. Base Stable Diffusion weights are governed by Stability AI’s license.

Troubleshooting

Memory constraints: lower n_inference_steps or image size; prefer GPU/MPS when available
Import/runtime errors: ensure dependencies are installed via pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inkpunk Diffusion (PyTorch)

What this project demonstrates

Tech stack

Results Gallery (Inkpunk style)

Quickstart

Reproduce My Results

Interactive (recommended)

Programmatic usage

Architecture Overview

Technical Pipeline

Key Pipeline Stages:

Project Structure

License

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
outputs		outputs
src		src
.gitignore		.gitignore
Inkpunk_Diffusion_Technical_Overview.md		Inkpunk_Diffusion_Technical_Overview.md
README.md		README.md
download_weights.py		download_weights.py
main.py		main.py
requirements.txt		requirements.txt

JarvisZhang24/Inkpunk-Diffusion-PyTorch

Folders and files

Latest commit

History

Repository files navigation

Inkpunk Diffusion (PyTorch)

What this project demonstrates

Tech stack

Results Gallery (Inkpunk style)

Quickstart

Reproduce My Results

Interactive (recommended)

Programmatic usage

Architecture Overview

Technical Pipeline

Key Pipeline Stages:

Project Structure

License

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages