Abstract
A unified framework using diffusion models with semantic control and cross-medium style augmentation generates consistent and high-fidelity multi-media painting processes, supported by a comprehensive dataset and evaluation metrics.
Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Birth of a Painting: Differentiable Brushstroke Reconstruction (2025)
- AvatarTex: High-Fidelity Facial Texture Reconstruction from Single-Image Stylized Avatars (2025)
- UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback (2025)
- X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering (2025)
- VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning (2025)
- PickStyle: Video-to-Video Style Transfer with Context-Style Adapters (2025)
- Enhancing Video Inpainting with Aligned Frame Interval Guidance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper