LisaCLIP: Locally Incremental Semantics Adaptation towards Zero-shot Text-driven Image Synthesis
Release time:2023-11-21
Hits:
- Indexed by:
- Essay collection
- Journal:
- Proceedings of the International Joint Conference on Neural Networks
- Included Journals:
- EI、CPCI-S
- Place of Publication:
- 澳大利亚
- Discipline:
- Engineering
- First-Level Discipline:
- Computer Science and Technology
- Document Type:
- C
- Key Words:
- image synthesis, style transfer, CLIP model, adaptive patch selection
- DOI number:
- 10.1109/IJCNN54540.2023.10191516
- Date of Publication:
- 2023-06-18
- Abstract:
- The automatic transfer of a plain photo into a desired synthetic style has attracted numerous users in the application fields of photo editing, visual art, and entertainment. By connecting images and texts, the Contrastive Language-Image Pre-Training (CLIP) model facilitates the text-driven style transfer without exploring the image's latent domain. However, the trade-off between content fidelity and stylization remains challenging. In this paper, we present LisaCLIP, a CLIP-based image synthesis framework that only exploits the CLIP model to guide the imagery manipulations with a depth-adaptive encoder-decoder network. Since an image patch's semantics depend on its size, LisaCLIP progressively downsizes the patches while adaptively selecting the most significant ones for further stylization. We introduce a multi-stage training strategy to speed up LisaCLIP's convergence by decoupling the optimization objectives. Various experiments on public datasets demonstrated that LisaCLIP supported a wide range of style transfer tasks and outperformed other state-of-the-art methods in maintaining the balance between content and style.