LisaCLIP: Locally Incremental Semantics Adaptation towards Zero-shot Text-driven Image Synthesis
发布时间:2023-11-21
点击次数:
- 论文类型:
- 论文集
- 发表刊物:
- Proceedings of the International Joint Conference on Neural Networks
- 收录刊物:
- EI、CPCI-S
- 刊物所在地:
- 澳大利亚
- 学科门类:
- 工学
- 一级学科:
- 计算机科学与技术
- 文献类型:
- C
- 关键字:
- image synthesis, style transfer, CLIP model, adaptive patch selection
- DOI码:
- 10.1109/IJCNN54540.2023.10191516
- 发表时间:
- 2023-06-18
- 摘要:
- The automatic transfer of a plain photo into a desired synthetic style has attracted numerous users in the application fields of photo editing, visual art, and entertainment. By connecting images and texts, the Contrastive Language-Image Pre-Training (CLIP) model facilitates the text-driven style transfer without exploring the image's latent domain. However, the trade-off between content fidelity and stylization remains challenging. In this paper, we present LisaCLIP, a CLIP-based image synthesis framework that only exploits the CLIP model to guide the imagery manipulations with a depth-adaptive encoder-decoder network. Since an image patch's semantics depend on its size, LisaCLIP progressively downsizes the patches while adaptively selecting the most significant ones for further stylization. We introduce a multi-stage training strategy to speed up LisaCLIP's convergence by decoupling the optimization objectives. Various experiments on public datasets demonstrated that LisaCLIP supported a wide range of style transfer tasks and outperformed other state-of-the-art methods in maintaining the balance between content and style.