华中科技大学主页平台管理系统 SHEN GANG--英文主页-- LisaCLIP: Locally Incremental Semantics Adaptation towards Zero-shot Text-driven Image Synthesis

LisaCLIP: Locally Incremental Semantics Adaptation towards Zero-shot Text-driven Image Synthesis

Release time:2023-11-21

Hits:

Journal:: Proceedings of the International Joint Conference on Neural Networks

Key Words:: image synthesis, style transfer, CLIP model, adaptive patch selection

Abstract:: The automatic transfer of a plain photo into a desired synthetic style has attracted numerous users in the application fields of photo editing, visual art, and entertainment. By connecting images and texts, the Contrastive Language-Image Pre-Training (CLIP) model facilitates the text-driven style transfer without exploring the image's latent domain. However, the trade-off between content fidelity and stylization remains challenging. In this paper, we present LisaCLIP, a CLIP-based image synthesis framework that only exploits the CLIP model to guide the imagery manipulations with a depth-adaptive encoder-decoder network. Since an image patch's semantics depend on its size, LisaCLIP progressively downsizes the patches while adaptively selecting the most significant ones for further stylization. We introduce a multi-stage training strategy to speed up LisaCLIP's convergence by decoupling the optimization objectives. Various experiments on public datasets demonstrated that LisaCLIP supported a wide range of style transfer tasks and outperformed other state-of-the-art methods in maintaining the balance between content and style.