當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

ICCV 2021可逆的跨空间映射实现多样化的图像风格传输：Diverse Image Style Transfer via Invertible Cross-Space Mapping

發(fā)布時(shí)間：2023/12/10 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 ICCV 2021可逆的跨空间映射实现多样化的图像风格传输：Diverse Image Style Transfer via Invertible Cross-Space Mapping 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Diverse Image Style Transfer via Invertible Cross-Space Mapping

Haibo Chen, Lei Zhao? , Huiming Zhang, Zhizhong Wang Zhiwen Zuo, Ailin Li, Wei Xing? , Dongming Lu

College of Computer Science and Technology, Zhejiang University

[paper]

Abstract

1.? Introduction

3.? Approach

3.1.? Stylization Branch

3.2.? Disentanglement Branch

3.3.? Inverse Branch

3.4.? Final Objective and Network Architectures

4.? Experiments

5.? Conclusion

Abstract

Image style transfer aims to transfer the styles of artworks onto arbitrary photographs to create novel artistic images.

Although style transfer is inherently an underdetermined problem, existing approaches usually assume a deterministic solution, thus failing to capture the full distribution of possible outputs.

To address this limitation, we propose a Diverse Image Style Transfer (DIST) framework which achieves significant diversity by enforcing an invertible cross-space mapping.

Specifically, the framework consists of three branches: disentanglement branch, inverse branch, and stylization branch. Among them, the disentanglement branch factorizes artworks into content space and style space; the inverse branch encourages the invertible mapping between the latent space of input noise vectors and the style space of generated artistic images; the?stylization branch renders the input content image with the style of an artist. Armed with these three branches, our approach is able to synthesize significantly diverse stylized images without loss of quality.

We conduct extensive experiments and comparisons to evaluate our approach qualitatively and quantitatively. The experimental results demonstrate the effectiveness of our method.

研究方向：

圖像風(fēng)格轉(zhuǎn)換的目的是將藝術(shù)作品的風(fēng)格轉(zhuǎn)換到任意的照片上，創(chuàng)造新穎的藝術(shù)形象。

提出本文要解決的核心問(wèn)題：

雖然風(fēng)格轉(zhuǎn)移本質(zhì)上是一個(gè)不確定的問(wèn)題，但現(xiàn)有的方法通常假設(shè)一個(gè)確定的解決方案，因此無(wú)法捕獲可能的輸出的完整分布。

本文主要的研究方法：

為了解決這一限制，本文提出了一個(gè)多樣化的圖像風(fēng)格傳輸 (DIST) 框架，該框架通過(guò)強(qiáng)制一個(gè)可逆的跨空間映射實(shí)現(xiàn)了顯著的多樣性。

研究方法的具體介紹：

具體來(lái)說(shuō)，框架由三個(gè)分支組成：解耦分支、逆分支和風(fēng)格化分支。其中，解耦分支將藝術(shù)品分解為內(nèi)容空間和風(fēng)格空間；逆分支鼓勵(lì)輸入噪聲向量的潛在空間與生成的藝術(shù)圖像的風(fēng)格空間之間的可逆映射；風(fēng)格化分支以藝術(shù)家的風(fēng)格呈現(xiàn)輸入內(nèi)容圖像。有了這三個(gè)分支，本文的方法能夠合成出顯著不同的風(fēng)格化圖像而不損失質(zhì)量。

實(shí)驗(yàn)結(jié)論：

本文進(jìn)行了廣泛的實(shí)驗(yàn)和比較，從質(zhì)量和數(shù)量上評(píng)價(jià)我們的方法。實(shí)驗(yàn)結(jié)果證明了該方法的有效性。

1. Introduction

An exquisite artwork can take a diligent artist days or even months to create, which is labor-intensive and timeconsuming. Motivated by this, a series of recent approaches studied the problem of repainting an existing photograph with the style of an artist using either a single artwork or a collection of artworks. These approaches are known as style transfer. Armed with style transfer techniques, anyone could create artistic images.

本文研究對(duì)象：風(fēng)格遷移之藝術(shù)圖像生成

一件精美的藝術(shù)品可能需要一個(gè)勤奮的藝術(shù)家?guī)滋焐踔翈讉€(gè)月的時(shí)間來(lái)創(chuàng)作，這是勞動(dòng)密集型和時(shí)間消耗。受此啟發(fā)，一系列最近的研究方法研究了用藝術(shù)家的風(fēng)格重新繪制現(xiàn)有照片的問(wèn)題，無(wú)論是使用單一的藝術(shù)作品還是收藏的藝術(shù)作品。這些方法被稱為風(fēng)格轉(zhuǎn)換。有了風(fēng)格轉(zhuǎn)換技術(shù)，任何人都可以創(chuàng)造藝術(shù)圖像。

How to represent the content and style of an image is the key challenge of style transfer. Recently, the seminal work of Gatys et al. [7] firstly proposed to extract content and style features from an image using pre-trained Deep Convolutional Neural Networks (DCNNs). By separating and recombining contents and styles of arbitrary images, novel artworks can be created. This work showed the enormous potential of CNNs in style transfer and created a surge of interest in this field. Based on this work, a series of subsequent methods have been proposed to achieve better performance in many aspects, including efficiency [13, 21, 34], quality [20, 35, 40, 43, 39, 4], and generalization [6, 5, 10, 24, 30, 27, 22]. However, diversity, as another important aspect, has received relatively less attention.

研究背景及問(wèn)題引出：

如何表現(xiàn)圖像的內(nèi)容和風(fēng)格是風(fēng)格轉(zhuǎn)換的關(guān)鍵挑戰(zhàn)。最近，Gatys et al. [7] 的開(kāi)創(chuàng)性工作首先提出使用預(yù)訓(xùn)練的深度卷積神經(jīng)網(wǎng)絡(luò) (Deep Convolutional Neural Networks, DCNNs) 從圖像中提取內(nèi)容和風(fēng)格特征。通過(guò)對(duì)任意圖像的內(nèi)容和風(fēng)格進(jìn)行分離和重組，可以創(chuàng)作出新穎的藝術(shù)品。這項(xiàng)工作顯示了 cnn 在風(fēng)格轉(zhuǎn)換方面的巨大潛力，并引發(fā)了人們對(duì)這一領(lǐng)域的興趣激增。在此基礎(chǔ)上，又提出了一系列后續(xù)的方法，以期在效率 [13,21,34]、質(zhì)量 [20,35,40,43,39,4] 和泛化[6,5,10,24,30,27,22] 等多個(gè)方面取得更好的性能。然而，作為另一個(gè)重要方面，多樣性受到的關(guān)注相對(duì)較少。

As the saying goes, “There are a thousand Hamlets in a thousand people’s eyes”. Similarly, different people have different understanding and interpretation of the style of an artwork. There is no uniform and quantitative definition of the artistic style of an image. Therefore, the stylization results should be diverse rather than unique, so that the preferences of different people can be satisfied. To put it another way, style transfer is an underdetermined problem, where a large number of solutions can be found. Unfortunately, existing style transfer methods usually assume a deterministic solution. As a result, they fail to capture the full distribution of possible outputs.

動(dòng)機(jī)：用很站得住的理由解釋動(dòng)機(jī)，讓動(dòng)機(jī)更合理，更自然

俗話說(shuō)，“一千個(gè)人眼中有一千個(gè)哈姆雷特”。同樣，不同的人對(duì)一件藝術(shù)品的風(fēng)格也有不同的理解和解讀。一個(gè)形象的藝術(shù)風(fēng)格沒(méi)有統(tǒng)一的、定量的定義。因此，風(fēng)格化的結(jié)果應(yīng)該是多樣化的，而不是獨(dú)一無(wú)二的，這樣才能滿足不同人的偏好。換句話說(shuō)，風(fēng)格轉(zhuǎn)移是一個(gè)未確定的問(wèn)題，可以找到大量的解決方案。不幸的是，現(xiàn)有的樣式轉(zhuǎn)換方法通常采用確定性的解決方案。因此，它們無(wú)法捕獲可能輸出的全部分布。

A straightforward approach to handle diversity in style transfer is to take random noise vectors along with content images as inputs, i.e., utilizing the variability of the input noise vectors to produce diverse stylization results. However, the network tends to pay more attention to the high-dimensional and structured content images and ignores the noise vectors, leading to deterministic output. To ensure that the variability in the latent space can be passed into the image space, Ulyanov et al. [35] enforced the dissimilarity among generated images by enlarging their distance in the pixel space. Similarly, Li et al. [23] introduced a diversity loss that penalized the feature similarities of different samples in a mini-batch. Although these methods can achieve diversity to some extent, they have obvious limitations.

First, forcibly enlarging the distance among outputs may cause the results to deviate from the local optimum, resulting in the degradation of image quality.

Second, to avoid introducing too many artifacts to the generated images, the weight of the diversity loss is generally set to a small value. Consequently, the diversity of the stylization results is relatively limited.

Third, diversity is more than the pixel distance or feature distance among generated images, which contains richer and more complex connotation. Most recently, Wang et al. [37] achieved better diversity by using an orthogonal noise matrix to perturb the image feature maps while keeping the original style information unchanged. However, this approach is apt to generate distorted?results, providing insufficient visual quality. Therefore, the problem of diverse style transfer remains an open challenge.

技術(shù)難題：對(duì)問(wèn)題更細(xì)致、更深入的探討

處理風(fēng)格轉(zhuǎn)換多樣性的一種簡(jiǎn)單方法是將隨機(jī)噪聲向量與內(nèi)容圖像一起作為輸入，即利用輸入噪聲向量的可變性來(lái)產(chǎn)生多樣化的程式化結(jié)果。然而，該網(wǎng)絡(luò)更傾向于關(guān)注高維和結(jié)構(gòu)化內(nèi)容圖像，而忽略噪聲向量，導(dǎo)致輸出的確定性。為了保證潛在空間中的可變性可以傳遞到圖像空間，Ulyanov et al. [35] 通過(guò)增大像素空間中的距離來(lái)增強(qiáng)生成圖像之間的不相似性。類似地，Li et al. [23] 引入了多樣性損失，懲罰了小批量中不同樣品的特征相似性。這些方法雖然能在一定程度上實(shí)現(xiàn)多樣性，但也存在明顯的局限性。

首先，強(qiáng)行增大輸出之間的距離可能會(huì)導(dǎo)致結(jié)果偏離局部最優(yōu)，導(dǎo)致圖像質(zhì)量下降。

其次，為了避免在生成的圖像中引入過(guò)多的偽影，一般將多樣性損失的權(quán)重設(shè)置為一個(gè)較小的值。因此，風(fēng)格化結(jié)果的多樣性相對(duì)有限。

第三，多樣性不僅僅是生成圖像之間的像素距離或特征距離，它包含更豐富、更復(fù)雜的內(nèi)涵。最近，Wang et al. [37] 在保持原始風(fēng)格信息不變的情況下，利用正交噪聲矩陣擾動(dòng)圖像特征映射，獲得了更好的多樣性。然而，這種方法容易產(chǎn)生失真的結(jié)果，產(chǎn)生不理想的視覺(jué)質(zhì)量。

因此，多元化的風(fēng)格轉(zhuǎn)換問(wèn)題仍然是一個(gè)開(kāi)放的挑戰(zhàn)。

In this paper, we propose a Diverse Image Style Transfer (DIST) framework which achieves significant diversity without loss of quality by enforcing an invertible crossspace mapping. Specifically, the framework takes random noise vectors along with everyday photographs as its inputs, where the former are responsible for style variations and the latter determine the main contents. However, according to above analyses, we can learn that the noise vectors are prone to be ignored in the network. Our proposed DIST framework tackles this problem through three branches: disentanglement branch, inverse branch, and stylization branch.

The disentanglement branch factorizes artworks into content space and style space. The inverse branch encourages the invertible mapping between the latent space of input noise vectors and the style space of generated artistic images, which is inspired by [32]. But different from [32], we invert the style information rather than the whole generated image to the input noise vector, since the input noise vector mainly influences the style of the generated image. The stylization branch renders the input content image with the style of an artist. Equipped with these three branches, DIST is able to synthesize significantly diverse stylized images without loss of quality, as shown in Figure 1.

本文提出了一個(gè)多樣化的圖像風(fēng)格傳輸 (DIST) 框架，通過(guò)強(qiáng)制一個(gè)可逆的跨空間映射來(lái)實(shí)現(xiàn)顯著的多樣性而不損失質(zhì)量。具體來(lái)說(shuō)，框架將隨機(jī)噪聲向量和日常照片作為輸入，前者負(fù)責(zé)風(fēng)格變化，后者決定主要內(nèi)容。但是，通過(guò)以上分析，可以了解到噪聲向量在網(wǎng)絡(luò)中很容易被忽略。本文提出的 DIST 框架通過(guò)三個(gè)分支來(lái)解決這個(gè)問(wèn)題：解糾纏分支、逆分支和風(fēng)格化分支。

解構(gòu)分支將藝術(shù)品分解為內(nèi)容空間和風(fēng)格空間。

逆分支鼓勵(lì)輸入噪聲向量的潛在空間與生成的藝術(shù)圖像的風(fēng)格空間之間的可逆映射，其靈感來(lái)自 [32]。但與 [32] 不同的是，由于輸入噪聲矢量主要影響生成圖像的風(fēng)格，所以本文將風(fēng)格信息而不是生成的整個(gè)圖像轉(zhuǎn)換為輸入噪聲矢量。

風(fēng)格化分支以藝術(shù)家的風(fēng)格呈現(xiàn)輸入內(nèi)容圖像。

配備了這三個(gè)分支，DIST 能夠合成出明顯不同的風(fēng)格化圖像而不降低圖像質(zhì)量，如圖 1 所示。

Overall, the contributions can be summarized as follows:

?? We propose a novel style transfer framework which achieves significant diversity by learning the one-toone mapping between latent space and style space.

?? Different from existing style transfer methods [35, 23, 37] that obtain diversity with serious degradation of quality, our approach can produce both high-quality and diverse stylization results.

?? Our approach provides a new way to disentangle the style and content of an image.

?? We demonstrate the effectiveness and superiority of our approach by extensive comparison with several state-of-the-art style transfer methods.

總的來(lái)說(shuō)，這些貢獻(xiàn)可總結(jié)如下:

?? 通過(guò)學(xué)習(xí)潛在空間和風(fēng)格空間之間的一對(duì)一映射，提出了一種新的風(fēng)格遷移框架，實(shí)現(xiàn)了顯著的多樣性。

?? 與現(xiàn)有的風(fēng)格轉(zhuǎn)移方法 [35,23,37] 獲得多樣性而質(zhì)量嚴(yán)重退化不同，本文的方法可以產(chǎn)生高質(zhì)量和多樣化的風(fēng)格化結(jié)果。

?? 本文的方法提供了一種新的方法來(lái)理清圖像的風(fēng)格和內(nèi)容。

?? 通過(guò)與幾種最先進(jìn)的風(fēng)格轉(zhuǎn)換方法的廣泛比較，證明了本文的方法的有效性和優(yōu)越性。

【貢獻(xiàn)總結(jié)略顯簡(jiǎn)單】

3.? Approach

Inspired by [29, 17, 18, 33], we learn artistic style not from a single artwork but from a collection of related artworks. Formally, our task can be described as follows: given a collection of photos x ～ X and a collection of artworks y ～ Y (the contents of X and Y can be totally different), we aim to learn a style transformation G : X → Y with significant diversity. To achieve this goal, we propose a DIST framework consisting of three branches: stylization branch, disentanglement branch, and inverse branch. In this section, we introduce the three branches in details.

受 [29,17,18,33] 的啟發(fā)，學(xué)習(xí)藝術(shù)風(fēng)格不是從單一的藝術(shù)品，而是從相關(guān)藝術(shù)品的集合。形式上，本文的任務(wù)可以這樣描述：給定一組照片 x ~ X?和一組藝術(shù)品 y ~ Y?(x 和 y 的內(nèi)容可以完全不同)，本文的目標(biāo)是學(xué)習(xí)具有顯著多樣性的風(fēng)格轉(zhuǎn)變 G: X→Y。為了實(shí)現(xiàn)這一目標(biāo)，本文提出了一個(gè)由三個(gè)分支組成的 DIST 框架：風(fēng)格化分支、解耦分支和逆分支。

3.1.? Stylization Branch

The stylization branch aims to repaint x ～ X with the style of y ～ Y . To this end, we enable G to approximate the distribution of Y by employing a discriminator D to train against G: G tries to generate images that resembles the images in Y , while D tries to distinguish the stylized images from the real ones. Joint training of these two networks leads to a generator that is able to produce desired stylizations. This process can be formulated as follows (note that for G, we adopt an encoder-decoder architecture consisting of an encoder Ec and a decoder D) :

? (1)

where z ∈ R dz is a random noise vector and p(z) is the standard normal distribution N (0, I). We leverage its variability to encourage diversity in generated images.

風(fēng)格化分支

風(fēng)格化分支的目標(biāo)是用 y ~ Y?的風(fēng)格重新繪制 x ~ X。為此，本文使用判別器器 D 對(duì) G 進(jìn)行訓(xùn)練，使 G 能夠近似 Y 的分布: G 試圖生成與 Y 中的圖像相似的圖像，而 D 試圖將程式化的圖像與真實(shí)的圖像區(qū)分開(kāi)來(lái)。對(duì)這兩個(gè)網(wǎng)絡(luò)的聯(lián)合訓(xùn)練將產(chǎn)生一個(gè)能夠產(chǎn)生所需程式化的生成器。這個(gè)過(guò)程可以表述如下 (注意對(duì)于 G，本文采用編碼器 Ec 和解碼器 D 組成的編解碼器體系結(jié)構(gòu))，如公式（1）。

其中 z∈rdz 是一個(gè)隨機(jī)噪聲向量，p(z) 是標(biāo)準(zhǔn)正態(tài)分布 N (0, I)。本文利用它的可變性來(lái)鼓勵(lì)生成圖像的多樣性。
?

Only using above adversarial loss cannot preserve the content information of x in the generated image, which does not meet the requirements of style transfer. The simplest solution is to utilize a pixel-wise loss between the content image x ～ X and stylized image D(Ec(x), z). However, this loss is too strict and harms the quality of the stylized image. Therefore, we soften the constraint: instead of directly calculating the distance between original images, we first input them into an average pooling layer P and then calculate the distance between them. We express this content structure loss as:

? ? (2)

Compared with the pixel-wise loss which requires the content image and the stylized image to be exactly the same, Lp measures their difference in a more coarse-grained manner and only requires them to be similar in general content structures, more consistent with the goal of style transfer.

Although the stylization branch is sufficient to obtain remarkable stylized images, it can only produce a deterministic stylized image without diversity, because the network tends to ignore the random noise vector z.

僅使用上述對(duì)抗性損失無(wú)法保留生成圖像中 x 的內(nèi)容信息，不滿足風(fēng)格轉(zhuǎn)移的要求。最簡(jiǎn)單的解決方案是利用內(nèi)容圖像 x ~ X?和風(fēng)格化圖像 D(Ec(x)， z) 之間的像素?fù)p失。然而，這種損失太過(guò)嚴(yán)格，損害了風(fēng)格化圖像的質(zhì)量。因此，本文軟化約束：本文不是直接計(jì)算原始圖像之間的距離，而是首先將它們輸入到平均池化層 P 中，然后計(jì)算它們之間的距離。本文將這種內(nèi)容結(jié)構(gòu)損失表示為公式（2）。

與要求內(nèi)容圖像和風(fēng)格化圖像完全相同的像素?fù)p失相比，Lp 以更粗粒度的方式衡量它們的差異，只要求它們?cè)谝话銉?nèi)容結(jié)構(gòu)上相似，更符合風(fēng)格轉(zhuǎn)移的目標(biāo)。

雖然風(fēng)格化分支足以獲得顯著的風(fēng)格化圖像，但由于網(wǎng)絡(luò)傾向于忽略隨機(jī)噪聲向量 z，只能產(chǎn)生確定性的風(fēng)格化圖像，沒(méi)有多樣性。

3.2. Disentanglement Branch

[32] alleviated the mode collapse issue in GANs by enforcing the bijection mapping between the input noise vectors and generated images. Different from [32], which only takes noise vectors as inputs, our model takes noise vectors along with content images as inputs, where the former are responsible for style variations and the latter determine the main contents. Therefore, in the inverse process, instead of inverting the whole generated image to the input noise vector like [32] does, we invert the style information of the stylized image to the input noise vector (details in Section 3.3). To be specific, we utilize a style encoder to extract the style information from the stylized image, and enforce the consistency between the style encoder’s output and the input noise vector. The main problem now is how to obtain such a style encoder. We resolve this problem through the disentanglement branch.

解釋為什么需要解耦分支：

[32] 通過(guò)加強(qiáng)輸入噪聲向量與生成圖像之間的雙射映射，緩解了 GAN 中的模式坍縮問(wèn)題。與 [32] 只使用噪聲向量作為輸入不同，本文的模型將噪聲向量與內(nèi)容圖像一起作為輸入，前者負(fù)責(zé)風(fēng)格變化，后者決定主要內(nèi)容。因此，在反過(guò)程中，本文不像 [32] 那樣將整個(gè)生成的圖像反到輸入噪聲向量，而是將風(fēng)格化圖像的樣式信息反到輸入噪聲向量 (詳見(jiàn)第3.3節(jié))。具體來(lái)說(shuō)，本文利用樣式編碼器從風(fēng)格化圖像中提取樣式信息，并加強(qiáng)樣式編碼器輸出與輸入噪聲向量之間的一致性?，F(xiàn)在的主要問(wèn)題是如何獲得這樣的樣式編碼器。本文通過(guò)解耦分支來(lái)解決這個(gè)問(wèn)題。

First, the disentanglement branch employs an encoder E ′ c which takes the stylized image D(Ec(x), z) as input. Given that the content image and stylized image share the same content and differ greatly in style, if we encourage the similarity between the output of Ec (whose input is the content image) and that of E ′ c (whose input is the stylized image), Ec and E ′ c shall extract the shared content information and neglect the specific style information. Notice that Ec and E ′ c are two independent networks and do not share weights. This is because there are some differences when extracting photographs’ contents and artworks’ contents. We define the corresponding content feature loss as,

? (3)?

首先，解耦分支采用編碼器 E ' c，以風(fēng)格化圖像 D(Ec(x)， z) 作為輸入。鑒于內(nèi)容圖像和風(fēng)格化圖像共享相同的內(nèi)容和風(fēng)格有很大的不同，如果鼓勵(lì) Ec 的輸出之間的相似性 (其輸入內(nèi)容圖像) 和E’c (其輸入是程式化的形象), Ec 和 E’c 應(yīng)提取共享內(nèi)容信息和忽視具體樣式信息。注意，Ec 和 E ’c 是兩個(gè)獨(dú)立的網(wǎng)絡(luò)，不共享權(quán)值。這是因?yàn)樵谔崛≌掌瑑?nèi)容和藝術(shù)品內(nèi)容時(shí)存在一些差異。本文將相應(yīng)的內(nèi)容特征損失定義為公式（3）。

However, L_FP may encourage Ec and E ′ c to output feature maps in which the value of each element is pretty small (i.e., ∥ Ec(x) ∥→ 0, ∥ E ′ c (D(Ec(x), z)) ∥→ 0). In such a circumstance, although L_FP is minimized, the similarity between Ec(x) and E ′ c (D(Ec(x), z)) is not increased. To alleviate this problem, we employ a feature discriminator Df and introduce a content feature adversarial loss,

? ?(4)?

L_cadv measures the distribution deviation, less sensitive to the value of its input in comparison with L_FP . In addition, Lcadv together with L_FP can promote the similarity in two dimensions, further improving the performance.

然而，L_FP 可能鼓勵(lì) Ec 和 E ' c 輸出特征圖中每個(gè)元素的值是很小的 (也就是說(shuō)，∥Ec (x)∥→0,∥E ' c (D (Ec (x), z))∥→0)。在這種情況下，盡管如果 P 是最小化，Ec (x) 之間的相似性和 Ec (D (Ec (x), z)) 不增加。為了緩解這個(gè)問(wèn)題，本文使用了特征判別器 Df，并引入了內(nèi)容特征對(duì)抗損失，即公式 (4)。

L_cadv 測(cè)量的是分布偏差，與 L_FP 相比，L_cadv 對(duì)其輸入值的敏感性較小。此外，L_cadv 與 L_FP一起可以促進(jìn)兩個(gè)維度的相似性，進(jìn)一步提高性能。

Then the disentanglement branch adopts another encoder Es together with the content encoder E ′ c and the decoder D to reconstruct the artistic image. Since E ′ c is constrained to extract the content information, Es has to extract the style information to reconstruct the artistic image. Therefore, we get our desired style encoder Es. We formulate the reconstruction loss as,

? ?(5)?

然后解耦分支采用另一個(gè)編碼器 Es，再加上內(nèi)容編碼器 E’c 和解碼器 D 來(lái)重構(gòu)藝術(shù)圖像。由于 E ' c 被約束提取內(nèi)容信息，Es 必須提取風(fēng)格信息來(lái)重建藝術(shù)形象。因此，本文得到所需的樣式編碼器 Es。本文將重建損失定義為公式（5）。

3.3.? Inverse Branch

Armed with the style encoder Es, we can access the style space of artistic images. To achieve diversity, the inverse branch enforces the one-to-one mapping between latent space and style space by employing the inverse loss,

? (6)

The inverse loss ensures that the style information of the generated image D(Ec(x), z) can be inverted to the corresponding noise vector z, which implies that D(Ec(x), z)?retains the influence and variability of z. In this way, we can get diverse stylization results by randomly sampling different z from the standard normal distribution N (0, I).

借助風(fēng)格編碼器 Es，可以進(jìn)入藝術(shù)形象的風(fēng)格空間。為了實(shí)現(xiàn)分集，逆分支利用逆損失，強(qiáng)制潛在空間和風(fēng)格空間之間的一對(duì)一映射，如果公式 (6)。

逆損失確保樣式信息生成的圖像 D (Ec (x), z) 可以倒到相應(yīng)的噪聲向量 z，這意味著 D (Ec (x), z) 保留 z 的影響和變化。通過(guò)這種方式，本文可以得到不同的格式化結(jié)果由標(biāo)準(zhǔn)正態(tài)分布 N(0, 1) 隨機(jī)抽樣不同 z。

3.4.? Final Objective and Network Architectures

Figure 2 illustrates the full pipeline of our approach. We summarize all aforementioned losses and obtain the compound loss,

where the hyper-parameters λadv, λp, λfp, λcadv, λrecon, and λinv control the importance of each term. We use the compound loss as the final objective to train our model.

圖 2 說(shuō)明了本文方法的完整 pipeline。將上述損失匯總，得出復(fù)合損失。

其中超參數(shù) λadv、λp、λfp、λcadv、λrecon、λinv 控制各項(xiàng)的重要性。本文將復(fù)合損耗作為訓(xùn)練模型的最終目標(biāo)。

Network Architectures

We build on the recent AST backbone [29], and extend it with our proposed changes to produce diverse stylization results. Specifically, the content encoder Ec and E ′ c have the same architecture and are composed of five convolution layers. The style encoder Es includes five convolution layers, a global average pooling layer, and a fully connected (FC) layer. Similar to [15], our decoder D has two branches. One branch takes the content image x as input, containing nine residual blocks [9], four upsampling blocks, and one convolution layer. Another branch takes the noise vector z as input (notice that at inference time, we can take either z or the style code Es(y) extracted from a reference image y as its input), containing one FC layer to produce a set of affine parameters γ, β. Then the two branches are combined through AdaIN [10],

where a is the activation of the previous convolutional layer in branch one, μ and σ are channel-wise mean and standard deviation, respectively. The image discriminator D is a fully convolutional network with seven convolution layers. The feature discriminator Df consists of three convolution layers and one FC layer. As for P, it is an average pooling layer. The loss weights are set to λadv = 2, λp = 150, λfp = 100, λcadv = 10, λrecon = 200, and λinv = 600. We use the Adam optimizer with a learning rate of 0.0002.

本文的網(wǎng)絡(luò)構(gòu)建在最近的 AST baseline [29]?上，并使用提出的更改對(duì)其進(jìn)行擴(kuò)展，以產(chǎn)生不同的樣式化結(jié)果。

[29] A style-aware content loss for real-time hd style transfer. ECCV 2018.

具體來(lái)說(shuō)，內(nèi)容編碼器 Ec 和 E ' c 具有相同的架構(gòu)，由 5 個(gè)卷積層組成。

風(fēng)格編碼器 Es 包括 5 個(gè)卷積層、全局平均池化層和完全連接 (FC) 層。

解碼器 D 有兩個(gè)分支。一個(gè)分支以內(nèi)容圖像 x 為輸入，包含 9 個(gè)殘差 block，4 個(gè)上采樣 block 和 1 個(gè)卷積層。另一個(gè)分支將噪聲向量 z 作為輸入 (注意，在推斷時(shí)，可以將 z 或從參考圖像 y 中提取的樣式代碼 Es(y) 作為輸入)，包含一個(gè) FC 層以產(chǎn)生一組仿射參數(shù) γ， β。

然后兩個(gè)分支通過(guò) AdaIN 合并，如上式，式中，a 為分支1上一層卷積層的激活，μ 和 σ 分別為通道均值和標(biāo)準(zhǔn)差。

圖像判別器 D 是一個(gè)具有 7 層卷積層的全卷積網(wǎng)絡(luò)。特征鑒別器 Df 由 3 個(gè)卷積層和 1 個(gè) FC 層組成。P 是一個(gè)平均的池化層。

損失權(quán)重設(shè)置為: λadv = 2， λp = 150， λfp = 100， λcadv = 10， λrecon = 200， λinv = 600。我們使用 Adam 優(yōu)化器，學(xué)習(xí)率為 0.0002。

4.? Experiments

Dataset

Like [29, 17, 18, 33], we take Places365 [45] as the content dataset and Wikiart [14] as the style?dataset (concretely, we collect hundreds of artworks for each artist from WikiArt and train a separate model for him/her). Training images were randomly cropped and resized to 768×768 resolutions.

本文將 Places365 [GitHub][官網(wǎng)]?作為內(nèi)容集，Wikiart [14] 作為風(fēng)格集 (具體來(lái)說(shuō)，從 Wikiart 為每個(gè)藝術(shù)家收集數(shù)百件藝術(shù)品，并為他/她訓(xùn)練一個(gè)單獨(dú)的模型)。訓(xùn)練圖像被隨機(jī)裁剪并調(diào)整為 768×768 分辨率。

Baselines

We take the following methods that can produce diversity as our baselines: Gatys et al. [7], Li et al. [23], Ulyanov et al. [35], DFP [37], and MUNIT [11]. Apart from above methods, we also compare with AST [29] and Svoboda et al. [33] to make the experiments more sufficient. Note that we use their officially released codes and default settings of hyper-parameters for experiments.

我們采用以下能夠產(chǎn)生多樣性的方法作為我們的基線:Gatys et al. [7]， Li et al. [23]， Ulyanov et al. [35]， DFP[37]，和MUNIT[11]。除上述方法外，我們還與AST[29]和Svoboda et al.[33]進(jìn)行了比較，使實(shí)驗(yàn)更加充分。請(qǐng)注意，我們使用他們官方發(fā)布的代碼和默認(rèn)超參數(shù)的實(shí)驗(yàn)設(shè)置。

[7] Image style transfer using convolutional neural networks. CVPR 2016.

[11] Multimodal unsupervised image-to-image translation. ECCV 2018.

[23] Diversified texture synthesis with feed-forward networks. CVPR 2017.

[29] A style-aware content loss for real-time hd style transfer. ECCV 2018.

[33] Two-stage peer-regularized feature recombination for arbitrary image style transfer. CVPR 2020. GitHub

[35] Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. CVPR 2017.

[37] Diversified arbitrary style transfer via deep feature perturbation. CVPR 2020. GitHub

實(shí)驗(yàn)結(jié)果展示

5.? Conclusion

In this paper, we propose a Diverse Image Style Transfer (DIST) framework which achieves significant diversity without loss of quality by encouraging the one-to-one mapping between the latent space of input noise vectors and the style space of generated artistic images. The framework consists of three branches, where the stylization branch is responsible for stylizing the content image, and the other two branches (i.e., the disentanglement branch and the inverse branch) are responsible for diversity. Our extensive experimental results demonstrate the effectiveness and superiority of our method. In the future work, we would like to extend our method to other tasks, such as text-to-image synthesis and image inpainting.

在本文中，本文提出了一個(gè)多樣化的圖像風(fēng)格轉(zhuǎn)換 (DIST) 框架，通過(guò)鼓勵(lì)輸入噪聲向量的潛在空間和生成的藝術(shù)圖像的風(fēng)格空間之間的一對(duì)一映射，實(shí)現(xiàn)了顯著的多樣性而不損失質(zhì)量?？蚣苡扇齻€(gè)分支組成，其中程式化分支負(fù)責(zé)程式化內(nèi)容圖像，其他兩個(gè)分支?(即耦分支和逆分支) 負(fù)責(zé)多樣性。大量的實(shí)驗(yàn)結(jié)果證明了該方法的有效性和優(yōu)越性。在未來(lái)的工作中，希望將本文的方法擴(kuò)展到其他任務(wù)，如文本到圖像的合成和圖像的填充。

總結(jié)

以上是生活随笔為你收集整理的ICCV 2021可逆的跨空间映射实现多样化的图像风格传输：Diverse Image Style Transfer via Invertible Cross-Space Mapping的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：【数据结构与算法】【算法思想】Dijks
下一篇： VS2005 .vs. Orcas