DETAILED ACTION
As previously indicated in Advisory action of 6/4/2021, Applicant’s remarks regarding independent claims 6 and 15, with respect to Yao reference (Remarks, Pg. 8-9), were fully considered and found persuasive overcoming previous rejections. Therefore, the finality of the previous Office action has been withdrawn. However, upon further review, a new ground of rejection is warranted. Claims 1-20 are pending on the application. 

Remarks
In the Final Office action of 1/6/2021, it was indicated that claims 1-5, 12-13, and 20 contained allowable subject matter. However, after having performed an updated search/review of prior art, it has been determined that items of information contained in newly found reference(s), along with NPL reference “Collaging on Internal Representations: An Intuitive Approach for Semantic Transfiguration”, furnished via IDS, are materially pertinent to the pending claims and necessitate further prosecution. Therefore, a new ground of rejection is warranted as indicated further below.

Claim Objections
Claim 9 is objected to because of the following informalities:
Claim 9, Ln. 4, recites the limitation “to infer for two regions”. However, it should recite “to infer for the two regions” instead. Examiner believes aforementioned discrepancy was due to a typographical error. Appropriate correction is required.
the two regions”.  

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1, Ln. 14, recites the limitation “a substantially photorealistic image” limitation. The term “substantially” in claim 1 is a relative term which renders the claim indefinite. The term “substantially” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.  
Claims 2-4 are rejected by virtue of dependent upon rejected base claim 1.
Claim 6, Ln. 4, recites the limitation “a substantially photorealistic image” limitation. The term “substantially” in claim 6 is a relative term which renders the claim indefinite. The term “substantially” is not defined by the claim, the specification does not 
Claims 7-14 are rejected by virtue of dependent upon rejected base claim 6.
Claim 15, Ln. 7, recites the limitation “a substantially photorealistic image” limitation. The term “substantially” in claim 15 is a relative term which renders the claim indefinite. The term “substantially” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.  
Claims 16-20 are rejected by virtue of dependent upon rejected base claim 15.
For examination purposes examiner has interpreted the claimed “a substantially photorealistic image” as “a photorealistic image”.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-11, and 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. PG Pub. No. 2019/0295302 A1), hereafter referred to as Fu, in view of Suzuki et al. (“Collaging on Internal Representations: An Intuitive Approach for Semantic Transfiguration”), referred to as Suzuki, Applicant cited prior art.

claim 1, Fu discloses a computer-implemented method (Abstract: methods and systems for image generation through use of adversarial networks; Par. [0004]: a system for training an image generator… the system comprises a processor and a memory with computer code instructions stored thereon, wherein the processor and the memory, with the computer code instructions, are configured to cause the system to provide a generator, discriminator, and segmentor), comprising:
receiving a boundary input separating an image space into two regions associated with respective semantic labels, the respective image labels indicating respective types of image content;
generating a semantic segmentation mask representing the two regions with the respective semantic labels;
providing the semantic segmentation mask as input to a trained image synthesis network;
receiving, from the trained image synthesis network, value inferences for a plurality of pixel locations of the image space corresponding to the respective types of image content for the regions associated with those pixel locations; and
rendering a substantially photorealistic image from the image space using the value inferences, the photorealistic image including the types of image content for the regions defined by the boundary input (Fu, Par. [0041-46]: generator 240 takes as inputs, a target segmentation 227, a given image 226, and a vector 228 indicating desired attributes of the image to be generated. The generator 220 implemented with the blocks 221-225 is configured to receive the inputs 226, 227, 228 and generate a target image 229 that is based on, i.e., a translated version of, the input image 226 and consistent with the input segmentation 227 and attributes 228… The discriminator 240 is configured to take an image, e.g., the images 246 and/or 249 and produce a discrimination result 242 indicating if an input image is real, i.e., an image that did not get produced by the generator, or fake, i.e., an image created by the generator, and determine attributes 243 of the input image. The discriminator 240 pushes the generated images towards a target domain distribution, and meanwhile, utilizes an auxiliary attribute classifier to enable the SGGAN framework to generate images, such as the images in the row 102a in FIG. 1 with target attributes… the segmentor neural network 260 includes a convolutional block 261, a down-sampling convolutional block 262, residual block 263 (which may be implemented similarly to the residual block 223 described hereinabove), up-sampling convolutional block 264, and convolutional block 265. The segmentor network 260 implemented with the blocks 261-265 is configured to receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process… optimization tends to teach the generator 220 to impose the spatial constraints indicated in an input segmentation 271 on the translated images, e.g., 274. During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… In the training procedure 270, the generator 220 is configured to receive the target segmentation 271, desired attributes vector 272, and real image 273 and from the inputs 271-273, generate the image 274. Further, the generator 220 (which is depicted twice in FIG. 2D to show additional processing) is configured to perform a reconstruction process that attempts to reconstruct the input image 273 using a segmentation 275 that is based on the real image 273, attributes 276 of the real image 273, and the generated image 274; Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 323 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0123-127]: SCGAN has both spatial and attribute-level controllability, with a segmentor network that guides the generator network with spatial information, and increases the model stability for convergence… to avoid foreground-background mismatch, the generator network is configured to first, extract spatial information from an input segmentation, second, concatenate that latent vector to provide variations, and third, use attribute labels to synthesize attribute-specific contents in the generated image… a SCGAN that takes latent vectors, attribute labels, and semantic segmentations as inputs, and decouples the image generation into three dimensions. As such, embodiments of the SCGAN are capable of generating images with controlled spatial contents and attributes and generate target images; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; receiving a boundary input separating an image space into two regions associated with respective semantic labels, the respective image labels indicating respective types of image content (e.g. generate target-oriented realistic images (i.e. photorealistic images) guided by using semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated (i.e. regions associated with respective semantic labels), including a generator which is configured to receive inputs (i.e. sources, indications, selections, etc.) and generate a target image that is based on a translated version of the input image and consistent with the input segmentation and corresponding attributes (i.e. the respective image labels indicating respective types of image content), including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. receiving a boundary input separating an image space into (first, second, third… Nth) regions associated with 
generating a semantic segmentation mask representing the two regions with the respective semantic labels (e.g. generate target-oriented realistic images (i.e. photorealistic images) guided by using semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated (i.e. a semantic segmentation mask representing (first, second, third… Nth) regions associated with respective semantic labels), including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. a boundary input separating an image space into the (first, second, third… Nth) regions of a digital representation of an image), by using a segmentor network which is configured to receive an input image and generate a corresponding segmentation indicating features of the input image (i.e. and generating the semantic layout (i.e. semantic segmentation mask) representing the (first, second, third… Nth) regions with the semantic labels), as indicated above), for example);
providing the semantic segmentation mask as input to a trained image synthesis network (e.g. generator is specially designed to take a semantic segmentation (i.e. the semantic segmentation mask), a latent vector, and an attribute label as inputs to synthesize photorealistic image, including a segmentor network which is trained (i.e. a trained image synthesis network) together with the GAN framework, as indicated above), for example);

rendering a substantially photorealistic [a photorealistic] image from the image space using the value inferences, the photorealistic image including the types of image content for the regions defined by the boundary input (e.g. generate target-oriented realistic images (i.e. photorealistic images) guided by using semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated (i.e. indication of semantic labels to be associated with image space regions), including a generator which is configured to receive inputs (i.e. receiving value inferences for a plurality of pixel locations of the image space corresponding to the respective types of image content for the regions associated with those pixel locations) and generate a target image that is based on a translated version of the input image and consistent with the input segmentation and corresponding attributes (i.e. semantic labels associated with image space regions including the types of image content for the regions defined by the boundary input), including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. a boundary input separating an image space into the (first, second, third… Nth) regions of a digital representation of an image), including a segmentor network which is configured to receive an input image and generate a corresponding segmentation indicating features of the input image (i.e. and rendering (i.e. generating, constructing, etc.) a photorealistic 
Fu further discloses a neural network architecture comprising an instance normalization (IN) step (Par. [0094]: Table 1 below illustrates the network architecture for the embodiments of the present invention implemented… In Table 1… IN refers to instance normalization; Par. [0154]: synthesis provides… images associated with attribute labels, caption, and semantic segmentation… Batch normalization [Ioffe et al., "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," In International Conference on Machine Learning, 448-456 (2015)] in both the generator and the segmentor was replaced with instance normalization [Ulyanov et al., "Instance Normalization: The Missing Ingredient for Fast Stylization), but does not expressly disclose the following as further recited in claim 1.  
However, Suzuki, teaches the trained image synthesis network including a spatially-adaptive normalization layer configured to propagate semantic information from the semantic segmentation mask throughout other layers of the trained image synthesis network (Pg. 1, Abstract: CNN-based image editing method that allows the user to change the semantic information of an image over a user-specified region. Our method makes this possible by combining the idea of manifold projection with spatial conditional batch normalization (sCBN), a version of conditional batch normalization with userspecifiable spatial weight maps. With sCBN and manifold projection, our method lets the user perform (1) spatial class translation that changes the class of an object over an arbitrary region of user’s choice, and (2) semantic transplantation that transplants semantic information contained in an arbitrary region of the reference image to an arbitrary region in the target image; Pg. 1, Par. 1-2: deep generative models like generative adversarial networks (GANs) [10] and variational autoencoders (VAEs) [20] make possible the unsupervised learning of rich latent semantic information from images… Image conditional GANs [24, 40, 17] based on encoderdecoder architectures have been popular both for their convenient implemention in end-to-end differentiable ML frameworks, and their uncanny ability to produce photo- realistic images; Pg. 2, Par. 3: CNN-based image editing method that grants the user this very freedom. With our method, the user can transform a user-chosen part of image in a copy-paste fashion–and the user can do this all the while preserving semantic consistency. More precisely, we present a method that features two types of image transformation: (1) spatial class-translation that translates a class category of a region of interest, and (2) semantic transplantation that transplants a semantic feature of an user-selected region in an arbitrary image to a region of interest in the target image. To facilitate this editing process, we also propose an efficient optimization method to project images onto the latent space of generator; Pg. 2, Par. 7-8: Class-conditional GAN [29, 26, 43, 2] is a framework designed to learn an invariant latent representation among various classes, and it is capable of generating diverse images from a same latent code z by changing class embedding (Figure 2). The work of [26, 2], in particular, succeeded in producing an impressive results by interpolating the parameters of conditional batch normalization layer, which was first introduced in [31, 5]. Conditional batch normalization (CBN) is mechanism that learns conditional information by separately learning condition-specific scaling parameter and shifting parameter for batch normalization. Our method extends the technique used in [26] by restricting the region of interpolation to a region that corresponds to the region of interest in the pixel space. We will refer to our approach spatial conditional batch normalization (sCBN). Unlike the manipulation done in style transfer [12], we introduce the conditional information at multiple levels in the network, depending on the style preference of the user. As we will show, sCBN in the lower layers transforms global features, and sCBN at upper layers transforms local features… Semantic transformation. In order to grant the user with wide freedom of semantic transformation, there has to be some mechanism to finely adjust the user-suggested transformation so that the final product becomes natural; Pg. 3, Par. 3-4: Spatial Class-translation With our spatial class translation, the user can change the class of the object in the user-selected region of interest (ROI). The user can change the class of a part of the target objects in intuitive fashion… Spatial Semantic Transplantation With our semantic transplantation, the user can transplant a semantic feature of the user-selected object in the reference image to an object in the target image to be transformed. Our method first prompts the user to specify the region in the target image containing the object of interest, along with the reference image of equal size. The user will be also asked to specify the region in the reference image that contains the semantic information to be transplanted. The method then automatically transplants the semantic information of the specified region of the reference image into the target image; Pg. 4, Par. 2-5 and Pg. 5, Par. 1: spatial class translation Our method functions on a trained conditional generator G, paired with the discriminator D with which G was trained. Upon receiving the region of interest x clipped from the target image and the class c of the target object contained in x, the algorithm begins by looking for a latent variable z such that G(z; c) will be close to x in the feature space of D (Manifold Projection step). The class c can either be specified by the user or by a pre-trained classifier. Suppose that the user wants to partially translate a region R in x to a class c′, and let Vℓ be the set of features in ℓ-th conditional batch normalization(CBN) layers that correspond to R in the pixel space. Our method then simply substitutes the parameters governing the shift and mean parameters of Vℓ with those of c′ (Figure 6). This will result in a modification of G… in which the CBN parameters of Vℓ exclusively carry the style information of the class c′. A transformed image can be constructed by applying this modified G… to z… our spatial editing method is applicable to any generative model (e.g., GAN, VAE) that is equipped with a machanism to iteratively incorporate class information during its image generation process. We will next elaborate on the design of our sCBN and spatial semantic implantation, along with the other details omitted in the brief description above… Spatial conditional batch normalization (sCBN) is the core of our spatial class translation. As can be inferred from our naming, sCBN is based on batch normalization (BN) [16], a technique developed for the purpose of reducing the internal covariance shift to accelerate the training of neural network. More precisely, we will borrow our idea from conditional batch normalization (CBN) [8, 5], a variant of BN that incorporates the class specific semantic information in the parameters for BN. Given a set of batches sampled each from a single class, the conditional batch normalization [8, 5] works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class. Let us fix the layer ℓ, and let Fk,h,w represent the feature of ℓ-th layer at channel k, height location h, and width location w. Given a batch {Fi,k,h,w} of Fk,h,w s generated from class c, the CBN at layer ℓ then transforms Fi,k,h,w… In our implementation, we replaced CBN at each layer with sCBN; Pg. 6, Par. 3: generator used in our study is a ResNet-based generator trained as part of a conditional DCGAN. Each residua l block in our generator contains the conditional batch normalization (CBN) layer. At the time of inference, these CBN layers are replaced by the aforementioned sCBN layers that are tailored to the user’s preference. We base our architectures on those used in previous work [25, 26], and used the pre-trained model from [26]; Pg. 8, Par. 6: image transformation method that allows the user to translate the class of an object and transplant semantic features over a user-specified pixel region of the image. Indeed, there is still much room left for the exploration of the semantic information contained in the intermediate feature spaces of CNNs. We were, however, able to show that we can manipulate this information in a somewhat intuitive manner and produce customized photorealistic images; Pg. 11, Par. 9: we conducted a set of automatic spatial class translations. For each one the selected images, we (1) used a pre-trained model to extract the region of the object to be transformed (dog/cat), (2) conducted the manifold projection to obtain the z, (3) passed z to the generator with the class map corresponding to the segmented region, and (4) conducted a post-processing over the segmented region. For the semantic segmentation, we used a TensorFlow implementation of DeepLab v3 Xception model trained on MS COCO dataset; the trained image synthesis network including a spatially-adaptive normalization layer configured to propagate semantic information from the semantic segmentation mask throughout other layers of the trained image synthesis network (e.g. image generative model which uses a neural network, such as a Generative Adversarial Network (GAN), that is equipped with a mechanism to iteratively incorporate class (i.e. feature, attribute, label, etc.) information during its image generation, including semantic features (i.e. class, attribute, label, etc.) of selected object regions (i.e. propagate semantic information), which are segmented/extracted in a reference (i.e. source, input, etc.) image, corresponding to object(s) in a target image to be transformed (i.e. an image to be generated), based on the set of features Vℓ in ℓ-th (first, second, third… Nth) conditional (i.e. adaptive, instance, etc.) batch normalization (CBN) layers (i.e. a spatially-adaptive normalization layer configured to propagate (i.e. transfer, pass, etc.) semantic information from the semantic layout throughout other layers of the neural network) that correspond to a region R in the pixel space, including a number of spatial conditional batch normalization (sCBN) layers, as indicated above), for example).
Fu and Suzuki are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the 

Regarding claim 2, claim 1 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), further comprising:
modulating, by the spatially-adaptive normalization layer, a set of activations through a spatially-adaptive transformation in order to propagate the semantic information throughout the other layers of the trained image synthesis network (Suzuki, Pg. 1, Abstract: CNN-based image editing method that allows the user to change the semantic information of an image over a user-specified region. Our method makes this possible by combining the idea of manifold projection with spatial conditional batch normalization (sCBN), a version of conditional batch normalization with userspecifiable spatial weight maps. With sCBN and manifold projection, our method lets the user perform (1) spatial class translation that changes the class of an object over an arbitrary region of user’s choice, and (2) semantic transplantation that transplants semantic information contained in an arbitrary region of the reference image to an arbitrary region in the target image; Pg. 1, Par. 1-2: deep generative models like generative adversarial networks (GANs) [10] and variational autoencoders (VAEs) [20] make possible the unsupervised learning of rich latent semantic information from images… Image conditional GANs [24, 40, 17] based on encoderdecoder architectures have been popular both for their convenient implemention in end-to-end differentiable ML frameworks, and their uncanny ability to produce photo- realistic images; Pg. 2, Par. 3: CNN-based image editing method that grants the user this very freedom. With our method, the user can transform a user-chosen part of image in a copy-paste fashion–and the user can do this all the while preserving semantic consistency. More precisely, we present a method that features two types of image transformation: (1) spatial class-translation that translates a class category of a region of interest, and (2) semantic transplantation that transplants a semantic feature of an user-selected region in an arbitrary image to a region of interest in the target image. To facilitate this editing process, we also propose an efficient optimization method to project images onto the latent space of generator; Pg. 2, Par. 7-8: Class-conditional GAN [29, 26, 43, 2] is a framework designed to learn an invariant latent representation among various classes, and it is capable of generating diverse images from a same latent code z by changing class embedding (Figure 2). The work of [26, 2], in particular, succeeded in producing an impressive results by interpolating the parameters of conditional batch normalization layer, which was first introduced in [31, 5]. Conditional batch normalization (CBN) is mechanism that learns conditional information by separately learning condition-specific scaling parameter and shifting parameter for batch normalization. Our method extends the technique used in [26] by restricting the region of interpolation to a region that corresponds to the region of interest in the pixel space. We will refer to our approach spatial conditional batch normalization (sCBN). Unlike the manipulation done in style transfer [12], we introduce the conditional information at multiple levels in the network, depending on the style preference of the user. As we will show, sCBN in the lower layers transforms global features, and sCBN at upper layers transforms local features… Semantic transformation. In order to grant the user with wide freedom of semantic transformation, there has to be some mechanism to finely adjust the user-suggested transformation so that the final product becomes natural; Pg. 3, Par. 3-4: Spatial Class-translation With our spatial class translation, the user can change the class of the object in the user-selected region of interest (ROI). The user can change the class of a part of the target objects in intuitive fashion… Spatial Semantic Transplantation With our semantic transplantation, the user can transplant a semantic feature of the user-selected object in the reference image to an object in the target image to be transformed. Our method first prompts the user to specify the region in the target image containing the object of interest, along with the reference image of equal size. The user will be also asked to specify the region in the reference image that contains the semantic information to be transplanted. The method then automatically transplants the semantic information of the specified region of the reference image into the target image; Pg. 4, Par. 2-5 and Pg. 5, Par. 1: spatial class translation Our method functions on a trained conditional generator G, paired with the discriminator D with which G was trained. Upon receiving the region of interest x clipped from the target image and the class c of the target object contained in x, the algorithm begins by looking for a latent variable z such that G(z; c) will be close to x in the feature space of D (Manifold Projection step). The class c can either be specified by the user or by a pre-trained classifier. Suppose that the user wants to partially translate a region R in x to a class c′, and let Vℓ be the set of features in ℓ-th conditional batch normalization(CBN) layers that correspond to R in the pixel space. Our method then simply substitutes the parameters governing the shift and mean parameters of Vℓ with those of c′ (Figure 6). This will result in a modification of G… in which the CBN parameters of Vℓ exclusively carry the style information of the class c′. A transformed image can be constructed by applying this modified G… to z… our spatial editing method is applicable to any generative model (e.g., GAN, VAE) that is equipped with a machanism to iteratively incorporate class information during its image generation process. We will next elaborate on the design of our sCBN and spatial semantic implantation, along with the other details omitted in the brief description above… Spatial conditional batch normalization (sCBN) is the core of our spatial class translation. As can be inferred from our naming, sCBN is based on batch normalization (BN) [16], a technique developed for the purpose of reducing the internal covariance shift to accelerate the training of neural network. More precisely, we will borrow our idea from conditional batch normalization (CBN) [8, 5], a variant of BN that incorporates the class specific semantic information in the parameters for BN. Given a set of batches sampled each from a single class, the conditional batch normalization [8, 5] works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class. Let us fix the layer ℓ, and let Fk,h,w represent the feature of ℓ-th layer at channel k, height location h, and width location w. Given a batch {Fi,k,h,w} of Fk,h,w s generated from class c, the CBN at layer ℓ then transforms Fi,k,h,w… In our implementation, we replaced CBN at each layer with sCBN; Pg. 6, Par. 3: generator used in our study is a ResNet-based generator trained as part of a conditional DCGAN. Each residua l block in our generator contains the conditional batch normalization (CBN) layer. At the time of inference, these CBN layers are replaced by the aforementioned sCBN layers that are tailored to the user’s preference. We base our architectures on those used in previous work [25, 26], and used the pre-trained model from [26]; Pg. 8, Par. 6: image transformation method that allows the user to translate the class of an object and transplant semantic features over a user-specified pixel region of the image. Indeed, there is still much room left for the exploration of the semantic information contained in the intermediate feature spaces of CNNs. We were, however, able to show that we can manipulate this information in a somewhat intuitive manner and produce customized photorealistic images; Pg. 11, Par. 9: we conducted a set of automatic spatial class translations. For each one the selected images, we (1) used a pre-trained model to extract the region of the object to be transformed (dog/cat), (2) conducted the manifold projection to obtain the z, (3) passed z to the generator with the class map corresponding to the segmented region, and (4) conducted a post-processing over the segmented region. For the semantic segmentation, we used a TensorFlow implementation of DeepLab v3 Xception model trained on MS COCO dataset; modulating, by the spatially-adaptive normalization layer, a set of activations through a spatially-adaptive transformation in order to propagate the semantic information throughout the other layers of the trained image synthesis network (e.g. image generative model which uses a neural network, such as a Generative Adversarial Network (GAN), that is equipped with a mechanism to iteratively incorporate class (i.e. feature, attribute, label, etc.) information during its image generation, including semantic features (i.e. class, attribute, label, etc.) of selected object regions (i.e. propagate semantic information), which are segmented/extracted in a reference (i.e. source, input, etc.) image, corresponding to object(s) in a target image to be transformed (i.e. an image to be generated), based on the set of features Vℓ in ℓ-th (first, second, third… Nth) conditional (i.e. adaptive, instance, etc.) batch normalization (CBN) layers (i.e. spatially-adaptive normalization layer is a conditional layer configured to propagate (i.e. transfer, pass, etc.) semantic information from the semantic layout throughout other layers of the neural network) that correspond to a region R in the pixel space, by incorporating the class specific semantic information in the parameters for batch normanilation (BN), and given a set of batches sampled each from a single class, the conditional batch normalization works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class (i.e. modulating, by the spatially-adaptive normalization layer, a set of activations through a spatially-adaptive transformation in order to propagate the semantic information throughout the other layers of the trained image synthesis network), as indicated above), for example).


Regarding claim 4, claim 1 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), wherein the trained image synthesis network includes a generative adversarial network (GAN) including a generator and a discriminator (Fu, Par. [0034-45]: embodiments employ a segmentor implemented with a neural network that is designed to impose semantic information on the generated images… GAN has the potential to provide realistic image generation… Segmentation Guided Generative Adversarial Network (SGGAN), which fully leverages semantic segmentation information to guide the image generation (e.g., translation) process… the image semantic segmentation can be obtained through a variety of methodologies, such as human annotations or any variety of existing segmentation methods… explicitly guide the generator with pixel-level semantic segmentations and, thus, further boost the quality of generated images. Further, the target segmentation employed in embodiments works as a strong prior, i.e., provides knowledge that stems from previous experience, for the image generator, which is able to use this prior knowledge to edit the spatial content… the segmentor neural network 260 includes a convolutional block 261… The segmentor network 260 implemented with the blocks 261-265 is configured to receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process … During training, estimated segmentations from the segmentor 260 are compared with their ground-truth values, which provides gradient information to optimize the generator 220. This optimization tends to teach the generator 220 to impose the spatial constraints indicated in an input segmentation 271 on the translated images, e.g., 274. During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… the segmentor 260 receives a target segmentation 271 and a generated image 274 produced by the generator 220. Then, based upon a segmentation loss, i.e., the difference between a segmentation determined from the generated image 274 and the target segmentation 271, the segmentor 260 is adjusted, e.g., weights in a neural network implementing the segmentor 260 are modified so the segmentor 260 produces segmentations that are closer to the target segmentation 271. The generator 240 is likewise adjusted based upon the segmentation loss to generate images that are closer to the target segmentation 271; Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 323 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; wherein the neural network is a generative adversarial network (GAN) including a generator and a discriminator (e.g. generate (i.e. produce, infer, construct, etc.) target-oriented realistic images (i.e. infer substantially photorealistic images) guided by semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated, including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. received semantic layout/segmentation mask indicating a plurality (first, second, third… Nth) of regions of a digital representation of an image), by using a neural network, such as a Spatially Constrained Generative Adversarial Network (SCGAN) (i.e. the neural network is a generative adversarial network (GAN)), including a generator network and a discriminator network (i.e. including a generator and a discriminator), to generate diversified images with 

Regarding claim 5, claim 1 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), further comprising:
selecting content, from a plurality of content options of the type of image content for a first region of the two regions, to generate for the first region (Fu, Par. [0123-127]: SCGAN has both spatial and attribute-level controllability, with a segmentor network that guides the generator network with spatial information, and increases the model stability for convergence… to avoid foreground-background mismatch, the generator network is configured to first, extract spatial information from an input segmentation, second, concatenate that latent vector to provide variations, and third, use attribute labels to synthesize attribute-specific contents in the generated image… a SCGAN that takes latent vectors, attribute labels, and semantic segmentations as inputs, and decouples the image generation into three dimensions. As such, embodiments of the SCGAN are capable of generating images with controlled spatial contents and attributes and generate target images; selecting content, from a plurality of content options of the type of image content for a first region of the two regions, to generate for the first region (e.g. generate target-oriented realistic images (i.e. photorealistic images) guided by semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated (i.e. determined semantic labels), including target segmentation(s) comprising a set of segments, such th) regions of a digital representation of an image), including the (first, second, third… Nth) regions of a digital representation of an image (i.e. to generate for the first region), including an image generator which uses (i.e. selects) attribute (i.e. semantic) labels to render attribute-specific contents (i.e. selecting content, from a plurality of content options of the type of image content for a first region of the (first, second, third… Nth) regions, to generate for the first region) by controlling the spatial contents as well as attribute-specific contents to generate diversified images with sharper and more realistic details (i.e. photorealistic images), as indicated above), for example).

Regarding claim 6, Fu discloses a computer-implemented method (Abstract: methods and systems for image generation through use of adversarial networks; Par. [0004]: a system for training an image generator… the system comprises a processor and a memory with computer code instructions stored thereon, wherein the processor and the memory, with the computer code instructions, are configured to cause the system to provide a generator, discriminator, and segmentor), comprising:
receiving a semantic layout indicating two regions of a digital representation of an image; and
inferring a substantially photorealistic image using a neural network based, at least in part, on the received semantic layout (Par. [0004]: a processor-generated image, where the processor may be a neural network, and a target segmentation… is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image; Par. [0034-45]: embodiments employ a segmentor implemented with a neural network that is designed to impose semantic information on the generated images… GAN has the potential to provide realistic image generation… Segmentation Guided Generative Adversarial Network (SGGAN), which fully leverages semantic segmentation information to guide the image generation (e.g., translation) process… the image semantic segmentation can be obtained through a variety of methodologies, such as human annotations or any variety of existing segmentation methods… explicitly guide the generator with pixel-level semantic segmentations and, thus, further boost the quality of generated images. Further, the target segmentation employed in embodiments works as a strong prior, i.e., provides knowledge that stems from previous experience, for the image generator, which is able to use this prior knowledge to edit the spatial content… the segmentor neural network 260 includes a convolutional block 261… The segmentor network 260 implemented with the blocks 261-265 is configured to receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process … During training, estimated segmentations from the segmentor 260 are compared with their ground-truth values, which provides gradient information to optimize the generator 220. This optimization tends to teach the generator 220 to impose the spatial constraints indicated in an input segmentation 271 on the translated images, e.g., 274. During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… the segmentor 260 receives a target segmentation 271 and a generated image 274 produced by the generator 220. Then, based upon a segmentation loss, i.e., the difference between a segmentation determined from the generated image 274 and the target segmentation 271, the segmentor 260 is adjusted, e.g., weights in a neural network implementing the segmentor 260 are modified so the segmentor 260 produces segmentations that are closer to the target segmentation 271. The generator 240 is likewise adjusted based upon the segmentation loss to generate images that are closer to the target segmentation 271; Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 323 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; receiving a semantic layout indicating two regions of a digital representation of an image and inferring a substantially photorealistic image using a neural network based, at least in part, on the received semantic layout (e.g. generate (i.e. produce, infer, construct, etc.) target-oriented realistic images (i.e. infer substantially photorealistic images) guided by semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated, including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. received semantic layout/segmentation mask indicating a plurality (first, second, third… Nth) of regions of a digital representation of an image), by using a neural network, such as a Spatially Constrained Generative Adversarial Network (SCGAN), to generate diversified images with sharper and more realistic details (i.e. to infer a substantially photorealistic image), as indicated above), for example).
Fu further discloses a neural network architecture comprising an instance normalization (IN) step (Par. [0094]: Table 1 below illustrates the network architecture for the embodiments of the present invention implemented… In Table 1… IN refers to instance normalization; Par. [0154]: synthesis provides… images associated with attribute labels, caption, and semantic segmentation… Batch normalization [Ioffe et al., "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," In International Conference on Machine Learning, 448-456 (2015)] in both the generator and the segmentor was replaced with instance normalization [Ulyanov et al., "Instance Normalization: The Missing Ingredient for Fast Stylization), but does not expressly disclose the following as further recited in claim 6.  
However, Suzuki, teaches wherein the neural network includes at least one spatially-adaptive normalization layer to normalize information from the semantic layout (Pg. 1, Abstract: CNN-based image editing method that allows the user to change the semantic information of an image over a user-specified region. Our method makes this possible by combining the idea of manifold projection with spatial conditional batch normalization (sCBN), a version of conditional batch normalization with userspecifiable spatial weight maps. With sCBN and manifold projection, our method lets the user perform (1) spatial class translation that changes the class of an object over an arbitrary region of user’s choice, and (2) semantic transplantation that transplants semantic information contained in an arbitrary region of the reference image to an arbitrary region in the target image; Pg. 1, Par. 1-2: deep generative models like generative adversarial networks (GANs) [10] and variational autoencoders (VAEs) [20] make possible the unsupervised learning of rich latent semantic information from images… Image conditional GANs [24, 40, 17] based on encoderdecoder architectures have been popular both for their convenient implemention in end-to-end differentiable ML frameworks, and their uncanny ability to produce photo- realistic images; Pg. 2, Par. 3: CNN-based image editing method that grants the user this very freedom. With our method, the user can transform a user-chosen part of image in a copy-paste fashion–and the user can do this all the while preserving semantic consistency. More precisely, we present a method that features two types of image transformation: (1) spatial class-translation that translates a class category of a region of interest, and (2) semantic transplantation that transplants a semantic feature of an user-selected region in an arbitrary image to a region of interest in the target image. To facilitate this editing process, we also propose an efficient optimization method to project images onto the latent space of generator; Pg. 2, Par. 7-8: Class-conditional GAN [29, 26, 43, 2] is a framework designed to learn an invariant latent representation among various classes, and it is capable of generating diverse images from a same latent code z by changing class embedding (Figure 2). The work of [26, 2], in particular, succeeded in producing an impressive results by interpolating the parameters of conditional batch normalization layer, which was first introduced in [31, 5]. Conditional batch normalization (CBN) is mechanism that learns conditional information by separately learning condition-specific scaling parameter and shifting parameter for batch normalization. Our method extends the technique used in [26] by restricting the region of interpolation to a region that corresponds to the region of interest in the pixel space. We will refer to our approach spatial conditional batch normalization (sCBN). Unlike the manipulation done in style transfer [12], we introduce the conditional information at multiple levels in the network, depending on the style preference of the user. As we will show, sCBN in the lower layers transforms global features, and sCBN at upper layers transforms local features… Semantic transformation. In order to grant the user with wide freedom of semantic transformation, there has to be some mechanism to finely adjust the user-suggested transformation so that the final product becomes natural; Pg. 3, Par. 3-4: Spatial Class-translation With our spatial class translation, the user can change the class of the object in the user-selected region of interest (ROI). The user can change the class of a part of the target objects in intuitive fashion… Spatial Semantic Transplantation With our semantic transplantation, the user can transplant a semantic feature of the user-selected object in the reference image to an object in the target image to be transformed. Our method first prompts the user to specify the region in the target image containing the object of interest, along with the reference image of equal size. The user will be also asked to specify the region in the reference image that contains the semantic information to be transplanted. The method then automatically transplants the semantic information of the specified region of the reference image into the target image; Pg. 4, Par. 2-5 and Pg. 5, Par. 1: spatial class translation Our method functions on a trained conditional generator G, paired with the discriminator D with which G was trained. Upon receiving the region of interest x clipped from the target image and the class c of the target object contained in x, the algorithm begins by looking for a latent variable z such that G(z; c) will be close to x in the feature space of D (Manifold Projection step). The class c can either be specified by the user or by a pre-trained classifier. Suppose that the user wants to partially translate a region R in x to a class c′, and let Vℓ be the set of features in ℓ-th conditional batch normalization(CBN) layers that correspond to R in the pixel space. Our method then simply substitutes the parameters governing the shift and mean parameters of Vℓ with those of c′ (Figure 6). This will result in a modification of G… in which the CBN parameters of Vℓ exclusively carry the style information of the class c′. A transformed image can be constructed by applying this modified G… to z… our spatial editing method is applicable to any generative model (e.g., GAN, VAE) that is equipped with a machanism to iteratively incorporate class information during its image generation process. We will next elaborate on the design of our sCBN and spatial semantic implantation, along with the other details omitted in the brief description above… Spatial conditional batch normalization (sCBN) is the core of our spatial class translation. As can be inferred from our naming, sCBN is based on batch normalization (BN) [16], a technique developed for the purpose of reducing the internal covariance shift to accelerate the training of neural network. More precisely, we will borrow our idea from conditional batch normalization (CBN) [8, 5], a variant of BN that incorporates the class specific semantic information in the parameters for BN. Given a set of batches sampled each from a single class, the conditional batch normalization [8, 5] works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class. Let us fix the layer ℓ, and let Fk,h,w represent the feature of ℓ-th layer at channel k, height location h, and width location w. Given a batch {Fi,k,h,w} of Fk,h,w s generated from class c, the CBN at layer ℓ then transforms Fi,k,h,w… In our implementation, we replaced CBN at each layer with sCBN; Pg. 6, Par. 3: generator used in our study is a ResNet-based generator trained as part of a conditional DCGAN. Each residua l block in our generator contains the conditional batch normalization (CBN) layer. At the time of inference, these CBN layers are replaced by the aforementioned sCBN layers that are tailored to the user’s preference. We base our architectures on those used in previous work [25, 26], and used the pre-trained model from [26]; Pg. 8, Par. 6: image transformation method that allows the user to translate the class of an object and transplant semantic features over a user-specified pixel region of the image. Indeed, there is still much room left for the exploration of the semantic information contained in the intermediate feature spaces of CNNs. We were, however, able to show that we can manipulate this information in a somewhat intuitive manner and produce customized photorealistic images; Pg. 11, Par. 9: we conducted a set of automatic spatial class translations. For each one the selected images, we (1) used a pre-trained model to extract the region of the object to be transformed (dog/cat), (2) conducted the manifold projection to obtain the z, (3) passed z to the generator with the class map corresponding to the segmented region, and (4) conducted a post-processing over the segmented region. For the semantic segmentation, we used a TensorFlow implementation of DeepLab v3 Xception model trained on MS COCO dataset; wherein the neural network includes at least one spatially-adaptive normalization layer to normalize information from the semantic layout (e.g. an image generative model which uses a neural network, such as a Generative Adversarial ℓ in ℓ-th (first, second, third… Nth) conditional (i.e. adaptive, instance, etc.) batch normalization (CBN) layers that correspond to a region R in the pixel space (i.e. at least one spatially-adaptive normalization layer to normalize information), including a number of spatial conditional batch normalization (sCBN) layers, as indicated above), for example).
Fu and Suzuki are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention) to modify the method for image generation through use of adversarial networks (as disclosed by Fu) with a neural network that includes at least one spatially-adaptive normalization layer to normalize information from the semantic layout (as taught by Suzuki, Abstract, Pg. 1-6 and 11) to produce customized photorealistic images based on a set of photorealistic transformations (Suzuki, Abstract, Pg. 1-2 and 8).

Regarding claim 7, claim 6 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), further comprising:

generating representations of the respective types of image content for the two regions of the substantially photorealistic image (Fu, Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 323 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; determining semantic labels associated with the two regions, the semantic labels indicating respective types of image content and generating representations of the respective types of image content for the two regions of the substantially photorealistic image (e.g. generate target-oriented realistic images (i.e. photorealistic images) guided th) regions of a digital representation of an image), by using a neural network, including an image generator which uses attribute labels to render attribute-specific contents (i.e. the semantic labels indicating respective types of image content), to generate diversified images with sharper and more realistic details (i.e. and generating representations of the respective types of image content for the (first, second, third… Nth) regions of the substantially photorealistic image), as indicated above), for example).

Regarding claim 8, claim 7 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), further comprising:
receiving a boundary input separating an image space into the two regions;
receiving indication of semantic labels to be associated with the two regions; and
generating the semantic layout representing the two regions with the semantic labels (Fu, Par. [0041-46]: generator 240 takes as inputs, a target segmentation 227, a given image 226, and a vector 228 indicating desired attributes of the image to be generated. The generator 220 implemented with the blocks 221-225 is configured to receive the inputs 226, 227, 228 and generate a target image 229 that is based on, i.e., a translated version of, the input image 226 and consistent with the input segmentation 227 and attributes 228… The discriminator 240 is configured to take an image, e.g., the images 246 and/or 249 and produce a discrimination result 242 indicating if an input image is real, i.e., an image that did not get produced by the generator, or fake, i.e., an image created by the generator, and determine attributes 243 of the input image. The discriminator 240 pushes the generated images towards a target domain distribution, and meanwhile, utilizes an auxiliary attribute classifier to enable the SGGAN framework to generate images, such as the images in the row 102a in FIG. 1 with target attributes… the segmentor neural network 260 includes a convolutional block 261, a down-sampling convolutional block 262, residual block 263 (which may be implemented similarly to the residual block 223 described hereinabove), up-sampling convolutional block 264, and convolutional block 265. The segmentor network 260 implemented with the blocks 261-265 is configured to receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process… optimization tends to teach the generator 220 to impose the spatial constraints indicated in an input segmentation 271 on the translated images, e.g., 274. During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… In the training procedure 270, the generator 220 is configured to receive the target segmentation 271, desired attributes vector 272, and real image 273 and from the inputs 271-273, generate the image 274. Further, the generator 220 (which is depicted twice in FIG. 2D to show additional processing) is configured to perform a reconstruction process that attempts to reconstruct the input image 273 using a segmentation 275 that is based on the real image 273, attributes 276 of the real image 273, and the generated image 274; Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 323 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; receiving a boundary input separating an image space into the two regions, receiving indication of semantic labels to be associated with the two regions, and generating the semantic th) regions of a digital representation of an image), including a segmentor network which is configured to receive an input image and generate a corresponding segmentation indicating features of the input image (i.e. and generating the semantic layout (i.e. semantic segmentation mask) representing the (first, second, third… Nth) regions with the semantic labels), as indicated above), for example).

Regarding claim 9, claim 8 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), selecting content, from a plurality of content options of types of image content associated with the semantic labels, to infer for [the] two regions (Fu, Par. [0123-127]: SCGAN has both spatial and attribute-level controllability, with a segmentor network that guides the generator network with spatial information, and increases the model stability for convergence… to avoid foreground-background mismatch, the generator network is configured to first, extract spatial information from an input segmentation, second, concatenate that latent vector to provide variations, and third, use attribute labels to synthesize attribute-specific contents in the generated image… a SCGAN that takes latent vectors, attribute labels, and semantic segmentations as inputs, and decouples the image generation into three dimensions. As such, embodiments of the SCGAN are capable of generating images with controlled spatial contents and attributes and generate target images; selecting content, from a plurality of content options of types of image content associated with the semantic labels, to infer for [the] two regions (e.g. generate target-oriented realistic images (i.e. photorealistic images) guided by semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated (i.e. determined semantic labels), including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. semantic labels associated with the (first, second, third… Nth) regions of a digital representation of an image), including the (first, second, third… Nth) regions of a digital representation of an image (i.e. to infer the two regions), including an image generator which uses (i.e. selects) attribute (i.e. semantic) labels to render attribute-specific contents (i.e. selecting content, from a plurality of content options of types of image content associated with the 

Regarding claim 10, claim 6 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), wherein the at least one spatially-adaptive normalization layer is a conditional layer configured to propagate semantic information from the semantic layout throughout other layers of the neural network (Suzuki, Pg. 1, Abstract: CNN-based image editing method that allows the user to change the semantic information of an image over a user-specified region. Our method makes this possible by combining the idea of manifold projection with spatial conditional batch normalization (sCBN), a version of conditional batch normalization with userspecifiable spatial weight maps. With sCBN and manifold projection, our method lets the user perform (1) spatial class translation that changes the class of an object over an arbitrary region of user’s choice, and (2) semantic transplantation that transplants semantic information contained in an arbitrary region of the reference image to an arbitrary region in the target image; Pg. 1, Par. 1-2: deep generative models like generative adversarial networks (GANs) [10] and variational autoencoders (VAEs) [20] make possible the unsupervised learning of rich latent semantic information from images… Image conditional GANs [24, 40, 17] based on encoderdecoder architectures have been popular both for their convenient implemention in end-to-end differentiable ML frameworks, and their uncanny ability to produce photo- realistic images; Pg. 2, Par. 3: CNN-based image editing method that grants the user this very freedom. With our method, the user can transform a user-chosen part of image in a copy-paste fashion–and the user can do this all the while preserving semantic consistency. More precisely, we present a method that features two types of image transformation: (1) spatial class-translation that translates a class category of a region of interest, and (2) semantic transplantation that transplants a semantic feature of an user-selected region in an arbitrary image to a region of interest in the target image. To facilitate this editing process, we also propose an efficient optimization method to project images onto the latent space of generator; Pg. 2, Par. 7-8: Class-conditional GAN [29, 26, 43, 2] is a framework designed to learn an invariant latent representation among various classes, and it is capable of generating diverse images from a same latent code z by changing class embedding (Figure 2). The work of [26, 2], in particular, succeeded in producing an impressive results by interpolating the parameters of conditional batch normalization layer, which was first introduced in [31, 5]. Conditional batch normalization (CBN) is mechanism that learns conditional information by separately learning condition-specific scaling parameter and shifting parameter for batch normalization. Our method extends the technique used in [26] by restricting the region of interpolation to a region that corresponds to the region of interest in the pixel space. We will refer to our approach spatial conditional batch normalization (sCBN). Unlike the manipulation done in style transfer [12], we introduce the conditional information at multiple levels in the network, depending on the style preference of the user. As we will show, sCBN in the lower layers transforms global features, and sCBN at upper layers transforms local features… Semantic transformation. In order to grant the user with wide freedom of semantic transformation, there has to be some mechanism to finely adjust the user-suggested transformation so that the final product becomes natural; Pg. 3, Par. 3-4: Spatial Class-translation With our spatial class translation, the user can change the class of the object in the user-selected region of interest (ROI). The user can change the class of a part of the target objects in intuitive fashion… Spatial Semantic Transplantation With our semantic transplantation, the user can transplant a semantic feature of the user-selected object in the reference image to an object in the target image to be transformed. Our method first prompts the user to specify the region in the target image containing the object of interest, along with the reference image of equal size. The user will be also asked to specify the region in the reference image that contains the semantic information to be transplanted. The method then automatically transplants the semantic information of the specified region of the reference image into the target image; Pg. 4, Par. 2-5 and Pg. 5, Par. 1: spatial class translation Our method functions on a trained conditional generator G, paired with the discriminator D with which G was trained. Upon receiving the region of interest x clipped from the target image and the class c of the target object contained in x, the algorithm begins by looking for a latent variable z such that G(z; c) will be close to x in the feature space of D (Manifold Projection step). The class c can either be specified by the user or by a pre-trained classifier. Suppose that the user wants to partially translate a region R in x to a class c′, and let Vℓ be the set of features in ℓ-th conditional batch normalization(CBN) layers that correspond to R in the pixel space. Our method then simply substitutes the parameters governing the shift and mean parameters of Vℓ with those of c′ (Figure 6). This will result in a modification of G… in which the CBN parameters of Vℓ exclusively carry the style information of the class c′. A transformed image can be constructed by applying this modified G… to z… our spatial editing method is applicable to any generative model (e.g., GAN, VAE) that is equipped with a machanism to iteratively incorporate class information during its image generation process. We will next elaborate on the design of our sCBN and spatial semantic implantation, along with the other details omitted in the brief description above… Spatial conditional batch normalization (sCBN) is the core of our spatial class translation. As can be inferred from our naming, sCBN is based on batch normalization (BN) [16], a technique developed for the purpose of reducing the internal covariance shift to accelerate the training of neural network. More precisely, we will borrow our idea from conditional batch normalization (CBN) [8, 5], a variant of BN that incorporates the class specific semantic information in the parameters for BN. Given a set of batches sampled each from a single class, the conditional batch normalization [8, 5] works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class. Let us fix the layer ℓ, and let Fk,h,w represent the feature of ℓ-th layer at channel k, height location h, and width location w. Given a batch {Fi,k,h,w} of Fk,h,w s generated from class c, the CBN at layer ℓ then transforms Fi,k,h,w… In our implementation, we replaced CBN at each layer with sCBN; Pg. 6, Par. 3: generator used in our study is a ResNet-based generator trained as part of a conditional DCGAN. Each residua l block in our generator contains the conditional batch normalization (CBN) layer. At the time of inference, these CBN layers are replaced by the aforementioned sCBN layers that are tailored to the user’s preference. We base our architectures on those used in previous work [25, 26], and used the pre-trained model from [26]; Pg. 8, Par. 6: image transformation method that allows the user to translate the class of an object and transplant semantic features over a user-specified pixel region of the image. Indeed, there is still much room left for the exploration of the semantic information contained in the intermediate feature spaces of CNNs. We were, however, able to show that we can manipulate this information in a somewhat intuitive manner and produce customized photorealistic images; Pg. 11, Par. 9: we conducted a set of automatic spatial class translations. For each one the selected images, we (1) used a pre-trained model to extract the region of the object to be transformed (dog/cat), (2) conducted the manifold projection to obtain the z, (3) passed z to the generator with the class map corresponding to the segmented region, and (4) conducted a post-processing over the segmented region. For the semantic segmentation, we used a TensorFlow implementation of DeepLab v3 Xception model trained on MS COCO dataset; wherein the at least one spatially-adaptive normalization layer is a conditional layer configured to propagate semantic information from the semantic layout throughout other layers of the neural network (e.g. image generative model which uses a neural network, such as a Generative Adversarial Network (GAN), that is equipped ℓ in ℓ-th (first, second, third… Nth) conditional (i.e. adaptive, instance, etc.) batch normalization (CBN) layers (i.e. spatially-adaptive normalization layer is a conditional layer configured to propagate (i.e. transfer, pass, etc.) semantic information from the semantic layout throughout other layers of the neural network) that correspond to a region R in the pixel space, including a number of spatial conditional batch normalization (sCBN) layers, as indicated above), for example).

Regarding claim 11, claim 10 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), further comprising:
modulating, by the spatially-adaptive normalization layer, a set of activations through a spatially-adaptive transformation in order to propagate the semantic information throughout the other layers of the neural network (Suzuki, Pg. 1, Abstract: CNN-based image editing method that allows the user to change the semantic information of an image over a user-specified region. Our method makes this possible by combining the idea of manifold projection with spatial conditional batch normalization (sCBN), a version of conditional batch normalization with userspecifiable spatial weight maps. With sCBN and manifold projection, our method lets the user perform (1) spatial class translation that changes the class of an object over an arbitrary region of user’s choice, and (2) semantic transplantation that transplants semantic information contained in an arbitrary region of the reference image to an arbitrary region in the target image; Pg. 1, Par. 1-2: deep generative models like generative adversarial networks (GANs) [10] and variational autoencoders (VAEs) [20] make possible the unsupervised learning of rich latent semantic information from images… Image conditional GANs [24, 40, 17] based on encoderdecoder architectures have been popular both for their convenient implemention in end-to-end differentiable ML frameworks, and their uncanny ability to produce photo- realistic images; Pg. 2, Par. 3: CNN-based image editing method that grants the user this very freedom. With our method, the user can transform a user-chosen part of image in a copy-paste fashion–and the user can do this all the while preserving semantic consistency. More precisely, we present a method that features two types of image transformation: (1) spatial class-translation that translates a class category of a region of interest, and (2) semantic transplantation that transplants a semantic feature of an user-selected region in an arbitrary image to a region of interest in the target image. To facilitate this editing process, we also propose an efficient optimization method to project images onto the latent space of generator; Pg. 2, Par. 7-8: Class-conditional GAN [29, 26, 43, 2] is a framework designed to learn an invariant latent representation among various classes, and it is capable of generating diverse images from a same latent code z by changing class embedding (Figure 2). The work of [26, 2], in particular, succeeded in producing an impressive results by interpolating the parameters of conditional batch normalization layer, which was first introduced in [31, 5]. Conditional batch normalization (CBN) is mechanism that learns conditional information by separately learning condition-specific scaling parameter and shifting parameter for batch normalization. Our method extends the technique used in [26] by restricting the region of interpolation to a region that corresponds to the region of interest in the pixel space. We will refer to our approach spatial conditional batch normalization (sCBN). Unlike the manipulation done in style transfer [12], we introduce the conditional information at multiple levels in the network, depending on the style preference of the user. As we will show, sCBN in the lower layers transforms global features, and sCBN at upper layers transforms local features… Semantic transformation. In order to grant the user with wide freedom of semantic transformation, there has to be some mechanism to finely adjust the user-suggested transformation so that the final product becomes natural; Pg. 3, Par. 3-4: Spatial Class-translation With our spatial class translation, the user can change the class of the object in the user-selected region of interest (ROI). The user can change the class of a part of the target objects in intuitive fashion… Spatial Semantic Transplantation With our semantic transplantation, the user can transplant a semantic feature of the user-selected object in the reference image to an object in the target image to be transformed. Our method first prompts the user to specify the region in the target image containing the object of interest, along with the reference image of equal size. The user will be also asked to specify the region in the reference image that contains the semantic information to be transplanted. The method then automatically transplants the semantic information of the specified region of the reference image into the target image; Pg. 4, Par. 2-5 and Pg. 5, Par. 1: spatial class translation Our method functions on a trained conditional generator G, paired with the discriminator D with which G was trained. Upon receiving the region of interest x clipped from the target image and the class c of the target object contained in x, the algorithm begins by looking for a latent variable z such that G(z; c) will be close to x in the feature space of D (Manifold Projection step). The class c can either be specified by the user or by a pre-trained classifier. Suppose that the user wants to partially translate a region R in x to a class c′, and let Vℓ be the set of features in ℓ-th conditional batch normalization(CBN) layers that correspond to R in the pixel space. Our method then simply substitutes the parameters governing the shift and mean parameters of Vℓ with those of c′ (Figure 6). This will result in a modification of G… in which the CBN parameters of Vℓ exclusively carry the style information of the class c′. A transformed image can be constructed by applying this modified G… to z… our spatial editing method is applicable to any generative model (e.g., GAN, VAE) that is equipped with a machanism to iteratively incorporate class information during its image generation process. We will next elaborate on the design of our sCBN and spatial semantic implantation, along with the other details omitted in the brief description above… Spatial conditional batch normalization (sCBN) is the core of our spatial class translation. As can be inferred from our naming, sCBN is based on batch normalization (BN) [16], a technique developed for the purpose of reducing the internal covariance shift to accelerate the training of neural network. More precisely, we will borrow our idea from conditional batch normalization (CBN) [8, 5], a variant of BN that incorporates the class specific semantic information in the parameters for BN. Given a set of batches sampled each from a single class, the conditional batch normalization [8, 5] works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class. Let us fix the layer ℓ, and let Fk,h,w represent the feature of ℓ-th layer at channel k, height location h, and width location w. Given a batch {Fi,k,h,w} of Fk,h,w s generated from class c, the CBN at layer ℓ then transforms Fi,k,h,w… In our implementation, we replaced CBN at each layer with sCBN; Pg. 6, Par. 3: generator used in our study is a ResNet-based generator trained as part of a conditional DCGAN. Each residua l block in our generator contains the conditional batch normalization (CBN) layer. At the time of inference, these CBN layers are replaced by the aforementioned sCBN layers that are tailored to the user’s preference. We base our architectures on those used in previous work [25, 26], and used the pre-trained model from [26]; Pg. 8, Par. 6: image transformation method that allows the user to translate the class of an object and transplant semantic features over a user-specified pixel region of the image. Indeed, there is still much room left for the exploration of the semantic information contained in the intermediate feature spaces of CNNs. We were, however, able to show that we can manipulate this information in a somewhat intuitive manner and produce customized photorealistic images; Pg. 11, Par. 9: we conducted a set of automatic spatial class translations. For each one the selected images, we (1) used a pre-trained model to extract the region of the object to be transformed (dog/cat), (2) conducted the manifold projection to obtain the z, (3) passed z to the generator with the class map corresponding to the segmented region, and (4) conducted a post-processing over the segmented region. For the semantic segmentation, we used a TensorFlow implementation of DeepLab v3 Xception model trained on MS COCO dataset; modulating, by the spatially-adaptive normalization layer, a set of activations through a spatially-adaptive transformation in order to propagate the semantic information throughout the other layers of the neural network (e.g. image generative model which uses a neural network, such as a Generative Adversarial Network (GAN), that is equipped with a mechanism to iteratively incorporate class (i.e. feature, attribute, label, etc.) information during its image generation, including semantic features (i.e. class, attribute, label, etc.) of selected object regions (i.e. propagate semantic information), which are segmented/extracted in a reference (i.e. source, input, etc.) image, corresponding to object(s) in a target image to be transformed (i.e. an image to be generated), based on the set of features Vℓ in ℓ-th (first, second, third… Nth) conditional (i.e. adaptive, instance, etc.) batch normalization (CBN) layers (i.e. spatially-adaptive normalization layer is a conditional layer configured to propagate (i.e. transfer, pass, etc.) semantic information from the semantic layout throughout other layers of the neural network) that correspond to a region R in the pixel space, by incorporating the class specific semantic information in the parameters for batch normanilation (BN), and given a set of batches sampled each from a single class, the conditional batch normalization works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.

Regarding claim 14, claim 6 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), wherein the neural network is a generative adversarial network (GAN) including a generator and a discriminator (Fu, Par. [0034-45]: embodiments employ a segmentor implemented with a neural network that is designed to impose semantic information on the generated images… GAN has the potential to provide realistic image generation… Segmentation Guided Generative Adversarial Network (SGGAN), which fully leverages semantic segmentation information to guide the image generation (e.g., translation) process… the image semantic segmentation can be obtained through a variety of methodologies, such as human annotations or any variety of existing segmentation methods… explicitly guide the generator with pixel-level semantic segmentations and, thus, further boost the quality of generated images. Further, the target segmentation employed in embodiments works as a strong prior, i.e., provides knowledge that stems from previous experience, for the image generator, which is able to use this prior knowledge to edit the spatial content… the segmentor neural network 260 includes a convolutional block 261… The segmentor network 260 implemented with the blocks 261-265 is configured to receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process … During training, estimated segmentations from the segmentor 260 are compared with their ground-truth values, which provides gradient information to optimize the generator 220. This optimization tends to teach the generator 220 to impose the spatial constraints indicated in an input segmentation 271 on the translated images, e.g., 274. During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… the segmentor 260 receives a target segmentation 271 and a generated image 274 produced by the generator 220. Then, based upon a segmentation loss, i.e., the difference between a segmentation determined from the generated image 274 and the target segmentation 271, the segmentor 260 is adjusted, e.g., weights in a neural network implementing the segmentor 260 are modified so the segmentor 260 produces segmentations that are closer to the target segmentation 271. The generator 240 is likewise adjusted based upon the segmentation loss to generate images that are closer to the target segmentation 271; Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 323 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; wherein the neural network is a generative adversarial network (GAN) including a generator and a discriminator (e.g. generate (i.e. produce, infer, construct, etc.) target-oriented realistic images (i.e. infer substantially photorealistic images) guided by semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated, including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. received semantic layout/segmentation mask indicating a plurality (first, second, third… Nth) of regions of a digital representation of an image), by using a neural network, such as a Spatially Constrained Generative Adversarial Network (SCGAN) (i.e. the neural network is a generative 

Regarding claim 15, Fu discloses a system, comprising: 
at least one processor; and
memory including instructions that, when executed by the at least one processor, cause the system to (Abstract: methods and systems for image generation through use of adversarial networks; Par. [0004]: a system for training an image generator… the system comprises a processor and a memory with computer code instructions stored thereon, wherein the processor and the memory, with the computer code instructions, are configured to cause the system to provide a generator, discriminator, and segmentor)
The steps of the program further recited in claim 15 correspond to claim 6 when executed and are rejected as applied to method claim 6 above.

Regarding claim 16, claim 15 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 7 above.

Regarding claim 17, claim 15 is incorporated and the combination of Fu and Suzuki, as a whole teaches the system (Fu, Par. [0004]), wherein the instructions when executed further cause the system to:

generate the semantic layout representing the two regions with the semantic labels (Fu, Par. [0041-46]: generator 240 takes as inputs, a target segmentation 227, a given image 226, and a vector 228 indicating desired attributes of the image to be generated. The generator 220 implemented with the blocks 221-225 is configured to receive the inputs 226, 227, 228 and generate a target image 229 that is based on, i.e., a translated version of, the input image 226 and consistent with the input segmentation 227 and attributes 228… The discriminator 240 is configured to take an image, e.g., the images 246 and/or 249 and produce a discrimination result 242 indicating if an input image is real, i.e., an image that did not get produced by the generator, or fake, i.e., an image created by the generator, and determine attributes 243 of the input image. The discriminator 240 pushes the generated images towards a target domain distribution, and meanwhile, utilizes an auxiliary attribute classifier to enable the SGGAN framework to generate images, such as the images in the row 102a in FIG. 1 with target attributes… the segmentor neural network 260 includes a convolutional block 261, a down-sampling convolutional block 262, residual block 263 (which may be implemented similarly to the residual block 223 described hereinabove), up-sampling convolutional block 264, and convolutional block 265. The segmentor network 260 implemented with the blocks 261-265 is configured to receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process… optimization tends to teach the generator 220 to impose the spatial constraints indicated in an input segmentation 271 on the translated images, e.g., 274. During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… In the training procedure 270, the generator 220 is configured to receive the target segmentation 271, desired attributes vector 272, and real image 273 and from the inputs 271-273, generate the image 274. Further, the generator 220 (which is depicted twice in FIG. 2D to show additional processing) is configured to perform a reconstruction process that attempts to reconstruct the input image 273 using a segmentation 275 that is based on the real image 273, attributes 276 of the real image 273, and the generated image 274; Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 323 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; receiving a boundary input separating an image space into the two regions, receiving indication of semantic labels to be associated with the two regions, and generating the semantic layout representing the two regions with the semantic labels (e.g. generate target-oriented realistic images (i.e. photorealistic images) guided by using semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated (i.e. indication of semantic labels to be associated with image space regions), including a generator which is configured to receive inputs (i.e. sources, indications, selections, etc.) and generate a target image that is based on a translated version of the input image and consistent with the input segmentation and corresponding attributes (i.e. semantic labels associated with image space regions), including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. a boundary input separating an image space into the (first, second, third… Nth) regions of a digital representation of an image), including a segmentor network which is configured to receive an input image and generate a corresponding segmentation indicating features of the input image (i.e. and th) regions with the semantic labels), as indicated above), for example). 

Regarding claim 18, claim 15 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 10 above.

Regarding claim 19, claim 18 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 11 above.

Claims 3, 12-13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Fu in view of Suzuki, as applied to claims 1 and 6 above, in further view of GAO et al. (PG Pub. No. 2019/0114511 A1), hereafter referred to as GAO, and in further view of SU et al. (PG Pub. No. 2021/0150812 A1), hereafter referred to as SU.

Regarding claim 3, claim 1 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), but fails to teach the following as further recited in claim 3.
However, GAO teaches wherein the spatially-adaptive normalization layer is a conditional normalization layer, and further comprising:
normalizing, by the spatially-adaptive normalization layer, layer activations to zero mean (Par. [0104-120]: convolutional neural network is a special type of neural network. The fundamental difference between a densely connected layer and a convolution layer is this: Dense layers learn global patterns in their input feature space, whereas convolution layers learn local patters: in the case of images, patterns found in small 2D windows of the inputs. This key characteristic gives convolutional neural networks two interesting properties: (1) the patterns they learn are translation invariant and (2) they can learn spatial hierarchies of patterns… convolutional neural network learns highly non-linear mappings by interconnecting layers of artificial neurons arranged in many different layers with activation functions that make the layers dependent. It includes one or more convolutional layers, interspersed with one or more sub-sampling layers and non-linear layers, which are typically followed by one or more fully connected layers. Each element of the convolutional neural network receives inputs from a set of features in the previous layer… Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (also called the channels axis). For an RGB image, the dimension of the depth axis is 3, because the image has three color channels; red, green, and blue. For a black-and-white picture, the depth is 1 (levels of gray). The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map. This output feature map is still a 3D tensor: it has a width and a height … The algorithm includes computing the activation of all neurons in the network, yielding an output for the forward pass; Par. [0130]: convolution layers of the convolutional neural network serve as feature extractors. Convolution layers act as adaptive feature extractors capable of learning and decomposing the input data into hierarchical features; Par. [0164-171]: Batch normalization is a method for accelerating deep network training by making data standardization an integral part of the network architecture. Batch normalization can adaptively normalize data even as the mean and variance change over time during training. It works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training… Batch normalization can be seen as yet another layer that can be inserted into the model architecture, just like the fully connected or convolutional layer. The BatchNormalization layer is typically used after a convolutional or densely connected layer. It can also be used before a convolutional or densely connected layer… Batch normalization provides a definition for feed-forwarding the input and computing the gradients with respect to the parameters and its own input via a backward pass. In practice, batch normalization layers are inserted after a convolutional or fully connected layer, but before the outputs are fed into an activation function. For convolutional layers, the different elements of the same feature map--i.e. the activations--at different locations are normalized in the same way in order to obey the convolutional property. Thus, all activations in a mini -batch are normalized over all locations, rather than per activation… The internal covariate shift is the phenomenon where the distribution of network activations change across layers due to the change in network parameters during training. Ideally, each layer should be transformed into a space where they have the same distribution but the functional relationship stays the same. In order to avoid costly calculations of covariance matrices to decorrelate and whiten the data at every layer and step, we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one… the batch normalization procedure is described herein per activation; Par. [0185]: batch normalization provides a new regularization method through normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters; wherein the spatially-adaptive normalization layer is a conditional normalization layer, and further comprising normalizing, by the spatially-adaptive normalization layer, layer activations to zero mean (e.g. convolutional neural network receives inputs from a set of features of previous layers, by using convolutions, which operate over three-dimensional (3D) tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (the channels axis), to perform Batch normalization, which adaptively normalizes data (i.e. spatially-adaptive normalization) even as the mean and variance change over time during training, by performing normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters in order to normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one (i.e. normalizing layer activations to zero mean), as indicated above), for example).
Fu, Suzuki, and GAO are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, the combined teachings of Fu, Suzuki, and GAO, as a whole, would have rendered obvious the invention recited in claim 3 with a reasonable expectation of success in order to modify the method for image generation through use of adversarial networks (as disclosed by 
The combination of Fu, Suzuki, and GAO, as a whole, teaches the method as indicated above, but fails to teach the following as further recited in claim 3.
 However, SU teaches and de-normalizing the normalized layer activations to modulate activation using an affine transformation (Par. [0028-29]: one or more neural network (NN) models, each adapted to approximate an image… The encoder selects a neural network model from the variety of NN models to determine an output image which approximates the second image based on the first image and the second image. Next, it determines at least some values of the parameters of the selected NN model according to an optimizing criterion, the first image, and the second image, wherein the parameters comprise node weights and/or node biases to be used with an activation function for at least some of the nodes in at least one layer of the selected NN model… For one or more color components of the encoded image, the image metadata may comprise: the number of neural-net layers in the NN, the number of neural nodes for at least one layer, and weights and offsets to be used with an activation function in some nodes of the at least one layer. After decoding the encoded image, the decoder generates an output image in the second dynamic range based on the encoded image and the parameters of the NN model; Par. [0050]: performance can be improved by renormalizing the input signals to the range [-1 1]… the neural network needs to include… a pre-scaling stage (normalization), where each channel in the input signal is scaled… a post-scaling stage (de-normalization), where each channel in the output signal… is scaled back to the original range; Par. [0138-139]: the image metadata comprise parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node… parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node; and… generating an output image based on the encoded image and the parameters of the NN model… wherein the image metadata further comprise scaling metadata, wherein for each color component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value, and the method further comprises generating a de-normalizing output image based on the scaling metadata and the output image; and de-normalizing the normalized layer activations to modulate activation using an affine transformation (e.g. image metadata comprising parameters for a neural network (NN) model to map (i.e. transform, 
Fu, Suzuki, GAO, and SU are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, the combined teachings of Fu, Suzuki, GAO, and SU, as a whole, would have rendered obvious the invention recited in claim 3 with a reasonable expectation of success in order to modify the method for image generation through use of adversarial networks (as disclosed by Fu) with de-normalizing the normalized layer activations to modulate activation using an affine transformation (as taught by SU, Abstract, Par. [0028-29, 50, 138-139]) to derive image-mapping functions based on neural-networks to improve 

Regarding claim 12, claim 6 is incorporated and the combination of Fu and Suzuki, as a whole teaches the method (Fu, Par. [0004]), but fails to teach the following as further recited in claim 12.
However GAO teaches further comprising:
normalizing, by the spatially-adaptive normalization layer, layer activations to zero mean (Par. [0104-120]: convolutional neural network is a special type of neural network. The fundamental difference between a densely connected layer and a convolution layer is this: Dense layers learn global patterns in their input feature space, whereas convolution layers learn local patters: in the case of images, patterns found in small 2D windows of the inputs. This key characteristic gives convolutional neural networks two interesting properties: (1) the patterns they learn are translation invariant and (2) they can learn spatial hierarchies of patterns… convolutional neural network learns highly non-linear mappings by interconnecting layers of artificial neurons arranged in many different layers with activation functions that make the layers dependent. It includes one or more convolutional layers, interspersed with one or more sub-sampling layers and non-linear layers, which are typically followed by one or more fully connected layers. Each element of the convolutional neural network receives inputs from a set of features in the previous layer… Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (also called the channels axis). For an RGB image, the dimension of the depth axis is 3, because the image has three color channels; red, green, and blue. For a black-and-white picture, the depth is 1 (levels of gray). The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map. This output feature map is still a 3D tensor: it has a width and a height … The algorithm includes computing the activation of all neurons in the network, yielding an output for the forward pass; Par. [0130]: convolution layers of the convolutional neural network serve as feature extractors. Convolution layers act as adaptive feature extractors capable of learning and decomposing the input data into hierarchical features; Par. [0164-171]: Batch normalization is a method for accelerating deep network training by making data standardization an integral part of the network architecture. Batch normalization can adaptively normalize data even as the mean and variance change over time during training. It works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training… Batch normalization can be seen as yet another layer that can be inserted into the model architecture, just like the fully connected or convolutional layer. The BatchNormalization layer is typically used after a convolutional or densely connected layer. It can also be used before a convolutional or densely connected layer… Batch normalization provides a definition for feed-forwarding the input and computing the gradients with respect to the parameters and its own input via a backward pass. In practice, batch normalization layers are inserted after a convolutional or fully connected layer, but before the outputs are fed into an activation function. For convolutional layers, the different elements of the same feature map--i.e. the activations--at different locations are normalized in the same way in order to obey the convolutional property. Thus, all activations in a mini -batch are normalized over all locations, rather than per activation… The internal covariate shift is the phenomenon where the distribution of network activations change across layers due to the change in network parameters during training. Ideally, each layer should be transformed into a space where they have the same distribution but the functional relationship stays the same. In order to avoid costly calculations of covariance matrices to decorrelate and whiten the data at every layer and step, we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one… the batch normalization procedure is described herein per activation; Par. [0185]: batch normalization provides a new regularization method through normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters; normalizing, by the spatially-adaptive normalization layer, layer activations to zero mean (e.g. convolutional neural network receives inputs from a set of features of previous layers, by using convolutions, which operate over three-dimensional (3D) tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (the channels axis), to perform Batch normalization, which adaptively normalizes data (i.e. spatially-adaptive normalization) even as the mean and variance change over time during training, by performing normalization of scalar features for each activation within a mini-batch and learning each mean and variance as zero mean and a standard deviation of one (i.e. normalizing layer activations to zero mean), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.
However, SU teaches and de-normalizing the normalized layer activations to modulate activation using an affine transformation (Par. [0028-29]: one or more neural network (NN) models, each adapted to approximate an image… The encoder selects a neural network model from the variety of NN models to determine an output image which approximates the second image based on the first image and the second image. Next, it determines at least some values of the parameters of the selected NN model according to an optimizing criterion, the first image, and the second image, wherein the parameters comprise node weights and/or node biases to be used with an activation function for at least some of the nodes in at least one layer of the selected NN model… For one or more color components of the encoded image, the image metadata may comprise: the number of neural-net layers in the NN, the number of neural nodes for at least one layer, and weights and offsets to be used with an activation function in some nodes of the at least one layer. After decoding the encoded image, the decoder generates an output image in the second dynamic range based on the encoded image and the parameters of the NN model; Par. [0050]: performance can be improved by renormalizing the input signals to the range [-1 1]… the neural network needs to include… a pre-scaling stage (normalization), where each channel in the input signal is scaled… a post-scaling stage (de-normalization), where each channel in the output signal… is scaled back to the original range; Par. [0138-139]: the image metadata comprise parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node… parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node; and… generating an output image based on the encoded image and the parameters of the NN model… wherein the image metadata further comprise scaling metadata, wherein for each color component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value, and the method further comprises generating a de-normalizing output image based on the scaling metadata and the output image; and de-normalizing the normalized layer activations to modulate activation using an affine transformation (e.g. image metadata comprising parameters for a neural network (NN) model to map (i.e. transform, translate, etc.) an encoded image to an output image (i.e. generate an image using an affine transformation), including a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node parameters for a neural network (NN) model to map the encoded 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Regarding claim 13, claim 12 is incorporated and the combination of Fu, Suzuki, GAO, and SU, as a whole teaches the method (Fu, Par. [0004]), wherein the de-normalizing uses different normalization parameter values for the two regions (SU, Par. [0028-29]: a neural network model from the variety of NN models to determine an output image which approximates the second image based on the first image and the second image. Next, it determines at least some values of the parameters of the selected NN model according to an optimizing criterion, the first image, and the second image, wherein the parameters comprise node weights and/or node biases to be used with an activation function for at least some of the nodes in at least one layer of the selected NN model… For one or more color components of the encoded image, the image metadata may comprise: the number of neural-net layers in the NN, the number of neural nodes for at least one layer, and weights and offsets to be used with an activation function in some nodes of the at least one layer. After decoding the encoded image, the decoder generates an output image in the second dynamic range based on the encoded image and the parameters of the NN model; Par. [0045-55]: input and output parameters of a NN may be expressed in terms of the mapping in equation… The goal is to find the parameters… in all (L+1) layers, to minimize the total minimum square error (MSE) for all P pixels… An L-layer neural-network based mapping can be represented using the following parameters, which can be communicated to a receiver as metadata… the normalization parameters for each input component (e.g., gain, min, and max); Par. [0094]: As described earlier, NNM metadata include the input normalization parameters and the neural-network parameters. These values are typically floating-point numbers in single or double precision; Par. [0138-139]: the image metadata comprise parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node… generating an output image based on the encoded image and the parameters of the NN model… the image metadata further comprise scaling metadata, wherein for each color component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value, and the method further comprises generating a de-normalizing output image based on the scaling metadata and the output image; wherein the de-normalizing uses different normalization parameter values for the two regions (e.g. encoder selects a neural network model from a variety of NN models to determine an output image which approximates an encoded input image and determines values of the parameters of the selected NN model according to an optimizing criterion, including image metadata comprising parameters for a neural network (NN) model to map the encoded image to an output image, and generating an output image based on the encoded image and the parameters of the NN model, the image metadata further comprising scaling metadata, wherein for each color component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value (i.e. different parameter values), and generating a de-normalizing output image based on the scaling metadata (i.e. the de-normalizing uses different normalization parameter values for the (first, second, third… Nth) regions) and the output image, as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 3.

Regarding claim 20, claim 15 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 12 above.

Contact Information

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on 571-272-7332.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/GUILLERMO M RIVERA-MARTINEZ/           Primary Examiner, Art Unit 2668