DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 1/5/2021 have been fully considered but they are not persuasive.
Applicant argues 

    PNG
    media_image1.png
    134
    665
    media_image1.png
    Greyscale

	The amendment does not simplify or rectify the issue.  The current claim requires “a semantic representation [[of]] generated from a first digital representation of an image”.  
The disclosure fails to define what is “digital representation of an image”?1
The disclosure has 
[22]”The semantic representation may include a semantic label map, an edge map, a depth map, a relationship map (e.g., a relationship map between pairs of objects within an image), etc.”
[27] ”In addition, in one embodiment, the instance feature map may be used to control a style of the created image (e.g., by dictating a color and/or texture of one or more components of the created image, etc.). In one embodiment, the feature encoder network may use instance-wise average pooling to ensure that features are uniform 
[31] “the discriminator may extract one or more features (e.g., intermediate feature representations, etc.)”
[167]” the discriminator may extract a first set of intermediate feature representations from the image, and may also extract a second set of intermediate feature representations from the original image on which the semantic representation is based.” 
[00133]  Figure 6 illustrates an exemplary network architecture of a generator 600, according to one embodiment. In one embodiment, we first train a residual network (Go) 602 on lower resolution images. Then this network is used to initialize our final network trained on high resolution images (G1) 604A-B. Specifically, the input to the residual blocks in G1 604A-B is the element-wise sum of the feature map from G1 604A-B and the last feature map from (Go) 602.
	Is “a first digital representation” one of the items above?  Not all of them have a resolution.  Why not simply claim “a semantic representation [[of]] generated from a first [[digital representation of an]] image”
Additionally throughout the claim set, “the semantic representation of the first digital representation of the image” must be changed to “the semantic representation  [[of]] generated from the first digital representation of the image”

Applicant argues 

    PNG
    media_image2.png
    243
    671
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    91
    628
    media_image3.png
    Greyscale

Examiner’s Response
[23] states “In one embodiment, the coarse neural network may take the semantic representation as input, and may output a first image having a first resolution. In one embodiment, the coarse neural network may include a residual network that is trained on images having a first resolution.”
[165] As shown in operation 802, a semantic representation is received as input to a coarse- to-fine generator. Additionally, as shown in operation 804, the coarse-to-fine generator creates an image, using the semantic representation. Further, as shown in operation 806, the image is sent to a discriminator. 
*Note: Figure 8 illustrates a flowchart of a method for training a coarse-to-fine generator, in accordance with an embodiment. Does not use the only language of the claim.
[00171]   As shown in operation 902, a semantic representation is received as input to a coarse- to-fine generator. Additionally, as shown in operation 904, a coarse neural network of the coarse-to-fine generator creates a first image having a first resolution, utilizing the semantic representation.
*Note: Figure 9 illustrates a flowchart of a method for implementing a trained coarse-to-fine generator, in accordance with an embodiment.
[00177] As shown in operation 1102, a coarse neural network is trained using only the semantic representation of the first digital representation of the image to generate a coarse digital representation of the image having a resolution that is less than the resolution of the first digital representation of the image. In one embodiment, the semantic representation of the first digital representation of the image includes a semantic label map of the first digital representation of the image. In one embodiment, the semantic representation of the first digital representation of the image includes an edge map of the first digital representation of the image. In one embodiment, the semantic representation of the first digital representation of the image includes a relationship map of the first digital representation of the image.
Examiner Note: A Review of 62/586743 shows this claim language is not supported by the Priority Document. The need for training is briefly mentioned once on page 4 and once on page 12.
ONLY paragraph 177 gives support for the only limitation.  See next section.

How a normal GAN is trained2 : 
    PNG
    media_image4.png
    298
    585
    media_image4.png
    Greyscale

GAN contains two separately trained networks (generator and discriminator)
The discriminator's training data comes from two sources: Real data instances, such as real pictures of people. The discriminator uses these instances as positive examples during training and Fake data instances created by the generator. The discriminator uses these instances as negative examples during training.
The generator part of a GAN learns to create fake data by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real.  
So we train the generator with the following procedure:
Sample random noise.
Produce generator output from sampled random noise.
Get discriminator "Real" or "Fake" classification for generator output.
Calculate loss from discriminator classification.
Backpropagate through both the discriminator and generator to obtain gradients.
Use gradients to change only the generator weights.
As can be seen above in a normal GAN network, the generator is trained by backpropagation from the discriminator, not using ONLY the input (semantic representation).  Going back to the claim 1 last limitation “adjusting weight values associated with one or more nodes of one or both of the coarse neural network and the fine neural network…”  The adjusting of weights of the coarse neural network is the actual training of the generator and adjusting of weights of the fine neural network is the actual training of the discriminator.  The Examiner has no idea what “training a coarse neural network using only the semantic representation of the first digital representation of the image …” is supposed to mean as it doesn’t make sense in context of the field of endeavor (GAN’s) or the claim as written.   Thus it is indefinite.
It is the Examiner’s opinion that disclosure’s drafter is error with GAN’s terminology, so the language fails to particularly point out and distinctly claim the subject matter.  If it is Applicant’s opinion the claim language is accurate, the Examiner will advance the 35 USC 112(a) rejection as implied on page 14 paragraph 2 or the reply filed on 1/5/21.

Applicant argues 

    PNG
    media_image5.png
    286
    686
    media_image5.png
    Greyscale

The Examiner disagrees in part.  See New rejection below.  In Zhang the text description is being fed into an Embedding function which is being fed into the GAN.    See below.  The text description is a first digital representation of an image; the output of the Embedding function is the semantic representation.  The Examiner previously suggested language to avoid this interpretation3. 

    PNG
    media_image6.png
    242
    438
    media_image6.png
    Greyscale

Applicant argues 

    PNG
    media_image7.png
    364
    662
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    60
    638
    media_image8.png
    Greyscale

The Examiner disagrees.  Cox paragraph 41 continues to say “Machine learning engine 322 could then generate an updated configuration file 324 using the adjusted weight values.  Digital camera 302 could then generate a new set of trial images 330 based on the updated configuration file 324.  Training engine 328 could then repeat the training process with the new set of trial images 330 generated using the updated configuration file 324.”.  Thus this is an iterative process and meets the claim limitation.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 1-23 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Claim 1 recites “training a coarse neural network using only the semantic representation of the first digital representation of the image to generate a coarse digital representation of the image having a resolution that is less than the resolution of the first digital representation of the image.  This disclosure defines this as “In one embodiment, the semantic representation may include a semantic label map, an edge map, a depth map, a relationship map (e.g., a relationship map between pairs of objects within an image), etc.”   image
Claim 1 requires “a semantic representation [[of]] generated from a first digital representation of an image”.  
The disclosure fails to define what is “digital representation of an image”?4
The disclosure has 
[22]”The semantic representation may include a semantic label map, an edge map, a depth map, a relationship map (e.g., a relationship map between pairs of objects within an image), etc.”
[27] ”In addition, in one embodiment, the instance feature map may be used to control a style of the created image (e.g., by dictating a color and/or texture of one or more components of the created image, etc.). In one embodiment, the feature encoder network may use instance-wise average pooling to ensure that features are uniform within the instance feature map. In this way, all similar features in the created image may be the same (e.g., same color grass, same type of road, etc.).”
[31] “the discriminator may extract one or more features (e.g., intermediate feature representations, etc.)”
[167]” the discriminator may extract a first set of intermediate feature representations from the image, and may also extract a second set of intermediate feature representations from the original image on which the semantic representation is based.” 
[00133]  Figure 6 illustrates an exemplary network architecture of a generator 600, according to one embodiment. In one embodiment, we first train a residual network (Go) 602 on lower resolution images. Then this network is used to initialize our final network trained on high resolution images (G1) 604A-B. Specifically, the input to the  element-wise sum of the feature map from G1 604A-B and the last feature map from (Go) 602.
	Is “a first digital representation” one of the items above?  Not all of them have a resolution.  Why not simply claim “a semantic representation [[of]] generated from a first [[digital representation of an]] image”
The phrase “of the image” is the descriptor of the first digital representation of the image. 
It appears, Applicant desires there to be an image which is converted to a digital representation which is converted to a semantic representation. At a minimum the claim would require “generating a semantic representation from a first image” or “generating a semantic representation from a first digital representation, which is generated from an image”.  There is no correlation between the semantic representation and any image resolution.   For example if the semantic representation were a depth map, edge map, relationship map, etc., it would say nothing about the resolution of an input image.  
Furthermore, training a NN with ONLY semantic information would be unable to create an image.  In a GAN (generative adversarial networks), training must be done with an image and semantic information.  See Applicant’s disclosure:
Paragraph 26 : “the creating may be performed during a training process (e.g., a training of the generator using a conditional adversarial network, etc.)”; “a separate neural network (e.g., a feature encoder network, etc.) may receive an original image on which the semantic representation is based as input, and may create an instance feature map based on the original image. In one embodiment, the instance feature map may be used as input to the generator along with the semantic representation.”
Paragraph 126 : “the training dataset is given as a set of pairs of corresponding images { (si, xi) }, where si is a semantic label map and xi is a corresponding natural photo.”  Additionally the adjusting of weights is the actual training of the neural network.
It is appears Applicant has a different meaning for “training” than ordinarily used with NN.
Claims 16 is rejected under similar grounds as claim 1.
The claim set (Claims 1-23) use “the semantic representation of the first digital representation of the image”.  This must be changed to “the semantic representation  [[of]] generated from the first digital representation of the image” to maintain antecedent basis.
Claims 2-15 are rejected as dependent on a rejected claim.
Claim 17 and 23 recite “a coarse digital representation of the image having a resolution that is less than the resolution of the first digital representation of the image” Claim 17/23 requires “a semantic representation [[of]] generated from a first digital representation of an image”.  
The disclosure fails to define what is “digital representation of an image”?5
The disclosure has 
[22]”The semantic representation may include a semantic label map, an edge map, a depth map, a relationship map (e.g., a relationship map between pairs of objects within an image), etc.”
[27] ”In addition, in one embodiment, the instance feature map may be used to control a style of the created image (e.g., by dictating a color and/or texture of one or more components of the created image, etc.). In one embodiment, the feature encoder network may use instance-wise average pooling to ensure that features are uniform within the instance feature map. In this way, all similar features in the created image may be the same (e.g., same color grass, same type of road, etc.).”
[31] “the discriminator may extract one or more features (e.g., intermediate feature representations, etc.)”
[167]” the discriminator may extract a first set of intermediate feature representations from the image, and may also extract a second set of intermediate feature 
[00133]  Figure 6 illustrates an exemplary network architecture of a generator 600, according to one embodiment. In one embodiment, we first train a residual network (Go) 602 on lower resolution images. Then this network is used to initialize our final network trained on high resolution images (G1) 604A-B. Specifically, the input to the residual blocks in G1 604A-B is the element-wise sum of the feature map from G1 604A-B and the last feature map from (Go) 602.
	Is “a first digital representation” one of the items above?  Not all of them have a resolution.  Why not simply claim “a semantic representation [[of]] generated from a first [[digital representation of an]] image”
The phrase “of the image” is the descriptor of the first digital representation of the image. 
It appears, Applicant desires there to be an image which is converted to a digital representation which is converted to a semantic representation. At a minimum the claim would require “generating a semantic representation from a first image” or “generating a semantic representation from a first digital representation, which is generated from an image”.  There is no correlation between the semantic representation and any image resolution.   For example if the semantic representation were a depth map, edge map, relationship map, etc., it would say nothing about the resolution of an input image.  
Claims 18-22 are rejected as dependent on a rejected claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

Claim 1-2, 4-5, 14-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (“StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”) in view of Cox (PGPub 2014/0152848). 
Regarding claim 1, Zhang discloses. A method comprising: 
training a machine learning model based, at least in part, on a semantic representation generated from a first digital representation of an image, wherein training the machine learning model includes: 
training a coarse neural network using only the semantic representation of the first digital representation of the image to generate a coarse digital representation of the image having a resolution that is less than the resolution of the first digital representation of the image;  (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.”; Section 3.5, “For training, we first iteratively train D0 and G0 of Stage-I GAN for 600 epochs by fixing Stage-II GAN. Then we iteratively train D and G of Stage-II GAN for another 600 epochs by fixing Stage-I GAN. All networks are trained using ADAM solver with batch size 64 and an initial learning rate of 0.0002. The learning rate is decayed to 1/2 of its previous value every 100 epochs”, where the text is the first digital representation of an image, which converted using an embedding function (see Fig. 2)  and inputted into the GAN 
    PNG
    media_image9.png
    303
    842
    media_image9.png
    Greyscale
)
training a fine neural network using the semantic representation of the first digital representation of the image and the coarse digital representation of the image to generate a fine digital representation of the image having a resolution that is greater than the resolution of the coarse digital representation of the image;  (Zhang, Section 3, “Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a high resolution photo-realistic image.”; Section 3.5, “For training, we first iteratively train D0 and G0 of Stage-I GAN for 600 epochs by fixing Stage-II GAN. Then we iteratively train D and G of Stage-II GAN for another 600 epochs by fixing Stage-I GAN. All networks are trained using ADAM solver with batch size 64 and an initial learning rate of 0.0002. The learning rate is decayed to 1/2 of its previous value every 100 epochs”)

But does not expressly disclose “comparing the fine digital representation of the image to the first digital representation of the image; and 
adjusting weight values associated with one or more nodes of one or both of the coarse neural network and the fine neural network to minimize a difference between the first digital representation of the image and the fine digital representation of the image. “

Cox discloses “comparing the fine digital representation of the image to the first digital representation of the image; and adjusting weight values associated with one or more nodes of one or both of the coarse neural network and the fine neural network to minimize a difference between the first digital representation of the image and the fine digital representation of the image. “ (Cox, paragraph 41, “For example, training engine 328 could compare each trial image 330 to a corresponding target image 326 and compute differences in pixel values between those two images.  Based on those differences, training engine 322 could adjust the weight values within machine learning engine 322 (e.g. using a cost function or gradient descent algorithm, etc.) to minimize the difference in pixel values between the two images.  Training engine 322 could repeat this training process for each trial image/target image pair.  Machine learning engine 322 could then generate an updated configuration file 324 using the adjusted weight values.  ”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to use the training engine of Cox to train the neural network of Zhang.
The suggestion/motivation for doing so would have been to implement the training of Zhang.
Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. 
Therefore, it would have been obvious to combine Zhang with Cox to obtain the invention as specified in claim 1.
	
Regarding claim 2, Zhang in view of Cox discloses.  The method of claim 1, wherein the semantic representation of the first digital representation of the image includes a semantic label map of the first (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.” Text reads on a semantic label map, absent any specific definition of this term)

Regarding claim 4, Zhang in view of Cox discloses. The method of claim 1, wherein the semantic representation of the first digital representation of the image includes a relationship map of the first digital representation of the image. (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.” Text reads on a relationship map, absent any specific definition of this term)
	 
Regarding claim 5, Zhang in view of Cox discloses. The method of claim 1, further comprising generating, utilizing the fine digital representation of the image, a downsampled fine digital representation of the image having a resolution that is less than the resolution of the fine digital representation of the image. (Zhang, Section 3.5, “For training, we first iteratively train D0 and G0 of Stage-I GAN for 600 epochs by fixing Stage-II GAN. Then we iteratively train D and G of Stage-II GAN for another 600 epochs by fixing Stage-I GAN. All networks are trained using ADAM solver with batch size 64 and an initial learning rate of 0.0002. The learning rate is decayed to 1/2 of its previous value every 100 epochs”, where an iterative training requires this step)

Regarding claim 14, Zhang in view of Cox discloses The method of claim 1, wherein the machine learning model is also trained based, at least in part, on an instance feature map of the first digital representation of the image. (inherent to a neural network, This would be the convolutional layer; See Zhang, Section 3.3, last paragraph, “The resulting tensor is further fed to a 1×1 convolutional layer to jointly learn features across the image and the text. Finally, a fully connected layer with one node is used to produce the decision score”)
 
Regarding claim 15, Zhang in view of Cox discloses The method of claim 14, wherein the instance feature map of the first digital representation of the image is added to the semantic representation of the first digital representation of the image as input to the machine learning model. (inherent to a neural network, This would be the convolutional layer; See Zhang, Section 3.3, last paragraph, “The resulting tensor is further fed to a 1×1 convolutional layer to jointly learn features across the image and the text. Finally, a fully connected layer with one node is used to produce the decision score”)
 
Claim 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Cox in view of Isola (“Image-to-Image Translation with Conditional Adversarial! Networks”, IDS).
Regarding claim 3, Zhang in view of Cox discloses.  The method of claim 1, But does not expressly disclose “wherein the semantic representation of the first digital representation of the image includes an edge map of the first digital representation of the image”
Isola (“Image-to-Image Translation with Conditional Adversarial! Networks”, IDS) discloses “wherein the semantic representation of the first digital representation of the image includes an edge map of the first (Isola, Section 4,” 
    PNG
    media_image10.png
    356
    352
    media_image10.png
    Greyscale
”, Edges->Photo)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to an edge map in place of the text in Zhang as suggested by Isola.
The suggestion/motivation for doing so would have been that a GAN can use a variety of different inputs to implement image generation.
Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. 
Therefore, it would have been obvious to combine Zhang with Isola to obtain the invention as specified in claim 3.

Claim 6-8, 10-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Cox in view of Korkin (PGPub 2016/0117800)
claim 6, Zhang in view of Cox discloses.  The method of claim 5, further comprising 
But does not expressly disclose “generating, utilizing the first digital representation of the image, a downsampled first digital representation of the image having a resolution that is less than the resolution of the first digital representation of an image.” 
Korkin discloses  “generating, utilizing the first digital representation of the image, a downsampled first digital representation of the image having a resolution that is less than the resolution of the first digital representation of an image.” (Korkin, paragraph 59, “In one embodiment, the blur kernel is estimated using a calibration procedure involving acquisition of a test image by the photographic image acquisition device disclosed herein at two different distances corresponding to the ratio of the super-resolved image and the low-resolution image, and then minimizing the difference between said images by applying the blur kernel to the higher-resolution test image and downsampling it, and iteratively modifying the kernel weights, said minimization in the preferred embodiment comprising the least squares method, or the like. ”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to downsample both images as shown by Korkin before the comparison in Cox.
The suggestion/motivation for doing so would have been to reduce the number of data points thereby increasing the speed of determining weights.
Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. 
Therefore, it would have been obvious to combine Zhang and Cox with Korkin to obtain the invention as specified in claim 6.

claim 7, Zhang in view of Cox in view of Korkin discloses. The method of claim 6, further comprising
But does not expressly disclose “comparing the downsampled fine digital representation of the image to the downsampled first digital representation of the image” (Korkin, paragraph 59, “In one embodiment, the blur kernel is estimated using a calibration procedure involving acquisition of a test image by the photographic image acquisition device disclosed herein at two different distances corresponding to the ratio of the super-resolved image and the low-resolution image, and then minimizing the difference between said images by applying the blur kernel to the higher-resolution test image and downsampling it, and iteratively modifying the kernel weights, said minimization in the preferred embodiment comprising the least squares method, or the like. ”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to downsample both images as shown by Korkin before the comparison in Cox.
The suggestion/motivation for doing so would have been to reduce the number of datapoints thereby increasing the speed of determining weights.
Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. 
Therefore, it would have been obvious to combine Zhang and Cox with Korkin to obtain the invention as specified in claim 7.

	Regarding claim 8, Zhang in view of Cox in view of Korkin discloses The method of claim 7, further comprising adjusting weight values associated with one or more nodes of one or both of the coarse neural network and the fine neural network to minimize a difference between the downsampled (Korkin, paragraph 59, “In one embodiment, the blur kernel is estimated using a calibration procedure involving acquisition of a test image by the photographic image acquisition device disclosed herein at two different distances corresponding to the ratio of the super-resolved image and the low-resolution image, and then minimizing the difference between said images by applying the blur kernel to the higher-resolution test image and downsampling it, and iteratively modifying the kernel weights, said minimization in the preferred embodiment comprising the least squares method, or the like. ”)

	Claims 10-13 are rejected under similar grounds as claims 5-8.  A downsampled image is a set of intermediate feature representations of the fine digital representation of the image

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

Claim(s) 16-18, 20-23 is/are rejected under 35 U.S.C. 102(a1) as being anticipated by Zhang

Regarding claim 16, Zhang discloses A method comprising:
training a machine learning model based, at least in part, on a semantic representation of a first digital representation of an image, wherein training the machine learning model includes: 
training a coarse neural network using only the semantic representation of the first digital representation of the image to generate a coarse digital representation of the image having a resolution that is less than the resolution of the first digital representation of the image;  and (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.”; Section 3.5, “For training, we first iteratively train D0 and G0 of Stage-I GAN for 600 epochs by fixing Stage-II GAN. Then we iteratively train D and G of Stage-II GAN for another 600 epochs by fixing Stage-I GAN. All networks are trained using ADAM solver with batch size 64 and an initial learning rate of 0.0002. The learning rate is decayed to 1/2 of its previous value every 100 epochs”, where the text is the first digital representation of an image, which converted using an embedding function (see Fig. 2)  and inputted into the GAN 
    PNG
    media_image9.png
    303
    842
    media_image9.png
    Greyscale
)
training a fine neural network using the semantic representation of the first digital representation of the image and the coarse digital representation of the image to generate a fine digital representation of the image having a resolution that is greater than the resolution of the coarse digital representation of the image. (Zhang, Section 3, “Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a high resolution photo-realistic image.” Section 3.5, “For training, we first iteratively train D0 and G0 of Stage-I GAN for 600 epochs by fixing Stage-II GAN. Then we iteratively train D and G of Stage-II GAN for another 600 epochs by fixing Stage-I GAN. All networks are trained using ADAM solver with batch size 64 and an initial learning rate of 0.0002. The learning rate is decayed to 1/2 of its previous value every 100 epochs”)
 
Regarding claim 17, Zhang discloses machine learning model that includes: 
a coarse neural network that generates, using only a semantic representation of a first digital representation of an image, a coarse digital representation of the image having a resolution that is less than the resolution of the first digital representation of the image;  and (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.”)
a fine neural network that generates, using the semantic representation of the first digital representation of the image and the coarse digital representation of the image, a fine digital representation of the image having a resolution that is greater than the resolution of the coarse digital representation of the image. (Zhang, Section 3, “Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a high resolution photo-realistic image.”)
 
Regarding claim 18, Zhang discloses machine learning model of claim 17, wherein the semantic representation of the first digital representation of the image includes a semantic label map of the first digital representation of the image. (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.” Text reads on a semantic label map, absent any specific definition of this term)

claim 20, Zhang discloses The machine learning model of claim 17, wherein the semantic representation of the first digital representation of the image includes a relationship map of the first digital representation of the image. (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.” Text reads on a relationship map, absent any specific definition of this term)
 
Regarding claim 21, Zhang discloses The machine learning model of claim 17, wherein the machine learning model also generates the fine digital representation of an image based, at least in part, on an instance feature map of the first digital representation of the image. (inherent to a neural network, This would be the convolutional layer; See Zhang, Section 3.3, last paragraph, “The resulting tensor is further fed to a 1×1 convolutional layer to jointly learn features across the image and the text. Finally, a fully connected layer with one node is used to produce the decision score”)
 
Regarding claim 22, Zhang discloses.  The machine learning model of claim 21, wherein the instance feature map  of the first digital representation of the image is added to the semantic representation of the first digital representation of the image as input to the machine learning model. (inherent to a neural network, This would be the convolutional layer; See Zhang, Section 3.3, last paragraph, “The resulting tensor is further fed to a 1×1 convolutional layer to jointly learn features across the image and the text. Finally, a fully connected layer with one node is used to produce the decision score”)
 
Regarding claim 23, Zhang discloses.  A method comprising: 
 (Zhang, Section 3, “Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.”; Section 3.5, “For training, we first iteratively train D0 and G0 of Stage-I GAN for 600 epochs by fixing Stage-II GAN. Then we iteratively train D and G of Stage-II GAN for another 600 epochs by fixing Stage-I GAN. All networks are trained using ADAM solver with batch size 64 and an initial learning rate of 0.0002. The learning rate is decayed to 1/2 of its previous value every 100 epochs”, where the text is the first digital representation of an image, which converted using an embedding function (see Fig. 2)  and inputted into the GAN 
    PNG
    media_image9.png
    303
    842
    media_image9.png
    Greyscale
)
generating, by a fine neural network using the semantic representation of the first digital representation of the image and the coarse digital representation of the image, a fine digital representation of the image having a resolution that is greater than the resolution of the coarse digital representation of the image. (Zhang, Section 3, “Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a high resolution photo-realistic image.”)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Isola

Regarding claim 19, Zhang discloses machine learning model of claim 17, 
But does not expressly disclose “wherein the semantic representation of the first digital representation of the image includes an edge map of the first digital representation of the image”
Isola (“Image-to-Image Translation with Conditional Adversarial! Networks”, IDS) discloses “wherein the semantic representation of the first digital representation of the image includes an edge (Isola, Section 4,” 
    PNG
    media_image10.png
    356
    352
    media_image10.png
    Greyscale
”, Edges->Photo)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to an edge map in place of the text in Zhang as suggested by Isola.
The suggestion/motivation for doing so would have been that a GAN can use a variety of different inputs to implement image generation.
Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. 
Therefore, it would have been obvious to combine Zhang with Isola to obtain the invention as specified in claim 19.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20200151938 [0056] After receiving the edge-stroke-training image 202 as an input, the NPR generator 204 generates a simplified-training image 206 of the edge-stroke-training image 202.  The term "simplified-training image" refers to a simplified image of a stroke-training image used to train a stroke-style-transfer-neural network.
US 20190362191 A1 [0035] In contrast to previous GANs that include a single discriminator that provides feedback to the generator, the evaluator consists of three sub-modules: a discriminator, a normalizer, and a semantic embedding module, that each evaluate the images generated by the generator and provide the generator with feedback.  During training, each sub-module receives as input a vector representation of an image generated by the generator that depicts an object in a normalized view, and a real image of the object in a normalized view.  The discriminator outputs a probability that the input image generated by the generator is a real image or a generated image.
US 20180028294 [0151] FIG. 18 illustrates examples of inputs and corresponding outputs of dental restorations generated by one of the methods 1050 and 1700.  In method 1700, depth maps 1800a, 1805a, and 1810a of prepared tooth are provided as inputs to the trained generative deep neural network (e.g., generator 510).  In response to the input depth maps, the trained generative deep neural network may generate images 1800b, 1805b, and 1810b that are generated by the trained generative deep neural network (generator 510) based on the input depth maps.  

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GANDHI THIRUGNANAM whose telephone number is (571)270-3261.  The examiner can normally be reached on M-F 8:30-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached on 571-272-3638.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact 






/GANDHI THIRUGNANAM/Primary Examiner, Art Unit 2662                                                                                                                                                                                                        


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 “of an image” is a descriptor of the digital representation.  
        2 https://developers.google.com/machine-learning/gan/gan_structure
        3 “generating a semantic representation from a first image” or “generating a semantic representation from a first digital representation, which is generated from an image”
        4 “of an image” is a descriptor of the digital representation.  
        5 “of an image” is a descriptor of the digital representation.