DETAILED ACTION
This Office Action is in response to the Application filed on October 18, 2019, which claims benefit of U.S. Provisional Application No. 62/745605 filed on October 15, 2018. An action on the merits follows. Claims 1-20 are pending on the application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1, Ln. 3-4 and 5-6, recite the limitations “a first plurality of network (NN) layers” and “an initial neural network (NN) layer of the first plurality of NN layers”, respectively. However, the examiner cannot clearly ascertain whether the claimed “NN layers” refers to “network (NN) layers”, “neural network (NN) layer”, or both, since the claimed “NN” term is being defined twice in the claim. Therefore, the meets and bounds of the claim are not clearly set forth as it is unclear to determine which elements are encompassed by the claim language. 
Claims 2-14 are rejected by virtue of dependent upon rejected base claim 1.
For examination purposes the examiner has interpreted the claimed “a first plurality of network (NN) layers” and “an initial neural network (NN) layer of the first plurality of NN layers” of claim 1 above, as “a first plurality of neural network (NN) layers” and “an initial neural network layer of the first plurality of NN layers”, respectively.
Claim 15, Ln. 3-6, recite the limitations “a first plurality of network (NN) layers” and “an initial neural network (NN) layer of the first plurality of NN layers”, respectively. However, the examiner cannot clearly ascertain whether the claimed “NN layers” refers to “network (NN) layers”, “neural network (NN) layer”, or both, since the claimed “NN” term is defined twice in the claim. Therefore, the meets and bounds of the claim are not 
Claims 16-19 are rejected by virtue of dependent upon rejected base claim 15.
For examination purposes the examiner has interpreted the claimed “a first plurality of network (NN) layers” and “an initial neural network (NN) layer of the first plurality of NN layers” of claim 15 above, as “a first plurality of neural network (NN) layers” and “an initial neural network layer of the first plurality of NN layers”, respectively.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 5, 7, 10-20 are rejected under 35 U.S.C. 103 as being unpatentable over Poole et al. (U.S. Patent Publication No. 2020/0104640 A1), hereafter referred to as Poole, in view of Liu et al. (U.S. Patent Publication No. 2020/0012940 A1), hereafter referred to as Liu, and in further view of Risser et al. (U.S. Patent Publication No. 2018/0068463 A1), hereafter referred to as Risser.

Regarding claim 1, Poole discloses an information processing apparatus (Figs. 1 and 5), comprising:
a variational autoencoder (VAE) neural network system implemented as computer programs on one or more computers in one or more locations, and methods of training the system… When trained the VAE neural network system comprises a trained encoder neural network and a trained decoder neural network; Par. [0091-92]: processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output… Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data), wherein the encoder network includes a loss function (Par. [0052]: training engine 120 is configured to train the VAE neural network system 100 by back-propagating gradients of an objective function, in order to update neural network parameters 122 of the encoder neural network 104 and decoder neural network 110. The training engine uses prior distribution parameters 124 of a prior distribution of the set of latent variables in conjunction with the posterior distribution parameters to determine a divergence loss term of the objective function. The training engine 120 also determines, from an input data item and a corresponding output data item, a reconstruction loss term of the objective function which aims to match a distribution of the output data items to a distribution of the training data items; Par. [0072]: objective function may have the general form log p(x|z)… The reconstruction loss term log p(x|z) term can be evaluated from training data item and an output data item, e.g., by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and output data item; wherein the encoder network includes a loss function (e.g. training engine (i.e. the encoder network) uses prior distribution parameters of a prior distribution of the set of latent variables in conjunction with the posterior distribution parameters to determine a divergence loss term of the objective function (i.e. a loss function), as indicated above), for example) and a first plurality of [neural] network (NN) layers (Par. [0003]: Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer; Par. [0083]: an output of the neural network may define parameters of a mixture of logistics distributions (shown schematically) which are sampled to obtain pixel values, and the neural network may incorporate attention layers, e.g., attention over an output of the encoder neural network. The encoder neural network may have the same architecture as the decoder neural network; and a first plurality of neural network (NN) layers (e.g. neural networks include one or more (i.e. a first, second, third… Nth plurality of) hidden layers in addition to an output layer, as indicated above), for example); and 
a processor configured (Par. [0091]: processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output) to:
Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters… a variational autoencoder (VAE) determines a distribution for a set of latent variables representing an input data item, x. Thus the encoder determines parameters of a posterior distribution q(z|x) over the latent variables z; Par. [0021-22]: the VAE system may also be trained on video data and thus the trained encoder may encode or compress video data… a sample may be drawn from the prior and provided to the decoder to generate a sample output data item. In a system with an auxiliary neural network a sample may be provided to the auxiliary neural network to generate a sequence of latent variables which may then be provided to the decoder to generate a sample output data item… A generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video). In the case of video the decoder may be an image decoder. For example, the VAE system could be trained to generate images conditional upon an additional data input defining movement/change of a viewpoint and then a sequence of sets of latent variables could be generated by applying a succession of such additional data inputs, each image being generated independently from a respective set of latent variables; input data to an initial neural network layer of the first plurality of NN layers (e.g. autoencoder determines a distribution for a set of latent variables representing an input data item, x (i.e. input data), by employing one or more neural network layers (i.e. an initial neural network layer of the first, second, third… Nth plurality of NN layers), as indicated above), for example);
generate a latent image as an output from a final NN layer of the first plurality of NN layers based on application of the encoder network on the input data (Par. [0003-4]: Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters… a variational autoencoder (VAE) determines a distribution for a set of latent variables representing an input data item, x. Thus the encoder determines parameters of a posterior distribution q(z|x) over the latent variables z; Par. [0021-22]: the VAE system may also be trained on video data and thus the trained encoder may encode or compress video data… a sample may be drawn from the prior and provided to the decoder to generate a sample output data item. In a system with an auxiliary neural network a sample may be provided to the auxiliary neural network to generate a sequence of latent variables which may then be provided to the decoder to generate a sample output data item… A generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video). In the case of video the decoder may be an image decoder. For example, the VAE system could be trained to generate images conditional upon an additional data input defining movement/change of a viewpoint and then a sequence of sets of latent variables could be generated by applying a succession of such additional data inputs, each image being generated independently from a respective set of latent variables; generate a latent image as an output from a final NN layer of the first plurality of NN layers based on application of the encoder network on the input data (e.g. autoencoder determines a distribution for a set of latent variables representing an input data item, x (i.e. based on application of the encoder network on the input data), by employing one or more layers (i.e. the first, second, third… Nth plurality of NN layers), including one or more hidden layers in addition to an output layer (i.e. a final NN layer of the first plurality of NN layers), to predict an output for the received input, including a generated three dimensional data item, such as an image sequence (video) sequence (i.e. generate a latent image as an output from a final NN layer of the first plurality of NN layers), as indicated above), for example);
estimate a distance between the generated latent image and a reference image based on the loss function (Par. [0052]: training engine 120 is configured to train the VAE neural network system 100 by back-propagating gradients of an objective function, in order to update neural network parameters 122 of the encoder neural network 104 and decoder neural network 110. The training engine uses prior distribution parameters 124 of a prior distribution of the set of latent variables in conjunction with the posterior distribution parameters to determine a divergence loss term of the objective function. The training engine 120 also determines, from an input data item and a corresponding output data item, a reconstruction loss term of the objective function which aims to match a distribution of the output data items to a distribution of the training data items; Par. [0072]: objective function may have the general form log p (x|z)… The reconstruction loss term log p(x|z) term can be evaluated from training data item and an output data item, e.g., by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and output data item; estimate a distance between the generated latent image and a reference image based on the loss function (e.g. training engine (neural network) determines, from an input data item (i.e. a reference image) and a corresponding output data item (i.e. the generated latent image), a reconstruction loss term of the objective function (i.e. the loss function) to match a distribution of output data items to a distribution of training data items, including reconstruction loss term log p(x|z) term, which is evaluated from training data item and an output data item by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and  the output data item (i.e. estimate a distance (difference, error, change, divergence, etc.) between the generated latent image and a reference image based on the loss function), as indicated above), for example);
update the encoder network based on the estimated distance; and 
generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video). In the case of video the decoder may be an image decoder. For example, the VAE system could be trained to generate images conditional upon an additional data input defining movement/change of a viewpoint and then a sequence of sets of latent variables could be generated by applying a succession of such additional data inputs, each image being generated independently from a respective set of latent variables… the trained VAE system and/or encoder/decoder may be used for image or other data item processing tasks such as an image or other data item… the VAE system may be used to make a personalized recommendation for a user. For example the latent variables may be used to characterize a user's taste in data items. For example where the system is trained using data items comprising identifiers of items/content which a user has selected… the VAE system, and in particular the trained decoder, may be used to generate further examples of data items for training another machine learning system. For example the VAE system may be trained on a set of data items and then a set of latent variables may be determined and used generate new data items similar to those in the training data set. The set of latent variables may be determined by sampling from the (prior) distribution of latent variables and/or using the auxiliary neural network; Par. [0050-52]: data items are provided to an encoder neural network 104 which outputs a set of parameters 106 defining a posterior distribution of a set of latent variables, e.g., defining the mean and variance… The set of latent variables may define values for a latent variable data structure such as a latent variable vector z… The latent variables are processed using a decoder neural network 110 which generates a data item output 112. In some implementations the decoder neural network 110 generates the data item directly; in others it generates parameters of an output data item distribution which is sampled to obtain an example output data item. For example the decoder output may specify parameters of a distribution of the intensity of each pixel (or color sub-pixel) of an image… training engine 120 is configured to train the VAE neural network system 100 by back-propagating gradients of an objective function, in order to update neural network parameters 122 of the encoder neural network 104 and decoder neural network 110. The training engine uses prior distribution parameters 124 of a prior distribution of the set of latent variables in conjunction with the posterior distribution parameters to determine a divergence loss term of the objective function. The training engine 120 also determines, from an input data item and a corresponding output data item, a reconstruction loss term of the objective function which aims to match a distribution of the output data items to a distribution of the training data items; Par. [0070-72]: set of latent variables is then processed by the decoder neural network 110 to obtain an output data item (step 308), either directly or, e.g., by sampling from a multivariate distribution parameterized by an output of the decoder neural network. The process then backpropagates gradients of an objective function of the type previously described to update the parameters of the encoder and decoder neural networks… The encoder neural network generates an output defining the mean… and variance… of a distribution for the latent variable… the process of FIG. 3 may be repeated until convergence of the neural network parameters… objective function may have the general form log p(x|z)… The reconstruction loss term log p(x|z) term can be evaluated from training data item and an output data item, e.g., by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and output data item; update the encoder network based on the estimated distance; and output the updated encoder network as a trained encoder network based on the estimated distance (e.g. trained VAE neural network system comprises a trained encoder neural network and a trained decoder neural network (i.e. a trained encoder network), in which latent variables representing input data item(s) are processed using encoder neural network to generate data item output (s), including encoder neural network which generates an output defining the mean and variance of a distribution for the latent variable, which includes a reconstruction loss term of the objective function (i.e. the loss function), such as reconstruction loss term log p(x|z) term above, by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and output data item (i.e. the estimated distance), and the process includes back-propagating gradients of the objective function (i.e. the loss function) to update the parameters of the encoder and decoder neural networks until convergence of the neural network parameters (i.e. output the updated encoder network as a trained encoder network based on the estimated distance), as indicated above), for example).

However, Liu teaches input volume data (Par. [0038-40]: a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… The input layer may comprise the raw pixel data of an image and the output layer may comprise a single vector of class scores of the image along the depth dimension. In this example, the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame)… Each convolutional layer may apply a convolution operation to (or "convolve") a corresponding kernel with the input volume, and may pass a result to a next layer; input volume data (e.g. deep neural network that transforms a three dimensional (3D) input volume (i.e. input volume data) into a 3D output volume via input layer, which hold pixels data of the input image frames, and the three dimensions of the 3D input volume of the input layer include height, width, and depth, as indicated above), for example).
Poole and Liu are considered to be analogous art because they pertain to image processing applications using machine learning techniques. Therefore, it would have 
The combination of Poole and Liu, as a whole, teaches the apparatus, as indicated above, but fails to teach the following, as further recited in claim 1. 
 However, Risser teaches a trained encoder network based on the estimated distance being a minimum (Par. [0023]: the optimizing is performed to minimize to a loss function that includes the content loss function, a style loss function, and a histogram loss function; Par. [0073]: a CNN backpropagation training procedure may be used as the iterative optimization process to turn the… content image into an image that combines features of the content and style images. During backpropagation… the iterative optimization process can be directed by a loss function (equation 4) that the backpropagation training procedure is trying to minimize… the loss function is calculated as the difference between parametric models encoding the style of a style image and the image being synthesized… a content loss can be included as well, where the content loss is some distance metric between raw neural activations calculated for the content image and the image being synthesized; Par. [0100]: Style loss functions reproduce the textural component of the style image… style loss function approach may generate the global style loss function by applying the source style image to a CNN, gathering all activations for a layer in a CNN and building a parametric model from the gathered activations of the layer. An optimization process may then be used to cause the loss function of one image to appear statistically similar to the loss function of another image by minimizing the error distance between the parametric model of the loss functions of the two images (which act as a statistical fingerprint that is being matched); Par. [0120]: CNN backpropagation may be used to provide a style transfer process using global and/or local content loss. The use of CNN backpropagation can allow the image to be thought of as a point in a super-high dimensional space (a dimension for each color channel in each pixel of the image)… the combined loss function for style and content as well as optimizing towards a local minimum of the function, depending on where the noise commences in this space; a trained encoder network based on the estimated distance being a minimum (e.g. CNN backpropagation training procedure is used as the iterative optimization process to turn content image data into an image that combines features of the content and style images based on a loss function that the backpropagation training procedure is trying to minimize (i.e. a trained encoder network based on the estimated distance being a minimum), including an optimization process which is used to cause the loss function of one image to appear statistically similar to the loss function of another image by minimizing the error distance (distance metric) between the parametric model of the loss functions of the two images (i.e. the estimated distance being a minimum), as indicated above), for example).


Regarding claim 2, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the auto-encoder is a Deep Neural Network (DNN) (Liu, Par. [0037-40]: robust video frame interpolation that may achieve frame interpolation, which may use a deep convolutional neural network… a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… The input layer may comprise the raw pixel data of an image and the output layer may comprise a single vector of class scores of the image along the depth dimension. In this example, the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame)… Each convolutional layer may apply a convolution operation to (or "convolve") a corresponding kernel with the input volume, and may pass a result to a next layer; Par. [0099]: an example of a fully convolutional neural network architecture (CNNA) 1300 in accordance with various embodiments. In implementations, the CNNA 1300 may be or may be referred to as an "encoder-decoder network" 1300; wherein the auto-encoder is a Deep Neural Network (DNN) (e.g. video frame interpolation uses a deep convolutional neural network, such as a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume, including convolutional neural network architecture (CNNA) 1300, also referred to as an "encoder-decoder network" (i.e. the auto-encoder), as indicated above), for example).

Regarding claim 5, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the generated latent image is a 3-channel RGB image (Liu, Par. [0037-40]: video frame interpolation that may achieve frame interpolation, which may use a deep convolutional neural network… a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… The input layer may comprise the raw pixel data of an image and the output layer may comprise a single vector of class scores of the image along the depth dimension. In this example, the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame)… Each convolutional layer may apply a convolution operation to (or "convolve") a corresponding kernel with the input volume, and may pass a result to a next layer; Par. [0101]: CNNA 1300 may extract features that are given to four sub-networks that each estimate one of the four 1D kernels for each output pixel… each color channel may be treated equally, and the same 1D kernels may be applied to each of the Red-Green-Blue (RGB) channels to synthesize the output pixel; wherein the generated latent image is a 3-channel RGB image (e.g. synthesize the output pixel(s) for each generated image (i.e. the generated latent image), including Red, Green, Blue (RGB) channels for each frame (image), as indicated above), for example) and is a 2D latent representation of the input volume data (Poole, Par. [0022]: a generated data item may be a two dimensional data item such as an image, in which case the latent variables may have a 2D feature space and the data item values may comprise pixel values for the image such as brightness and/or color values. A generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video); and is a 2D latent representation of the input volume data (e.g. autoencoder determines a distribution for a set of latent variables representing an input data item, x (i.e. input data) to predict an output for the received input, such as a generated data item, including a two dimensional data item such as an image (i.e. a 2D latent representation of the input data), as indicated above), for example).

claim 7, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the processor is further configured to receive an input for a selection of a color image as the reference image from a set of color images (Risser, Par. [0058]: CNN may be used to determine localized loss functions for groups of pixels in the source content and/or source style images. The localized content and/or localized style loss functions may be used to generate a synthesized image that includes the content from the source content image and the texture from the source style image; Par. [0154-155]: auto-encoder to process color images… synthesis strategy involves using some color texture generated using another process as input. In addition, an exemplar material is given as input, where this material contains at least one map that is similar in appearance and purpose as the input color map. The input color map is then used as a guide to direct the synthesis of the full material. This is done through a nearest neighbor search where a pixel/patch is found in one of the maps in the material that is similar to a pixel/patch in the input color image; wherein the processor is further configured to receive an input for a selection of a color image as the reference image from a set of color images (e.g. source style images (i.e. a set of color images) including input color image (i.e. receive an input for a selection of a color image), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.

claim 10, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the loss function is a color loss function which indicates a color loss in the generated latent image with respect to the reference image (Risser, Par. [0100]: Style loss functions reproduce the textural component of the style image… style loss function approach may generate the global style loss function by applying the source style image to a CNN, gathering all activations for a layer in a CNN and building a parametric model from the gathered activations of the layer. An optimization process may then be used to cause the loss function of one image to appear statistically similar to the loss function of another image by minimizing the error distance between the parametric model of the loss functions of the two images (which act as a statistical fingerprint that is being matched); Par. [0120]: CNN backpropagation may be used to provide a style transfer process using global and/or local content loss. The use of CNN backpropagation can allow the image to be thought of as a point in a super-high dimensional space (a dimension for each color channel in each pixel of the image)… the combined loss function for style and content as well as optimizing towards a local minimum of the function, depending on where the noise commences in this space; wherein the loss function is a color loss function which indicates a color loss in the generated latent image with respect to the reference image (e.g. loss function for style and content is used depending on where the noise commences in this space, including a dimension for each color channel in each pixel of the two images (i.e. a color loss function which indicates a 

Regarding claim 11, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the processor is further configured to:
input the volume data to the initial NN layer of the trained encoder network (Liu, Par. [0038-40]: a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… The input layer may comprise the raw pixel data of an image and the output layer may comprise a single vector of class scores of the image along the depth dimension. In this example, the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame)… Each convolutional layer may apply a convolution operation to (or "convolve") a corresponding kernel with the input volume, and may pass a result to a next layer; input the volume data to the initial NN layer of the trained encoder network (e.g. deep neural network that transforms a three dimensional (3D) input volume (i.e. input volume data) into a 3D output volume via input layer, which hold pixels data of the input image frames, and the three dimensions of the 3D input volume of the input layer (i.e. the initial NN layer of the trained encoder network) include height, width, and depth, as indicated above), for example); and
a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… The input layer may comprise the raw pixel data of an image and the output layer may comprise a single vector of class scores of the image along the depth dimension. In this example, the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame)… Each convolutional layer may apply a convolution operation to (or "convolve") a corresponding kernel with the input volume, and may pass a result to a next layer… for each output pixel (x, y), a convolution kernel K may be estimated and used to convolve with patches P1 and P2 centered at (x, y) in the input frames 110 to produce color I(x, y) of the output frame 115; Par. [0051-56]: estimate a convolution kernel and use the kernel to convolve the two frames to interpolate the pixel color… the color of pixel (x, y) in the target image to be interpolated can be obtained by convolving a proper kernel K over input patches P1(x, y) and P2(x, y), which may also be centered at (x, y) in the respective input images… patches P1 and P2 that the output kernel may convolve in order to produce the color for the output pixel (x, y); Par. [0091-92]: embodiments may employ phase-based interpolation that represents motion in the phase shift of individual pixels and generates intermediate frames by per-pixel phase modification… estimate a convolution kernel and use the kernel to convolve the two frames to interpolate the pixel color… the color of pixel (x, y) in the target image to be interpolated can be obtained by convolving a proper kernel K over input patches P1(x, y) and P2(x, y), which may also be centered at (x, y) in the respective input images… patches P1 and P2 that the output kernel may convolve in order to produce the color for the output pixel (x, y)… applying deep learning algorithms to optical flow estimation, style transfer, and image enhancement. Embodiments may employ deep neural networks for view synthesis. Some embodiments may render unseen views from input images for objects like faces and chairs, instead of complex real-world scenes. The DeepStereo method, for example, may generate a novel natural image by projecting input images onto multiple depth planes and combining colors at these depth planes to create the novel view. A view expansion operation for light field imaging may use two sequential convolutional neural networks to model disparity and color estimation operations of view interpolation, and these two networks may be trained simultaneously; and generate a color-shifted latent image as an output from the final NN layer of the trained encoder network, based on the application of the trained encoder network on the input volume data (e.g. deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume by using kernel to convolve two frames to interpolate the pixel color (i.e. the trained encoder network on the input volume data), including phase-based interpolation that represents motion in the phase shift of individual pixels and generates intermediate frames by per-pixel phase modification (i.e. a color-shifted 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.

Regarding claim 12, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the processor is further configured to generate a style-transferred image based on application of a neural style transfer function on the generated latent image, and
wherein the neural style transfer function is based on a style transfer neural network trained to output the style-transferred image (Risser, Par. [0024]: performing style transfer in an image synthesis system where a synthesized image is generated with content from a source content image and texture from a source style image includes receiving a source content image that includes desired content for a synthesized image in the image synthesis system, receiving a source style image that includes a desired texture for the synthesized image in the image synthesis system, determining a localized loss function a pixel in at least one of the source content image and the source style image using the image synthesis system, and generating the synthesized image; Par. [0069-71]: CNN used for image style transfer… CNN-based image synthesis processes that perform style transfer synthesis operate in a similar manner to the texture synthesis process described above. However, a CNN-based image synthesis system receives a content image, C, and a style image, S, that are used to generate a styled image… The content loss is a feature distance between content and output that attempts to make output and content look similar; wherein the processor is further configured to generate a style-transferred image based on application of a neural style transfer function on the generated latent image, and wherein the neural style transfer function is based on a style transfer neural network trained to output the style-transferred image (e.g. CNN-based image synthesis processes that perform style transfer synthesis by performing style transfer in an image synthesis system where a synthesized image is generated (i.e. generate a style-transferred image) with content from a source content image and texture from a source style image, including receiving a source content image that includes desired content for a synthesized image in the image synthesis system (i.e. based on application of a neural style transfer function on the generated latent image), including receiving a source style image that includes a desired texture for the synthesized image in the image synthesis system (i.e. a neural style transfer function on the generated latent image), determining a localized loss function a pixel in at least one of the source content image and the source style image using the image synthesis system, and generating the synthesized image (i.e. a style transfer neural network trained to output the style-transferred image), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.

claim 13, claim 12 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the processor is further configured to:
input the generated style-transferred image to the decoder network; and
generate style-transferred volume data as an output of the decoder network based on application of the decoder network on the input style-transferred image (Risser, Par. [0145-146]: CNN-based image synthesis processes re-purpose style transfer to generate a continuous and progressive aging/de-aging process in a multiscale pyramid framework… Processes in accordance with many embodiments use the same concept to synthesize time sequences in a multiscale pyramid framework. These processes may bootstrap the animation by synthesizing the first frame in the sequence using the strategy described above. After the first frame is generated, subsequent frames can be created by using the frame before as a prior frame. As such, at any given point in time, two image pyramids are stored in memory, the pyramid for the previous frame and the pyramid for the current frame being synthesized. The synthesis order is illustrated in FIG. 23. As the multiple image sizes may be synthesized in parallel, processes in accordance with a number of embodiments may store an optimizer state for each pyramid level. When synthesizing the first frame in the sequence, the base of the pyramid may use white noise as a prior frame to start the synthesis and then each subsequent pyramid level starts from the final result of the previous level that is bi-linearly re-sized to the correct resolution… For all subsequent frames synthesized, a new image pyramid may be synthesized. In accordance with a number of embodiments, the first level of the new pyramid uses the first level of the previous frame as a prior image. For higher layers in the pyramid, the same layer from the previous frame is used as a prior image and a content loss is introduced by re-sizing the previous layer in the same frame, this content image can be seen as a blurry version of the desired result; input the generated style-transferred image to the decoder network; and generate style-transferred volume data as an output of the decoder network based on application of the decoder network on the input style-transferred image (e.g. CNN-based image synthesis processes that perform style transfer synthesis by performing style transfer in an image synthesis system where a synthesized image/frame is generated (i.e. the generated style-transferred image) and after the first frame is generated, subsequent frames can be created by using the frame before as a prior frame (i.e. input the generated style-transferred image to the decoder network), and for all subsequent frames synthesized, a new image pyramid is synthesized and the first level of the new pyramid uses the first level of the previous frame as a prior image (i.e. generate style-transferred volume data as an output of the decoder network based on application of the decoder network on the input style-transferred image), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.

claim 14, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the processor is further configured to:
input the generated latent image to an initial NN layer of a second plurality of NN layers of the decoder network (Poole, Par. [0003-4]: Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters… a variational autoencoder (VAE) determines a distribution for a set of latent variables representing an input data item, x. Thus the encoder determines parameters of a posterior distribution q(z|x) over the latent variables z; Par. [0021-22]: the VAE system may also be trained on video data and thus the trained encoder may encode or compress video data… a sample may be drawn from the prior and provided to the decoder to generate a sample output data item. In a system with an auxiliary neural network a sample may be provided to the auxiliary neural network to generate a sequence of latent variables which may then be provided to the decoder to generate a sample output data item… A generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video). In the case of video the decoder may be an image decoder. For example, the VAE system could be trained to generate images conditional upon an additional data input defining movement/change of a viewpoint and then a sequence of sets of latent variables could be generated by applying a succession of such additional data inputs, each image being generated independently from a respective set of latent variables; Par. [0021-22]: the VAE system may also be trained on video data and thus the trained encoder may encode or compress video data… a sample may be drawn from the prior and provided to the decoder to generate a sample output data item. In a system with an auxiliary neural network a sample may be provided to the auxiliary neural network to generate a sequence of latent variables which may then be provided to the decoder to generate a sample output data item… A generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video). In the case of video the decoder may be an image decoder. For example, the VAE system could be trained to generate images conditional upon an additional data input defining movement/change of a viewpoint and then a sequence of sets of latent variables could be generated by applying a succession of such additional data inputs, each image being generated independently from a respective set of latent variables; input the generated latent image to an initial NN layer of a second plurality of NN layers of the decoder network (e.g. autoencoder determines a distribution for a set of latent variables representing an input data item, x (i.e. input data), by employing one or more neural network layers (i.e. an initial neural th plurality of NN layers), including a sample drawn from the prior and provided to the decoder to generate a sample output data item (i.e. input the generated latent image to an initial NN layer of a second, third… Nth plurality of NN layers of the decoder network), as indicated above), for example);
generate reconstructed volume data as an output from a final NN layer of the second plurality of NN layers based on application of the decoder network on the generated latent image; and
estimate a reconstruction error between the reconstructed volume data and the input volume data (Poole, Par. [0052]: training engine 120 is configured to train the VAE neural network system 100 by back-propagating gradients of an objective function, in order to update neural network parameters 122 of the encoder neural network 104 and decoder neural network 110. The training engine uses prior distribution parameters 124 of a prior distribution of the set of latent variables in conjunction with the posterior distribution parameters to determine a divergence loss term of the objective function. The training engine 120 also determines, from an input data item and a corresponding output data item, a reconstruction loss term of the objective function which aims to match a distribution of the output data items to a distribution of the training data items; Par. [0072]: objective function may have the general form log p (x|z)… The reconstruction loss term log p(x|z) term can be evaluated from training data item and an output data item, e.g., by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and output data item; generate reconstructed volume data as an output from a final NN layer of the second plurality of NN layers based on application 
update both the encoder network and the decoder network based on the estimated reconstruction error; and
output the updated decoder network and the updated encoder network based on the estimated reconstruction error (Poole, Par. [0022-27]: generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video). In the case of video the decoder may be an image decoder. For example, the VAE system could be trained to generate images conditional upon an additional data input defining movement/change of a viewpoint and then a sequence of sets of latent variables could be generated by applying a succession of such additional data inputs, each image being generated independently from a respective set of latent variables… the trained VAE system and/or encoder/decoder may be used for image or other data item processing tasks such as an image or other data item… the VAE system may be used to make a personalized recommendation for a user. For example the latent variables may be used to characterize a user's taste in data items. For example where the system is trained using data items comprising identifiers of items/content which a user has selected… the VAE system, and in particular the trained decoder, may be used to generate further examples of data items for training another machine learning system. For example the VAE system may be trained on a set of data items and then a set of latent variables may be determined and used generate new data items similar to those in the training data set. The set of latent variables may be determined by sampling from the (prior) distribution of latent variables and/or using the auxiliary neural network; Par. [0050-52]: data items are provided to an encoder neural network 104 which outputs a set of parameters 106 defining a posterior distribution of a set of latent variables, e.g., defining the mean and variance… The set of latent variables may define values for a latent variable data structure such as a latent variable vector z… The latent variables are processed using a decoder neural network 110 which generates a data item output 112. In some implementations the decoder neural network 110 generates the data item directly; in others it generates parameters of an output data item distribution which is sampled to obtain an example output data item. For example the decoder output may specify parameters of a distribution of the intensity of each pixel (or color sub-pixel) of an image… training engine 120 is configured to train the VAE neural network system 100 by back-propagating gradients of an objective function, in order to update neural network parameters 122 of the encoder neural network 104 and decoder neural network 110. The training engine uses prior distribution parameters 124 of a prior distribution of the set of latent variables in conjunction with the posterior distribution parameters to determine a divergence loss term of the objective function. The training engine 120 also determines, from an input data item and a corresponding output data item, a reconstruction loss term of the objective function which aims to match a distribution of the output data items to a distribution of the training data items; Par. [0070-72]: set of latent variables is then processed by the decoder neural network 110 to obtain an output data item (step 308), either directly or, e.g., by sampling from a multivariate distribution parameterized by an output of the decoder neural network. The process then backpropagates gradients of an objective function of the type previously described to update the parameters of the encoder and decoder neural networks… The encoder neural network generates an output defining the mean… and variance… of a distribution for the latent variable… the process of FIG. 3 may be repeated until convergence of the neural network parameters… objective function may have the general form log p(x|z)… The reconstruction loss term log p(x|z) term can be evaluated from training data item and an output data item, e.g., by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and output data item; update both the encoder network and the decoder network based on the estimated reconstruction error; and output the updated decoder network and the updated encoder network based on the estimated the optimizing is performed to minimize to a loss function that includes the content loss function, a style loss function, and a histogram loss function; Par. [0073]: a CNN backpropagation training procedure may be used as the iterative optimization process to turn the… content image into an image that combines features of the content and style images. During backpropagation… the iterative optimization process can be directed by a loss function (equation 4) that the backpropagation training procedure is trying to minimize… the loss function is calculated as the difference between parametric models encoding the style of a style image and the image being synthesized… a content loss can be included as well, where the content loss is some distance metric between raw neural activations calculated for the content image and the image being synthesized; Par. [0100]: Style loss functions reproduce the textural component of the style image… style loss function approach may generate the global style loss function by applying the source style image to a CNN, gathering all activations for a layer in a CNN and building a parametric model from the gathered activations of the layer. An optimization process may then be used to cause the loss function of one image to appear statistically similar to the loss function of another image by minimizing the error distance between the parametric model of the loss functions of the two images (which act as a statistical fingerprint that is being matched); Par. [0120]: CNN backpropagation may be used to provide a style transfer process using global and/or local content loss. The use of CNN backpropagation can allow the image to be thought of as a point in a super-high dimensional space (a dimension for each color channel in each pixel of the image)… the combined loss function for style and content as well as optimizing towards a local minimum of the function, depending on where the noise commences in this space; the estimated reconstruction error (e.g. CNN backpropagation training procedure is used as the iterative optimization process to turn content image data into an image that combines features of the content and style images based on a loss function that the backpropagation training procedure is trying to minimize, including an optimization process which is used to cause the loss function of one image to appear statistically similar to the loss function of another image by minimizing the error distance (distance 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.

Regarding claim 15, is a corresponding method claim rejected as applied to the apparatus claim 1 above. The recited steps of claim 15 correspond to claim 1 when executed.

Regarding claim 16, claim 15 is incorporated and is a corresponding method claim rejected as applied to the apparatus claim 11 above.

Regarding claim 17, claim 15 is incorporated and is a corresponding method claim rejected as applied to the apparatus claim 12 above.

Regarding claim 18, claim 17 is incorporated and is a corresponding method claim rejected as applied to the apparatus claim 13 above.

Regarding claim 19, claim 15 is incorporated and is a corresponding method claim rejected as applied to the apparatus claim 14 above.

Regarding claim 20, Poole discloses a method (Par. [0042-43]: a variational autoencoder (VAE) neural network system implemented as computer programs on one or more computers in one or more locations, and methods of training the system), comprising:
providing an auto-encoder comprising an encoder network and a decoder network (Par. [0042-43]: a variational autoencoder (VAE) neural network system implemented as computer programs on one or more computers in one or more locations, and methods of training the system… When trained the VAE neural network system comprises a trained encoder neural network and a trained decoder neural network),
wherein the encoder network is trained to generate a latent image based on an input of data to the encoder network (Par. [0003-4]: Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters… a variational autoencoder (VAE) determines a distribution for a set of latent variables representing an input data item, x. Thus the encoder determines parameters of a posterior distribution q(z|x) over the latent variables z; Par. [0021-22]: the VAE system may also be trained on video data and thus the trained encoder may encode or compress video data… a sample may be drawn from the prior and provided to the decoder to generate a sample output data item. In a system with an auxiliary neural network a sample may be provided to the auxiliary neural network to generate a sequence of latent variables which may then be provided to the decoder to generate a sample output data item… A generated data item may be a three dimensional data item such as an image sequence (video), in which case the latent variables may have a 3D feature space and the data item values may comprise pixel values for the image sequence (video). In the case of video the decoder may be an image decoder. For example, the VAE system could be trained to generate images conditional upon an additional data input defining movement/change of a viewpoint and then a sequence of sets of latent variables could be generated by applying a succession of such additional data inputs, each image being generated independently from a respective set of latent variables; wherein the encoder network is trained to generate a latent image based on an input of data to the encoder network (e.g. autoencoder determines a distribution for a set of latent variables representing an input data item, x, to predict an output for the received input (i.e. based on an input data to the encoder network), including a generated three dimensional data item, such as an image sequence (video) sequence (i.e. generate a latent image), as indicated above), for example), and
wherein the encoder network is trained based on a loss function which measures a distances between the latent image and the data (Par. [0052]: training engine 120 is configured to train the VAE neural network system 100 by back-propagating gradients of an objective function, in order to update neural network parameters 122 of the encoder neural network 104 and decoder neural network 110. The training engine uses prior distribution parameters 124 of a prior distribution of the set of latent variables in conjunction with the posterior distribution parameters to determine a divergence loss term of the objective function. The training engine 120 also determines, from an input data item and a corresponding output data item, a reconstruction loss term of the objective function which aims to match a distribution of the output data items to a distribution of the training data items; Par. [0072]: objective function may have the general form log p (x|z)… The reconstruction loss term log p(x|z) term can be evaluated from training data item and an output data item, e.g., by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and output data item; wherein the encoder network is trained based on a loss function which measures a distances between the latent image and the data (e.g. training engine (neural network) determines, from an input data item (i.e. the data) and a corresponding output data item (i.e. the latent image), a reconstruction loss term of the objective function (i.e. based on a loss function) to match a distribution of output data items to a distribution of training data items, including reconstruction loss term log p(x|z) term, which is evaluated from training data item and an output data item by determining a cross-entropy loss or MSE (Mean Square Error) loss between the training data item and  the output data item (i.e. a loss function which measures a distances (differences, errors, changes, divergences, etc.) between the latent image and the data), as indicated above), for example).
Although Poole teachings above disclose a set of latent variables representing an input data item, x (i.e. input data), and generating data items including three dimensional (3D) data item(s), such as an image sequence (video), in which case the latent variables have a 3D feature space and the data item values comprise pixel values 
However, Liu teaches input of volume data (Par. [0038-40]: a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… The input layer may comprise the raw pixel data of an image and the output layer may comprise a single vector of class scores of the image along the depth dimension. In this example, the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame)… Each convolutional layer may apply a convolution operation to (or "convolve") a corresponding kernel with the input volume, and may pass a result to a next layer; input volume data (e.g. deep neural network that transforms a three dimensional (3D) input volume (i.e. input of volume data) into a 3D output volume via input layer, which hold pixels data of the input image frames, and the three dimensions of the 3D input volume of the input layer include height, width, and depth, as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.
The combination of Poole and Liu, as a whole, teaches the method, as indicated above, but fails to teach the following, as further recited in claim 20. 
 However, Risser teaches generating a style-transferred image based on application of a neural style transfer function on the latent image,
performing style transfer in an image synthesis system where a synthesized image is generated with content from a source content image and texture from a source style image includes receiving a source content image that includes desired content for a synthesized image in the image synthesis system, receiving a source style image that includes a desired texture for the synthesized image in the image synthesis system, determining a localized loss function a pixel in at least one of the source content image and the source style image using the image synthesis system, and generating the synthesized image; Par. [0069-71]: CNN used for image style transfer… CNN-based image synthesis processes that perform style transfer synthesis operate in a similar manner to the texture synthesis process described above. However, a CNN-based image synthesis system receives a content image, C, and a style image, S, that are used to generate a styled image… The content loss is a feature distance between content and output that attempts to make output and content look similar; generating a style-transferred image based on application of a neural style transfer function on the latent image, wherein the neural style transfer function is based on a style transfer neural network trained to output the style-transferred image (e.g. CNN-based image synthesis processes that perform style transfer synthesis by performing style transfer in an image synthesis system where a synthesized image is generated (i.e. generate a style-transferred image) with content from a source content image and texture from a source style image, including receiving a source content image that includes desired content for a synthesized image in the 
inputting the generated style-transferred image to the decoder network; and
generating style-transferred volume data as an output of the decoder network based on application of the decoder network on the input style-transferred image (Par. [0145-146]: CNN-based image synthesis processes re-purpose style transfer to generate a continuous and progressive aging/de-aging process in a multiscale pyramid framework… Processes in accordance with many embodiments use the same concept to synthesize time sequences in a multiscale pyramid framework. These processes may bootstrap the animation by synthesizing the first frame in the sequence using the strategy described above. After the first frame is generated, subsequent frames can be created by using the frame before as a prior frame. As such, at any given point in time, two image pyramids are stored in memory, the pyramid for the previous frame and the pyramid for the current frame being synthesized. The synthesis order is illustrated in FIG. 23. As the multiple image sizes may be synthesized in parallel, processes in accordance with a number of embodiments may store an optimizer state for each pyramid level. When synthesizing the first frame in the sequence, the base of the pyramid may use white noise as a prior frame to start the synthesis and then each subsequent pyramid level starts from the final result of the previous level that is bi-linearly re-sized to the correct resolution… For all subsequent frames synthesized, a new image pyramid may be synthesized. In accordance with a number of embodiments, the first level of the new pyramid uses the first level of the previous frame as a prior image. For higher layers in the pyramid, the same layer from the previous frame is used as a prior image and a content loss is introduced by re-sizing the previous layer in the same frame, this content image can be seen as a blurry version of the desired result; inputting the generated style-transferred image to the decoder network; and generating style-transferred volume data as an output of the decoder network based on application of the decoder network on the input style-transferred image (e.g. CNN-based image synthesis processes that perform style transfer synthesis by performing style transfer in an image synthesis system where a synthesized image/frame is generated (i.e. the generated style-transferred image) and after the first frame is generated, subsequent frames can be created by using the frame before as a prior frame (i.e. inputting the generated style-transferred image to the decoder network), and for all subsequent frames synthesized, a new image pyramid is synthesized and the first level of the new pyramid uses the first level of the previous frame as a prior image (i.e. generating style-transferred volume data as an output of the decoder network based on application of the decoder network on the input style-transferred image), as indicated above), for example).
.

Claims 3 and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Poole, in view of Liu, in further view of Risser, as applied to claim 1 above, and in further view of Ecins et al. (U.S. Patent Publication No. 2020/0110158 A1), hereafter referred to as Ecins.

Regarding claim 3, claim 1 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein the input volume data comprises voxel information (Liu, Par. [0038-39]: a convolutional neural network (CNN or ConvNet) 105, which may be a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame); Par. [0093]: deep voxel flow approach may develop a deep neural network to output dense voxel flows that may be optimized for frame interpolation; wherein the input volume data comprises voxel information (e.g. deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume, including a deep voxel flow approach may develop a deep neural network to output dense voxel flows that are optimized for frame interpolation (i.e. the input volume data comprises voxel 
However, Ecins teaches voxel information sampled at regularly aligned voxel centers for an object-of-interest in 3D space (Par. [0047]: change voxel data for the voxels associated with the noisy surface normal vectors region so that the surface normal vectors are substantially aligned to the same angle; Par. [0072-73]: map generation component 238 can generate a 3D map including a mesh, wherein the mesh includes a plurality of polygons that define the shape of objects in the environment… surfaces in a 3D map can be represented by one or more polygons. In some instances, objects can be represented by voxels; Par. [0091-105]: determine that the first voxel or a centroid (or point) associated with the first voxel is less than a threshold distance from the second voxel or a centroid (or point) associated with the second… determine a centroid (also referred to as a mean) of data represented within the first voxel and a centroid of data represented within the second voxel. The computing device can then determine that the distance between the centroid associated with the first voxel and the centroid associated with the second voxel… identify a first centroid (or point) associated with the first voxel. For example, the computing device can identify the first centroid for the first voxel based on data stored in the first voxel data (e.g., statistical data for the first voxel indicating the first centroid). The first centroid can be a geometric center of points or other three-dimensional locations represented in the first voxel data… identify a second centroid (or point) associated with the second voxel. For example, the computing device can identify the second centroid for the second voxel based on data stored in the second voxel data (e.g., statistical data for the second voxel indicating the second centroid). The second centroid can be a geometric center of points or other three-dimensional locations represented in the second voxel data… determine that an angle between a surface normal vector (for a surface associated with the first voxel data) and a line from the first centroid (or point) to the second centroid (or point) satisfies one or more angle criteria; voxel information sampled at regularly aligned voxel centers for an object-of-interest in 3D space (e.g. first, second… Nth centroid including a geometric center of points or other three-dimensional locations represented in the first, second… Nth voxel data, in which objects are represented by voxels (i.e. voxel centers for an object-of-interest in 3D space), including change voxel data for the voxels associated with the noisy surface normal vectors region so that the surface normal vectors are substantially aligned to the same angle (i.e. voxel information sampled at regularly aligned voxel centers for in 3D space), as indicated above), for example).
Poole, Liu, Risser, and Ecins are considered to be analogous art because they pertain to image processing applications using machine learning techniques. Therefore, the combined teachings of Poole, Liu, Risser, and Ecins, as a whole, would have rendered obvious the invention recited in claim 3 with a reasonable expectation of success in order to modify the variational autoencoder (VAE) neural network system that comprises a trained encoder neural network and a trained decoder neural network (as disclosed by Poole) with voxel information sampled at regularly aligned voxel centers for an object-of-interest in 3D space (as taught by Ecins, Abstract, Par. [0047, 

Regarding claim 4, claim 3 is incorporated and the combination of Poole, Liu, Risser, and Ecins, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), wherein each voxel in the voxel information comprises a set of channels that define a set of volumetric attributes for the corresponding voxel (Liu, Par. [0038-39]: a convolutional neural network (CNN or ConvNet) 105, which may be a deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume… the input layer may hold pixels data of the input image frames 110. The three dimensions of the 3D input volume of the input layer may include height, width, and depth. The depth dimension of the input layer may be a color of one or more image frames 110 (e.g., Red, Green, Blue (RGB) channels for each frame); Par. [0093]: deep voxel flow approach may develop a deep neural network to output dense voxel flows that may be optimized for frame interpolation; wherein each voxel in the voxel information comprises a set of channels that define a set of volumetric attributes for the corresponding voxel (e.g. deep neural network that transforms a three dimensional (3D) input volume into a 3D output volume, including a deep voxel flow approach may develop a deep neural network to output dense voxel flows that are optimized for frame interpolation (i.e. a set of volumetric attributes for the corresponding voxel), the three dimensions of the 3D input volume of the input layer include height, width, and depth, and the depth dimension of the input layer may be a color of one or 
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 1.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Poole, in view of Liu, in further view of Risser, as applied to claim 1 above, and in further view of Vogels et al. (U.S. Patent Publication No. 2018/0293496 A1), hereafter referred to as Vogels.

Regarding claim 8, claim 7 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), but fails to teach the following as further recited in claim 8.
However, teaches wherein the color image is an explosion image (Par. [0032]: training a neural-network based denoiser using importance sampling, where more challenging patches within a training dataset are selected with higher probabilities than others. The sampling probabilities can depend on some image metrics, such as average pixel color variance within a patch; Par. [0083-85]: ensure that the final color estimate always lies within the convex hull of the respective neighborhood of the input image. This can vastly reduce the search space of output values as compared to the direct-prediction method and avoids potential artifacts (e.g., color shifts)… a neural network may be trained on a first training dataset, and then be re-trained to be specialized for a specific production… an initial model may be trained across a set of general images of a movie, and then that model may be re-used in a new model that specializes in certain special effects of the movie, such as explosions, clouds, fog, smoke, and the like. The new specialized model may be further specialized. For example, it may be further specialized to certain types of explosions; Par. [0118]: a first set of data may include images of a general scene, and a second set of data may be images of a special lighting effects, such as an explosion that may include fire, water, oil, and other visual effects; wherein the color image is an explosion image (e.g. data includes color images of a special lighting effects, such as an explosion (i.e. an explosion image), as indicated above), for example).
Poole, Liu, Risser, and Vogels are considered to be analogous art because they pertain to image processing applications using machine learning techniques. Therefore, the combined teachings of Poole, Liu, Risser, and Vogels, as a whole, would have rendered obvious the invention recited in claim 8 with a reasonable expectation of success in order to modify the variational autoencoder (VAE) neural network system that comprises a trained encoder neural network and a trained decoder neural network (as disclosed by Poole) with wherein the color image is an explosion image (as taught by Vogels, Abstract, Par. [0032, 83-85, 118]) to reduce the number of samples needed while still producing high-quality images and for the neural network to improve denoising quality during training (Vogels, Abstract, Par. [0006, 130]).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Poole, in view of Liu, in further view of Risser, as applied to claim 1 above, and in further view of Li et al. (U.S. Patent Publication No. 2019/0244329 A1), hereafter referred to as Li.

Regarding claim 9, claim 7 is incorporated and the combination of Poole, Liu, and Risser, as a whole, teaches the apparatus (Poole, Figs. 1 and 5), but fails to teach the following as further recited in claim 9.
However, Li teaches wherein the color image is a green forest image (2019/0244329 A1, Par. [0029]: photorealistic style image and the photorealistic content image… are processed by the photo style transfer neural network model 110 to produce the stylized photorealistic image Y. The cloud pattern in the photorealistic content image is retained in the stylized photorealistic image while a blue color of the sky and the green color of the landscape in the photorealistic style image appear in the stylized photorealistic image--the color of the sky and the landscape areas is changed compared with the photorealistic content image; wherein the color image is a green forest image (e.g. photo style transfer neural network model is used to produce stylized photorealistic image, including the green color of the landscape in the photorealistic style image (i.e. a green forest image) appearing in the stylized photorealistic image, as indicated above), for example).
Poole, Liu, Risser, and Li are considered to be analogous art because they pertain to image processing applications using machine learning techniques. Therefore, the combined teachings of Poole, Liu, Risser, and Li, as a whole, would have rendered obvious the invention recited in claim 9 with a reasonable expectation of success in .

Allowable Subject Matter
Claim 16 would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and by overcoming 112 rejections set forth in this Office action.
The following is a statement of reasons for the indication of allowable subject matter:  The prior art of record fail to anticipate or render obvious the following limitations as claimed:
In view of claim 1 in its entirety, the further limitations of “…wherein the processor is further configured to compress, by the encoder network, the input volume data along a user-defined depth axis of the input volume data to generate the latent image” as recited in claim 16.

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GUILLERMO M RIVERA-MARTINEZ whose telephone number is (571)272-4979.  The examiner can normally be reached on 9 am to 5 pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on 571-272-7332.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/GUILLERMO M RIVERA-MARTINEZ/           Primary Examiner, Art Unit 2668