Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to submission of application on 1/25/2019.
Claims 1-20 are presented for examination.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f), because the claim limitations use a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: “extracting unit” in claims 14-17, and “embedding unit” in claim 14.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f), they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f), applicant may:  (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recite sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f).

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claim limitations “extracting unit” and “embedding unit” invoke 35 U.S.C. 112(f). However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b).
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f); 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1- 5, 8-12, and 14-17 are rejected under 35 U.S.C. 103 as being unpatentable over Arandjelovic et al ( Objects that Sound, herein Arandjelovic-1), and Arandjelovic et al (Look, Listen and Learn, herein Arandjelovic-2).


Regarding claim 1,
	Arandjelovic-1 teaches a computer-implemented method of learning data-augmentations from unlabeled media (Arandjelovic-1, page 1, paragraph 1, line 1 “In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal.  We achieve both these objectives by training from unlabeled video using only audio-visual correspondence (AVC) as the objective function.” In other words, training is learning, networks that can embed audio and visual inputs is computer implemented method of learning data augmentations, and from unlabeled video is from unlabeled data.), the method comprising:
receiving media data including moving images of an object and audio generated by the object (Arandjelovic-1, page 2, paragraph 1, line 2 “In particular, we use unlabeled video as our source material, and employ audio-visual correspondence (AVC) as the training objective [4].”  In other words, unlabeled video is moving images of an object and audio generated by the object.)
extracting an image frame of the object among the moving images and extracting an audio segment from the audio (Arandjelovic-1, Fig. 2, 

    PNG
    media_image1.png
    648
    676
    media_image1.png
    Greyscale

In other words, from Fig. 2, (a) and (b) show the vision and audio ConvNets which perform initial feature extraction from the image and audio embeddings.);
	generating first embeddings of the image frame and second embeddings of the audio segment (Arandjelovic-1,  Fig. 2(c), and page 3, paragraph 3, line 1 “In this section we describe a network architecture capable of learning good visual and audio embeddings from scratch and without labels.  Furthermore, the two embeddings are aligned in order to enable querying across modalities, e.g., using an image to search for related sounds.” And, from figure 2 (c), “Our AVE-Net is designed to produce aligned visual and audio embeddings…” In other words, from Fig. 2, produce aligned vision and audio embeddings is generating an image embedding and audio embedding.); 
	[concatenating the first and second embeddings together to generate concatenated embeddings; and labeling the media data based at least in part on the concatenated embeddings.] 
	Thus far, Arandjelovic-1 does not explicitly teach concatenating the first and second embeddings together to generate concatenated embeddings; and labeling the media data based at least in part on the concatenated embeddings. (Examiner notes that Arandjelovic-1 discloses this limitation in Fig. 2(d).  However, it is recited as a reference to an earlier paper by the same author.  For the purpose of clarity, the earlier paper is included in this office action.)
	Arandjelovic-2 teaches concatenating the first and second embeddings together to generate concatenated embeddings; and labeling the media data based at least in part on the concatenated embeddings (Arandjelovic-2, Figure 2, and, page, 611 column 1, paragraph 1, “The two 512-D visual and audio features are concatenated into a 1024-D vector which is passed through the fusion network to  produce a 2-way classification output, namely, whether the vison and audio correspond or not.” 

    PNG
    media_image2.png
    690
    373
    media_image2.png
    Greyscale

In other words, visual and audio features are concatenated is concatenating the first and second embeddings together, and produce 2-way classification is labeling the media data based at least in part on the concatenated embeddings.)
	Both Arandjelovic-1 and Arandjelovic-2 are directed to learning visual and audio data. Arandjelovic-1 teaches networks that can embed audio and visual inputs in a way that enables cross-modal retrieval and can localize the object that sounds in an image, given the audio signal, but does not explicitly teach concatenating the image and audio embeddings. Arandjelovic-2 teaches a method that is able to learn both visual and audio semantic information in an unsupervised manner that includes, among other things, concatenating the image and audio embeddings. In view of the teaching of Arandjelovic-1, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Arandjelovic-2 into Arandjelovic-1.
	One of ordinary skill in the art would be motivated to do this because being able to use unlabeled videos to train audio and visual networks from scratch through concatenation provides the ability to learn from a nearly infinite source of data. (Arandjelovic-2, page 1, column 1, paragraph 1, line 3 “There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and audio streams…” And, page 1, column 2, paragraph 1, line 2 “…it is interesting to learn from a virtually infinite source free of supervision (video with visual and audio modes in this case) rather than requiring strong supervision; second, this is a possible source of supervision that an infant could use as their visual and audio capability develop…”)
Regarding claim 2,
	The combination of Arandjelovic-1 and Arandjelovic-2 teaches the computer-implemented method of claim 1, wherein 
	the image frame is extracted at a point of time in the media data (Arandjelovic-1, page 2, paragraph 1, line 8 “…frame and audio coming from the same time in a video are positives…” In other words, video is media data, and frame and audio coming from the same time in a video is frame is extracted at a point in time in the media data.)
Regarding claim 3,
	The combination of Arandjelovic-1 and Arandjelovic-2 teaches the computer-implemented method of claim 2, wherein 
	the audio segment is extracted at the point of time corresponding to the extracted image frame (Arandjelovic-1, page 2, paragraph 1, line 8,  See mapping of claim 2.)  
Regarding claim 4,
	The combination of Arandjelovic-1 and Arandjelovic-2 teaches the computer-implemented method of claim 3, wherein 
	the audio segment is extracted in response to generating a spectrogram of the audio at the point of time corresponding to the extracted image frame (Arandjelovic-1, page 3, paragraph 4, line 2 “The input image and 1 second of audio (represented as a log-spectrogram) are processed by vision and audio subnetworks (Figures 2a and 2b), respectively, followed by feature fusion whose goal is to determine whether the image and the audio correspond under the AVC task.”  In other words, audio (represented as a log-spectrogram) is spectrogram of the audio, and processed is extracted.  See mapping of claims 2 and 3 for point of time corresponding to the extracted image frame.).
Regarding claim 5,
	The combination of Arandjelovic-1 and Arandjelovic-2 teaches the computer-implemented method of claim 4, further comprising 
	encoding the concatenated embeddings, and labeling the media data based at least in part on the encoded concatenated embeddings (Arandjelovic-2, See mapping of claim 1. Figure 2, and, page, 611 column 1, paragraph 1, line 1 “The two 512-D visual and audio features are concatenated into a 1024-D vector which is passed through the fusion network to produce a 2-way classification output, namely, whether the vison and audio correspond or not.”  In other words, concatenated into a 1024-D vector is encoding the concatenated embeddings, and produce 2-way classification is labeling the media data at least in part on the concatenated embeddings.)
Claims 8-12 are computer program product claims corresponding to computer-implemented method claims 1-5, respectively. Computer program products are output of computer programs. Arandjelovic-1 teaches this (Arandjelovic-1, Fig.1, and page 2, paragraph 1, line 10 “As the labels are constructed directly from the data itself, this is an example of “self-supervision” [13-22], a subclass of unsupervised methods.”

    PNG
    media_image3.png
    312
    803
    media_image3.png
    Greyscale

In other words, Fig.1 (b) and labels are computer program products.).  Therefore, claims 8-12 are rejected for the same reasons as claims 1-5, respectively.
Claims 14-17 are computing system claims that correspond to computer-implemented method claims 1-4, respectively.  Arandjelovic-1 teaches a computing system (Arandjelovic-1, page 9, paragraph 1, line 10 “Training is done using 16 GPUs in parallel with synchronous updates implemented in TensorFlow, where each worker processes a 128-element batch, thus making the effective batch size 2048.” In other words, 16 GPUs in parallel is a computing system.)  Therefore, claims 14-17 are rejected for the same reasons as claims 1-4, respectively.

Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Arandjelovic-1, and Arandjelovic-2 in view of Li et al (Disentangle Sequential Autoencoder, herein Li).
Regarding claim 6,
The combination of Arandjelovic-1 and Arandjelovic-2 teaches the computer-implemented method of claim 5, further comprising:
Thus far, the combination of Arandjelovic-1 and Arandjelovic-2 does not explicitly teach decoding the encoded concatenated embeddings based at least in part on the first embeddings of the image frame and latent vectors of the encoded concatenated embeddings; and generating a voice feature of the object in response to decoding the encoded concatenated embeddings.  
	Li teaches decoding the encoded [concatenated- Arandjelovic-2, Figure 2, and, page, 611 column 1, paragraph 1– see mapping of claim 1.] embeddings based at least in part on the first embeddings of the image frame and latent vectors of the encoded [concatenated] embeddings (Li, Fig. 3, and page 1, paragraph 1, line, 1 “We present a VAE architecture for encoding and generating high dimensional sequential data, such as video or audio.” And, page 4, column 2, paragraph 3, line 4 “These panels demonstrate that the encoder and the decoder have learned a factored representation for content and pose.”

    PNG
    media_image4.png
    709
    552
    media_image4.png
    Greyscale

In other words, decode is decoding the encoded embeddings at least in part on the first embeddings.  Examiner notes that a VAE (variational autoencoder) is an encoder/decoder architecture.); and 
	generating a voice feature of the object in response to decoding the encoded [concatenated] embeddings (Li,  page 6, column 1, paragraph 2, line 1 “We perform voice conversion experiments to demonstrate the disentanglement of the learned representation.  The goal here is to convert male voice to female voice ( and vice versa) with the speech content being preserved.”  In other words, voice conversion is generating a voice feature of the object.).
	Both Li and the combination of Arandjelovic-1 and Arandjelovic-2 are directed to encoding and generating high dimensional data, among other things.  The combination of Arandjelovic-1 and Arandjelovic-2 teaches encoding and labeling multimedia data based on embeddings, but does not explicitly teach decoding the embeddings.  Li teaches encoding and decoding embeddings for multimedia data such as video and audio.  In view of the teaching of the combination of Arandjelovic-1 and Arandjelovic-2 it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Li into the combination of Arandjelovic-1 and Arandjelovic-2.  This would result in being able to encode, decode, label and generate high dimensional multimedia data.
	One of ordinary skill in the art would be motivated to do this because representation learning has remained a difficult problem in the art and disentangling representations of different modalities would open new ways of style manipulation as well as other useful applications. (Li, page 1, column 1, paragraph 2, line 1 “Representation learning remains an outstanding research problem in machine learning and computer vision.  Recently there is a rising interest in disentangled representations, in which each component of learned features refers to a semantically meaningful concept.  In the example of video sequence modelling, an ideal disentangled representation would be able to separate time-independent concepts (e.g. the identity of the object in the scene) from dynamical information (e.g. the time-varying position and the orientation or pose of that object).  Such disentangled representations would open new efficient ways of compression and style manipulation, among other applications.”)
Claim 13 is a computer program product claim corresponding to the computer-implemented method of claim 6, with the additional limitation of “generating one or both of a voice feature of the object and a hallucinated feature of the object”. Since claim 6 requires generating a voice feature, claim 13 is rejected for the same reasons as claim 6.

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Arandjelovic-1, Arandjelovic-2, and Li, in view of Liu et al (Unsupervised Image-To-Image Translation Networks, herein Liu).
Regarding claim 7,
	The combination Arandjelovic-1, Arandjelovic-2, and Li teaches the computer-implemented method of claim 5, further comprising:
	decoding the encoded concatenated embeddings based at least in part on the first embeddings of the image frame and latent vectors of the encoded concatenated embeddings (Li, Fig. 3, and page 1, paragraph 1, line, 1 See mapping of claim 6.); and
	It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Li into the combination of Arandjelovic-1 and Arandjelovic-2 for at least the same reasons discussed above in claim 6.  
	Thus far, the combination of Arandjelovic-1, Arandjelovic-2, and Li does not explicitly teach generating a hallucinated feature of the object in response to decoding the encoded concatenated embeddings.
	Liu teaches generating a hallucinated feature of the object in response to decoding the encoded [concatenated- Arandjelovic-2, Figure 2, and, page, 611 column 1, paragraph 1– see mapping of claim 1 ] embeddings (The specification of the instant application recites “A hallucination includes, for example, generating objects (e.g.) faces in different lighting conditions…” (Specification, paragraph [0067], line 1.) Liu, page 1, paragraph 4, line 4 “We model each image domain using a VAE-GAN. The adversarial training objective interacts with a weight-sharing constraint, which enforces a shared-latent space, to generate corresponding images in two domains, while the variational autoencoders relate translated images with input images in the respective domains.” And, page 6, paragraph 3, line 1 ‘We applied the proposed framework to several unsupervised street scene image translation tasks including sunny to rainy, day to night, summery to snowy, and vice versa.” And, page 6, paragraph 4, line 3 “For the real to synthetic translation, we found our method made the cityscape images cartoon like.” And, page 6, paragraph 5, line 4 “We found our method translated a dog to a different breed.” In other words, translating a street scene from night to day or from summery to snowy is generating a hallucinated feature of the object.)
	Both Liu and the combination of Arandjelovic-1, Arandjelovic-2, and Li are directed to unsupervised learning of images, among other things. The combination of Arandjelovic-1, Arandjelovic-2, and Li teaches encoding and decoding of embedded images and audio, but does not explicitly teach generating  “hallucinated” feature(s) after decoding the embedded image.  Liu teaches generating  “hallucinated” feature(s) after decoding the embedded image.  In view of the teaching of the combination of Arandjelovic-1, Arandjelovic-2, and Li it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Liu into the combination of Arandjelovic-1, Arandjelovic-2, and Li.
	One of ordinary skill in the art would be motivated to do this because there are many vision problems that reduce to image-to-image translation, such as “hallucinating” the image.  By showing how to do image-to-image translation, these problems can be resolved.  (Liu, page 1, paragraph 2, line 1 “Many computer vision problems can be posed as an image-to-image translation problem, mapping an image in one domain to a corresponding image in another domain.  For example, super-resolution can be considered as a problem of mapping a low-resolution image to a corresponding high-resolution image; colorization can be considered as a problem of mapping a gray-scale image to a corresponding color image.”)

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Arandjelovic-1, and Arandjelovic-2 in view of Pandey et al (Variational method for Conditional Multimodal Deep Learning, herein Pandey).
Regarding Claim 18,
	The combination of Arandjelovic-1 and Arandjelovic-2 teaches the computing system of claim 17, further comprising 
	[a conditional variational autoencoder (VAE)] 
	configured to encode the concatenated embeddings, wherein the media data is labeled based at least in part on the encoded concatenated embeddings (Arandjelovic-2, See mapping of claim 5.)
	Thus far, the combination of Arandjelovic-1 and Arandjelovic-2 doe s not explicitly teach a conditional variational autoencoder (VAE).
	Pandey teaches a conditional variational autoencoder (VAE) (Pandey, Fig. 3, and page 308, column 1, paragraph 1, line 1 “In this paper, we address the problem of conditional modality learning, whereby one is interested in generating one modality given the other.  While it is straightforward to learn a joint distribution over multiple modalities using a deep multi-modal architecture, we observe that such models are not very effective at conditional generation. Hence, we address the problem by learning conditional distributions between the modalities.  We use variational methods for maximizing the corresponding conditional log-likelihood.  The resultant deep model, which we refer to as conditional multimodal autoencoder (CMMA), forces the latent representation obtained from a single modality alone to be ‘close’ to the joint representation obtained from multiple modalities.  We use the proposed model to generate faces from attributes.” And, page 311, column 1, paragraph 5, line 1 “Both GAN and VAE are directed probabilistic models with an edge from the latent layer to the data.  Conditional extensions of both these models for incorporating attributes/labels have also been proposed [7], [4], [11].  The graphical representation of a conditional GAN or conditional VAE is shown in Fig. 3.” 


    PNG
    media_image5.png
    336
    638
    media_image5.png
    Greyscale

In other words, using a conditional VAE for conditional multi-modal learning is using a conditional VAE for encoding and decoding multi-modal data such as images and audio.) 
	Both Pandey and the combination of Arandjelovic-1 and Arandjelovic-2 are directed to classifying and using multi-modal data, among other things.  The combination of Arandjelovic and Arandjelovic-2 teaches labeling multi-modal media data, including image and voice from video, identifying objects within images and matching them with the audio, and generating the synchronized voice and image data, but does not teach using a conditional VAE to perform the actions. Pandey teaches using conditional VAEs to learn several modalities simultaneously.  In view of the teaching of the combination of Arandjelovic-1 and Arandjelovic-2, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Pandey into the combination of Arandjelovic-1 and Arandjelovic-2.  This would result in being able to use conditional VAEs for labeling multi-modal media data, among other things.
	One of ordinary skill in the art would be motivated to do this because the task of learning multi-modalities is applicable to many real-world applications thereby making it important to identify new and improved ways of learning. One of the difficulties of concatenating embeddings is that the latent representation of each of the modalities may be “far” from the joint representation, thus making learning the joint embedding more challenging. (Pandey, page 308, column 2, paragraph 2, line 1 “The problem of learning from several modalities simultaneously has garnered the attention of several deep leaning researchers over the past few years [12], [15], [16].  This is primarily because of the wide availability of such data, and the numerous real-world applications where multimodal data is used. For instance, speech may be accompanied with text and the resultant data can be used for training speech-to-text or text-to-speech engines.  Even with the same medium, several modalities may exist simultaneously, for instance, the plan and elevation of a 3d object, or multiple translations of a text.” And, page 1, paragraph 1, line 6 “Hence, we address the problem by learning conditional distributions between the modalities.  We use variational method for maximizing the corresponding conditional log-likelihood.  The resultant deep model, which we refer to as conditional multimodal autoencoder (CMMA), forces the latent representation obtained from a single modality alone to be ‘close’ to the joint representation obtained from multiple modalities.”) 

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Arandjelovic-1, Arandjelovic-2, and Pandey in view of Li.
Regarding claim 19,
	The combination of Arandjelovic, Arandjelovic-2, and Pandey teaches the computing system of claim 18, wherein
	the conditional VAE (Pandey, Fig. 3, and page 308, column 1, paragraph 1, line 1.  See mapping of claim 18.)
	Thus far, the combination of Arandjelovic-1, Arandjelovic-2 and Pandey does not explicitly teach decodes the encoded concatenated embeddings based at least in part on the first embeddings of the image frame and latent vectors of the encoded concatenated embeddings.
	Li teaches decodes the encoded concatenated embeddings based at least in part on the first embeddings of the image frame and latent vectors of the encoded concatenated embeddings (Li, Fig. 3, and page 1, paragraph 1, line, 1.  See mapping of claim 6.); wherein
	Li teaches the conditional VAE generates a voice feature of the object in response to decoding the encoded concatenated embeddings (Li, page 6, column 1, paragraph 2, line 1. See mapping of claim 6.).  
	Both Li and the combination of Arandjelovic-1, Arandjelovic-2, and Pandey are directed to encoding and generating high dimensional data, among other things.  The combination of Arandjelovic-1, Arandjelovic-2, and Pandey teaches encoding and labeling multimedia data based on embeddings using a conditional VAE, but does not explicitly teach decoding the embeddings.  Li teaches encoding and decoding embeddings for multimedia data such as video and audio.  In view of the teaching of the combination of Arandjelovic-1, Arandjelovic-2, and Pandey it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Li into the combination of Arandjelovic-1, Arandjelovic-2, and Pandey.  This would result in being able to encode, decode, label and generate high dimensional multimedia data.
	One of ordinary skill in the art would be motivated to do this because representation learning has remained a difficult problem in the art and disentangling representations of different modalities would open new ways of style manipulation as well as other useful applications. (Li, page 1, column 1, paragraph 2, line 1 “Representation learning remains an outstanding research problem in machine learning and computer vision.  Recently there is a rising interest in disentangled representations, in which each component of learned features refers to a semantically meaningful concept.  In the example of video sequence modelling, an ideal disentangled representation would be able to separate time-independent concepts (e.g. the identity of the object in the scene) from dynamical information (e.g. the time-varying position and the orientation or pose of that object).  Such disentangled representations would open new efficient ways of compression and style manipulation, among other applications.”)

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Arandjelovic, Arandjelovic-2, Pandey, and Li in view of Liu.
Regarding claim 20,
	The combination of Arandjelovic, Arandjelovic-2, Pandey and Li, teaches the computing system of claim 18, further comprising: wherein
	decodes the encoded concatenated embeddings based at least in part on the first embeddings of the image frame and latent vectors of the encoded concatenated embeddings (Li, Fig. 3, and page 1, paragraph 1, line, 1. See mapping of claim 6.); and 
	It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Li into the combination of Arandjelovic-1,  Arandjelovic-2, and Pandey for at least the same reasons discussed above in claim 19.  
	the conditional VAE (Pandey, Fig. 3, and page 308, column 1, paragraph 1, line 1.  See mapping of claim 18.)
	generates a hallucinated feature of the object in response to decoding the encoded concatenated embeddings (Liu, page 1, paragraph 4, line 4, and page 6, paragraph 3, line 1. See mapping of claim 7.).
	It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Liu into the combination of Arandjelovic-1, Arandjelovic-2, Pandey, and Li for at least the same reasons as in claim 7 regarding Li, furthermore as discussed in claim 18 regarding the VAE of Pandey.   

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to BART RYLANDER whose telephone number is (571)272-8359. The examiner can normally be reached Monday - Thursday 8:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/B.I.R./Examiner, Art Unit 2124             

/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145