Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/01/2022 has been entered.
 
Amendments
	Claims 1-2, 4, and 12-17 are amended. Claims 1-17 are pending and have been considered.

Claim Objections
Claim 1 is objected to because of the following informalities:  
In line 5, the noun “piece” should be plural as was recited in the claim filed 09/22/2021.
Lines 18 of the first page of the claim through the end of claim 1 contains numerous grammatical errors, along with corresponding limitations in independent claims 12-17. Applicant is respectfully asked to resolve the grammatical errors in these claims.
For purposes of examination, Examiner interprets line 18 of the first page of the claim to mean “a second decoder is inputted”; line 2 of the second page of the claim to mean “encoder is inputted”; lines 2-3 to mean “outputs [[an]] output information”; line 3 to mean “the output information corresponding 13 of the second page of the claim are interpreted similarly to lines 20 of the first page to line 4 of the second page; line 14 of the second page of the claim is interpreted to mean “training, via machine learning, the model with the output of the first encoder…”; and in lines 18-19, both recitations of “was” are interpreted as “is”.
	Claims 2-3, 5-10 are objected to because they recite “the processor learns” instead of “the processor trains”.
	Claim 4 is objected to because the limitation “decoders that generates” in line 2 from the end was not amended like lines 2, 3, 4, and 7 were amended.
	Claims 12-17 are objected to for many of the grammatical issues set forth in the objections to claim 1. 
Claim 13, line 4 should recite “plurality of pieces of input information”.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1 and 13-17 recite “machine learning the model” in the second-to-last paragraph of each claim. This limitation is unclear to Examiner and the specification does not define what machine learning the model comprises. For purposes of examination, Examiner interprets this limitation to mean “training the model”. 
Claim 12 recites “machine learn the model” in the second-to-last paragraph of the claim. This limitation is unclear to Examiner and the specification does not define what machine learn the model comprises. For purposes of examination, Examiner interprets this limitation to mean train the model.
Claims 2-11 are rejected for failing to cure the deficiencies of claim 1 upon which they depend.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-5, 7-8, and 10-17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Silberer et al. (“Learning Grounded Meaning Representations with Autoencoders”, cited in PTO-892 filed 06/03/2021).
	
	Regarding CLAIM 1, Silberer teaches: A learning device comprising: a memory; and a processor operatively coupled to the memory, the processor including a plurality of encoders and a plurality of decoders, the processor being programmed to: (The experimental results, disclosed in § 5 on p. 728, are evidence of a computer comprising a memory and a processor. A plurality of encoders and a plurality of decoders are taught by the text encoders, text decoders, image encoders, and image decoders shown in the bimodal autoencoder in Fig. 1 on p. 724, which is described below in more detail.)
		acquire a plurality of pieces of input information, the plurality of piece of input information having a first classification or a second classification, which is different than the first classification; and (According to p. 725, col. 2, first paragraph, the model takes as input two (real-valued) vectors representing the visual and textual modalities, and the text inputs are obtained from Wikipedia; according to p. 725, col. 2, second paragraph, the image inputs are obtained from ImageNet.; figure 1)
		execute a machine learning process by implementing a model that receives the plurality of pieces of input information as inputs, and outputs a plurality of pieces of output information corresponding to the respective pieces of input information, the machine learning process including: (Silberer teaches implementing a bimodal autoencoder which receives a text input as a first modality and image input as a second modality, encodes both text and images inputs into a bimodal coding, and outputs a text reconstruction and an image reconstruction. The structure of the bimodal autoencoder is shown in Fig. 1 on p. 724 and is further described at least by subsections “Autoencoders” (first paragraph) and “Stacked Autoencoder” on p. 723, “Bimodal Autoencoder” and “Stacked Bimodal Autoencoder” on p. 724, and the second paragraph on p. 726, col. 1. The training portion of a machine learning process is discussed below in more detail.) 
			generating a first encoder and a second encoder of the plurality of encoders and a first decoder and a second encoder* of the plurality of decoders, (The claim limitation “second encoder” marked with an asterisk is being interpreted as a second decoder. An autoencoder contains both an encoder and a decoder, as discussed on p. 723, col. 1, “Autoencoder”, lines 1-10. Generating a first encoder and a first decoder is taught by a text autoencoder, and generating a second encoder and a second decoder is taught by an image autoencoder. See p. 723, col. 2, § 3.2 to p. 724, line 2; also, p. 724, “Unimodal Autoencoders” teaches a textual autoencoder and a visual/image autoencoder.)
			modifying the first encoder and the first decoder such that: (At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training a text autoencoder separately from a visual autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
				(a) when the input information having the first classification inputted to the first encoder, the first encoder outputs characteristic information indicating characteristics of the input information that is inputted to the first encoder, and (p. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
				(b) when the characteristic information that is outputted from the first encoder inputted to the first decoder, the first decoder outputs an output information having the first classification, the output information is corresponded to the input information inputted to the first encoder; (p. 723, col. 1, subsection “Autoencoders”, lines 8-10.)
modifying the second encoder and the second decoder such that: (At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training the visual autoencoder separately from the text autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
				(a) when the input information having the second classification inputted to the second encoder, the second encoder outputs characteristic information indicating characteristics of the input information that is inputted to the second encoder, and (p. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
(b) when the characteristic information that is outputted from the second encoder inputted to the second decoder, the second decoder outputs an output information having the second classification, the output information is corresponded to the input information inputted to the second encoder; (p. 723, col. 1, subsection “Autoencoders”, lines 8-10.)
			machine learning the model by the output of the first encoder and the second encoder being inputted to a synthesizing model and the output of the synthesizing model being inputted to the first decoder and the second decoder; and (The outputs of the text encoder and the image encoder (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            5
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input into bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                    . The outputs of the bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                     (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            
                                                
                                                    5
                                                
                                                
                                                    '
                                                
                                            
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input to the text decoder and the image decoder. This limitation is interpreted as the forward propagation step of gradient descent which is used to update autoencoder parameters (see p. 723, col. 1, 2 lines above the last paragraph).)
			modifying the model so that when the input information having the first classification was inputted to the first encoder and the input information having the second classification was inputted to the second encoder, the first decoder outputs output information having the first classification and the second decoder outputs output information having the second classification. (Modifying the model is updating the parameters W and b in the stacked bimodal autoencoder during the backpropagation step of gradient descent. Training and gradient descent is taught in subsection “Autoencoders”, lines 10-17 on p. 723, col. 1. Further, p. 724, col. 2, subsection “Stacked Autoencoder” teaches: “all network parameters are fine-tuned with backpropagation” and  p. 724, col. 2, subsection “Stacked Bimodal Autoencoder” teaches: “We finally build a stacked bimodal autoencoder (SAE) with all pre-trained layers and fine-tune them with respect to a semi-supervised criterion”. Training the stacked bimodal autoencoder is taught by p. 724, col. 2, line 2 under equation 3 to p. 725, line 6. 
Again, Fig. 1 shows the text decoder on the left outputs a reconstruction of the input text and the image decoder on the right outputs a reconstruction of the input image.)

Regarding CLAIM 2, Silberer teaches: The learning device according to claim 1, wherein: 
the processor learns the first decoder and second decoder of the plurality of decoders that generate the pieces of output information from synthesized information, and (Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6. As seen in Fig. 1, the output of the bimodal coding is input into the text decoder and image decoder, which each outputs a reconstruction of the input text or image.)
the respective classifications of each piece of the output information are different and the respective classification of each piece of the output information is the same classification as the classification of each piece of input information input to the different encoders of the plurality of encoders. (As seen in Fig. 1, the output of the bimodal coding is input into the text decoder and image decoder, which each outputs a reconstruction of the input text or image. Fig. 1 is further taught by p. 726, col. 1, second paragraph.)

Regarding CLAIM 3, Silberer teaches: The learning device according to claim 1, wherein the processor learns the plurality of encoders that have learned the pieces of characteristic information of different classifications, and the plurality of decoders that have learned the pieces of characteristic information of the same classification as the respective plurality of encoders. (This limitation is interpreted as either training the autoencoders separately or training the bimodal autoencoder. Training the separate autoencoders is taught by p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph. Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6.)

Regarding CLAIM 4, Silberer teaches: The learning device according to claim 1, wherein the first encoder of the plurality of encoders generates a characteristic of an image, the second encoder of the plurality of encoders generates a characteristic of text, (Fig. 1 on p. 724 shows a text encoder by the second and third layers from the bottom on the left side, and an image encoder by the second and third layers from the bottom on the right side; also see p. 726, col. 1, second paragraph, lines 5-13.)
a synthesizer generates synthesized information obtained by synthesizing the characteristic of the image and the characteristic of the text respectively generated by the first encoder and the second encoder, (Fig. 1 on p. 724 shows a bimodal coding by the middle layer into which                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            5
                                        
                                    
                                
                            
                        
                     is input; also see p. 726, col. 1, second paragraph, lines 13-15.)
the first decoder of the plurality of decoders generates output information corresponding to the image from the synthesized information, and the second decoder of the plurality of decoders that generates output information corresponding to the text from the synthesized information. (Fig. 1 on p. 724 shows a text decoder by the second and third layers from the top on the left side, and an image encoder by the second and third layers from the top on the right side. The output of each decoder is a reconstruction of the input.)

	Regarding CLAIM 5, Silberer teaches: The learning device according to claim 1, wherein the processor learns a synthesizer that generates synthesized information obtained by synthesizing the pieces of characteristic information generated by the plurality of encoders in a synthesizing mode corresponding to an output mode of the output information. (Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6. Fig. 1 on p. 724 shows a hidden representation of each of text and images are input into the bimodal coding.  Hidden representations of each of text and images correspond to the reconstruction of text and images from the outputs of the decoders in the stacked bimodal autoencoder.)

Regarding CLAIM 7, Silberer teaches: The learning device according to claim 5, wherein the processor learns a synthesizer that generates synthesized information corresponding to an output mode of the output information from combined information obtained by linearly combining the pieces of characteristic information generated by the plurality of encoders. (Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6. Linearly combining is taught by concatenating final hidden codings at p. 724, col. 1, last paragraph, first sentence.)

Regarding CLAIM 8, Silberer teaches: The learning device according to claim 1, wherein the processor learns a plurality of models that have a structure corresponding to a classification of the input information and generate an intermediate representation indicating the characteristic of input information, and learns the plurality of encoders that generate the characteristic information from the intermediate representation generated by each model of the plurality of models. (This limitation is interpreted as training the text and image autoencoders separately, where each of the text and image autoencoders has at least two encoder layers. This is evident by p. 723, col. 2, subsection “Stacked Autoencoders” which discloses several (denoising) autoencoders can be used as building blocks to form a deep neural network; also, Fig. 1 and P. 726, col. 1, ¶ 2, lines 5-16 disclose the fusion of both autoencoders which contain two encoder layers. The first of the two encoder layers corresponds to the “intermediate representation” as claimed, and the second of at least two encoder layers corresponds to the “encoder” as claimed. Finally, training each autoencoder is disclosed by p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)

Regarding CLAIM 10, Silberer teaches: The learning device according to claim 1, wherein the processor learns the plurality of encoders and the plurality of decoders included in a plurality of groups of an encoder and a decoder, and each of the plurality of groups has learned characteristics of pieces of information belonging to the different classifications. (This limitation is interpreted as training the autoencoders separately. At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training a text autoencoder separately from a visual autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)

Regarding CLAIM 11, Silberer teaches: The learning device according to claim 1, wherein the processor outputs the pieces of output information having content with a same characteristic from a plurality of the pieces of input information included in predetermined content. (A characteristic is a modality. According to p. 724, col. 1, lines 1-2, each of the text and visual autoencoders are trained separately. Predetermined content includes text and image training data. According to p. 725, col. 2, first paragraph, the text inputs are obtained from Wikipedia; according to the second paragraph, the image inputs are obtained from ImageNet.)

Regarding CLAIM 12,  Silberer teaches: A generation device comprising: a processor operatively coupled to a memory, (The experimental results, disclosed in § 5 on p. 728, are evidence of a computer comprising a memory and a processor.)
the processor including a plurality of encoders and a plurality of decoders, the processor being programmed to: (A plurality of encoders and a plurality of decoders are taught by the text encoders, text decoders, image encoders, and image decoders shown in the bimodal autoencoder in Fig. 1 on p. 724, which is described below in more detail.)
acquire a plurality of pieces of output information corresponding to a plurality of pieces of input information included in a predetermined content (According to p. 725, col. 2, first paragraph, the model takes as input two (real-valued) vectors representing the visual and textual modalities, and the text inputs are obtained from Wikipedia; according to the second paragraph, the image inputs are obtained from ImageNet.)
by executing a machine learning process by using a plurality of encoders that generate pieces of characteristic information indicating characteristics of the plurality of pieces of input information of different classifications, and a plurality of decoders that generate a plurality of pieces of output information corresponding to the plurality of pieces of input information of a first classification or a second classification, which is different than the first classification; (Silberer teaches implementing a bimodal autoencoder which receives a text input as a first modality and image input as a second modality, encodes both text and images inputs into a bimodal coding, and outputs a text reconstruction and an image reconstruction. The structure of the bimodal autoencoder is shown in Fig. 1 on p. 724 and is further described at least by subsections “Autoencoders” (first paragraph) and “Stacked Autoencoder” on p. 723, “Bimodal Autoencoder” and “Stacked Bimodal Autoencoder” on p. 724, and the second paragraph on p. 726, col. 1. The training portion of a machine learning process is discussed below in more detail.)
generate a first encoder and a second encoder of the plurality of encoders and a first decoder and a second encoder* of the plurality of decoders; (The claim limitation “second encoder” marked with an asterisk is being interpreted as a second decoder. An autoencoder contains both an encoder and a decoder, as discussed on p. 723, col. 1, “Autoencoder”, lines 1-10. Generating a first encoder and a first decoder is taught by a text autoencoder, and generating a second encoder and a second decoder is taught by an image autoencoder. See p. 723, col. 2, § 3.2 to p. 724, line 2; also, p. 724, “Unimodal Autoencoders” teaches a textual autoencoder and a visual/image autoencoder.)
modify the first encoder and the first decoder such that: (At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training a text autoencoder separately from a visual autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
(a) when the input information having the first classification inputted to the first encoder, the first encoder outputs characteristic information indicating characteristics of the input information that is inputted to the first encoder, and-6-Application No. 15/996,968 (p. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
(b) when the characteristic information that is outputted from the first encoder inputted to the first decoder, the first decoder outputs an output information having the first classification, the output information is corresponded to the input information inputted to the first encoder; (p. 723, col. 1, subsection “Autoencoders”, lines 8-10.)
modify the second encoder and the second decoder such that: (At p. 724,
col. 1, lines 1-8, Silberer explicitly teaches training the visual autoencoder separately from the text autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
(a) when the input information having the second classification inputted to the second encoder, the second encoder outputs characteristic information indicating characteristics of the input information that is inputted to the second encoder, and (P. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
(b) when the characteristic information that is outputted from the second encoder inputted to the second decoder, the second decoder outputs an output information having the second classification, the output information is corresponded to the input information inputted to the second encoder; (P. 723, col. 1, subsection “Autoencoders”, lines 8-10.)
machine learn the model by the output of the first encoder and the second encoder being inputted to a synthesizing model and the output of the synthesizing model being inputted to the first decoder and the second decoder; and (The outputs of the text encoder and the image encoder (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            5
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input into bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                    . The outputs of the bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                     (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            
                                                
                                                    5
                                                
                                                
                                                    '
                                                
                                            
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input to the text decoder and the image decoder. This limitation is interpreted as the forward propagation step of gradient descent which is used to update autoencoder parameters (see p. 723, col. 1, 2 lines above the last paragraph).)
modify the model so that when the input information having the first classification was inputted to the first encoder and the input information having the second classification was inputted to the second encoder, the first decoder outputs output information having the first classification and the second decoder outputs output information having the second classification. (Modifying the model is updating the parameters W and b in the stacked bimodal autoencoder during the backpropagation step of gradient descent. Training and gradient descent is taught in subsection “Autoencoders”, lines 10-17 on p. 723, col. 1. Further, p. 724, col. 2, subsection “Stacked Autoencoder” teaches: “all network parameters are fine-tuned with backpropagation” and  p. 724, col. 2, subsection “Stacked Bimodal Autoencoder” teaches: “We finally build a stacked bimodal autoencoder (SAE) with all pre-trained layers and fine-tune them with respect to a semi-supervised criterion”. Training the stacked bimodal autoencoder is taught by p. 724, col. 2, line 2 under equation 3 to p. 725, line 6. 
Again, Fig. 1 shows the text decoder on the left outputs a reconstruction of the input text and the image decoder on the right outputs a reconstruction of the input image.)

Claim 13 is a method and Claim 15 is a product which recite the same features as system claim 1. Claims 13 and 15 is rejected for the reasons set forth in the rejection of claim 1.
Claim 14 is a method and Claim 16 is a product which recite the same features as system claim 12. Claims 14 and 16 are rejected for the reasons set forth in the rejection of claim 12.

Regarding CLAIM 17, Silberer teaches: A non-transitory computer-readable storage medium having stored therein a program that causes a computer (The experimental results, disclosed in § 5 on p. 728, are evidence of a computer comprising a memory and a processor.)
to execute a machine learning process by implementing a model that receives a plurality of pieces of input information as inputs and outputs a plurality of pieces of output information corresponding to the respective pieces of input information, the plurality of piece of input information having a first classification or a second classification, which is different than the first classification, the machine learning process comprising: (According to p. 725, col. 2, first paragraph, a bimodal autoencoder takes as input two (real-valued) vectors representing the visual and textual modalities, and the text inputs are obtained from Wikipedia; according to the second paragraph, the image inputs are obtained from ImageNet. Silberer teaches implementing a bimodal autoencoder which receives a text input as a first modality and image input as a second modality, encodes both text and images inputs into a bimodal coding, and outputs a text reconstruction and an image reconstruction. The structure of the bimodal autoencoder is shown in Fig. 1 on p. 724 and is further described at least by subsections “Autoencoders” (first paragraph) and “Stacked Autoencoder” on p. 723, “Bimodal Autoencoder” and “Stacked Bimodal Autoencoder” on p. 724, and the second paragraph on p. 726, col. 1. The training portion of a machine learning process is discussed below in more detail.)
generating a first encoder and a second encoder of the plurality of encoders and a first decoder and a second encoder* of the plurality of decoders; (The claim limitation “second encoder” marked with an asterisk is being interpreted as a second decoder. An autoencoder contains both an encoder and a decoder, as discussed on p. 723, col. 1, “Autoencoder”, lines 1-10. Generating a first encoder and a first decoder is taught by a text autoencoder, and generating a second encoder and a second decoder is taught by an image autoencoder. See p. 723, col. 2, § 3.2 to p. 724, line 2; also, p. 724, “Unimodal Autoencoders” teaches a textual autoencoder and a visual/image autoencoder.)
modifying the first encoder and the first decoder such that: (At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training a text autoencoder separately from a visual autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
(a) when the input information having the first classification inputted to the first encoder, the first encoder outputs characteristic information indicating characteristics of the input information that is inputted to the first encoder, and (p. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
(b) when the characteristic information that is outputted from the first encoder inputted to the first decoder, the first decoder outputs an output information having the first classification, the output information is corresponded to the input information inputted to the first encoder; (p. 723, col. 1, subsection “Autoencoders”, lines 8-10.)
modifying the second encoder and the second decoder such that: (At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training the visual autoencoder separately from the text autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
(a) when the input information having the second classification inputted to the second encoder, the second encoder outputs characteristic information indicating characteristics of the input information that is inputted to the second encoder, and (p. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
(b) when the characteristic information that is outputted from the second encoder inputted to the second decoder, the second decoder outputs an output information having the second classification, the output information is corresponded to the input information inputted to the second encoder; (p. 723, col. 1, subsection “Autoencoders”, lines 8-10.)
machine learning the model by the output of the first encoder and the second encoder being inputted to a synthesizing model and the output of the synthesizing model being inputted to the first decoder and the second decoder; and (The outputs of the text encoder and the image encoder (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            5
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input into bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                    . The outputs of the bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                     (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            
                                                
                                                    5
                                                
                                                
                                                    '
                                                
                                            
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input to the text decoder and the image decoder. This limitation is interpreted as the forward propagation step of gradient descent which is used to update autoencoder parameters (see p. 723, col. 1, 2 lines above the last paragraph).)
modifying the model so that when the input information having the first classification was inputted to the first encoder and the input information having the second classification was inputted to the second encoder, the first decoder outputs output information having the first classification and the second decoder outputs output information having the second classification. (Modifying the model is updating the parameters W and b in the stacked bimodal autoencoder during the backpropagation step of gradient descent. Training and gradient descent is taught in subsection “Autoencoders”, lines 10-17 on p. 723, col. 1. Further, p. 724, col. 2, subsection “Stacked Autoencoder” teaches: “all network parameters are fine-tuned with backpropagation” and  p. 724, col. 2, subsection “Stacked Bimodal Autoencoder” teaches: “We finally build a stacked bimodal autoencoder (SAE) with all pre-trained layers and fine-tune them with respect to a semi-supervised criterion”. Training the stacked bimodal autoencoder is taught by p. 724, col. 2, line 2 under equation 3 to p. 725, line 6. 
Again, Fig. 1 shows the text decoder on the left outputs a reconstruction of the input text and the image decoder on the right outputs a reconstruction of the input image.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Silberer et al. (“Learning Grounded Meaning Representations with Autoencoders”, cited in PTO-892 filed 06/03/2021) in view of Sullivan et al. (US 20130018833 A1).

Regarding CLAIM 6, Silberer teaches: The learning device according to claim 5, wherein the processor learns a synthesizer that generates synthesized information obtained by synthesizing the pieces of characteristic information generated by the plurality of encoders in a synthesizing mode (Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6. Fig. 1 on p. 724 shows a hidden representation of each of text and images are input into the bimodal coding.  Hidden representations of each of text and images correspond to the reconstruction of text and images from the outputs of the decoders in the stacked bimodal autoencoder.)
However, Silberer does not explicitly teach: characteristic information corresponding to an attribute of a user that is an output destination of the output information.
But Sullivan teaches: characteristic information corresponding to an attribute of a user that is an output destination of the output information. (Silberer discloses a neural network that learns how to distributing content to particular recipients based on a usefulness metric; see ¶ [0035], [0045] and [0048] for an overview. ¶ [0084] to [0087] disclose that each recipient provides feedback or ratings regarding the usefulness of the content they received, and the neural network incorporates this usefulness metric as an attribute of the neural network.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Sullivan’s usefulness metric as an attribute of Silberer’s neural network. A motivation for the combination is to filter a large amount of information based on the preferences of a particular recipient and usefulness of the information. (¶ [0080] lines 1-5 and ¶ [0003], [0006])

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Silberer et al. (“Learning Grounded Meaning Representations with Autoencoders”, cited in PTO-892 filed 06/03/2021) and Kiros et al. (“Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models”, cited in the PTO-892 filed 06/03/2021).

Regarding CLAIM 9, Silberer teaches: The learning device according to claim 8, 
wherein the processor learns a model… that generates an intermediate representation of the input information that is text, and learns a model… that generates an intermediate representation of the input information that is an image. (Intermediate representation –P. 726, col. 1, ¶ 2, lines 5-16 and Fig. 1 on p. 724 disclose a bimodal autoencoder with two encoder layers per modality. The first encoder layer corresponds to the “intermediate” model as claimed and the second encoder layer corresponds to the “encoder” of parent claim 1. Input information that is text and an image – p. 724, col. 1, last 4 lines teach hidden codings of textual and visual modalities; p. 726, col. 1, ¶ 2, lines 5-16 and Fig. 1 on p. 724 disclose a bimodal autoencoder with text and image modalities.)
	However, Silberer does not explicitly teach: learns a model that is a recurrent neural network as a model that generates an intermediate representation of the input information that is text
	learns a model that is a convolution neural network as a model that generates an intermediate representation of the input information that is an image. 
	But Kiros teaches: learns a model that is a recurrent neural network as a model that generates an intermediate representation of the input information that is text (P. 1, § 1, ¶ 2, lines 2-5; P. 3, Fig. 1 caption for “Encoder”. The RNN is further discussed on p. 4, § 2.1-2.2.)
learns a model that is a convolution neural network as a model that generates an intermediate representation of the input information that is an image. (P. 1, § 1, ¶ 2, lines 2-5; P. 3, Fig. 1 caption for “Encoder”. The CNN is further discussed on p. 4, § 2.2. The OxfordNet CNN is taught at p. 7, § 3.1, lines 5-7)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used Kiros’ long short-term memory to generate a hidden representation for text and Kiros’ OxfordNet CNN to generate a hidden representation for an image. A motivation for the combination is that LSTM RNNs are used to encode sentences (Kiros, p. 4, § 2, “We first review LSTM RNNs which are used for encoding sentences”) and a motivation for using the OxfordNet CNN is that it is classifies images well (P. 7, § 3.1, lines 5-7).

Response to Arguments
Examiner herein responds to Applicant’s remarks and claim amendments filed 02/01/2022. 

Claim Objections : The objections to claims 1, 2, 13, and 15 for typographical errors are withdrawn due to the claim amendments. Upon further consideration of claim 12, Examiner withdraws the previous objection to claim 12.

Claim Rejections Under 35 U.S.C. 112: The rejections of claims 12, 14, and 16 are withdrawn due to the claim amendments.


Claim Rejections Under 35 U.S.C. 102 and 103 (Remarks pp. 19-20): Applicant’s arguments with respect to claims 1-17 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Claim 1 amounts to training two unimodal autoencoders of different modalities, fusing the last hidden representations into a synthesis model of a multimodal autoencoder, and further training the multimodal autoencoder. Silberer explicitly teaches training text and image autoencoders separately, concatenating the last hidden representations into a bimodal coding of a bimodal autoencoder, and further training the bimodal autoencoder.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Cheng et al. (US 20160093048 A1) teaches a multimodal autoencoder.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher H. Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.H.J./Examiner, Art Unit 2127                                                                                                                                                                                                        

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127