Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Amendments
	Claims 1, 5-10, and 12-17 are amended. Claims 2-4 are canceled. Claims 1 and 5-17 are pending and have been considered.


Claim Objections
Claims 1 and 12-17 are objected to because of the following informalities:  In Claim 1, line 21 recites “a summary of the text data” and on p. 3, line 13 recites “the text summary.” For sake of consistency, line 21 should recite “a text summary of the text data.” Claims 12-17 are objected to for the reasons set forth in the objection to claim 1. Appropriate action is required. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 5, 7-8, and 10-17 are rejected under 35 U.S.C. 103 as being unpatentable over Silberer et al. (“Learning Grounded Meaning Representations with Autoencoders”, cited in the PTO-892 filed 06/03/2021) in view of Li et al. (“Multimedia News Summarization in Search”).

	Regarding CLAIM 1, Silberer teaches: A learning device comprising: a memory; and a processor operatively coupled to the memory, the processor including a plurality of encoders and a plurality of decoders, the processor being programmed to: (The experimental results, disclosed in § 5 on p. 728, are evidence of a computer comprising a memory and a processor. A plurality of encoders and a plurality of decoders are taught by the text encoders, text decoders, image encoders, and image decoders shown in the bimodal autoencoder in Fig. 1 on p. 724, which is described below in more detail.)
acquire a plurality of pieces of input information, including text data and image data extracted from  distribution content; and (Crossed-out text is not explicitly taught by the reference. According to p. 725, col. 2, first paragraph, the model takes as input two real-valued vectors representing the visual and textual modalities. The text inputs are obtained from Wikipedia. According to p. 725, col. 2, second paragraph, the image inputs are obtained from ImageNet; Figure 1 on p. 724 shows the text and images are input to a neural network.)
execute a machine learning process by implementing a model that receives the plurality of pieces of input information as inputs, and outputs a plurality of pieces of output information corresponding to the respective pieces of input information, the machine learning process including: (Silberer teaches implementing a bimodal autoencoder which receives a text input as a first modality and image input as a second modality, encodes both text and images inputs into a bimodal coding, and outputs a text reconstruction and an image reconstruction. The structure of the bimodal autoencoder is shown in Fig. 1 on p. 724 and is further described at least by subsections “Autoencoders” (first paragraph) and “Stacked Autoencoder” on p. 723, “Bimodal Autoencoder” and “Stacked Bimodal Autoencoder” on p. 724, and the second paragraph on p. 726, col. 1. The training portion of a machine learning process is discussed below in more detail.)
generating a first encoder and a second encoder of the plurality of encoders and a first decoder and a second decoder of the plurality of decoders, (An autoencoder contains both an encoder and a decoder, as discussed on p. 723, col. 1, “Autoencoder”, lines 1-10. Generating a first encoder and a first decoder is taught by a text autoencoder, and generating a second encoder and a second decoder is taught by an image autoencoder. See p. 723, col. 2, § 3.2 to p. 724, line 2; also, p. 724, “Unimodal Autoencoders” teaches a textual autoencoder and a visual/image autoencoder.)
modifying the first encoder and the first decoder such that: (At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training a text autoencoder separately from a visual autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
(a) when the text data is inputted to the first encoder, the first encoder outputs characteristic information indicating characteristics of the text data inputted to the first encoder, and (P. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
(b) when the characteristic information that is outputted from the first encoder is inputted to the first decoder, the first decoder outputs  text data inputted to the first encoder as output information; (Crossed-out text is not explicitly taught by the reference. P. 723, col. 1, subsection “Autoencoders”, lines 8-10 teaches the decoder outputs a reconstruction of the same text data that was input into the input layer.)
modifying the second encoder and the second decoder such that: (At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training the visual autoencoder separately from the text autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)
(a) when the image data is inputted to the second encoder, the second encoder outputs characteristic information indicating characteristics of the image data that is inputted to the second encoder, and (P. 723, col. 1, subsection “Autoencoders”, lines 1-8, where characteristic information is a latent representation.)
(b) when the characteristic information that is outputted from the second encoder is inputted to the second decoder, the second decoder outputs  the image data inputted to the second encoder as output information; (Crossed-out text is not explicitly taught by the reference. P. 723, col. 1, subsection “Autoencoders”, lines 8-10 teaches the decoder outputs a reconstruction of the same image data that was input into the input layer.)
training, via machine learning, the model by the output of the first encoder and the second encoder being inputted to a synthesizing model and the output of the synthesizing model being inputted to the first decoder and the second decoder; and (The outputs of the text encoder and the image encoder (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            5
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input into bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                    . The outputs of the bimodal coding                         
                            
                                
                                    y
                                
                                ˘
                            
                        
                     (the arrows in line with                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            
                                                
                                                    5
                                                
                                                
                                                    '
                                                
                                            
                                        
                                    
                                
                            
                        
                     in Fig. 1) are input to the text decoder and the image decoder. This limitation is interpreted as the forward propagation step of gradient descent which is used to update autoencoder parameters (see p. 723, col. 1, 2 lines above the last paragraph).)
modifying the model so that when the text data is inputted to the first encoder and the image data is inputted to the second encoder, the first decoder outputs the text  and the second decoder outputs the . (Crossed-out text is not explicitly taught by the reference. Modifying the model is updating the parameters W and b in the stacked bimodal autoencoder during the backpropagation step of gradient descent. Training and gradient descent is taught in subsection “Autoencoders”, lines 10-17 on p. 723, col. 1. Further, p. 724, col. 2, subsection “Stacked Autoencoder” teaches: “all network parameters are fine-tuned with backpropagation” and p. 724, col. 2, subsection “Stacked Bimodal Autoencoder” teaches: “We finally build a stacked bimodal autoencoder (SAE) with all pre-trained layers and fine-tune them with respect to a semi-supervised criterion”. Training the stacked bimodal autoencoder is taught by p. 724, col. 2, line 2 under equation 3 to p. 725, line 6. Fig. 1 on p. 724 shows the text decoder on the left outputs a reconstruction of the input text and the image decoder on the right outputs a reconstruction of the input image.)
	Silberer teaches the text and image decoders respectively output a reconstruction of the text and images input to the neural network. However, Silberer does not explicitly teach: text data and image data extracted from a same distribution content; the first decoder outputs a summary of the text data inputted to the first encoder as output information; the second decoder outputs a thumbnail of the image data inputted to the second encoder as output information; the first decoder outputs the text summary and the second decoder outputs the thumbnail.
	But Li teaches:  text data and image data extracted from a same distribution content; (P. 12, § 6.1, lines 4 teaches: “We build a large-scale multimedia news dataset collected from four news Web sites, including ABCNews.com, BBC.co.uk, CNN.com, and Google News. There are 135,308 news articles and 69,144 news images in total, whose distribution over these four Web sites is shown in Table I.” A same distribution content includes the dataset built by Li.) 
the first decoder outputs a summary of the text data inputted to the first encoder as output information; (A summary of the text data includes keywords. P. 11, § 5, lines 16-18 teaches: “For each subtopic, we present the keywords of its representative article and the chosen representative image. Here, we present an illustrative example of multimedia news summarization in search in Figure 1” (Fig. 1 is on p. 2). Figure 5 on p. 12 also contains keywords.)
the second decoder outputs a thumbnail of the image data inputted to the second encoder as output information; (§ 4.2 on pp. 9-10 generally teaches selecting and downsizing an image representative of the text summary. Specifically, § 4.2, p. 9, lines 7-11; P. 10, lines 11-19; and Fig. 4 on p. 10. P. 11, § 5, lines 16-18 teaches: “For each subtopic, we present the keywords of its representative article and the chosen representative image. Here, we present an illustrative example of multimedia news summarization in search in Figure 1” (Fig. 1 is on p. 2). )
the first decoder outputs the text summary and the second decoder outputs the thumbnail.
(A text summary includes keywords and a thumbnail includes a downsized image. For text summary and thumbnail, see P. 11, § 5, lines 16-18, Fig. 1 on p. 2; and Fig. 5 on p. 12. For thumbnail, see also § 4.2, p. 9, lines 7-11; P. 10, lines 11-19; and Fig. 4 on p. 10.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Li’s method of outputting text keywords and an image thumbnail from Silberer’s stacked denoising autoencoder. A motivation for the combination is to present a user with a multimodal news summary view. (P. 5, last paragraph before section 2)

Regarding CLAIM 5, the combination of Silberer and Li teaches: The learning device according to claim 1, 
Silberer teaches: wherein the processor trains a synthesizer that generates synthesized information obtained by synthesizing the characteristic information generated by each of the plurality of encoders in a synthesizing mode corresponding to an output mode of the output information. (Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6. Fig. 1 on p. 724 shows a hidden representation of each of text and images are input into the bimodal coding.  Hidden representations of each of text and images correspond to the reconstruction of text and images from the outputs of the decoders in the stacked bimodal autoencoder.)

Regarding CLAIM 7, the combination of Silberer and Li teaches: The learning device according to claim 5, 
Silberer teaches: wherein the processor trains a synthesizer that generates synthesized information corresponding to an output mode of the output information from combined information obtained by linearly combining the characteristic information generated by each of the plurality of encoders. (Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6. Linearly combining is taught by concatenating final hidden codings at p. 724, col. 1, last paragraph, first sentence.)

Regarding CLAIM 8, the combination of Silberer and Li teaches: The learning device according to claim 1, 
Silberer teaches: wherein the processor trains a plurality of models that have a structure corresponding to a classification of the input information and generate an intermediate representation indicating the characteristic of input information, and learns the plurality of encoders that generate the characteristic information from the intermediate representation generated by each model of the plurality of models. (This limitation is interpreted as training the text and image autoencoders separately, where each of the text and image autoencoders has at least two encoder layers. This is evident by p. 723, col. 2, subsection “Stacked Autoencoders” which discloses several (denoising) autoencoders can be used as building blocks to form a deep neural network; also, Fig. 1 and P. 726, col. 1, ¶ 2, lines 5-16 disclose the fusion of both autoencoders which contain two encoder layers. The first of the two encoder layers corresponds to the “intermediate representation” as claimed, and the second of at least two encoder layers corresponds to the “encoder” as claimed. Finally, training each autoencoder is disclosed by p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)

Regarding CLAIM 10, the combination of Silberer and Li teaches: The learning device according to claim 1, 
Silberer teaches: wherein the processor trains the plurality of encoders and the plurality of decoders included in a plurality of groups of an encoder and a decoder, and each of the plurality of groups has learned characteristics of pieces of information including the text data and the image data. (This limitation is interpreted as training the autoencoders separately. At p. 724, col. 1, lines 1-8, Silberer explicitly teaches training a text autoencoder separately from a visual autoencoder before the two are fused into a bimodal autoencoder. Training an autoencoder is explained at p. 723, col. 1, subsection “Autoencoders”, from “The training objective” to the end of the paragraph.)

Regarding CLAIM 11, the combination of Silberer and Li teaches: The learning device according to claim 1, 
Silberer teaches: wherein the processor outputs the pieces of output information having content with a same characteristic from a plurality of the pieces of input information included in predetermined content. (A characteristic is a modality. According to p. 724, col. 1, lines 1-2, each of the text and visual autoencoders are trained separately. Predetermined content includes text and image training data. According to p. 725, col. 2, first paragraph, the text inputs are obtained from Wikipedia; according to the second paragraph, the image inputs are obtained from ImageNet.)

	Independent claims 12-17 implement the same features as the system of claim 1 and are therefore rejected for at least the same reasons therein. 

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Silberer et al. (“Learning Grounded Meaning Representations with Autoencoders”, cited in PTO-892 filed 06/03/2021) in view of Li et al. (“Multimedia News Summarization in Search”) and Sullivan et al. (US 20130018833 A1, cited in PTO-892 filed 06/22/2022).

Regarding CLAIM 6, the combination of Silberer and Li teaches: The learning device according to claim 5, 
	Silberer teaches: wherein the processor trains a synthesizer that generates synthesized information obtained by synthesizing the characteristic information generated by each of the plurality of encoders in a synthesizing mode (Training the stacked bimodal autoencoder is taught by p. 724, line 2 under equation 3 to p. 725, line 6. Fig. 1 on p. 724 shows a hidden representation of each of text and images are input into the bimodal coding.  Hidden representations of each of text and images correspond to the reconstruction of text and images from the outputs of the decoders in the stacked bimodal autoencoder.)
	However, neither Silberer nor Li explicitly teaches: synthesizing characteristic information corresponding to an attribute of a user that is an output destination of the output information. 
But Sullivan teaches: synthesizing characteristic information corresponding to an attribute of a user that is an output destination of the output information. (Silberer discloses a neural network that learns how to distributing content to particular recipients based on a usefulness metric; see ¶ [0035], [0045] and [0048] for an overview. ¶ [0084] to [0087] disclose that each recipient provides feedback or ratings regarding the usefulness of the content they received, and the neural network incorporates this usefulness metric as an attribute of the neural network.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Sullivan’s usefulness metric as an attribute of Silberer’s neural network in Silberer/Li’s system. A motivation for the combination is to filter a large amount of information based on the preferences of a particular recipient and usefulness of the information. (¶ [0080] lines 1-5 and ¶ [0003], [0006])

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Silberer et al. (“Learning Grounded Meaning Representations with Autoencoders”, cited in PTO-892 filed 06/03/2021) in view of Li et al. (“Multimedia News Summarization in Search”) and Kiros et al. (“Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models”, cited in the PTO-892 filed 06/03/2021).

Regarding CLAIM 9, the combination of Silberer and Li teaches: The learning device according to claim 8, 
Silberer teaches: wherein the processor trains a model  that generates an intermediate representation of the input information that is text, and learns a model  that generates an intermediate representation of the input information that is an image. (Crossed-out text is not explicitly taught by the reference. P. 724, col. 1, last 4 lines teaches hidden codings of textual and visual modalities; p. 726, col. 1, ¶ 2, lines 5-16 and Fig. 1 on p. 724 disclose a bimodal autoencoder with two encoder layers for the text modality and two encoder layers for the image modality. Examiner refers to the first and second layers after the input layer as first and second encoder layers for each modality. The first text encoder layer corresponds to the “intermediate representation of the input information that is text” and the first image encoder layer corresponds to the “intermediate representation of the input information that is an image” as claimed. The second text and image encoder layers correspond to the text and image encoders of claim 1.)
	However, neither Silberer nor Li explicitly teaches: trains a model that is a recurrent neural network as a model that generates an intermediate representation of the input information that is text
	trains a model that is a convolution neural network as a model that generates an intermediate representation of the input information that is an image. 
	But Kiros teaches: trains a model that is a recurrent neural network as a model that generates an intermediate representation of the input information that is text (P. 1, § 1, ¶ 2, lines 2-5; P. 3, Fig. 1 caption for “Encoder”. The RNN is further discussed on p. 4, § 2.1-2.2.)
trains a model that is a convolution neural network as a model that generates an intermediate representation of the input information that is an image. (P. 1, § 1, ¶ 2, lines 2-5; P. 3, Fig. 1 caption for “Encoder”. The CNN is further discussed on p. 4, § 2.2. The OxfordNet CNN is taught at p. 7, § 3.1, lines 5-7)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used Kiros’ long short-term memory to generate a hidden representation for text and Kiros’ OxfordNet CNN to generate a hidden representation for an image in Silberer/Li’s system. A motivation for the combination is that LSTM RNNs are used to encode sentences (Kiros, p. 4, § 2, “We first review LSTM RNNs which are used for encoding sentences”) and a motivation for using the OxfordNet CNN is that it is classifies images well (P. 7, § 3.1, lines 5-7).

Response to Arguments
Applicant’s arguments with respect to claims 1 and 5-17 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. “Multi-Modal Event Topic Model for Social Event Analysis” to Qian et al. teaches multi-modal event tracking and evolution framework (See Fig. 2). The input is the multi-modality data collected from Google News including images and texts. Based on the input data, our algorithm can learn multi-modality topics and track multiple events. After tracking, for each event, it can be visualized with texts and images over time.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher H. Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.H.J./Examiner, Art Unit 2127                                                                                                                                                                                                        

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127