DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claims 32 – 34, 36 – 38 are objected to because of the following informalities:  
In claims 32 – 34, 36 – 38, line 1, please replace “The system of claim 21” by -The system of claim 31-.
Appropriate correction is required.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 21 – 41 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1 – 10, 12 – 31 of U.S. Patent No. 10,789,427.  Although the claims at issue are not identical, they are not patentably distinct from each other because claims 21 – 41 of the instant application are similar in scope and content of claims 1 – 10, 12 – 31 of the Patent from the same applicant.

Instant Applicant 16/984,337
Patent 10,789,427
Comparison
21. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement:
1. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement:
Same
a machine learning model that comprises:
a machine learning model that comprises:
Same
a plurality of input modality neural networks, wherein each input modality neural network corresponds to a different modality of multiple modalities and is configured to map received data inputs of the corresponding modality to 


an encoder neural network that is configured to process mapped data inputs from the unified representation space to generate respective encoder data outputs;
an encoder neural network that is configured to process mapped data inputs from the unified representation space to generate respective encoder data outputs;
Same
a decoder neural network that is configured to 


a plurality of multiple output modality neural 




Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 21 – 25, 29, 40, 41 are rejected under 35 U.S.C. 103 as being unpatentable over Ngiam et al., (Multimodal Deep Learning) in view of Kramer et al., (US PAP 2017/0293736).
s per claim 21, Ngiam et al., teach a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement:
a machine learning model that comprises: a plurality of input modality neural networks, wherein each input modality neural network corresponds to a different modality of multiple modalities and is configured to map received data inputs of the corresponding modality to mapped data inputs from a variable-sized unified representation space (“resulting in a 483 dimension vector which was reduced to 100 dimension”; section 4.1), wherein the received data inputs of different modalities have different sizes and dimensions and wherein the mapped data inputs for the received data inputs of the different modalities from the variable-sized unified representation Space vary in size; (“Multimodal learning involves relating information from multiple sources. For example, images and 3-d depth scans are correlated at rst-order as depth dis-continuities often manifest as strong edges in images. Conversely, audio and visual data for speech recognition have correlations at a \mid-level", as phonemes and visemes (lip pose and motions); it can be difficult to relate raw pixels to audio waveforms or spectro-grams.”; pages 1, and 3, fig.3);
neural network that is configured to process mapped data inputs from the unified representation space to generate respective encoder data outputs (“neural networks for
multimodal learning”; section 5); a neural network that is configured to process encoder data outputs to generate respective decoder data outputs from the unified representation space (“shared representation”; page 3, see al so figs 2, 3); and

However, Ngiam et al., do not specifically teach an encoder neural network and a decoder neural network.
Kramer et al., disclose training an encoder (e.g., a multi-modal neural network). As shown in FIG. 4, the first processor, for example, generates a full multi-modal neural network 400 with one additional layer 402 at the top of the multi-modal neural network 400. The first processor then performs supervised training on the multi-modal neural network 400 (paragraph 35)... The first processor may calculate the reconstruction error by using a decoder related to the encoder 400 (e.g., with transposed weights or different weights compared to the encoder 400) to reconstruct inputs, and by comparing the reconstructed inputs with the input data 406 input into the multi-modal neural network 400 as part of the supervised training (paragraphs 37).
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to use encoder neural network as taught by Kramer et al., 

As per claim 22, Ngiam et al., in view of Kramer et al., further disclose the multiple modalities comprise one or more of (i) image recognition, (ii) speech recognition, (iii) translation, (iv) image captioning, or (v) parsing (“image processing”; Kramer et al., paragraph 68; Ngiam et al., page 1).

As per claim 23, Ngiam et al., in view of Kramer et al., further disclose the received data inputs comprise data inputs from different modalities and with different sizes and dimensions, and wherein mapped data inputs from the unified representation space vary in size (“A first modality-specific dataset of the plurality of modality-specific datasets has a first dimensionality, and a second modality-specific dataset of the plurality of modality-specific datasets has a second dimensionality. The first dimensionality is different than the second dimensionality.”; Kramer et al., paragraphs 6; Ngiam et al., page 4).

As per claim 24, Ngiam et al., in view of Kramer et al., further disclose the plurality of input modality networks comprise neural networks corresponding to different modalities, and wherein the plurality of output modality networks comprise neural networks corresponding to different modalities (“generates a full multi-modal neural network 400 with one additional layer 402 at the top of the multi-modal neural network


As per claim 25, Ngiam et al., in view of Kramer et al., further disclose the plurality of input modality networks and plurality of output modality networks modalities comprise (i) language modality networks, (ii) image modality networks, (iii) audio modality networks, and (iv) categorical data modality networks (Ngiam et al., page 3, figs 2, 3; Kramer et al., paragraph 35, fig.5).

As per claim 29, Ngiam et al., in view of Kramer et al., further disclose a categorical output modality network is configured to reshape a one-dimensional decoder neural network output into a two-dimensional output and perform progressive down sampling on the two-dimensional output (“The method includes identifying, by a processor, a plurality of modality-specific datasets. A first modality-specific dataset of the plurality of modality-specific datasets has a first dimensionality, and a second modality-specific dataset of the plurality of modality-specific datasets has a second dimensionality. The first dimensionality is different than the second dimensionality.”; Kramer et al., paragraphs 6 — 9).

As per claim 40, 41, Ngiam et al., teach a computer implemented method comprising:
receiving a request to perform a machine learning task on an input of a first modality of a plurality of modalities, wherein the machine learning task comprises a 
selecting an input modality neural network that corresponds to the first modality from a plurality of input modality neural networks, wherein the selected input modality neural network is configured to map data inputs of the first modality to mapped data inputs of a variable-sized unified representation space (“resulting in a 483 dimension vector which was reduced to 100 dimension”; section 4.1), wherein data inputs of different modalities have different sizes and dimensions and wherein the mapped data inputs for the data inputs of the different modalities from the variable-sized unified representation space vary in size (“shared representation”; page 3, see al so figs 2, 3); 
processing the input of the first modality using the selected input modality neural network to generate a mapped input of the unified representation space (page 3, figs.2, 3);
processing the mapped input of the unified representation space using an encoder neural network and a decoder neural network to generate a decoder output, the decoder output representing a representation of an output of the machine learning task in the unified representation space; (audio and video)”; page 3, figs.2, 3);

However, Ngiam et al., do not specifically teach an encoder neural network and a decoder neural network.
Kramer et al., disclose training an encoder (e.g., a multi-modal neural network). As shown in FIG. 4, the first processor, for example, generates a full multi-modal neural network 400 with one additional layer 402 at the top of the multi-modal neural network
400. The first processor then performs supervised training on the multi-modal neural network 400 (paragraph 35)... The first processor may calculate the reconstruction error by using a decoder related to the encoder 400 (e.g., with transposed weights or different weights compared to the encoder 400) to reconstruct inputs, and by comparing the reconstructed inputs with the input data 406 input into the multi-modal neural network 400 as part of the supervised training (paragraphs 37).
Therefore, it would have been obvious to one of ordinary skill in the art at the
.

7. 	Claim 28 is rejected under 35 U.S.C. 103 as being unpatentable over Ngiam et al., (Multimodal Deep Learning) in view of Kramer et al., (US PAP 2017/0293736); and further in view of Yu et al., (US PAP 2017/0330068).
As per claim 28, Ngiam et al., in view of Kramer et al., do not specifically teach
an image input modality network is configured to deepen a received input image feature depth using one or more residual convolutional layers.
Yu et al., disclose that the neural network 200 includes a first-stage neural network 210 (e.g., a deep neural network, an autoencoder, a convolutional neural network, a recurrent neural network, a de-convolutional neural network) and a second- stage neural network (SSNN) 220(paragraph24)... The first neural network 600A includes a visual first-stage neural network 610A, a visual second-stage neural network 620A, and a visual joint-encoding network 629A. The second neural network 600B includes a depth first-stage neural network 610B, a depth second-stage neural network 620B, and a depth joint-encoding network 629B (paragraph 38).
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to use one or more residual convolutional layers as taught by Yu et al., in Ngiam et al., in view of Kramer et al., because that would help achieve better performance (paragraph 29).

is rejected under 35 U.S.C. 103 as being unpatentable over Ngiam et al., (Multimodal Deep Learning) in view of Kramer et al., (US PAP 2017/0293736); and further in view of Edwards et al., (US Patent 6,125,105).
As per claim 30, Ngiam et al., in view of Kramer et al., do not specifically teach the decoder neural network is an autoregressive decoder neural network.
Edwards et al., disclose because the trends analyzer 1 is based on neural network technology it has the following beneficial attributes: Accuracy--predictions using neural network engines have been shown to outperform multi-variate discriminant
analysis, autoregressive integrated moving average, and autoregressive moving average, moving average, and autoregressive moving average (col.9, lines 49 — 54).
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to use an autoregressive decoder neural network as taught by Edwards et al., in Ngiam et al., in view of Kramer et al., because that would help achieve accurate results (col.9, lines 49 — 50).

Allowable Subject Matter
9. 	Claims 26, 27, 31 – 39 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims and filing a terminal a disclaimer.
The following is a statement of reasons for the indication of allowable subject matter:


As per claim 27, neither Ngiam et al., nor Kramer et al., nor Edwards et al., nor Yu et al., teach or suggest a language output modality network is configured to: receive as input a decoder output from the decoder neural network; perform a learned linear
mapping followed by a softmax activation function to generate a probability distribution over the token vocabulary.

As per claims 31 - 39, neither Ngiam et al., nor Kramer et al., nor Edwards et al., nor Yu et al., teach or suggest the encoder neural network and decoder neural network comprise neural network components from multiple machine learning domains, comprising (i) one or more convolutional neural network layers, (ii) one or more attention neural network layers configured to perform respective attention mechanisms, and (iii) one or more sparsely gated neural network layers.

Conclusion
10.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.   Metallinou et al. teach multi-modal natural language.  Healey et . 

11.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD SAINT CYR whose telephone number is (571)272-4247. The examiner can normally be reached Monday- Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LEONARD SAINT CYR/           Primary Examiner, Art Unit 2658