DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for examination.
Oath/Declaration
For the record, the Examiner acknowledges that the Oaths/Declarations submitted on 3/19/2018 have been received.
Drawings
The drawings filed on 3/19/2018 combined with the amended drawings filed on 5/2/2018 are acceptable for examination purposes.
Specification
The Specification filed on 3/19/2018 is acceptable for examination purposes.
Claim Objections
Claim 1 recites “the format and range values” in line 5.  There is insufficient antecedent basis for this limitation in the claim.  For the purposes of prior art examination, Examiner is interpreting as “the format and value ranges.” 
Claim 4 recites “the training data” in line 2.  There is insufficient antecedent basis for this limitation in the claim.  For the purposes of prior art examination, Examiner is interpreting as “training data.”
Claim 6 recites “the format and range values” in line 5. There is insufficient antecedent basis for this limitation in the claim.  For the purposes of prior art examination, Examiner is interpreting as “the format and value ranges.”
Claim 14 recites “the training examples” in line 13.  There is insufficient antecedent basis for this limitation in the claim.  For the purposes of prior art examination, Examiner is interpreting as “training examples.”
Claim 15 recites “the encapsulated samples” in line 2.  There is insufficient antecedent basis for this limitation in the claim.  For the purposes of prior art examination, Examiner is interpreting as “encapsulated samples.”
Claim 19 recites “the neural network parameter values” in line 2.  There is insufficient antecedent basis for this limitation in the claim.  For the purposes of examination, Examiner is interpreting as “neural network parameter values.”
Claim 19 recites “the output processor” in line 3.  There is insufficient antecedent basis for this limitation in the claim.  For the purposes of examination, Examiner is interpreting as “an output processor.”
Claim 20 recites “the output processor” in line 2.  There is insufficient antecedent basis for this limitation in the claim. For the purpose of prior art examination, Examiner is interpreting as “an output processor.”
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

Claims 1, 2, 6, 7, and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser) and Glickman (US 20120209795 A1, herein Glickman).
Regarding Claim 1,
	Kaiser teaches a computer-implemented method of training a deep learning neural network to undertake neural processing of a plurality of disparate data items and related disparate feature data, the method comprising: (Kaiser, Page 1, Paragraph 1, Line 2  “But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning.”  In other words, deep model is deep learning network. And, Page 1, Paragraph 1, Line 5 “In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task.” In other words, trained concurrently is neural processing;  on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task is a plurality of disparate data items and related disparate feature data.)
	determining format and value ranges for the plurality of disparate data items; (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.” In other words, convert inputs is determining format and value ranges.  Examiner notes that it is implicit that to convert data into a joint representation it would first be necessary to determine the format and value ranges of that input data.) 
	 [rescaling] the feature data to correspond to the format and range values; (Kaiser, Page 2, Paragraph 4, Line 2, “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.”  In other words, input data of widely different sizes and dimensions is feature data.)
	merging the [rescaled] feature data with the disparate data items to create a single encapsulated data item corresponding to the disparate data items and the disparate feature data; (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.” In other words, transformation into a joint representation space is merging the [rescaled] featured data with the disparate data items to create a single encapsulated data item.)
[combining into a training set the single encapsulated data item and a known correct output corresponding to the] disparate data items and the single encapsulated data item; and (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  ” In other words, transformation into a joint representation space is combining into a training set the single encapsulated data item.)
training the deep learning neural network with the training set. (Kaiser, Page 6, Section 3, line 1, “We implemented the MultiModel architecture described above using TensorFlow and trained it in a number of configurations.”  In other words trained it in a number of configurations is training the deep learning neural network.)
	Kaiser, thus far, does not explicitly teach rescaling.  Kaiser also does not explicitly teach combining into a training set the single encapsulated data item and a known correct output.
	Glickman teaches rescaling. (Glickman, Fig. 12, Block 1240; and Page 13, Column 2, Paragraph [0218] “Step 1240: The feature data is scaled. Data scaling typically involves changing the attribute weights.  Scaling is typically performed to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges.” In other words scaling is rescaling.) 
	It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Glickman into the teaching of Kaiser. This would result in rescaling the feature data to correspond to the format and range values.
	One of ordinary skill in the art would be motivated to do this because scaling is required in order to allow multiple disparate data to be combined into one representation. This is because datasets are typically fixed in size for ease of processing. If one image is encoded in 512 bits and the predetermined range for the image data is 256 bits, the image would need to be scaled to fit into the allocated range in order to proceed with processing.
	The combination of Kaiser and Glickman thus far still does not teach combining into a training set the single encapsulated data item and a known correct output.  	
combining into a training set the single encapsulated data item and a known correct output. (Glickman, Page 4, Figure 2, Step 340 “Pass content items represented as features along with corresponding target demographic characteristics to supervised learning algorithm to obtain a prediction model;” Examiner notes that Supervised Learning is known in the field of machine learning and means a training set that maps input to an output based on example input-output pairs.  The set combines the input with the known correct output in order to train the neural network. See O’Reilly Library- Chapter 2 Supervised Learning, Page 1, Paragraph 1.)
	It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teaching of Glickman into Kaiser. This would result in combining into a training set the single encapsulated data item and a known correct output.
	One of ordinary skill in the art would be motivated to do so, because using supervised learning where the known correct output is combined with the input dataset speeds up the training process.
Regarding claim 2,
The combination of Kaiser and Glickman teaches the computer-implemented method of claim 1, 
wherein a plurality of single encapsulated data items are combined into a multi-set and used in assembly of the training set, and (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  We call modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.” In other words, transformation into a joint representation space is single encapsulated data items and input data is multi-set; and Kaiser, Page 6, Section 3, Line 1 “ We implemented the MultiModel architecture described above using TensorFlow and trained it in a number of configurations.” In other words, trained is used in assembly of the training set.) 
where the known correct output relates to the assembled training set. (Glickman, Page 4, Figure 2, Step 340. “Pass content items represented as features along with corresponding target demographic characteristics to supervised learning algorithm to obtain a prediction model;” In other words, supervised learning algorithm is known correct output relates to the assembled training set.) 
Regarding claim 6,
Claim 6 is using the neural network that was trained in the computer-implemented method of claim 1. 
The combination of Kaiser and Glickman teaches the method of claim 1. (Examiner notes that claim 1 explicitly recites use of a computer and that upon completion of training, a neural network is ready to be used without further action.)  
Additionally, claim 6 recites
providing the single encapsulated data item and previously determined weights and biases as inputs to the neural network in a feed forward computation mode to determine an output for downstream processing. (Examiner notes that neural networks have weights and biases.  Once the training is complete, the predetermined weights and biases remain in place for feed forward computation to produce output.  Once trained, the single encapsulated data item would be an input to produce output.)
Therefore, Claim 6 is rejected for the same reasons as claim 1.
Regarding claim 7,
The combination of Kaiser and Glickman teaches the computer-implemented method of claim 6, 
wherein the output for downstream processing comprises at least one of: predictive analysis, classification, feature detection, and ranking. (Kaiser, Page 2 , Paragraph 2, Line 3  “Concretely, we train the MultiModel simultaneously on the following 8 corpora: (1) WSJ speech corpus; (2) ImageNet dataset (3) COCO image captioning dataset (4) WSJ parsing dataset (5) WMT English-German translation corpus (6) The reverse of the above: German-English translation. (7) WMT English-French translation corpus (8) The reverse of the above: German-French translation.” In other words, the following 8 corpora is at least one of predictive analysis, classification, feature detection, and ranking.  In particular, WSJ speech corpus is classification, and ImageNet dataset is feature detection.)
Regarding claim 10,
The combination of Kaiser and Glickman teaches the computer-implemented method of claim 6, 
wherein the neural processing is performed on a plurality of encapsulated data items and the output is compared to known outcomes (Glickman, Page 4, Figure 2, Step 340 “Pass Examiner notes that Supervised Learning is known in the field of machine learning and means a training set that maps input to an output based on example input-output pairs. The set combines the input with the known correct output in order to train the neural network.  In this instance, the training set is a plurality of encapsulated data items and known outcomes. See O’Reilly Library- Chapter 2 Supervised Learning, Page 1, Paragraph 1.)
The combination of Kaiser and Glickman, thus far, does not explicitly teach calculate a measure of forecast skill.
Glickman teaches calculate a measure of forecast skill. (Glickman, Figure 2, Step 340. “Pass content items represented as features along with corresponding target demographic characteristics to supervised learning algorithm to obtain a prediction model;” In this limitation, calculating a measure of forecast skill means that the neural processing is training the ability to forecast.  Each iteration through the data will produce a forecast which is compared to the known output causing the weights and biases to be adjusted until the forecast is accurate.  In other words, supervised learning algorithm is output is compared to known outcomes, prediction is forecast, and supervised training wherein the prediction, as output, is evaluated in the supervised learning is the output to calculate a measure of forecast skill.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Glickman into the teaching of Kaiser and Glickman.  This would result in calculate a measure of forecast skill.
. 
Claims 4, 5, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman) and van Merrienboer et al (Blocks and Fuel: Frameworks for deep learning, herein van Merrienboer).
Regarding claim 4,
The combination of Kaiser and Glickman teaches the computer-implemented method of claim 1, 
Thus far, the combination of Kaiser and Glickman does not explicitly teach situational information is added to the training data.
van Merrienboer teaches situational information is added to the training data. (van Merrienboer, Page 3, Section 3.2, Line 1 “Datasets are distributed in a wide range of formats.  Fuel simplifies dataset storage by converting all built-in datasets to annotated HDF5 files (The HDF Group, 1997-2015).  In addition to being an efficient format for large dataset that don’t fit into memory, HDF5 is easy to organize and document.  All of the data is stored in a single HDF5 file, with the following metadata attached:  - What are the data sources available? -  How are the data sources officially split? - Are some data sources unavailable for some splits? - Are some data sources unavailable for some splits?” In other words, metadata attached is situational information added to the training data.)
situational information is added to the training data.
	One of ordinary skill in the art would be motivated to do so because adding situational data speeds up the process of training a neural network by providing additional context to the input.
Regarding claim 5,
The combination of Kaiser and Glickman teaches the computer-implemented method of claim 1 
Thus far, the combination of Kaiser and Glickman does not explicitly teach reordering a plurality of encapsulated data items into new permutations to create additional training examples.
van Merrienboer teaches reordering a plurality of encapsulated data items into new permutations to create additional training examples. (van Merrienboer, Page 3, Section  3.1, Line 1 “Fuel allows for different ways of iterating over these datasets, such as sequential or shuffled minibatches, support of in-memory and out-of-core datasets, and resampling (cross validation bootstrapping.)” In other words, iterating over these datasets, such as sequential or shuffled minibatches, is reordering a plurality of encapsulated data items into new permutations to create additional training examples.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine van Merrienboer into the teaching of Kaiser and Glickman.  a reordering a plurality of encapsulated data items into new permutations to create additional training examples.
One of ordinary skill in the art would be motivated to do so because increasing the variety and permutations of data improves the effectiveness of the training so that the neural network is more effective over a wider range of possible input.
Regarding claim 8,
The combination of Kaiser and Glickman teaches the computer-implemented method of claim 6, 
further comprising using a plurality of encapsulated data items to assemble a training set, (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.” In other words, input data of widely different sizes and dimensions, such as images, sound waves and test is a plurality of encapsulated data items and define transformations between these external domains and a unified representation is assemble a training set.)
 [reordering the training set to create additional training set permutations], and using the training set and the training set permutations to determine the weights and biases. (Kaiser, Page 1, Paragraph 1, Line 5 “In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition In other words, trained is using the training set to determine weights and biases.)
Thus far, the combination of Kaiser and Glickman, does not explicitly teach reordering the training set to create additional training set permutations.
Van Merrienboer teaches reordering the training set to create additional training set permutations. (van Merrienboer, Page 3, Section 3.1, Line 1 “Fuel allows for different ways of iterating over these datasets, such as sequential or shuffled minibatches, support of in-memory and out-of-core datasets, and resampling (cross validation bootstrapping.)” In other words, iterating over these datasets, such as sequential or shuffled minibatches, is reordering the training set to create additional training set permutations.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine van Merrienboer into the teaching of Kaiser and Glickman.  This would result in reordering the training set to create additional training set permutations, and using the training set and the training set permutations to determine the weights and biases.
One of ordinary skill in the art would be motivated to do so in order to increase the number of datasets thereby more effectively training the neural network by exposing it to a wider range of input data.
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman), and Liberty et al (US 20100274753 A1, herein Liberty),  

Regarding claim 9,
The combination of Kaiser and Glickman teaches the computer-implemented method of claim 6, 
further comprising using a plurality of encapsulated data items to assemble a training set; (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.” In other words, input data of widely different sizes and dimensions, such as images, sound waves and test is a plurality of encapsulated data items and define transformations between these external domains and a unified representation is assemble a training set.)
[calculating, from other training data, proxy training data corresponding to missing data]; and 
using the [proxy training data] and the training set to determine the weights and biases. (Examiner notes that neural networks have weights and biases and a neural network is trained by determining weights and biases as a result of using a training set.)
Thus far, the combination of Kaiser and Glickman does not explicitly teach calculating, from other training data, proxy training data corresponding to missing data. The combination also does not teach using the proxy training data …to determine the weights and biases.
Liberty teaches calculating, from other training data, proxy training data corresponding to missing data; and using the proxy training data and the training set to determine the weights and biases. (Page 1, Abstract, Line 1 “The present invention is directed to a method for inferring/estimating missing values in a data matrix (d(q,r) …” In other words, inferring/estimating missing values in a data matrix is calculating, from other training data, proxy training data corresponding to missing data. Examiner notes that once the missing data has been filled by the proxy data, the training data is used the same as any other training data by being input to the neural network to determine weights and biases during training.)
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liberty into the teaching of Kaiser, Glickman, and van Merrienboer.  This would result in calculating, from other training data, proxy training data corresponding to missing data; and using the proxy training data and the training set to determine the weights and biases.
One of ordinary skill in the art would be motivated to do so because using inferred proxy data to fill in missing data speeds up the process of training the neural network by giving more complete input data without gaps.
Claims 11-13 are rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman) and Forman (US 8885928 B2, herein Forman).
Regarding claim 11,
Kaiser teaches a deep learning neural network apparatus for neural processing of a plurality of disparate data items and related disparate feature data, the apparatus comprising: (Kaiser, Page 1, Paragraph 1, Line 2  “But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning.”  In other words, deep model is deep learning network. And, Page 1, Paragraph 1, Line 5 “In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task.” In other words, trained concurrently is neural processing;  on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task is a plurality of disparate data items and related disparate feature data.)
[an input data processor] configured for storage and delivery to a [merge processor] of plural disparate data elements, the plural disparate data elements defining plural ranges; (Kaiser, Page 1, Paragraph 1, Line 5 “In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task.” In other words, on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task is for storage and deliver of a plurality of disparate data items and related disparate feature data.)
a [[rescaling] processor]] configured to accept as input the ranges from a [range repository] and plural disparate data feature elements from a [feature processor], the 47plural disparate data feature elements corresponding to the disparate data elements, (Kaiser, Page 2, Paragraph 4, Line 2, “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.”  In other words, input data of widely different sizes and dimensions is plural disparate feature data.)
the [[rescaling] processor] being configured to [rescale] the disparate data feature elements to correspond with the disparate data elements; (Kaiser, Page 2, Paragraph 4, Line 2,  into a joint representation space.”  In other words, input data of widely different sizes and dimensions is disparate data feature elements.)
a [merge processor] configured to accept as input the disparate data elements from the [input data processor] and the [rescaled] disparate data feature elements from the [rescaling processor] and to produce therefrom a single encapsulated data type representative of the plural disparate data feature elements and the disparate data feature elements; (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space. ” In other words, transformation into a joint representation space is combining into a training set the single encapsulated data item.)
a [neural network preprocessor] configured to accept as input the single encapsulated data type from the [merge processor] and a set of trained neural net parameter values from a [parameter repository], (Kaiser, Page 4, Section 2.4, Line 1 “The body of the MultiModel consists of 3 parts: the encoder that only processes the inputs, the mixer that mixes the encoded inputs with previous outputs (autoregressive part), and a decoder that processes the inputs and the mixture to generate new outputs.” In other words, processes the inputs is accept as input.) 
the [neural network preprocessor] further configured to produce therefrom neural network weights, biases, and input values; (Kaiser, Page 4, Section 2.4, Line 1 “The body of the MultiModel consists of 3 parts: the encoder that only processes the inputs, the mixer that In other words, the combination of the encoder and the autoregression part of the mixer produces the network weights, biases, and input values. Examiner notes that neural networks have weights and biases and training a neural network comprises determining the weights and biases as a result of processing the training data. )
and a [neural network processor] operatively connected to the neural network preprocessor and configured to accept as input the weights, biases, and input values, and perform multilayer feed forward computational processing, producing therefrom a neural network result. (Kaiser, Page 4, Section 2.4, Line 1 “The body of the MultiModel consists of 3 parts: the encoder that only processes the inputs, the mixer that mixes the encoded inputs with previous outputs (autoregressive part), and a decoder that processes the inputs and the mixture to generate new outputs.” and, Page 5, Section 2.4 Paragraph 2, Line 5 “The decoder consist of 4 blocks of convolutions and attention, with a mixture-of-experts layer in the middle. Crucially, the convolutions in the mixer and decoder are padded on the left, so they can never access any information in the future.  This allows the model to be autoregressive, and this convolutional autoregressive generation scheme offers large receptive fields over the inputs and past outputs, which are capable of establishing long term dependencies.  In other words, the decoder accepts as input, weights, biases and input values, and performs multilayer feed forward computational processing to produce the neural network result.)
Kaiser, thus far, does not explicitly teach rescaling.  Kaiser also, thus far, does not explicitly teach implementing the functions of the neural network with a plurality of processors input data processor, rescaling processor, range repository, parameter repository, feature processor, merge processor, neural network preprocessor, neural network processor.
Glickman teaches rescaling. (Glickman, Fig. 12, Block 1240; and Page 13, Column 2, Paragraph [0218] “Step 1240: The feature data is scaled. Data scaling typically involves changing the attribute weights.  Scaling is typically performed to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges.” In other words scaling is rescaling.) 
	It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Glickman into the teaching of Kaiser. This would result in the rescaling [processor] being configured to rescale the disparate data feature elements to correspond with the disparate data elements
	One of ordinary skill in the art would be motivated to do this because scaling is required in order to allow multiple disparate data to be combined into one representation. This is because datasets are typically fixed in size for ease of processing. Varying sizes of data would need to be scaled to fit into the allocated range in order to proceed with processing.
The combination of Kaiser and Glickman, thus far, still does not explicitly teach implementing the functions of the neural network with a plurality of processors and hardware components such as: input data processor, rescaling processor, range repository, parameter repository, feature processor, merge processor, neural network preprocessor, neural network processor.
In other words, implemented in software, hardware, firmware, or any combination of these is implemented with a plurality of processors and hardware components.)  
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Forman into the teaching of Kaiser and Glickman.  This would result in the neural network where the functions are implemented with a plurality of processors and hardware components.
One of ordinary skill in the art would be motivated to do because implementing the neural network with a plurality of processors and hardware components would increase the speed of processing the output for the neural network.
Regarding claim 12,
The combination of Kaiser, Glickman, and Forman teaches the apparatus of claim 11, 
wherein the disparate data elements include image elements of a first data type, video elements of a second data type, and sound elements of a third data type, and the ranges include an image range, a video range, and a sound range. (Kaiser, Page 2 , Paragraph 2, Line 3  “Concretely, we train the MultiModel simultaneously on the following 8 corpora: (1) In other words, ImageNet dataset is image elements, WSJ speech corpus is sound elements.  Examiner notes that ImageNet is also used for training video, video being a series of images. Page 2 Paragraph 2, Line,1 “In this work, we take a step toward positively answering the above question by introducing the MultiModel architecture, a single deep-learning model that can simultaneously learn multiple tasks from various domains. Examiner notes that Kaiser does not limit the application of MultiModel by the specific tasks presented in Kaiser but instead claims multiple tasks from various domains. Since video is a sequence of images, and ImageNet is a commonly used data source for video, and Kaiser specifically cites ImageNet, ImageNet dataset is video elements.)
Regarding claim 13,
The combination of Kaiser, Glickman, and Forman teaches the apparatus of claim 11, 
	further comprising an output processor operatively connected to the neural network processor, the output processor receiving the neural network result and initiating downstream processing. (Kaiser, Page 4, Section 2.4, Line 1 “The body of the MultiModel consists of 3 parts: the encoder that only processes the inputs, the mixer that mixes the encoded inputs with previous outputs (autoregressive part), and a decoder that processes the inputs and the mixture to generate new outputs.” and, Page 5, Section 2.4 Paragraph 2, Line 5 “The decoder consist of 4 blocks of convolutions and attention, with a mixture-of-experts layer in the middle. Crucially, the convolutions in the mixer and decoder are padded on the left, so In other words, the decoder accepts as input, weights, biases and input values, and performs multilayer feed forward computational processing to produce the neural network result and initiating downstream processing. And, Forman, Column 9, Line 14 “In addition, although general-purpose programmable devices have been described above, in alternate embodiments one or more special-purpose processors or computers instead (or in addition) are used.  In general, it should be noted that, except as expressly noted otherwise, any of the functionality described above can be implemented in software, hardware, firmware or any combination of these, with the particular implementation being selected based on known engineering tradeoffs.” The combination with Forman results in an output processor to perform the function of output.)
Claims 14 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman), Forman (US 8885928 B2, herein Forman), Liberty et al (US 20100274753 A1, herein Liberty) and van Merrienboer et al (Blocks and Fuel: Frameworks for deep learning, herein van Merrienboer).
Regarding Claim 14,
The combination of Kaiser, Glickman, and Forman teaches the apparatus of claim 11, further comprising 
A training subsystem, the training subsystem comprising: a [known data repository] configured to store a plurality of known disparate data elements; (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.” In other words, input data of widely different sizes and dimensions, such as images, sound waves and test is a plurality of known disparate data elements.)
a [known feature repository] configured to store a plurality of known disparate feature data elements corresponding to the known plural disparate data elements; (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs
into a joint representation space.  We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.” In other words, input data of widely different sizes and dimensions, such as images, sound waves and test is a plurality of known disparate data elements.)
a [known outcome repository] storing known outcomes corresponding to the known disparate data elements and the known disparate situational data elements; (Kaiser, Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  We call these sub-networks modality nets as they are specific to each In other words, input data of widely different sizes and dimensions, such as images, sound waves and test is a plurality of known disparate data elements.)
[a missing data replacement processor configured to create values for any missing feature data;] 
a training set creation processor configured to accept as input at least one single [encapsulated data training item in combination with situational information] about the at least one single encapsulated data training item and known outcomes to create training set examples (Kaiser Page 2, Paragraph 4, Line 2 “To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space.  We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.” In other words, disparate input data is accepted; The combination with Forman results in the training set creation processor); 
and a neural network training processor configured to accept as input the training examples and known outcomes (The combination of Kaiser and Glickman teaches the training examples and known outcomes, The combination with Forman would result in the neural network training processor to perform the function), 
to perform back propagation training processing to determine optimal weights and biases, and to store the optimal weights and biases in a training neural network parameter values subsystem. (Kaiser, Page 1, Section 1, Line 2 “ Convolutional networks excel at tasks Examiner notes that both, convolutional networks and recurrent neural networks use backpropagation as a step in determining optimal weights and biases while training the neural network.) 
The combination of Kaiser, Glickman, and Forman, thus far, also does not teach a known data repository, a known feature repository, a known outcome repository a missing data replacement processor configured to create values for any missing feature data, encapsulated data training item in combination with situational information. 
Glickman teaches known data repository. (Glickman, Page 2, Figure 1a, Block 70 Input Content, In other words, Input Content is known data repository.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Glickman into the teaching of Kaiser, Glickman, and Forman.  This would result in a training subsystem, the training subsystem comprising: a known data repository configured to store a plurality of known disparate data elements.
One of ordinary skill in the art would be motivated to do this in order to train the neural network with known data. 
Thus far, the combination of Kaiser, Glickman, and Forman still does not teach: a known feature repository, a known outcome repository a missing data replacement processor configured to create values for any missing feature data, encapsulated data training item in combination with situational information.
known feature repository. (Glickman, Page 2, Figure 1a, Block 10 Content with Demographic Information, In other words, Content with Demographic Information is known feature repository.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Glickman into the teaching of Kaiser, Glickman, and Forman.  This would result in a known feature repository configured to store a plurality of known disparate feature data elements corresponding to the known plural disparate data elements.
One of ordinary skill in the art would be motivated to do so in order to improve the training by including disparate feature data with the training set.  This would speed up the training.
Thus far, the combination of Kaiser, Glickman, and Forman still does not teach: a known outcome repository, a missing data replacement processor configured to create values for any missing feature data, encapsulated data training item in combination with situational information.
Glickman teaches known outcome repository. (Glickman, Page 2, Figure 1a, Block 90, Block 90 Content with Predicted Demographic Information, In other words, Content with Predicted Demographic Information is known outcome repository.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to combine Glickman into the teaching of Kaiser, Glickman, and Forman.  This would result in a known outcome repository storing known outcomes corresponding to the known disparate data elements and the known disparate situational data elements,

Thus far, the combination of Kaiser, Glickman, and Forman still does not teach: a missing data replacement processor configured to create values for any missing feature data, encapsulated data training item in combination with situational information.
Liberty teaches a missing data replacement … configured to create values for any missing feature data (Liberty, Page 1, Abstract, Line 1 “The present invention is directed to a method for inferring/estimating missing values in a data matrix (d(q,r) …”In other words, inferring/estimating missing values in a data matrix is missing data replacement to create values for any missing feature data. The combination of Kaiser with Forman results in the function performed by a processor.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Liberty into the teaching of Kaiser, Glickman, and Forman.  This would result in a missing data replacement processor configured to create values for any missing feature data.
One of ordinary skill in the art would be motivated to do so in order to improve the effectiveness of the training by replacing missing data. This would have the combined effect of providing more data to use for training thereby improving the neural network and by speeding up the training.
Thus far, the combination of Kaiser, Glickman, Forman and Liberty still does not teach: encapsulated data training item in combination with situational information.
	van Merrienboer teaches a training set creation processor, encapsulated data training item in combination with situational information and neural network training processor. (van Merrienboer, Page 3, Paragraph 3.2, Line 1 “Datasets are distributed in a wide range of formats.  Fuel simplifies dataset storage by converting all built-in datasets to annotated HDF5 files (The HDF Group, 1997-2015).  In addition to being an efficient format for large datasets that don’t fit into memory, HDF5 is easy to organize and document.  All of the data is stored in a single HDF5 file, with the following metadata attached:  - What are the data sources available? -  How are the data sources officially split? - Are some data sources unavailable for some splits? -  Are some data sources unavailable for some splits?” In other words, Fuel is a training set creation processor, all of the data is stored in a single HDF5 file … with the following metadata attached is in combination with situational information, and Fuel is neural network training processor as well.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the invention to combine van Merrienboer into the teaching of Kaiser, Glickman, and Forman.  This would result in encapsulated data training item in combination with situational information.
One of ordinary skill in the art would be motivated to do so in order to improve the effectiveness of the training by consolidating training set creation into a training set creation processor thereby speeding up training. 
Regarding claim 15,
The combination of Kaiser, Glickman, Forman, Liberty, and van Merrienboer teaches the apparatus of claim 14, 
further comprising a [shuffle processor configured to accept as input the training result, shuffle one or more of the encapsulated samples, and re-submit the shuffled training result to the neural network training processor for iterative training]. 
The combination of Kaiser, Glickman, Forman, Liberty, and van Merrienboer, thus far, does not explicitly teach shuffle processor configured to accept as input the training result, shuffle one or more of the encapsulated samples, and re-submit the shuffled training result to the neural network training processor for iterative training.
van Merrienboer teaches shuffle processor configured to accept as input the training result, shuffle one or more of the encapsulated samples, and re-submit the shuffled training result to the neural network training processor for iterative training. (van Merrienboer, Page 3, Paragraph 3.1, Line 1 “Fuel allows for different ways of iterating over these datasets, such as sequential or shuffled minibatches, support of in-memory and out-of-core datasets, and resampling (cross validation bootstrapping.)” In other words, shuffled minibatches is shuffle processor, iterating over these datasets is shuffle one or more of the encapsulated samples, resampling is re-submit the shuffled training, and iterating over these datasets is iterative training.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to combine van Merrienboer into the teaching of Kaiser, Glickman, Forman, Liberty, and van Merrienboer.  This would result in further comprising a shuffle processor configured to accept as input the training result, shuffle one or more of the encapsulated samples, and re-submit the shuffled training result to the neural network training processor for iterative training. 
.
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman), Forman (US 8885928 B2, herein Forman) and Kutsuna et al (US 20150279061 A1, herein Kutsuna).
Regarding Claim 16,
The combination of Kaiser, Glickman, and Foreman teaches the apparatus of claim 11,
further comprising a [medical picture archiving and communication system] configured to store output and a feature processor configured to provide data from individual patient episodes of care.  (Examiner notes, the feature processor is not part of the claimed deep learning network apparatus and the function does not result in a structural difference, hence it is not being given patentable weight.)
Thus far, the combination of Kaiser, Glickman, and Forman does not explicitly teach a medical picture archiving and communication system.
Kutsuna teaches a medical picture archiving and communication system. (Katusuna, Page 2, FIG. 1, 

    PNG
    media_image1.png
    696
    599
    media_image1.png
    Greyscale

In other words, Block 200 “Picture Archiving and Communication Server” is the medical picture archiving and communication system.)
	It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to combine Kutsuna into the teaching of Kaiser, Glickman, and Forman.  This would result in a medical picture archiving and communication system.
	One of ordinary skill in the art would be motivated to do this in order to have a means for medical picture archival and reference.
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman), Forman (US 8885928 B2, herein Forman), and Yuzhakov et al (US 20170299772 A1, herein Yuzhakov).
Regarding claim 17, 
teaches the apparatus of claim 11, 
wherein the input data processor takes as [input digitized representations of weather and wherein the feature processor is configured to provide location specific weather information]. (Examiner notes, the feature processor is not part of the claimed deep learning network apparatus and the function does not result in a structural difference, hence it is not being given patentable weight.)
The combination of Kaiser, Glickman, and Forman, thus far, does not explicitly teach input digitized representations of weather and wherein the feature processor is configured to provide location specific weather information.
Yuzhakov teaches input digitized representations of weather and wherein the feature processor is configured to provide location specific weather information. (Yuzhakov, Page 8, FIG. 7, Block 710 “creating by the machine learning module the normalized value of a weather forecasting parameter based on the normalized value of the weather measuring parameter, and the normalized value of the weather forecasting parameter is connected to the moment of forecasting after the measurement time” and Block 712 “based on the normalized value of the weather forecasting parameter, creating the weather forecast.” In other words, weather parameter is input digitized representations of weather, and creating the weather forecast is provide location specific weather information.)
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Yuzhakov into the teaching of Kaiser, Glickman, and Forman.  This would result in the input data processor takes as input digitized representations of weather and wherein the feature processor is configured to provide location specific weather information.
One of ordinary skill in the art would be motivated to do so in order to be able to provide location specific weather forecasts.
Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman), Forman (US 8885928 B2, herein Forman), and Peschmann (US 20050117700 A1, herein Peschmann).
Regarding claim 18,
The combination of Kaiser, Glickman, and Forman teaches the apparatus of claim 11 
wherein the input data processor [takes as input radiographic images of contents of parcels and wherein the feature processor is configured to provide quantitative and qualitative data about the parcels] (Kaiser, Page 3, Figure 2, Input Encoder, In other words, input encoder is the input data processor.)
	The combination of Kaiser, Glickman, and Forman, thus far, does not explicitly teach takes as input radiographic images of contents of parcels and wherein the feature processor is configured to provide quantitative and qualitative data about parcels.
	Peschmann teaches takes as input radiographic images of contents of parcels and wherein the feature processor is configured to provide quantitative and qualitative data about the parcels. (Peschmann, Page 2, Paragraph [0019], Line 5 “The apparatus for identifying an object concealed within a container comprises a first stage inspection system having a Computed Tomography scanning system to generate a first set of data, a plurality of processors In other words, X-ray is radiographic image and second stage inspection system produces a second set of data having an X-ray signature is provide quantitative and qualitative data about the parcels.)
	It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Peschmann into the teaching of Kaiser, Glickman, and Forman.  This would result in the input data processor takes as input radiographic images of contents of parcels and wherein the feature processor is configured to provide quantitative and qualitative data about the parcels.
	One of ordinary skill in the art would be motivated to do so in order to effectively detect and provide quantitative and qualitative information from the radiographic images.
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman), Forman (US 8885928 B2, herein Forman), and Xiangbo (CN 107145596 A, herein Xiangbo).
Regarding claim 19,
The combination of Kaiser, Glickman, and Forman teaches the apparatus of claim 11, 
wherein the input data processor takes as input [information about competitors], the neural network parameter values correspond to [information about the competitors in past competitive matchups], and the output processor is configured to generate [performance predictions for the competitors] (Kaiser, Page 3, Figure 2, Input Encoder, In other words, input encoder is the input data processor.)
The combination of Kaiser, Glickman, and Forman, thus far, does not explicitly teach information about competitors, information about the competitors in past competitive matchups, and performance predictions for the competitors.
Xiangbo teaches information about competitors, information about the competitors in past competitive matchups, and performance predictions for the competitors. (Xiangbo, Page 1, Paragraph 1, Line 2 “The method comprises the following steps that a large quantity of electronic competitive race data on the internet is obtained based on a web crawler mechanism; the competitive race data is divided into testing data and training data, victory team numbers are marked; the deep neural network is established, and Batch Normalization is embedded among network layers; the competitive race data is used for training and testing the network to obtain network parameters; two teams needing competition prediction are selected, and competition data of the two teams is obtained, calculation is performed by using the network parameters, and the number of the team which most possibly wins victory is obtained.” In other words, competitive race data on the internet is information about the competitors, competitive race data is used for training and testing the network is information about the competitors in past competitive matchups, and calculation is performed by using the network parameters, and the number of the team which most possibly wins victory is obtained is performance predictions for the competitors.)
It would be obvious to one of ordinary skill in the art to combine Xiangbo into the teaching of Kaiser, Glickman, and Forman.  This would result in the input data processor takes as input information about competitors, the neural network parameter values correspond to information about the competitors in past competitive matchups, and the output processor is configured to generate performance predictions for the competitors.
One of ordinary skill in the art would be motivated to do so in order to be able to make predictions about competitors in events.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Kaiser et al (One Model to Learn Them All, herein Kaiser), Glickman (US 20120209795 A1, herein Glickman), Forman (US 8885928 B2, herein Forman), and Weng et al (US 20170008168 A1, herein Weng).
Regarding claim 20,
The combination of Kaiser, Glickman, and Forman teach the apparatus of claim 11, 
wherein the plural disparate data elements are provided by plural disparate subsystems of an [autonomous vehicle], and the output processor is configured to generate [robotic movements of the autonomous vehicle]. (Kaiser, Page 1, paragraph 1 Line 5 “In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task.” In other words, concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task are a plurality of disparate data items and related disparate feature data.)
autonomous vehicle and robotic movements of the autonomous vehicle.
Weng teaches autonomous vehicle and robotic movements of the autonomous vehicle. (Weng, Page 15, Paragraph [0003] Line, 11 “Similarly, a self-driven automotive vehicle may use similar optical sensors to “see” traffic patterns as they develop, and then use its effectors to control operation of the vehicle movement, by controlling steering, acceleration, braking and the like.” In other words, self-driven automotive vehicle is autonomous vehicle and use its effectors to control operation of the vehicle is robotic movements of the autonomous vehicle.)  
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Weng into the teaching of Kaiser, Glickman, and Forman. This would result in the plural disparate data elements are provided by plural disparate subsystems of an autonomous vehicle, and the output processor is configured to generate robotic movements of the autonomous vehicle.
One of ordinary skill in the art would be motivated to do this in order to control robotic movements to generate robotic movements of an autonomous vehicle.
Allowable Subject Matter
Claim 3 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: Claim 3 requires, among other things, that a predefined “keystone location” in conjunction with an encapsulated set of input data items, and is related to known correct output, is combined into a training set. Kaiser teaches a joint representation space of disparate input data items.  Glickman teaches content items with demographic data combined with known correct outputs. van Merrienboer teaches disparate data combined into a single dataset with metadata.  None of the references explicitly disclose a predefined keystone location which is related to known correct output combined into a training set with an encapsulated set of data items.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BART I RYLANDER whose telephone number is (571)272-8359.  The examiner can normally be reached on Monday - Thursday 8:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amir Mehrmanesh can be reached on 571-270-3351.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished 
/B.I.R./Examiner, Art Unit 4172                                                                                                                                                                                                        





/JYOTI MEHTA/Primary Examiner, Art Unit 2182