DETAILED ACTION
Claims 1-13 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/11/2019 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-9, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Dalton et al. (US 2018/0039884 A1) in view of Nishimura et al. (US 20180032865 A1), and in further view of Catanzaro et al. (US 2017/0148433 A1).

Nishimura and Catanzaro were cited in IDS.

Regarding claim 1, Dalton teaches the invention substantially as claimed including a computer-implemented method comprising: 
in a computing network comprising a number of nodes 1 to X having processors and memory ([0052] the computational units 610 can include one or more processors configured to perform one or more neural network layer operations; [0056] computational unit 610, memory devices 630), dividing neurons of a Convolutional Neural Network (CNN) between the number of nodes (Abstract; [0017] In an aspect, there is provided a system for training a neural network having a plurality of interconnected layers. The system includes a first set of neural network units and a second set of neural networking units. Each neural network unit in the first set is configured to compute parameter update data for one of a plurality of instances of a first portion of the neural network. Each neural network unit in the first set includes a communication interface for communicating its parameter update data for combination with parameter update data from another neural network unit in the first set. Each neural network unit in the second set is configured to compute parameter update data for one of a plurality of instances of a second portion of the neural network. Each neural network unit in the second set includes a communication interface for communicating its parameter update data for combination with parameter update data from another neural network unit in the second set; [0051] computational unit 610 can be configured to perform multiplications, accumulations, additions, subtractions, divisions, comparisons, matrix operations, down sampling, up sampling, convolutions, drop outs, and/or any other operation that may be used in a neural network process.); 
allocating a mini-batch of input data from among mini-batches of input data to a node of the number of nodes ([0026] During training, a large training input data set 225 can be split into smaller batches or smaller data sets, sometimes referred to as mini-batches 235. In some instances, the size and number of mini-batches can affect time and resource costs associated with training, as well as the performance of the trained neural network (i.e. how accurately the neural network classifies data). [0027] As illustrated by FIG. 3A, each mini-batch is fed through a neural network architecture 300. During the feed forward stage, one or more of the layers of the neural network process the mini-batch data using one or more parameters such as weights w.sub.1 and w.sub.2. During the back-propagation stage, parameter adjustments are calculated based on the back propagation of errors between the calculated and expected outputs. In some embodiments, these parameter updates are applied before the next mini-batch is processed by the neural network.); 
splitting for the node, from among the number of nodes, the mini-batch into a number of mini-batch sections X corresponding and equal to the number of nodes (Fig. 3A; Fig. 3B; [0028] To introduce parallelism, a neural network architecture can include multiple instances of a neural network with each instance computing data points in parallel. For example, FIG. 3B shows an example neural network architecture 310 including three instances of the neural network 300A, 300B, 300C. Rather than all nine of the data sets 215 of the mini-batch 235 being processed by a single neural network (as in FIG. 3A), the mini-batch 235 is split into three with each neural network instance 300A, 300B, 300C processing a different subset of the mini-batch.); 
at the node retaining a mini-batch section from among the split mini-batch sections which has a same number as the node and sending other mini-batch sections of the split mini-batch sections to corresponding other nodes ([0028] For example, FIG. 3B shows an example neural network architecture 310 including three instances of the neural network 300A, 300B, 300C. Rather than all nine of the data sets 215 of the mini-batch 235 being processed by a single neural network (as in FIG. 3A), the mini-batch 235 is split into three with each neural network instance 300A, 300B, 300C processing a different subset of the mini-batch; [0055] Depending on the architecture of the neural network, the input data sets 215 of a mini-batch can be streamed through the neural network layers; Each 300A-C retains a received subset (i.e., 3 data sets 215) of the mini-batch 235 that had 9 data sets as shown below);

    PNG
    media_image1.png
    656
    512
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    792
    474
    media_image2.png
    Greyscale
 
multiplying the matrix by the neurons to provide output data sections having one section of output data per each mini-batch section of the split mini-batch sections ([0051] matrix operations; [0060] Based on the input data and any parameters p, the computational unit can, in some instances, be configured to compute or otherwise generate output data for a subsequent layer in the neural network and/or parameter update data.); 
at the node sending the output data sections corresponding to the other corresponding nodes to the corresponding nodes and combining the output data sections in the node so that the node has output data for entire of the split mini-batch sections ([0059] In some embodiments, the neural network unit 600 is configured to receive or access input data 640 from an input data set or from a previous neural network unit in the neural network instance. In some embodiments, the input data may be received via a communication interface 640 and/or a memory device 630. The input data may include values for processing during the feed forward phase and/or error propagation values for processing during the back propagation phase. [0060] Based on the input data and any parameters p, the computational unit can, in some instances, be configured to compute or otherwise generate output data for a subsequent layer in the neural network and/or parameter update data. In some embodiments, the neural network unit 600 is configured to communicate the output data via a communication interface 650 and/or a memory device 630. [0061] The neural network unit 600 includes at least one communication interface 620 for communicating parameter update data ∇p for combination with parameter update data from one or more other neural network units 600. In some embodiments, the at least one communication interface 620 provides an interface to a central node or another neural network unit 600. In some embodiments, the parameter update data from one neural network unit 600 can be communicated to another neural network unit 600 via the at least one communication interface and central node as part of a combined parameter update.).

Dalton does not expressly teach sending other mini-batch sections of the split mini-batch sections to corresponding other nodes according to a number of the split mini-batch sections; 
collating at the node the split mini-batch sections at the node into a single matrix and multiplying the collated matrix.

However, Nishimura teaches sending other mini-batch sections of the split mini- batch sections to corresponding other nodes according to a number of the split mini-batch sections ([0075] Hereinafter, the number of pieces of training data collectively used by each GPU 12, i.e. each learning thread, will be referred to as a sub-batch number Nsubbatch. All pieces of training data are divided to be stored in the storages 13 of the respective nodes 1 before start of learning. Specifically, in each storage 13, pieces of training data, which are accessed by the corresponding GPU 12 for learning, are stored. [0076-82]).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Nishimura of associating portions of a sub-batch based on a number to the teachings of Dalton of assigning subsets of a minibatch to separate computing instances. The modification would have been motivated by the desire of ensuring each node receives its respective portion of the mini-batch. 

Dalton nor Nishimura do not expressly teach collating at the node the split mini-batch sections at the node into a single matrix and multiplying the collated matrix.

However, Catanzaro teaches collating at the node the split mini-batch sections at the node into a single matrix and multiplying the collated matrix ([0129] A common configuration used a minibatch of 512 on 8 GPUs. Embodiments of a training pipeline used herein binds one process to each GPU. These processes then exchange gradient matrices during the back-propagation by using all-reduce, which exchanges a matrix between multiple processes and sums the result so that at the end, each process has a copy of the sum of all matrices from all processes.; [0150] and [0151]: In embodiments, each utterance in a minibatch is mapped to a compute thread block (such as CUDA thread block). Since there are no dependencies between the elements of a column, all of them can be computed in parallel by the threads in a thread block. There are dependencies between columns, since the column corresponding to time-step t+ 7 cannot be computed before the column corresponding to time-step t. The reverse happens when computing the B matrix, when column corresponding to time-step t cannot be computed before the column corresponding to time-step t+1. Thus, in both cases, columns are processed sequentially by the thread block., (b) mapping (1220) the forward and backward passes to corresponding compute kernels. In embodiments, the compute kernels are GPU executed compute kernels, such as CUDA kernels. This is straightforward since there are no data dependencies between elements of a column. The kernel that does the backward pass also computes the gradient. However, since the gradients must be summed up based on the label values, with each character as key, data dependencies must be dealt with due to repeated characters in an utterance label).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Catanzaro with the teachings of Dalton and Nishimura to organize the results into a matrix to train the model. The modification would have been motivated by the desire of ensuring the matrix includes the results of all processes.

Regarding claim 2, Dalton teaches wherein the method is used in a forward propagation phase in a fully connected layer of the CNN, for training the CNN ([0027] As illustrated by FIG. 3A, each mini-batch is fed through a neural network architecture 300. During the feed forward stage, one or more of the layers of the neural network process the mini-batch data using one or more parameters such as weights w.sub.1 and w.sub.2.).

Regarding claim 3, Dalton teaches wherein each node includes a memory and processing capability, including processing capability as an accelerator in a graphics processing unit (GPU) ([0052] For example, in some embodiments, a computational unit 610 may be implemented on and/or include a graphics processing unit (GPU), a central processing unit (CPU), one or more cores of a multi-core device, and the like.).

Regarding claim 4, Catanzaro teaches further comprising: adding a bias term to the combined output data sections ([0061-62]; provided below directly from Catanzaro’s description).

    PNG
    media_image3.png
    387
    617
    media_image3.png
    Greyscale

Regarding claim 5, Dalton teaches further comprising, in a forward propagation of a test phase at a fully connected layer: 
creating new threads from a root solver thread at a main node, from among the number of nodes, executing a test iteration, each created new thread assigned to a different node, from among the number of nodes, the created new threads accessing memory addresses of neuron parameters held at the different nodes ([0032] FIG. 4 shows an example neural network architecture 400 having n layers 450. Each layer 450 in the architecture 400 can rely on one or more parameters p.sub.1 . . . p.sub.n to process input data. In some embodiments, a single layer may utilize a single parameter, multiple parameters, or no parameters. For example, a fully-connected layer (see for example FIG. 1) may have anywhere from a few parameters to millions of parameters in the form of interconnect weights. Another example is a layer which performs a constant computation and does not rely on any parameters; [0057] The computational unit 610, in some embodiments, is configured to access the memory device(s) 630 to access parameter values for the computation of a parameter update value, an error value, and/or a value for use in another layer. [0058] In some embodiments, the memory device(s) 630 are part of the neural network unit 600. In other embodiments, the memory device(s) 630 are separate from the neural network unit 600 and may be accessed via one or more communication interfaces. [0059] In some embodiments, the neural network unit 600 is configured to receive or access input data 640 from an input data set or from a previous neural network unit in the neural network instance. In some embodiments, the input data may be received via a communication interface 640 and/or a memory device 630. The input data may include values for processing during the feed forward phase and/or error propagation values for processing during the back propagation phase.).

Regarding claim 6, Nishimura teaches further comprising: the main node broadcasting input data for the test phase to the created new threads, and the created new threads computing an output of the a fully connected layer before all the created new threads are joined ([0122-128] That is, the weight updating cycle is carried out by the AR thread, i.e. the CPU 11 of each node, to communicate the weight update quantities with the other nodes to update, based on the weight update quantities calculated by all the nodes 1 for each weight, the corresponding weight.).

Regarding claim 7, Catanzaro teaches wherein, in a backward propagation phase at a convolutional layer, each node receives input data gradients for the allocated mini-batch and sends the input data gradients to each node where a mini-batch section of the allocated mini-batch was processed and each node multiplies the input data gradients at each node with the collated split mini- batch sections from the forward propagation phase to produce parameter gradients at each node from all the split mini-batch sections ([0129] A standard technique of data-parallelism was used to train on multiple GPUs using synchronous Stochastic Gradient Descent (SGD). A common configuration used a minibatch of 512 on 8 GPUs. Embodiments of a training pipeline used herein binds one process to each GPU. These processes then exchange gradient matrices during the back-propagation by using all-reduce, which exchanges a matrix between multiple processes and sums the result so that at the end, each process has a copy of the sum of all matrices from all processes. [0139] The CTC loss function used to train the models has two passes: forward and backward, and the gradient computation involves element-wise addition of two matrices, α and β, generated during the forward and backward passes respectively. Finally, the gradients are summed using the character in the utterance label as the key, to generate one gradient per character. These gradients are then back-propagated through the network. The inputs to the CTC loss function are probabilities calculated by the softmax function which can be very small, so it is computed in log probability space for better numerical stability.).

Regarding claim 8, Catanzaro teaches wherein the input data gradients are stored at each node in a memory space used for the output data for the entire split mini-batch sections ([0056] The computational unit 610, in some embodiments, includes, is connected to, or is otherwise configured to access one or more memory devices 630. In some embodiments, the memory devices 630 may be internal/embedded memory blocks, memory logic array blocks, integrated memory devices, on-chip memory, external memory devices, random access memories, block RAMs, registers, flash memories, electrically erasable programmable read-only memory, hard drives, or any other suitable data storage device(s)/element(s) or combination thereof. The memory device(s) 630 can, in some embodiments, be configured to store parameter data, error propagation data, and/or any other data and/or instructions that may be used in the performance of one or more aspects of a neural network layer. [0104] As described herein or otherwise, in some embodiments, the method includes computing or otherwise performing data processing for each stage/layer to generate intermediate data sets which may be used in the next stage and/or provided for storage in a memory device for later processing).

Regarding claim 9, Catanzaro teaches further comprising using backward propagation to calculate data gradients, wherein each node multiples the output data for the entire split mini-batch sections by the parameter gradients to provide output data gradients and the output data gradients corresponding to the other corresponding nodes are sent to the corresponding nodes so that each node holds the output data gradients for the entire mini- split batch sections ([0129] A standard technique of data-parallelism was used to train on multiple GPUs using synchronous Stochastic Gradient Descent (SGD). A common configuration used a minibatch of 512 on 8 GPUs. Embodiments of a training pipeline used herein binds one process to each GPU. These processes then exchange gradient matrices during the back-propagation by using all-reduce, which exchanges a matrix between multiple processes and sums the result so that at the end, each process has a copy of the sum of all matrices from all processes. [0139] The CTC loss function used to train the models has two passes: forward and backward, and the gradient computation involves element-wise addition of two matrices, α and β, generated during the forward and backward passes respectively. Finally, the gradients are summed using the character in the utterance label as the key, to generate one gradient per character. These gradients are then back-propagated through the network. The inputs to the CTC loss function are probabilities calculated by the softmax function which can be very small, so it is computed in log probability space for better numerical stability.).

Regarding claim 11, Catanzaro teaches wherein the CNN is a Deep Neural Network, DNN ([0043] Feed-forward neural network acoustic models were explored more than 20 years ago. Recurrent neural networks and networks with convolution were also used in speech recognition around the same time. More recently, deep neural networks (DNNs) have become a fixture in the ASR pipeline with almost all state-of-the-art speech work containing some form of deep neural network. Convolutional networks have also been found beneficial for acoustic models. Recurrent neural networks, typically LSTMs, are just beginning to be deployed in state-of-the art recognizers and work well together with convolutional layers for the feature extraction. Models with both bidirectional and unidirectional recurrence have been explored as well.).

Regarding claim 12, it is a system type claim having similar limitations as claim 1 above. Therefore, it is rejected under the same rationale above. Further, the additional limitations “a processor; and a memory having instructions stored thereon, the instructions when executed by the apparatus implementing a node among the number of nodes, causing the node to control operations including” are taught by Dalton in at least [0105] “Systems and methods of the described embodiments may be capable of being distributed in a computer program product including a physical, non-transitory computer readable medium that bears computer usable instructions for one or more processors”

Regarding claim 13, it is a media/product type claim having similar limitations as claim 1 above. Therefore, it is rejected under the same rationale above. See Dalton’s [0019].

Claims 10 is rejected under 35 U.S.C. 103 as being unpatentable over Dalton, Nishimura, Catanzaro, as applied to claim 1, in further view of Gokmen (US 9,646,243 B1).

Gokmen was cited in IDS

Regarding claim 10, Dalton, Nishimura, nor Catanzaro expressly teach wherein the bias term is only synchronized at the fully connected layer before the neuron parameters are updated.
Gokmen teaches wherein the bias term is only synchronized at the fully connected layer before the neuron parameters are updated (Col. 9, lines 35-57: The data values for each layer in the CNN is typically represented using matrices (or tensors in some examples) and computations are performed as matrix computations. The indexes (and/or sizes) of the matrices vary from layer to layer and network to network, as illustrated in FIG. 4. Different implementations orient the matrices or map the matrices to computer memory differently. Referring to FIG. 4, in the example CNN illustrated, each level is a matrix of neuron values, as is illustrated by matrix dimensions for each layer of the neural network. The values in a matrix at a layer are multiplied by connection strengths, which are in a transformation matrix. This matrix multiplication scales each value in the previous layer according to the connection strengths, and then summed. A bias matrix is then added to the resulting product matrix to account for the threshold of each neuron in the next level. Further, an activation function is applied to each resultant value, and the resulting values are placed in the matrix for the next layer. In an example, the activation function can be rectified linear units, sigmoid, or tan h( ). Thus, as FIG. 4 shows, the connections between each layer, and thus an entire network, can be represented as a series of matrices. Training the CNN includes finding proper values for these matrices.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Gokmen with the teachings of Dalton, Nishimura, and Catanzaro to add the bias after all parameters have been updated. The modification would have been motivated by the desire of shifting activation functions.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JORGE A CHU JOY-DAVILA whose telephone number is (571)270-0692. The examiner can normally be reached Monday-Friday, 9:00am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai T An can be reached on (571)-272-3756. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JORGE A CHU JOY-DAVILA/Primary Examiner, Art Unit 2195