Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on April 8, 2022, in which claims 1, 4, 11, 17, and 21-27 are amended. Claims 13-16, 7-10, and 18-20 are cancelled.  Claims 28-31 have been newly added.  Claims 1-6, 11-12, 17, and 21-31 are currently pending.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on May 6, 2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-6, 9-12,15-17 and 21-27 under 35 U.S.C. 103 based on amendment have been considered and are persuasive. The argument is moot in view of a new ground of rejection set forth below.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claim 22 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.

With regard to claim 22, the amended claim limitation "wherein the neural network processing circuits comprise M MUXes and N MACs, where M is less than N, and wherein M is at least 1 and N is at least 2." lacks support in the specification.  FIG. 3 suggests that M may be less than N where N is the number of layers and M is the active layer.  The instant specification further states ([¶0064] "That is, rather than having N MUXes and N MACs, the die includes, for example, M MUXes and M MACs, where M<N").  There is no disclosure of a configuration where the number of MACs is less than the number of MUXes.  Having at least two MAC units and at least one multiplexers is interpreted as incorporating new matter into the claim which not does contain support in the original disclosure.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


	Claims 1-2, 5-6, 11-12, 17, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Park (“3-D Stacked Synapse Array Based on Charge-Trap Flash Memory for Implementation of Deep Neural Networks”, 2018) and in view of Huang (“LTNN: An Energy-efficient Machine Learning Accelerator on 3D CMOS-RRAM for Layer-wise Tensorized Neural Network”, 2017). 

	Regarding claim 1, Park teaches An apparatus comprising: a die comprising non-volatile memory (NVM) elements formed in the die and arranged in a plurality of NAND blocks, each comprising a plurality of word lines; ([Abstract] "This paper proposes a synaptic device based on charge-trap flash memory...we also propose a 3-D stacked synapse array and present the structure, operation, and process methods" Flash memory interpreted as synonymous with non-volatile memory.)
	a plurality of neural network processing circuits formed in the die and configured to access synaptic weight values in parallel ([p. 420 §I] "A neuromorphic system is characterized by massively parallel architecture connecting myriad computing elements (neurons) and adaptive memory elements (synapses).  Each synapse has its own synaptic weight, which refers to the connection strength between neurons."  See also FIG. 7. [p. 425 §IV]"[p. 425] §IV] "3D-NAND process techniques can be used to fabricate the proposed 3-D stacked synapse array, such as the punch and plug process for channel formation [30] and the gate replacement process for metal gate formation [31]" synapse array interpreted as synonymous with plurality of neural network processing circuits.  Fabrication process describes die formation.  See also FIG. 10 for fabrication process showing formation in the die.)
	from the word lines of a particular NAND block ([p. 425] §IV] "The proposed array structure is similar to commercialized 3-D stacked NAND flash memory in which WLs are vertically stacked in both structures")
	and wherein each neural network layer is stored in a separate NAND block. ("The specific configuration based on the proposed synapse device is shown in Fig. 8. The synapse arrays corresponding to each layer of the DNN are stacked vertically" See FIG. 9(a) "WL connection design of 3-D stacked synapse array architecture and its selective operation method for each layer" [p. 424 §III]).
	However, Park does not explicitly teach perform neural network operations in parallel using the synaptic weight values,  the plurality of neural network processing circuits comprising multiplexers (MUXes) and multiply-accumulate (MAC) circuits,
	with the MUXes configured to route particular synaptic weight values to particular MAC circuits in accordance with a particular MUX connectivity configuration 
	and a MUX connectivity configuration circuit formed in the die and configured to determine the particular MUX connectivity configuration for different layers of a neural network. 

Huang, in the same field of endeavor, teaches and perform neural network operations in parallel using the synaptic weight values, ([p. 283 §IIID] "Secondly, a tensorzaiton of weight matrix can decompose the big matrix into many small tensor-core matrices, which can effectively reduce the configuration time of RRAM. Lastly, the multiplication of small matrix can be performed in a highly parallel fashion on RRAM to speed-up the large neural network processing time")
	the plurality of neural network processing circuits comprising multiplexers (MUXes) and multiply-accumulate (MAC) circuits ([p. 283 §IIIC] "The detailed design of a tensor core is also shown in Fig. 3. In each tensor core, we store different slices of the 3-dimensional matrix into different RRAM-crossbars. Since only one 2D matrix is used at a time, two tensor core Multiplexers (MUX) are used" [p. 282§IIIA] "In one RRAM-crossbar, given the input probing voltage, the current on each bit-line (BL) is the multiplication-accumulation of current through each RRAM device on the BL" See also FIG. 3. RRAM crossbar interpreted as synonymous with multiply-accumulate circuit.)
	with the MUXes configured to route particular synaptic weight values to particular MAC circuits in accordance with a particular MUX connectivity configuration; ([p. 283 §IIIC] "In each tensor core, we store different slices of the 3-dimensional matrix into different RRAM-crossbars. Since only one 2D matrix is used at a time, two tensor core Multiplexers (MUX) are used so that only one matrix is connected to the input voltage as well as the output ADC. The TC selection module controls the input and output MUX according to i and j"  [p. 283 §IIID] "Secondly, a tensorzaiton of weight matrix can decompose the big matrix into many small tensor-core matrices, which can effectively reduce the configuration time of RRAM. Lastly, the multiplication of small matrix can be performed in a highly parallel fashion on RRAM to speed-up the large neural network processing time" FIG. 3 on p. 283 shows that the weights are passed through the RRAM to the multiplexers.  Huang explicitly teaches that the weights matrices are subdivided and routed through the the input and output multiplexers.)
	and a MUX connectivity configuration circuit formed in the die and configured to determine the particular MUX connectivity configuration for different layers of a neural network; ([p. 283 Sec. III C. ] "The detailed design of a tensor core is also shown in Fig. 3. In each tensor core, we store different slices of the 3-dimensional matrix into different RRAM-crossbars. Since only one 2D matrix is used at a time, two tensor core Multiplexers (MUX) are used so that only one matrix is connected to the input voltage as well as the output ADC. The TC selection
module controls the input and output MUX according to i and j." FIG. 3 on p. 283 shows the MUX connectivity configuration circuit with respect to a particular hidden layer.). 

Park and Huang are both directed towards a 3D stacked memory implementation of a neural network accelerator.  Therefore, Park and Huang are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network accelerator in Falcon with that of Huang by using multiplexers and multiply-accumulate circuits in the accelerator.  Huang outlines a number of benefits on [p. 283 Sec. III D] including but not limited to (“the multiplication of small matrix can be performed in a highly parallel fashion on RRAM to speed-up the large neural network processing time”). 

	Regarding claim 2, the combination of Park and Huang teaches The apparatus of claim 1, wherein the neural network processing circuits are configured as one or more of under-the-array circuits and next- to-the-array circuits. (Huang [p. 282 Sec. III B] "The proposed 3D CMOS-RRAM accelerator is shown in Fig. 2(a). This accelerator is composed of a top layer of wordlines, a bottom layer of CMOS circuits and vertical connection between both layers by RRAM" Bottom layer of CMOS circuits is interpreted as synonymous with under-the-array circuit.). 

Regarding claim 5, the combination of Park and Huang teaches The apparatus of claim 1, wherein the neural network processing circuits are configured to perform backpropagation operations in parallel on the synaptic weight values. (Huang [p. 282 §IIB] "The TNN can be further fine tuned by backward propagation on the tensor cores" [p. 284 §V] "In this paper, we propose a 3D CMOS-RRAM accelerator for highly-parallel yet energy-efficient machine learning").

	Regarding claim 6, the combination of Park and Huang teaches The apparatus of claim 5, wherein the neural network processing circuits comprise: a plurality of synaptic weight determination circuits disposed in parallel (Huang [p. 282 §IIB] "We define a tensorized neural network (TNN) if the weight of the neural network can be represented in the tensor-train data format" Tensor train interpreted as synonymous with weight determination circuit.)
	and a plurality of synaptic weight update circuits disposed in parallel (Huang [p. 282 §IIB] "Then we adjust one tensor core and fix the rest tensor cores for the minimization of ||HG1G2...Gd − W||2. Finally, we iteratively perform the
optimization of each tensor core until the error is small or the maximum iterative time reaches. The TNN can be further fine tuned by backward propagation on the tensor cores" Adjusting tensor core to minimize loss interpreted as synonymous with updating weights.). 

	Claims 11-12 are substantially similar to claim 1-2.  Therefore, the rejection applied to claims 1-2 also applies to claim 11-12.  

	Claim 17 is substantially similar to claim 1.  Therefore, the rejection applied to clam 1 also applies to claim 17.  

	Regarding claim 25, the combination of Park and Huang teaches The method of claim 11, wherein the particular MUX connectivity configuration is loaded based on a relevant set of synaptic weights. (Huang [p. 283 Sec. III C.] "Since only one 2D matrix is used at a time, two tensor core Multiplexers (MUX) are used so that only one matrix is connected to the input voltage as well as the output ADC. The TC selection module controls the input and output MUX according to i and j" I and j of the 2D matrix are explicitly taught as being synaptic weights.). 


	Claims 3 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Park, and Huang and in further view of Garbin (US20200151550A1).

	Regarding claim 3, the combination of Park and Huang teaches The apparatus of claim 1.
	However, the combination of Park and Huang does not explicitly teach the neural network processing circuits are configured to perform feedforward neural network operations in parallel using the synaptic weight values.  

Garbin, in the same field of endeavor, teaches the neural network processing circuits are configured to perform feedforward neural network operations in parallel using the synaptic weight values. ([¶0003] "In DNNs, data flows from the input layer to the output layer without looping back; they are feedforward networks." [¶0011] "In a neural network circuit according to embodiments of the present disclosure, the weighted current components may be provided by driving multiple transistors in parallel."). 

	Park, Huang, and Garbin are all directed towards a 3D stacked memory implementation of a neural network accelerator.  Therefore, Park, Huang, and Garbin are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network accelerator in Park and Huang with that of Garbin by having the neural network operate in a feedforward mode.  One of ordinary skill in the art would recognize that a feedforward neural network is the simplest form of neural network and a feedforward mode is well known.  The perceived intention of a stacked memory neural network accelerator is to increase processor throughput by increasing density.  In view of this, Garbin provides as a motivation for combination ([¶0071] "In example embodiments of the present disclosure a 3D NAND configuration provides the highest possible density option.").  

	Regarding claim 23, the combination of Park and Huang teaches The apparatus of claim 1.  
However, the combination of Park and Huang does not explicitly teach the MUX connectivity configuration circuit is configured to load the particular MUX connectivity configuration based on a relevant set of synaptic weights for each NAND block.  

Garbin, in the same field of endeavor, teaches the MUX connectivity configuration circuit is configured to load the particular MUX connectivity configuration based on a relevant set of synaptic weights for each NAND block. ([¶0072] "Also the reference signals for the reference pull-up and pull-down networks are provided as an input to the 3D NAND array 61. By the MAC operation, output signals are generated at the port OUT, which output signals are brought to the nodes of a next layer of the neural network." [¶0073] " the output signals of a particular layer may also be fed back to the input of a next layer, where these signals will act as the new input signals to be applied to this next layer. The output of the array should be stored, for example in a register 63. At the next clock cycle, the control unit 62 will provide the correct signals to the multiplexers"). 

	Park, Huang, and Garbin are all directed towards a 3D stacked memory implementation of a neural network accelerator.  Therefore, Park, Huang, and Garbin are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network accelerator in Park and Huang with that of Garbin by having the neural network operate in a feedforward mode.  One of ordinary skill in the art would recognize that a feedforward neural network is the simplest form of neural network and a feedforward mode is well known.  The perceived intention of a stacked memory neural network accelerator is to increase processor throughput by increasing density.  In view of this, Garbin provides as a motivation for combination ([¶0071] "In example embodiments of the present disclosure a 3D NAND configuration provides the highest possible density option.").  

	Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Park, Huang, and Garbin and in further view of Ma (US 2018/0075344 A1).

	Regarding claim 4, the combination of Park, Huang, and Garbin teaches The apparatus of claim 3, wherein the neural network processing circuits include multiplication circuits (Garbin [¶0003] "During inference (classification) mode, input data (image, sound track, etc.) are transformed by a series of Multiply Accumulate (MAC) operations, i.e. sums weighted by the synapses values, and non-linearity functions performed by the neurons. At the output layer, the active neuron will indicate the class of the input (classification)")
	configured for computing products of synaptic weight values and activation values;  summation circuits configured to sum the products (Garbin  [¶0047] "The weighted sum is a multiply accumulate (MAC) operation. In this calculation, a set of inputs VIN,i are multiplied by a set of weights Wi,j, and those values are summed to create a final result." Set of inputs Vin,i interpreted as synonymous with activation values.). 
and rectified linear unit (RLU) and/or sigmoid function circuits configured to compute RLU and/or sigmoid functions from resulting values (Park [p. 423 §IID] "We also used a rectifier linear unit as an activation function")
	However, the combination of Park, Huang, and Garbin does not explicitly teach bias addition circuits configured to add a bias value to the sums. 

Ma, in the same field of endeavor, teaches bias addition circuits configured to add a bias value to the sums; ([¶0035] “Their activations can hence be computed with a matrix multiplication followed by a bias offset.").

	Park, Huang, Garbin, and Ma are all directed towards a memory based neural network accelerator.  Therefore, Park, Huang, Garbin, and Ma are analogous art in the same field of endeavor It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Park, Huang, and Garbin with the teachings of Ma by using adding a bias to the sum. 
A bias term is well known in the art and it would be obvious to one of ordinary skill in the art to use one.  This is further reinforced by Ma, who describes an analogous neural network accelerator and mentions as a motivation for combination with other arts regarding neural network accelerators ([¶0008] "Embodiments of the present disclosure are directed to a neural network hardware accelerator architecture and the operating method thereof capable of improving the performance and efficiency of a neural network accelerator").

	Claims 21 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Park, and Huang and in further view of Chang Huang (US10241837B2).

	Regarding claim 21, the combination of Park and Huang teaches The apparatus of claim 1 wherein the full MUX connectivity connects all neuron outputs of a previous layer to neurons of a next layer of the layers of the neural network, and wherein the partial MUX connectivity connects only some of the neuron outputs of the previous layer to neurons of the next layer (Huang [p. 281 §IIA] "the fully-connected layer is a special case of convolutional layer with kernel size 1 × 1, such tensorized weights can also be applied to other convolutional layers." [p. 285 §V] "3-layer neural network with two full-connected layer...hidden nodes are all fixed to 1024 with 1 hidden layers....4-layer neural network with 3 full-connected layer" Huang explicitly teaches both full and partial hidden layer connectivity and further explicitly teaches using a neural network with both fully and partially connected layers.  It would further be obvious to one of ordinary skill in the art that a layer that is not fully connected would be partially connected, and by definition would be a layer where not all neurons of a previous layer are connected to a next layer (as is shown in FIG. 1 of Huang where H(L-1) is not connected to the first node of H(L).).
	However, the combination of Park and Huang does not explicitly teach wherein the full MUX connectivity connects all neuron outputs of a previous layer to neurons of a next layer of the layers of the neural network, and wherein the partial MUX connectivity connects only some of the neuron outputs of the previous layer to neurons of the next layer., wherein the MUX connectivity configuration circuit is configured to select between a partial MUX connectivity and a full MUX connectivity.  
	However, the combination of Park and Huang does not explicitly teach the MUX connectivity configuration circuit is configured to select between a partial MUX connectivity and a full MUX connectivity.  

Chang Huang, in the same field of endeavor, teaches the MUX connectivity configuration circuit is configured to select between a partial MUX connectivity and a full MUX connectivity. ([Col. 24 l. 31-34] "the same set of calculation circuits can be used for different types of layers including convolution layer, pooling layer, upscale, ReLU or fully-connected layer. In some cases, different operations may share the same set of calculation circuits by using a multiplexer for controlling data paths or data flow in accordance with the operations."). 

	Park, Huang, and Chang Huang are all directed towards a 3D stacked memory implementation of a neural network accelerator.  Therefore, Park, Huang, and Chang Huang are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network accelerator in Park and Huang with that of Chang Huang by using the multiplexer configuration circuit for both fully connected and partially connected neural network layers. Chang Huang teaches that this added flexibility broadens the scope of application of the neural network, and provides as further motivation for combination with other analogous arts ([Col. 7 l. 3-10] “the method and system provide an efficient data transmission between a main memory and a chip implements the parallel operations. The efficient data transmission may be achieved by dense parameter and input data packing. This data arrangement may also simplify instructions and reduce memory access. The parallel operations may include operations in a CNN layer and a smooth data pipelining or seamless dataflow between layers may be provided by data management").    

Regarding claim 24, The combination of Park and Huang teaches The method of claim 11, wherein the full MUX connectivity connects all neuron outputs of a previous layer to neurons of a next layer of the layers of the neural network, and wherein the partial MUX connectivity connects only some of the neuron outputs of the previous layer to neurons of the next layer. (Huang [p. 281 §IIA] "the fully-connected layer is a special case of convolutional layer with kernel size 1 × 1, such tensorized weights can also be applied to other convolutional layers." [p. 285 §V] "3-layer neural network with two full-connected layer...hidden nodes are all fixed to 1024 with 1 hidden layers....4-layer neural network with 3 full-connected layer" Huang explicitly teaches both full and partial hidden layer connectivity and further explicitly teaches using a neural network with both fully and partially connected layers.  It would further be obvious to one of ordinary skill in the art that a layer that is not fully connected would be partially connected, and by definition would be a layer where not all neurons of a previous layer are connected to a next layer (as is shown in FIG. 1 of Huang where H(L-1) is not connected to the first node of H(L).). 
However, the combination of Park and Huang doesn’t explicitly teach modifying the MUX connectivity configuration comprises changing the particular MUX connectivity configuration between a partial MUX connectivity a full MUX connectivity.

	Chang Huang, in the same field of endeavor, teaches modifying the MUX connectivity configuration comprises changing the particular MUX connectivity configuration between a partial MUX connectivity a full MUX connectivity. ([Col. 24 l. 31-34] "the same set of calculation circuits can be used for different types of layers including convolution layer, pooling layer, upscale, ReLU or fully-connected layer. In some cases, different operations may share the same set of calculation circuits by using a multiplexer for controlling data paths or data flow in accordance with the operations.").
	 
	Park, Huang, and Chang Huang are all directed towards a 3D stacked memory implementation of a neural network accelerator.  Therefore, Park, Huang, and Chang Huang are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network accelerator in Park and Huang with that of Chang Huang by using the multiplexer configuration circuit for both fully connected and partially connected neural network layers. Chang Huang teaches that this added flexibility broadens the scope of application of the neural network, and provides as further motivation for combination with other analogous arts ([Col. 7 l. 3-10] “the method and system provide an efficient data transmission between a main memory and a chip implements the parallel operations. The efficient data transmission may be achieved by dense parameter and input data packing. This data arrangement may also simplify instructions and reduce memory access. The parallel operations may include operations in a CNN layer and a smooth data pipelining or seamless dataflow between layers may be provided by data management").    


	Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Park, and Huang and in further view of Whatmough (US20200133831A1).

	Regarding claim 22, the combination of Park and Huang teaches The apparatus of claim 1.
	However, the combination of Park and Huang does not explicitly teach the neural network comprises N layers, and wherein the neural network processing circuits comprise M MUXes and N MACs, where M is less than N, and wherein M is at least 1 and N is at least 2.  

Whatmough, in the same field of endeavor, teaches The apparatus of claim 1, wherein the neural network comprises N layers, and wherein the neural network processing circuits comprise M MUXes and N MACs, where M is less than N, and wherein M is at least 1 and N is at least 2. ([0069] "Specifically, FIG. 8, illustrates a hardware arrangement 800 to implement a transpose, shown as an IM2COL transpose (FIG. 7). As shown in FIG. 8, arrangement 800 includes one or more IFM SRAMs, shown generally as 802, a transpose module 804, a temporary register 806, crossbar 809, which could be a multiplex (mux) crossbar, and module 820, which may be a SIMD unit or un-pipelined MAC array." See also FIG. 8.  It would be obvious to one of ordinary skill in the art that an array refers to more than one unit, while Whatmough shows that only a single multiplexer is needed.). 

	Park, Huang, and Whatmough are all directed towards a memory based neural network accelerator.  Therefore, Park, Huang, and Whatmough are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Park and Huang with the teachings of Whatmough by using multiple MAC units for every multiplexer.  Whatmough teaches that a GEMM is a generic matrix multiplication operation commonly used in neural networks, and teaches that said operations can be performed through a MAC unit array ([¶0040] “A common approach to implement convolution in a CPU (central processing unit), a GPU (graphics processing unit) and dedicated hardware accelerators is to convert it into a generic matrix multiplication (GEMM) operation. “ [¶0042] “In software, the GEMM operation is performed by calling a library function. In hardware, the GEMM operation is often implemented efficiently as a 2D MAC array”).  
	
	Claims 26-28 and 31 are rejected under 35 U.S.C. 103 as being unpatentable over Huang and in view of Li (US 20190019564 A1). 

	Regarding claim 26, Huang teaches An apparatus, comprising: a die comprising non-volatile memory (NVM) elements; (See FIG. 2.  [p. 282 Sec. III B] stacking non-volatile memories on top of microprocessors  enables costeffective heterogeneous integration")
	a plurality of neural network processing circuits formed in the die and configured to read synaptic weight values in parallel from a plurality of word lines of NVM elements of the die and perform neural network operations in parallel using the synaptic weight values; and a circuit formed on the die ([p. 282 Sec. III A] "In one RRAM-crossbar, given the input probing voltage, the current on each bit-line (BL) is the multiplication-accumulation of current through each RRAM device on the BL. Therefore, the RRAM-crossbar array can intrinsically perform the analog matrix-vector multiplication [17]. Given an input voltage vector...where ci,j is configurable conductance of the RRAM resistance Ri,j , which can represent real number of weight." [p. 283 Sec. III D] "the multiplication of small matrix can be performed in a highly parallel fashion on RRAM to speed-up the large neural network processing time" [p. 280 Sec. I] "the 3D CMOS-RRAM integration can further support more parallelism with higher I/O bandwidth in acceleration" Huang explicitly teaches that the RRAM accesses weights from the wordlines to perform multiplication and that the multiplication can be performed in a highly parallel fashion. Huang further teaches that the overall aim of the CMOS-RRAM integration circuit is to support higher parallelism through higher I/O bandwidth.)
	 and configured to perform an on-chip NVM fold operation to: read at least some of the synaptic weight values from a plurality of first word lines of the plurality of word lines, each of the first word lines comprising single-level-cell (SLC) NVM elements of a portion of the NVM configured to run in an SLC mode ([p. 281 §IIA] "A two-dimensional weight is folded into three-dimensional tensor and then decomposes into tensor cores G1,G2, ...Gd" FIG. 2 shows that the explicit word lines are expressed as a single NVM layer. FIG. 2 (b) shows that the word lines represent synaptic weights. Operation of single layer cells interpreted as running in single layer cell mode.)
	update the synaptic weight values read from the first word lines using at least one of the plurality of the neural network processing circuits, ([p. 281 Sec. II A] "To build a multi-layer neural network, we propose a layerwise training process based on stack auto-encoder for low rank tensor cores and high compression rate. An auto-encoder layer is to set the layer output T the same as input X and find an optimal weight to represent itself. For example, we need to train a tensorized weight W" Training the weight is interpreted as synonymous with updating the synaptic weight. Setting output T to input X is interpreted as updating the value of input X.).
	However, Huang does not explicitly teach  and store the updated synaptic weight values in a second word line of the plurality of word lines, the second word line comprising multi-level-cell (MLC) NVM elements of a portion of the NVM configured to run in an MLC mode.  

Li, in the same field of endeavor, teaches to store the updated synaptic weight values in a second word line of the plurality of word lines, the second word line comprising multi-level-cell (MLC) NVM elements of a portion of the NVM configured to run in an MLC mode. ([¶0196] "the MLC NVM matrix circuit 1900 is also configured to train the resistance of the MLC NVM storage circuits MLC-R.sub.00-MLC-R.sub.mn by supporting backwards propagation of a weight update according to the following formula:"). 

	Huang and Li are all directed towards a memory based neural network accelerator.  Therefore, Huang and Li are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network accelerator in Huang with the multi-level cell NVM elements for neural network acceleration taught in Li. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that while Huang does not explicitly teach multi-level memory cells, Huang implicitly teaches storing updated weight values in stacked memories.  Li is therefore introduced to reinforce and to implicitly teach storing updated weight values in stacked memory in the scope of a neural network accelerator that is interpreted as having similar design goals as the accelerator taught in Huang.  Li further supports the combination in ([¶0242] “The system memory chip 2608 could be connected to the dedicated MLC NVM matrix circuit chip 2602 through a dedicated local bus to improve performance. The dedicated MLC NVM matrix circuit chip 2602 could also be embedded into the SoC 2606 to save power and improve performance.”).

	Regarding claim 27, claim 27 is substantially similar to claim 26.  Therefore the rejection applied to claim 26 also applies to claim 27.

	Regarding claim 28, the combination of Huang and Li teaches The method of claim 27, further comprising: performing neural network operations in parallel using the synaptic weight values, (Huang [p. 283 §IIID] "Secondly, a tensorzaiton of weight matrix can decompose the big matrix into many small tensor-core matrices, which can effectively reduce the configuration time of RRAM. Lastly, the multiplication of small matrix can be performed in a highly parallel fashion on RRAM to speed-up the large neural network processing time")
	wherein the neural network operations are performed in parallel by a plurality of neural network processing components formed within the die (Huang [p. 283 §IIID] "Secondly, a tensorzaiton of weight matrix can decompose the big matrix into many small tensor-core matrices, which can effectively reduce the configuration time of RRAM. Lastly, the multiplication of small matrix can be performed in a highly parallel fashion on RRAM to speed-up the large neural network processing time")
	the plurality of neural network processing components comprising multiplexers (MUXes) and multiply-accumulate (MAC) components, (Huang [p. 283 §IIIC] "The detailed design of a tensor core is also shown in Fig. 3. In each tensor core, we store different slices of the 3-dimensional matrix into different RRAM-crossbars. Since only one 2D matrix is used at a time, two tensor core Multiplexers (MUX) are used" [p. 282 §IIIA] "In one RRAM-crossbar, given the input probing voltage, the current on each bit-line (BL) is the multiplication-accumulation of current through each RRAM device on the BL" See also FIG. 3. RRAM crossbar interpreted as synonymous with multiply-accumulate circuit.)
	with the MUXes configured to route particular synaptic weight values to particular MAC circuits in accordance with a particular MUX connectivity configuration; (Huang [p. 283 §IIIC] "In each tensor core, we store different slices of the 3-dimensional matrix into different RRAM-crossbars. Since only one 2D matrix is used at a time, two tensor core Multiplexers (MUX) are used so that only one matrix is connected to the input voltage as well as the output ADC. The TC selection module controls the input and output MUX according to i and j"  [p. 283 §IIID] "Secondly, a tensorzaiton of weight matrix can decompose the big matrix into many small tensor-core matrices, which can effectively reduce the configuration time of RRAM. Lastly, the multiplication of small matrix can be performed in a highly parallel fashion on RRAM to speed-up the large neural network processing time" FIG. 3 on p. 283 shows that the weights are passed through the RRAM to the multiplexers.  Huang explicitly teaches that the weights matrices are subdivided and routed through the the input and output multiplexers.)
	modifying the MUX connectivity configuration for a different layer of a neural network and then performing additional neural network operations; and wherein each neural network layer is stored in a separate NAND block of the die. (Huang [p. 283 Sec. III C. ] "The detailed design of a tensor core is also shown in Fig. 3. In each tensor core, we store different slices of the 3-dimensional matrix into different RRAM-crossbars. Since only one 2D matrix is used at a time, two tensor core Multiplexers (MUX) are used so that only one matrix is connected to the input voltage as well as the output ADC. The TC selection module controls the input and output MUX according to i and j." FIG. 3 on p. 283 shows the MUX connectivity configuration circuit with respect to a particular hidden layer.). 

	Regarding claim 31, claim 31 is substantially similar to claim 28.  Therefore, the rejection applied to claim 28 also applies to claim 31. 

	Claims 29 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Park, in view of Huang and in further view of Li. 

	Regarding claim 29, Huang teaches The apparatus of claim 1, further comprising a NVM storage circuit formed in the die and (See FIG. 2.  [p. 282 Sec. III B] stacking non-volatile memories on top of microprocessors  enables costeffective heterogeneous integration")
	configured to perform an on-chip NVM fold operation by: reading at least some of the synaptic weight values from a plurality of first word lines of the plurality of word lines,  each of the first word lines comprising single-level- cell (SLC) NVM elements of a portion of the NVM configured to run in an SLC mode ([p. 281 §IIA] "A two-dimensional weight is folded into three-dimensional tensor and then decomposes into tensor cores G1,G2, ...Gd" FIG. 2 shows that the explicit word lines are expressed as a single NVM layer. FIG. 2 (b) shows that the word lines represent synaptic weights. Operation of single layer cells interpreted as running in single layer cell mode.)
	updating the synaptic weight values read from the first word lines using at least one of the plurality of the neural network processing circuits, ([p. 281 Sec. II A] "To build a multi-layer neural network, we propose a layerwise training process based on stack auto-encoder for low rank tensor cores and high compression rate. An auto-encoder layer is to set the layer output T the same as input X and find an optimal weight to represent itself. For example, we need to train a tensorized weight W" Training the weight is interpreted as synonymous with updating the synaptic weight. Setting output T to input X is interpreted as updating the value of input X.).
	However, Huang does not explicitly teach and storing the updated synaptic weight values in a second word line of the plurality of word lines,  the second word line comprising multi-level-cell (MLC) NVM elements of a portion of the NVM configured to run in an MLC mode.  

Li, in the same field of endeavor, teaches and storing the updated synaptic weight values in a second word line of the plurality of word lines,  the second word line comprising multi-level-cell (MLC) NVM elements of a portion of the NVM configured to run in an MLC mode. ([¶0196] "the MLC NVM matrix circuit 1900 is also configured to train the resistance of the MLC NVM storage circuits MLC-R.sub.00-MLC-R.sub.mn by supporting backwards propagation of a weight update according to the following formula:"). 

	Park, Huang, and Li are all directed towards a memory based neural network accelerator.  Therefore, Park, Huang, and Li are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network accelerator in Park and Huang with the multi-level cell NVM elements for neural network acceleration taught in Li. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that while Huang does not explicitly teach multi-level memory cells, Huang implicitly teaches storing updated weight values in stacked memories.  Li is therefore introduced to reinforce and to implicitly teach storing updated weight values in stacked memory in the scope of a neural network accelerator that is interpreted as having similar design goals as the accelerator taught in Huang.  Li further supports the combination in ([¶0242] “The system memory chip 2608 could be connected to the dedicated MLC NVM matrix circuit chip 2602 through a dedicated local bus to improve performance. The dedicated MLC NVM matrix circuit chip 2602 could also be embedded into the SoC 2606 to save power and improve performance.”).

	Regarding claim 30, claim 30 is substantially similar to claim 29.  Therefore, the rejection applied to claim 29 also applies to claim 30.  


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        

/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126