Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Application 16/862,515 filed 4/29/2020 has been examined.
In this Office Action claims 1-24 are currently pending.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-24 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over:

Claims 1-20 of copending Application No. 16/840,216;
Claims 1-21 of copending Application No. 16/854,861;
Claims 1-20 of copending Application No. 16/841,598;
Claims 1-20 of copending Application No. 16/846,263;
Claims 1-20 of copending Application No 16/841601; and
Claims 1-20 of copending Application No 16/846263

Although the claims at issue are not identical, they are not patentably distinct from each other because the current application claims as stated above would have been anticipated by the respective copending application claims as stated above. 

This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.


Claim Rejections - 35 USC § 101

35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 1-24 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an
abstract idea without significantly more.
Claim 1 recites:
Determining a partial computation metric based on the computations.
The limitation of determining a partial computation metric based on the computations, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting a computer program/semiconductor, nothing in the claim element precludes the step from practically being performed in the mind. For example, but for the computer program/semiconductor language, determining in the context of this claim encompasses the user manually determining a generic “computation metric” based on generic “computations”. Similarly, the limitation(s) of specifying; generating; allocating; and outputting, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, but for the computer program/semiconductor language, specifying; generating; allocating; and outputting in the context of this claim encompasses the user manually generating a generic “computation metric” based on generic “computations”. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas (concepts performed in the human mind (including an observation, evaluation, judgment,
opinion)).
Further, these concepts also recite “Certain Methods of Organizing Human Activity”; (such as
commercial or legal interactions (including agreements in the form of contracts; legal
obligations; advertising, marketing or sales activities or behaviors; business relations) where
determining a generic “computation metric” based on generic “computations” is a method of human activity in commercial or legal interactions.
Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only
recites one additional element – using a computer program/semiconductor to perform both the specifying; generating; allocating; and outputting and determining steps. The program/semiconductor in both steps is recited at a high level of generality (i.e., as a generic processor determining a generic “computation metric” based on generic “computations”) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more
than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a computer program/semiconductor to perform both the specifying; generating; allocating; and outputting and determining steps amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim(s) is/are not patent eligible.

Dependent claims 2-17 are merely add further details of the abstract steps/elements recited in
claim 1 without integrating the idea into a practical application; or including an improvement to
another technology or technical field, an improvement to the functioning of the computer itself,
or meaningful limitations beyond generally linking the use of an abstract idea to a particular
technological environment. Therefore, dependent claims 2-17 are also directed towards
nonstatutory subject matter.

As per independent claims 18 and 22, are also rejected as ineligible subject matter under 35
U.S.C. 101 for substantially the same reasons as the method claim(s) 1. The components (i.e.,
system/medium described in independent claims 18 and 22 do not provide for integrating the
abstract idea into a practical application. At best, the claim(s) are merely providing alternate
environments to implement the abstract idea.

Dependent claims 19-21 and 23-24 merely add further details of the abstract steps/elements
recited in claim 1 without integrating the idea into a practical application; or including an
improvement to another technology or technical field, an improvement to the functioning of the
computer itself, or meaningful limitations beyond generally linking the use of an abstract idea to
a particular technological environment. Therefore, dependent claims 19-21 and 23-24 are also
directed towards non-statutory subject matter.






Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bleiweiss et al., US 2019/0205737, in view of Knowles et al., US Pub.  No. 2020/0150713.

As to claim 1 (and substantially similar claim 18), Bleiweiss discloses
a method for implementing a machine learning network on a machine learning accelerator
(MLA), the MLA comprising one or more meshes of interconnected Tiles implemented
on a semiconductor die,
(Bleiweiss [0026] FIG. 20 illustrates a computing device employing a machine learning acceleration mechanism, according to an embodiment.;
See also [0140] Each unit of logic 1172, 1174 can be implemented within a semiconductor die and coupled with the substrate 1180 via an interconnect structure 1173. The interconnect structure 1173 may be configured to route electrical signals between the logic 1172, 117 4 and the substrate 1180, and can include interconnects such as, but not limited to bumps or pillars.;
see also [0062] In some embodiments, a ring based interconnect unit 212 is used to couple the internal components of the processor 200. However, an alternative interconnect unit
may be used, such as a point-to-point interconnect, a switched interconnect,)

the method comprising:
receiving a description of the machine learning network, the machine learning network
comprising a plurality of interconnected layers and the description specifying
computations performed to implement each of the layers; 
(Bleiweiss [0170] Once the neural network is structured, a learning model can be applied to the network to train the network to perform specific tasks. The learning model describes how to
adjust the weights within the model to reduce the output error of the network.;
see also [0158] A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input
layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for
generating output in the output layer; see also [0158] Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can
take various forms.) 

and
Bleiweiss does not disclose:
generating a computer program that implements the machine learning network on the
MLA, wherein the computer program comprises Tile instructions for execution by
the Tiles to implement the machine learning network, and wherein generating the
computer program comprises:
for each of one or more of the layers in the machine learning network:
determining, for that layer, a partial computation metric based on the
computations performed to implement that layer;
allocating a group of Tiles to execute the Tile instructions implementing the
computations of that layer, wherein a number of Tiles in the group is
based on the partial computation metric for that layer; and
outputting the computer program;

however, Knowles discloses:
generating a computer program that implements the machine learning network on the
MLA, wherein the computer program comprises Tile instructions for execution by
the Tiles to implement the machine learning network, 
 (Knowles [0012] In this way, the compiler implementing the above method may automatically allocate respective local programs to respective processing units (tiles) in a computer
which is architected to operate in a time deterministic fashion.)

and wherein generating the computer program comprises:
for each of one or more of the layers in the machine learning network:
determining, for that layer, a partial computation metric based on the
computations performed to implement that layer;
(Knowles [0089] Further, each of the one or more parameters of each node's function is characterized by a respective error value. Moreover, a respective error condition may be associated with the error(s) in the parameter(s) of each node 102. For a node 102 representing a function parameterized by a single error parameter, the error condition may be a simple threshold, i.e. the error condition is satisfied if the error is within the specified threshold but not satisfied if the error is beyond the threshold.)

allocating a group of Tiles to execute the Tile instructions implementing the computations of that layer, wherein a number of Tiles in the group is based on the partial computation metric for that layer; 
(Knowles [0073] In order to ensure each individual tile executes SEND instructions and switch control instructions at appropriate times to transmit and receive the correct data, exchange scheduling requirements need to be met by the programmer or compiler that allocates individual programs to the individual tiles in the computer. This function is carried out by an exchange scheduler which needs to be aware of the following exchange timing (BNET) parameters.  )

and
outputting the computer program
(Knowles [0029] generating a local program for each processing unit comprising a sequence of executable instructions;).


It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply tile execution with machine learning as taught by Knowles since it was known in the art that this paradigm is particularly effective in the context of knowledge models for machine learning as this architecture is provided which utilizes time determinism as in an exchange phase of a BSP paradigm to efficiently process very large amounts of data. (Knowles 0108).

As to claim 2, Knowles discloses under the rationale above the method of claim 1, wherein generating the computer program further comprises:
partitioning the Tile instructions into one or more deterministic phases each utilizing
multiple Tiles; 
(Knowles [0010] The inventors have made a machine which makes certain time deterministic
guarantees to optimise computation on machine intelligence models. This allows a compiler to partition and schedule work across the nodes in a time deterministic fashion. It is this time determinism which is utilised in following described embodiments for significant optimisations in
designing a computer optimised to process workloads based on knowledge models.;
see also [0012] In this way, the compiler implementing the above method may automatically allocate respective local programs to respective processing units (tiles) in a computer
which is architected to operate in a time deterministic fashion.)
and
statically scheduling the Tile instructions within each deterministic phase relative to the
other Tile instructions in the same deterministic phase
(Knowles [0064] Thus, the computer described herein is time deterministic. Each tile 
operates a program which has been allocated to it by the programmer or by a compiler exercise, where the programmer or the compiler function has knowledge of what will be transmitted by a particular tile at a certain time and what needs to be received by a recipient tile at a certain time.; see also [0022] determining for each processing unit a relative time of execution of instructions of each local program whereby a local program allocated to one
processing unit is scheduled to execute with a predetermined delay relative to a synchronisation signal a send instruction to transmit at least one data packet at a predetermined transmit time, relative to the synchronisation signal, destined for a recipient processing
unit but having no destination identifier, and a local program allocated to the recipient processing unit is scheduled to execute at a predetermined switch time a switch control instruction to control the switching circuitry to connect its processing unit
wire to the switching fabric to receive the data packet at a receive time;).

As to claim 3, Knowles discloses under the rationale above the method of claim 1, wherein allocating a group of Tiles comprises:
determining an overall computation metric based on the computations for implementing
the machine learning network;
(Knowles [0089] As another example, a combined metric may be defined combining the errors in the different parameters for the same node 102, and the error condition may be satisfied on condition that the value of the combined metric falls within a specified threshold, but
otherwise the error condition is not satisfied if the value of the combined metric is beyond the threshold ( or vice versa depending on the definition of the metric). Whatever the error condition, this gives a measure of whether the error in the parameter(s) of the node falls below a certain level or degree of acceptability.)
determining a proportion of the partial computation metric to the overall computation
metric;
(Knowles [0089] For a node 102 representing a function parameterized by a single error parameter, the error condition may be a simple threshold, i.e. the error condition is satisfied if the error is within the specified threshold but not satisfied if the error is beyond the threshold. For a node 102 parameterized by more than one respective parameter, the error condition for that node 102 may be more complex. For example, the error condition may be satisfied only if each of the parameters of that node 102 falls within respective threshold.)
determining a total number of Tiles available to implement the machine learning
network; and
(Knowles [0006] In general, there may exist dependencies between the portions of a program running on different tiles in the array. A technique is therefore required to prevent a piece of
code on one tile running ahead of data upon which it is dependent being made available by another piece of code on another tile.)
determining the number of Tiles in the group as a function of the partial computation
metric, the overall computation metric, and the total number of Tiles
(Knowles [0006] There are a number of possible schemes for achieving this, but the scheme of interest herein is known as "bulk synchronous parallel" (BSP). According to BSP, each tile performs a compute phase and an exchange phase in an alternating manner During the compute phase each tile performs one or more computation tasks locally on tile, but does not communicate any results of its computations with any others of the tiles. In the exchange phase each tile is allowed to exchange one or more results of the computations from the preceding compute phase to and/or from one or more others of the tiles in the group, but does not yet begin a new compute phase until that tile has finished its exchange;
See also [0006] That is it say, either: (a) all tiles are required to complete their respective compute phases before any in the group is allowed to proceed to the next exchange phase, or
(b) all tiles in the group are required to complete their respective exchange phases before any tile in the group is allowed to proceed to the next compute phase, or ( c) both. When used herein the phrase "between a compute phase and an exchange phase" encompasses all these options).

 
As to claim 4, Knowles discloses under the rationale above the method of claim 3, wherein determining the number of Tiles in the group comprises:
determining the number of Tiles in the group such that a proportion of the number of
Tiles in the group relative to the total number of Tiles is within a predefined
rounding range of a proportion of the partial computation metric to the overall
computation metric
(Knowles [0017] In a computer intended to execute the local programs, the processing units may have a fixed positional relationship with respect to each other, wherein the step of determining comprises determining a fixed delay based on the positional relationship between each pair of processing units in the computer. That is, each pair would include one processing unit scheduled to execute a send instruction and another processing unit scheduled to include a switch control instruction. This could be a pair simultaneously sending to and receiving from each other.).

As to claim 5, Knowles discloses under the rationale above discloses the method of claim 3, wherein the number of Tiles in the group comprises a fractional number indicative of a single Tile being allocated to implementing the layer and to implementing at least one other layer (Knowles [0012] In this way, the compiler implementing the above
method may automatically allocate respective local programs
to respective processing units (tiles) in a computer
which is architected to operate in a time deterministic
fashion.).

As to claim 6, Knowles discloses under the rationale above the method of claim 1, wherein determining the total number of Tiles available to implement
the machine learning network comprises:
determining the number of Tiles in the MLA that are not assigned to implement another
machine learning network (Knowles [0012] In this way, the compiler implementing the above
method may automatically allocate respective local programs to respective processing units (tiles) in a computer which is architected to operate in a time deterministic fashion.).).

As to claim 7, Knowles discloses under the rationale above the method of claim 1, wherein determining the total number of Tiles available to implement
the machine learning network comprises:
determining the total number of Tiles based on the overall computation metric (Knowles [0012] In this way, the compiler implementing the above method may automatically allocate respective local programs to respective processing units (tiles) in a computer which is architected to operate in a time deterministic fashion.)).

As to claim 8, Knowles discloses under the rationale above the method of claim 1, wherein allocating a group of Tiles comprises:
determining the number of Tiles in the group such that the Tile instructions for each
group are executed in respective processing times that are within a predefined
time range from each other (Knowles [0022] determining for each processing unit a relative
time of execution of instructions of each local program whereby a local program allocated to one
processing unit is scheduled to execute with a predetermined delay relative to a synchronisation signal a send instruction to transmit at least one data packet at a predetermined transmit time, relative to the synchronisation signal, destined for a recipient processing unit but having no destination identifier, and a local program allocated to the recipient processing
unit is scheduled to execute at a predetermined switch time a switch control instruction to control the switching circuitry to connect its processing unit wire to the switching fabric to receive the data packet at a receive time;
see also claim 7: “7. The method of claim 1, wherein the configuring the first local program comprises accessing a look-up table holding information about delays enabling the transmit time at the first processing unit and a switching time at the second processing unit to be determined.”).

As to claim 9, Bleiweiss the method of claim 1, wherein determining the partial computation metric comprises
determining a number of matrix multiply operations associated with the computations for
that layer (Bleiweiss [0224] In one embodiment, CPU 2012 and GPU 2014 operate individually serially or simultaneously in parallel. As discussed above, deep learning algorithms are compute
intensive, and include algorithms, such as 2D Convolution, Rectifier Linear Unit (RELU), Batch Normalization, Matrix Multiplications, and other operators that are optimized by various software and hardware vendors.;
see also [0235] In a further embodiment, accelerator 2708 computes the transpose for each input matrix upon receiving the matrices during the forward propagation compute. Thus, the transpose of the input matrices are immediately available for compute upon receiving output matrices during back-propagation.).

As to claim 10, Knowles discloses under the rationale above the method of claim 1, wherein determining the partial computation metric comprises
determining a total computation time associated with the computations for that layer
(Knowles [0064] Instead, the recipient tile knows that it will be expecting a datum
from a certain transmitting tile at a certain time. Thus, the computer described herein is time deterministic. Each tile operates a program which has been allocated to it by the
programmer or by a compiler exercise, where the programmer or the compiler function has knowledge of what will be transmitted by a particular tile at a certain time and what
needs to be received by a recipient tile at a certain time. In order to achieve this, SEND instructions are included in the local programs executed by the processor on each tile, where
the time of execution of the SEND instruction is predetermined relative to the timing of other instructions being executed on other tiles in the computer.;
See also [0064] To implement the switching, the local programs executed on the tiles include switch control instructions (PUTi) which cause a multiplexer control signal 214 to be issued to control the multiplexer associated with that tile to switch its input at a certain time ahead of the time at which a particular datum is expected to be received at the tile.).

As to claim 11, Knowles discloses under the rationale above the method of claim 1, wherein determining the partial computation metric comprises determining a total computation and data transfer time associated with the computations for that layer (Knowles [0076] III. The tile to tile exchange delay, BNET_TT (TID of sending tile, TID of receiving tile). This is the
number of cycles between a SEND instruction being issued on one tile and the earliest point at which the receiving tile could issue a (hypothetical) load instruction pointing to the sent value in its own memory. This has been determined from the tile IDs of the sending
and receiving tiles, either by accessing a table such as has already been discussed, or by calculation. Looking again at FIG. 4, this delay comprises the time taken for
data to travel from transmit tile 4r from its ex_out interface 226r to the switching fabric 14 along its exchange bus 218r and then via the input mux 210R at the receiving tile 4R to the ex_in interface 224R of the receiving tile.
See also [0024] units. The step of determining can comprise determining the fixed delay for the switch control instruction to reach the multiplexer and an output data packet from the multiplexer to reach the input interface of its processing unit based on the predetermined physical locations and consequent transfer times.;
See also [0022] a send instruction to transmit at least one data packet at a predetermined transmit time, relative to the synchronisation signal, destined for a recipient processing
unit).

As to claim 12, Bleiweiss discloses the method of claim 1, wherein determining the partial computation metric comprises determining an estimated power consumption associated with the computations for that layer
(Bleiweiss [0075] In some embodiments, graphics core array 414 is scalable, such that the array includes a variable number of graphics cores, each having a variable number of execution
units based on the target power and performance level of GPE 410. ;
See also [0081] The SoC interface 537 can also implement power management controls for the graphics core 500 and enable an interface between a clock domain of the graphic core 500 and other clock domains within the SoC.;
See also [0078] The graphics processor core 500 is exemplary of one graphics core slice,
and a graphics processor as described herein may include multiple graphics core slices based on target power and performance envelopes.).

As to claim 13, Knowles discloses under the rationale above the method of claim 1, wherein allocating the group of Tiles comprises:
identifying a contiguous block of adjacent Tiles in a layout of the machine learning
accelerator
(Knowles [0040] The chips can be connected together into cards by a further 6 chip-tochip
links 30a, 30b arranged along the "East" side of the chip. A host may access a computer which is architected as a single chip processor 2 as described herein or a group of multiple interconnected single chip processors 2 depending on the workload from the host application.
See also [0064] Each tile indicates its synchronisation state to a sync module 36. Once it has been established that each tile is ready to send data, the synchronisation process 30 causes
the system to enter an exchange phase which is shown on the right-hand side of FIG. 3. In this exchange phase, data values move between tiles (in fact between the memories of
tiles in a memory-to-memory data movement).).

As to claim 14, Knowles discloses under the rationale above the method of claim 13, wherein identifying the contiguous block of adjacent Tiles comprises:
obtaining one or more constraints that constrain at least one of a shape of the group of
Tiles, a minimum number of tiles along a first dimension, a maximum number of
Tiles along the first dimension, a minimum number of Tiles along a second
dimension, and a maximum number of Tiles along the second dimension; and
identifying the contiguous block in accordance with the one or more constraints
(Knowles [0040] The processor 2 comprises multiple processing units referred to as tiles. In one
embodiment, there are 1216 tiles organised in arrays 6a, 6b which are referred to herein as "North" and "South". In the described example, each array has eight colunms of76 tiles (in fact generally there will be 80 tiles, for redundancy purposes). It will be appreciated that the concepts described herein extend to a number of different physical architectures---- one example is given here to aid understanding. The chip 2 has two chip to host links Sa, Sb and 4 chip to chip links 30a, 30b arranged on the "West" edge of the chip 2. The chip 2 receives work from a host (not shown) which is connected to the chip via one of the card-to-host links in the
form of input data to be processed by the chip 2. The chips can be connected together into cards by a further 6 chip-tochip links 30a, 30b arranged along the "East" side of the
chip. A host may access a computer which is architected as a single chip processor 2 as described herein or a group of multiple interconnected single chip processors 2 depending
on the workload from the host application.).

As to claim 15, Knowles discloses under the rationale above the method of claim 13, wherein identifying the contiguous block of adjacent Tiles comprises:
obtaining one or more optimization criteria; 
(Knowles [0010] This allows a compiler to partition and schedule work across the nodes in a time deterministic fashion. It is this time determinism which is utilised in following
described embodiments for significant optimisations in designing a computer optimised to process workloads based on knowledge models.;)
and
identifying the contiguous block in accordance with the optimization criterion (Knowles [0012] In this way, the compiler implementing the above method may automatically allocate respective local programs to respective processing units (tiles) in a computer which is architected to operate in a time deterministic fashion. Examples of such a computer are described herein
and referred to as an IPU [intelligence processing unit], and reference is further made to application numbers [PWF Ref: 408525 and 408527], the contents of which are herein
incorporated by reference.;
See also [0044] Note too that the program loaded into each tile is determined by a processor or compiler to allocate work based on the graph of the machine intelligence model being supported.).

As to claim 16, Knowles discloses under the rationale above the method of claim 13, wherein identifying the contiguous block of adjacent Tiles comprises: 
identifying the contiguous block such that the contiguous block is adjacent to a block of
Tiles associated with an immediately previous layer in the machine learning
network and is adjacent to a block of Tiles associated with an immediately
subsequent layer in the machine learning network
(Knowles [0012] In this way, the compiler implementing the above method may automatically allocate respective local programs to respective processing units (tiles) in a computer which is architected to operate in a time deterministic fashion. Examples of such a computer are described herein and referred to as an IPU [intelligence processing unit], and reference is further made to application numbers [PWF Ref: 408525 and 408527], the contents of which are herein incorporated by reference.;
See also [0044] Note too that the program loaded into each tile is determined by a processor or compiler to allocate work based on the graph of the machine intelligence model being supported.).

As to claim 17, Knowles discloses under the rationale above the method of claim 1, wherein allocating the group of Tiles comprises:
determining that the group of Tiles implements either a first layer of the machine learning
network or a last layer of the machine learning network; and
allocating the group of Tiles to include at least one Tile adjacent to a memory of the
MLA (Knowles [0012] In this way, the compiler implementing the above method may automatically allocate respective local programs to respective processing units (tiles) in a computer which is architected to operate in a time deterministic fashion. Examples of such a computer are described herein and referred to as an IPU [intelligence processing unit], and reference is further made to application numbers [PWF Ref: 408525 and 408527], the contents of which are herein incorporated by reference.;
See also [0044] Note too that the program loaded into each tile is determined by a processor or compiler to allocate work based on the graph of the machine intelligence model being supported.).


Referring to claim 19, this dependent claim recites similar limitations as claim 2;
therefore, the arguments above regarding claim 2 are also applicable to claim 19.

Referring to claim 20, this dependent claim recites similar limitations as claim 3;
therefore, the arguments above regarding claim 3 are also applicable to claim 20.

Referring to claim 21, this dependent claim recites similar limitations as claim 4;
therefore, the arguments above regarding claim 4 are also applicable to claim 21.

As to claim 22, Bleiweiss discloses machine learning accelerator device comprising:
one or more meshes of interconnected Tiles implemented on a semiconductor die;
(Bleiweiss [0026] FIG. 20 illustrates a computing device employing a machine learning
acceleration mechanism, according to an embodiment.;
See also [0140] Each unit of logic 1172, 1174 can be implemented within a semiconductor die
and coupled with the substrate 1180 via an interconnect structure 1173. The interconnect
structure 1173 may be configured to route electrical signals between the logic 1172, 117 4 and
the substrate 1180, and can include interconnects such as, but not limited to bumps or pillars.;
see also [0062] In some embodiments, a ring based interconnect unit 212 is used to couple the
internal components of the processor 200. However, an alternative interconnect unit
may be used, such as a point-to-point interconnect, a switched interconnect,)

a controller comprising a processor and a non-transitory computer-readable storage
medium for storing a computer program executable by the processor, 
(Bleiweiss [0053-0054] In one embodiment the processor(s) 102 include an integrated
memory controller 116 and a platform controller hub 130. The memory controller 116 facilitates communication between a memory device and other components of the system 100, while the platform controller hub (PCH) 130 provides connections to I/O devices via a local I/O bus.
[0054] The memory device 120 can be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory device, phase change
memory device, or some other memory device having suitable performance to serve as process memory.; see also [0055] a data storage device 124 (e.g., hard disk drive, flash memory, etc.). The data storage device 124 can connect via a storage interface (e.g., SATA) or via a peripheral bus, such as a Peripheral Component Interconnect bus (e.g., PCI, PCI Express).)

Bleiweiss does not disclose:
wherein the computer program when executed causes the processor to perform steps
including:
obtaining an allocation of Tile instructions of the computer program to be executed by different ones of the Tiles, the allocation causing different blocks of Tiles to perform computations associated with different layers of a machine learning network, 
wherein the different blocks are sized based on respective corresponding proportions of respective partial computation metrics for the different layers relative to an overall computation metric for the machine-learned model; 
and
distributing the instructions to the Tiles based on the allocation;

however, Knowles discloses:

wherein the computer program when executed causes the processor to perform steps
including:
obtaining an allocation of Tile instructions of the computer program to be executed by different ones of the Tiles, the allocation causing different blocks of Tiles to perform computations associated with different layers of a machine learning network, 
(Knowles [0012] In this way, the compiler implementing the above method may automatically
allocate respective local programs to respective processing units (tiles) in a computer
which is architected to operate in a time deterministic fashion.;
see also Knowles [0073] In order to ensure each individual tile executes SEND instructions and switch control instructions at appropriate times to transmit and receive the correct data, exchange scheduling requirements need to be met by the programmer or compiler that allocates individual programs to the individual tiles in the computer. This function is carried out by an exchange scheduler which needs to be aware of the following exchange timing (BNET) parameters.)

wherein the different blocks are sized based on respective corresponding proportions of respective partial computation metrics for the different layers relative to an overall computation metric for the machine-learned model; 
(Knowles [0017] In a computer intended to execute the local programs, the processing units
may have a fixed positional relationship with respect to each other, wherein the step of
determining comprises determining a fixed delay based on the positional relationship between
each pair of processing units in the computer. That is, each pair would include one processing
unit scheduled to execute a send instruction and another processing unit scheduled to include a
switch control instruction. This could be a pair simultaneously sending to and receiving from
each other.).
and
distributing the instructions to the Tiles based on the allocation.
(Knowles [0029] generating a local program for each processing unit comprising a sequence of
executable instructions;).

It would have been obvious to one having ordinary skill in the art at the time the time of the
effective filing date to apply tile execution with machine learning as taught by Knowles since it
was known in the art that this paradigm is particularly effective in the context of knowledge
models for machine learning as this architecture is provided which utilizes time determinism as
in an exchange phase of a BSP paradigm to efficiently process very large amounts of data.
(Knowles 0108).

Referring to claim 23, this dependent claim recites similar limitations as claim 2;
therefore, the arguments above regarding claim 2 are also applicable to claim 23.

As to claim 24, Knowles discloses under the rationale above the machine learning accelerator device of claim 22, further comprising:
one or more memory blocks;
(Knowles [0040] It will be appreciated that the concepts described herein extend to a number of different physical architectures----one example is given here to aid understanding. The
chip 2 has two chip to host links Sa, Sb and 4 chip to chip links 30a, 30b arranged on the "West" edge of the chip 2. The chip 2 receives work from a host (not shown) which is connected to the chip via one of the card-to-host links in the form of input data to be processed by the chip 2. The chips can be connected together into cards by a further 6 chip-to chip links 30a, 30b arranged along the "East" side of the chip. A host may access a computer which is architected as
a single chip processor 2 as described herein or a group of multiple interconnected single chip processors 2 depending on the workload from the host application.)

wherein the different blocks are furthermore allocated such that a block implementing an
intermediate layer of the machine learning network is adjacent to a block
implementing an immediately previously layer of the machine learning network
(Knowles [0012] In this way, the compiler implementing the above method may
automatically allocate respective local programs to respective processing units (tiles) in a
computer which is architected to operate in a time deterministic fashion. Examples of such a
computer are described herein and referred to as an IPU [intelligence processing unit], and
reference is further made to application numbers [PWF Ref: 408525 and 408527], the contents
of which are herein incorporated by reference.;
See also [0044] Note too that the program loaded into each tile is determined by a processor or
compiler to allocate work based on the graph of the machine intelligence model being
supported.)

and is adjacent to a block implementing an immediately subsequent layer of the
machine learning network, a block implementing a first layer of the machine
learning network is adjacent to the one or more memory blocks, and a block implementing a last layer of the machine learning network is adjacent to the one
or more memory blocks
(Knowles [0040] The processor 2 comprises multiple processing units referred to as tiles. In one
embodiment, there are 1216 tiles organised in arrays 6a, 6b which are referred to herein as
"North" and "South". In the described example, each array has eight colunms of76 tiles (in fact
generally there will be 80 tiles, for redundancy purposes). It will be appreciated that the
concepts described herein extend to a number of different physical architectures---- one
example is given here to aid understanding. The chip 2 has two chip to host links Sa, Sb and 4
chip to chip links 30a, 30b arranged on the "West" edge of the chip 2. The chip 2 receives work
from a host (not shown) which is connected to the chip via one of the card-to-host links in the
form of input data to be processed by the chip 2. The chips can be connected together into
cards by a further 6 chip-tochip links 30a, 30b arranged along the "East" side of the
chip. A host may access a computer which is architected as a single chip processor 2 as
described herein or a group of multiple interconnected single chip processors 2 depending
on the workload from the host application.)..






Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

Daga et al., US Pub. No. 20200320403 A1 teaches an apparatus to facilitate execution of non-linear functions operations is disclosed. The apparatus comprises accelerator circuitry including a compute grid having a plurality of processing elements to execute neural network computations, store values resulting from the neural network computations, and perform piecewise linear (PWL) approximations of one or more non-linear functions using the stored
values as input data; and

Madar et al., US Pub. No. 2022/0044153 A1, teaches methods, systems, and apparatus, including computer programs encoded on computer storage media, for virtualizing external memory as local to a machine learning accelerator. One ambient computing system comprises: an ambient machine learning engine; a low-power CPU; and an SRAM that is shared among at least the ambient machine learning engine and the low-power CPU; wherein the ambient
machine learning engine comprises virtual address logic to translate from virtual addresses generated by the ambient machine learning engine to physical addresses within the SRAM.








CONTACT INFORMATIONAny inquiry concerning this communication or earlier communications from the examiner should be directed to EVAN S ASPINWALL whose telephone number is (571)270-7723. The examiner can normally be reached Monday-Friday 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Neveen Abel-Jalil can be reached on 571-270-0474. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/Evan Aspinwall/Primary Examiner, Art Unit 2152