DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1, 9, and 17 have been amended.
Claims 1-24 have been examined.
The drawing objections in the previous Office Action have been addressed and are withdrawn.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on July 30, 2021 has been entered.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 6-8, 9, 11, 14-16, 17, 19, and 22-24 are rejected under 35 U.S.C. 103 as being unpatentable over “Instruction Scheduling for a Tiled Dataflow Architecture” by Mercaldi et al. (hereinafter referred to as “Mercaldi”) in view of US Patent No. 6,609,189 by Kuszmaul et al. (cited as pertinent by the Examiner on December 14, 2020 and hereinafter referred to as “Kuszmaul”).
Regarding claims 1, 9, and 17, taking claim 1 as representative, Mercaldi discloses:
a computer hardware device having a clock speed and a clock cycle, the device comprising: a plurality of arithmetic logic units (ALU) within a data path including a first, second and third ALUs, the (Mercaldi discloses, at § 1, right column of p. 141, a tiled architecture, which discloses a hardware device having a clock speed and a clock cycle, having a number of processing elements (PEs), which discloses first, second, and third ALUs, where the PEs are connected using an on-chip network (datapath) and some PEs are within a predefined range, e.g., two adjacent PEs form a pod and other PEs are not within the predefined range, e.g, PEs in other pods, domains, or clusters. Mercaldi also discloses, at § 2.1, left column of p. 142, fetching instructions, which discloses storing instructions in non-transitory computer readable media.); 
the device being programmed to execute a series of instructions stored in a non-transitory memory to perform operations, the operations comprising: first assigning a first instruction operation to the first ALU (Mercaldi discloses, at § 1, right column of p. 141, fetching instructions, which discloses storing the instructions in non-transitory memory, and assigning each instruction to a specific PE, which discloses assigning a first instruction to the first ALU.); 
first determining, for a second instruction operation having an input that depends directly on an output of a first instruction operation, whether a condition (a) exists in which all inputs for the second instruction operation are available within the locally predefined range from the first ALU, or a condition (b) exists in which at least one input for the second instruction operation are produced by the third ALU; second assigning, in response to at least the first determining finding condition (a), the second instruction operation to the second ALU (Mercaldi discloses, at § 1, right column of p. 141, placing dependent instructions on the same or adjacent tiles, which discloses determining that all inputs are available locally and assigning the dependent instruction to a second ALU that is within the locally predefined range. Mercaldi discloses, at § 3.3.1, assigning an instruction to a PE based on determining the number of operands the instruction shares with its successor instructions already assigned to the PE. This discloses determining whether or not all operands are locally available, which discloses determining whether the first or second condition exists.); 
in response to at least the first determining finding condition (b):... (Mercaldi discloses, at § 3.3.1, assigning an instruction to a PE based on determining the number of operands the instruction shares with its successor instructions already assigned to the PE. This discloses determining that not all operands are locally available.); 
third assigning, after the ensuring, the second instruction operation to an ALU of the plurality of ALUs (Mercaldi discloses, at § 3.3.1, assigning an instruction to a PE based on determining the number of operands the instruction shares with its successor instructions already assigned to the PE.). 
Mercaldi does not explicitly disclose ensuring a pause of at least one clock cycle will occur between execution of the first instruction operation and the second instruction operation.
However, in the same field of endeavor (e.g., instruction scheduling) Kuszmaul discloses:
ensuring a pause of at least one clock cycle will occur between execution of the first instruction operation and the second instruction operation (Kuszmaul discloses, at col. 61, lines 44-47, introducing an extra delay between execution of dependent instructions that execute in different clusters.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Mercaldi to include an extra delay in order to facilitate reduced clock cycle by compensating for critical path length. See Kuszmaul, col. 9, lines 14-16.

Regarding claims 3, 11, and 19, taking claim 3 as representative, Mercaldi discloses the elements of claim 1, as discussed above. Mercaldi also discloses:
wherein the locally predefined range is a distance between two adjacent ALUs (Mercaldi discloses, at § 2.2 forming a pod of two adjacent PEs to reduce communication costs.).

Regarding claims 6, 14, and 22, taking claim 6 as representative, Mercaldi discloses the elements of claim 1, as discussed above. Mercaldi also discloses:
wherein: the clock cycle of the device is shorter than an amount of time to needed to guarantee that the third ALU (a) receives and selects an input produced from the first ALU and (b) executes the second instruction operation (Mercaldi discloses, at p. 143, Table 1, 8 PEs per domain and that ALUs within a domain and outside a pod, have a transmission latency of 4 cycles for transmission, which discloses the clock cycle is shorter than the amount of time needed to guarantee that the third ALU receives and executes the second instruction operation.).

Regarding claims 7, 15, and 23, taking claim 7 as representative, Mercaldi discloses the elements of claim 1, as discussed above. Mercaldi also discloses:
wherein the ensuring further comprises: second determining whether the first and second instruction operations are already separated in time of execution by at least one clock cycle of the device; and in response to a negative outcome of the second determining, inserting a delay of at least one clock cycle of the device between execution of the first and second instruction operations (Mercaldi discloses, at § 3.3.2, delaying an instruction by one or more cycles if the instructions could execute at the same time, i.e., are not already separated in time by at least one clock cycle.).

Regarding claims 8, 16, and 24, taking claim 8 as representative, Mercaldi discloses the elements of claim 1, as discussed above. Mercaldi also discloses:
wherein the first assigning further comprises setting the first and second instruction operations to be executed during a same clock cycle of the device (Mercaldi discloses, at § 4.2, left column of p. 145, dividing instructions between multiple PEs to exploit parallelism, which discloses executing first and second instructions during the same clock cycle.).

Claims 2, 10, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Mercaldi in view of Kuszmaul in view of “Design and Analysis of Routed Inter-ALU Networks for ILP Scalability and Performance” by Singh et al. (hereinafter referred to as “Singh”). 
Regarding claims 2, 10, and 18, taking claim 2 as representative, Mercaldi, discloses the elements of claim 1, as discussed above. Mercaldi does not explicitly disclose wherein a clock speed of the device is defined in part by a worst case time of transmission between a consumer ALU and producer ALU of the plurality of ALUs within the locally predefined range.
However, in the same field of endeavor (e.g., inter-ALU transmission), Singh discloses: 
wherein a clock speed of the device is defined in part by a worst case time of transmission between a consumer ALU and producer ALU of the plurality of ALUs within the locally predefined range (Singh discloses, at § 1, p. 1, bypass latency (transmission delay) is a component in setting the cycle time (clock speed). Singh also discloses, at § 4.1, p. 11, using the worst case delay.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Mercaldi’s scheduling to use the worst case transmission delay disclosed by Singh in order to ensure high instruction throughput rates by preventing an increase in delay between execution of dependent instructions. See Singh, § 1, p. 1.

Claims 4, 12, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Mercaldi in view of Kuszmaul in view of Singh in view of “Scaling to the End of Silicon with EDGE Architectures” by Burger et al. (hereinafter referred to as “Burger”).
Regarding claims 4, 12, and 20, taking claim 4 as representative, Mercaldi, as modified, discloses the elements of claim 3, as discussed above. Mercaldi does not explicitly disclose wherein the locally predefined range is further defined by inputs and outputs of the two adjacent ALUs facing each other. 
However, in the same field of endeavor (e.g., array processing), Burger discloses: 
wherein the locally predefined range is further defined by inputs and outputs of the two adjacent ALUs facing each other (Burger discloses, at Figure 1, execution nodes having inputs and outputs facing each other.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Mercaldi’s scheduling to use the layout disclosed by Burger in order to improve performance by minimizing the physical distance that operands for dependent instruction chains must travel across the chip. See Burger, p. 46, right column.

Claims 5, 13, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Mercaldi in view of Kuszmaul in view of “Scaling to the End of Silicon with EDGE Architectures” by Burger et al. (hereinafter referred to as “Burger”).
Regarding claims 5, 13, and 21, taking claim 5 as representative, Mercaldi discloses the elements of claim 1, as discussed above. Mercaldi does not explicitly disclose wherein the first and second ALU are the same, and the locally predefined range is an ALU to itself.
However, in the same field of endeavor (e.g., array processing), Burger discloses: 
wherein the first and second ALU are the same, and the locally predefined range is an ALU to itself (Burger discloses, at page 49, left column, assigning instructions to the same ALU.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Mercaldi’s scheduling to use the instruction assignment disclosed by Burger in order to improve performance by minimizing the physical distance that operands for dependent instruction chains must travel across the chip. See Burger, p. 46, right column.

Response to Arguments
On pages 9-10 of the response filed July 30, 2021 (“response”), the Applicant argues that Mercaldi does not disclose all elements of claim 1, as amended. In support of this position, the Applicant argues, “Wavescalar as disclosed by Mercaldi is alleged in the FIND-AUS algorithm to generate a pause of at least clock cycle between instructions. Mercaldi discloses only one condition where this occurs - when "resource conflicts that arise when two in- structions at the same PE can execute at the same time". Claim 1 as amended ensures a pause of at least one clock cycle between the first and second dependent instructions in response to different criteria, and in particular the second instruction requiring one or more inputs produced by a third ALU, which is outside of the local range from the first ALU that received the first instructions. The potential clock cycle pause in Wavescaler was thus in response to different and unrelated criteria then claim 1 as amended, and indeed is responsive to a condition that had nothing to do with inputs at all, let alone a condition of which ALU is producing the input. Similarly, the claimed condition has no relation to a resource allocation conflict. Wavescalar therefore does not teach or suggest ensuring a clock pause under the conditions as recited in claim 1.”
These remarks have been fully considered and, in light of the claim amendments presented in the response, are deemed persuasive. Please see above for new grounds of rejection of the amended claims. Specifically, Kuszmaul explicitly discloses that it is known to insert an extra cycle between 
The Examiner notes that it is a fundamental and inherent property of circuits that longer wires require more time to traverse. This fact is explicitly recognized in Mercaldi. For example, Mercaldi discloses, at § 1, “placing dependent instructions on the same or adjacent tiles reduces producer-to consumer operand latency.” Mercaldi also discloses, at Table 1, that latency within a pod, i.e., a pair of processing elements (PEs) is a single cycle, but that the latency within a domain, i.e., a group of four pods, is longer, i.e., four cycle. Thus, Mercaldi discloses that if any dependent instructions require an input from outside the pod, more than one cycle will be required. 
While Mercaldi does not explicitly disclose ensuring a pause in response to determining that an input of a dependent instruction is produced outside a pod, this disclosure is implicit. Mercaldi discloses attempting to locate all instructions in a dependency chain on the same PE, but it is evident from Mercaldi’s disclosure at § 3.3.1 of “assigning the instruction to the PE with the largest number of communicating operands,” that it is not always possible to do so. In those cases in which a dependent instruction is not placed on the same or adjacent PE, it is evident that in order to ensure correct operation, additional delay will have to be introduced. This interpretation is supported by Mercaldi’s disclosure at § 3.3.2 of “accounting for stalling due to execution conflicts, as well as operand latency.” (emphasis supplied) However, rather than rely on implicit disclosure, in an effort to expedite prosecution the Examiner cites to Kuszmaul, which explicitly discloses adding an extra cycle of delay to compensate for increased transmission distance. The Examiner further notes that it is extremely well-known to add a cycle to compensate for delay in receiving results from processing resources that are relatively distant. A number of the references cited as pertinent in the office action of December 14, 2020 disclose such features, as does the Steiss reference cited below. 

On pages 10-11 of the response the Applicant presents arguments related to setting or basing clock speeds off of sub-groups of ALUs.
The Examiner notes that the Applicant is arguing features not in the claims. Claim 1 does not recite basing a system clock speed on transmission time between a sub-group of ALUs.

Conclusion
The following prior art made of record and not relied upon is considered pertinent to Applicant’s disclosure.
US 6766440 by Steiss discloses inserting a stall when an operand from one cluster is needed in another.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAWN DOMAN whose telephone number is (571)270-5677.  The examiner can normally be reached on Monday through Friday 8:30am-6pm Eastern Time.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee J. Li can be reached on 571-272-4169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAWN DOMAN/Primary Examiner, Art Unit 2183