DETAILED ACTION
Claims 1-2 and 4-12 are pending.
The office acknowledges the following papers:
Claims, specification, and remarks filed on 12/13/2021.

	Withdrawn objections and rejections
The specification objections have been withdrawn due to amendment.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the "right to exclude" granted by a patent and to prevent possible harassment by multiple assignees. See In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970);and, In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) may be used to overcome an actual or provisional rejection based on a nonstatutory double patenting ground provided the conflicting application or patent is shown to be commonly owned with this application. See 37 CFR 1.130(b).
Effective January 1, 1994, a registered attorney or agent of record may sign a terminal disclaimer. A terminal disclaimer signed by the assignee must fully comply with 37 CFR 3.73(b).

Applicants can file an eTerminal Disclaimer (eTD) in utility applications filed under 35 U.S.C. 111(a) or in compliance with 35 U.S.C. 371, and design applications. Filing an eTD via EFS-Web is highly recommended due to an extensive backlog for processing paper TDs. However, applicants may still file a TD for manual review.
Claims 1-2 and 4-8 are rejected under the judicially created doctrine of obviousness-type double patenting as being unpatentable over claims 1-4, 7 and 9-10 of 
Instant Application
Copending Application
1. An on-chip heterogeneous Artificial Intelligence (AI) processor, comprising:
1. A configurable heterogeneous Artificial Intelligence (AI) processor, comprising:
at least two different architectural types of computation units, each of the computation units being associated with a task queue configured to store computation subtasks to be executed by the computation unit, wherein a first one of the computation units is a customized computation unit for a particular Al algorithm or operation, and a second one of the computation units is a programmable computation unit;
at least two different architectural types of computation units, wherein each of the computation units is associated with a task queue, at least one of the computation units is a customized computation unit for a particular Al algorithm or operation, and at least another one of the computation units is a programmable computation unit; and a controller, wherein the controller comprises a task scheduling module, a task synchronization module and an access control module, and wherein:
a controller configured to partition a received computation graph associated with a neural network into a plurality of computation subtasks and distribute the plurality of computation subtasks to the respective task queues associated with the computation units, wherein the controller distributes each computation subtask according to a type of the computation subtask to the task queue associated with a computation unit suitable for processing the type of the computation subtask; 
the task scheduling module is configured to partition, according to a configuration option indicating task allocation, a computation graph associated with a neural network into a plurality of computation subtasks, distribute the computation subtasks to the respective task queues of the computation units, and set a dependency among the computation subtasks, wherein at least one of the task allocations is based on type matching and the controller is configured to distribute each of the computation subtasks according to a 

a storage unit;
an access interface configured to access an off-chip memory.
the task synchronization module is configured to realize the synchronization of the computation subtasks according to the set dependency; and
the access control module is configured to control access to data involved in the computation subtasks on the storage unit and an off-chip memory.

Dependent claims 2 and 4-8 are read upon by the dependent claims 2-3, 7 and 9-10 of copending application 16/812,832.
This is a provisional obviousness-type double patenting rejection.  
Claims 9-12 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1 and 7-8 of copending Application No. 16/812,832, in view Che et al. (U.S. 2020/0249998), in view of Wu et al. (U.S. 2020/0012521), in view of Official Notice.  
This is a provisional obviousness-type double patenting rejection.  

New Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2 and 4-12 are rejected under 35 U.S.C. 103 as being unpatentable over Che et al. (U.S. 2020/0249998), in view of Wu et al. (U.S. 2020/0012521), in view of Official Notice.
As per claim 1:
Che and Wu disclosed an on-chip heterogeneous Artificial Intelligence (AI) processor, comprising:
at least two different architectural types of computation units (Che: Figures 1-2 elements 100 and 220, paragraphs 14, 26, and 30)(The neural network processing unit is a heterogeneous platform that can include SIMD/GPU/FPGA/ASIC accelerators.), each of the computation units being associated with a task queue configured to store computation subtasks to be executed by the computation unit (Wu: Figures 1, 3, 7, and 9 elements 620, S302-303, and S702-703, paragraphs 102, 111-117, and 158-159)(Che: Figures 1-2 elements 100 and 220, paragraphs 14, 26, and 30)(Wu disclosed distributing tasks of a directed acyclic graph (DAG) to a distributed set of work queues for parallel execution on processor cores. The combination adds the work queues of Wu into the neural network processing unit of Che such that the target devices include work queues.), wherein a first one of the computation units is a customized computation unit for a particular Al algorithm or operation (Che: Figures 1-2 elements 100 and 220, 
a controller configured to partition a received computation graph associated with a neural network into a plurality of computation subtasks (Che: Figure 2-4 elements 210-213 and 403, paragraphs 14, 32-33, 38-40, and 42)(The graph partitioner of the scheduler partitions a received computation graph into a plurality of subsets. Neural network models can be graphically represented by such computation graphs.) and distribute the plurality of computation subtasks to the respective task queues associated with the computation units, wherein the controller distributes each computation subtask according to a type of the computation subtask to the task queue associated with a computation unit suitable for processing the type of the computation subtask (Wu: Figures 1, 3, 7, and 9 elements 620, S302-303, and S702-703, paragraphs 102, 111-117, 151 and 158-159)(Che: Figures 2-3 element 210-220, paragraphs 29, 42-43, 49, and 55)(Che disclosed the task allocation generator & optimizer assigns tasks of the computation graph to target devices for execution. Wu disclosed distributing tasks of a directed acyclic graph (DAG) to a distributed set of work queues for parallel execution on 
a storage unit configured to store data associated with executing the plurality of computation subtasks (Wu: Figure 1 element 620, paragraph 102)(Che: Figures 1-2 elements 104 and 220, paragraphs 19 and 29)(The host memory is used to load instructions and data of tasks to the cores of the neural network processing unit for task execution. In addition, the combination allows for Che to include caches for each target device.); and
an access interface configured to access an off-chip memory (Che: Figure 1 element 106, paragraph 19-20)(The memory controller accesses the off-chip host memory.).
The advantage of work queues buffering tasks is that tasks waiting to be executed can be stored and load balanced. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date to implement the work queues of Wu into the processing system of Che for the above advantage.
As per claim 2:
Che and Wu disclosed the on-chip heterogeneous Al processor according to claim 1, wherein the architectural types include at least one of the Application Specific Integrated Circuit (ASIC), General-Purpose Graphics Processing Unit (GPGPU), Field-
As per claim 4:
Che and Wu disclosed the on-chip heterogeneous Al processor according to claim 1, wherein the computation units comprise a computation unit of an Application Specific Integrated Circuit (ASIC) architecture (Che: Figures 1-2 elements 100 and 220, paragraphs 14, 26, and 30)(The neural network processing unit is a heterogeneous platform that can include SIMD/GPU/FPGA/ASIC accelerators.) and a computation unit of a General-Purpose Graphics Processing Unit (GPGPU) architecture (Che: Figures 1-2 elements 100 and 220, paragraphs 14, 26, and 30)(The neural network processing unit is a heterogeneous platform that can include SIMD/GPU/FPGA/ASIC accelerators.).
As per claim 5:
Che and Wu disclosed the on-chip heterogeneous Al processor according to claim 1, wherein the storage unit comprises a cache memory and a scratch-pad memory (Che: Figure 1 element 104, paragraphs 19 and 28)(The host memory is used to load instructions and data of tasks to the cores of the neural network processing unit for task execution. Official notice is given that storage elements can include cache and scratch-pad memories for the advantage of faster memory access. Thus, it would have been obvious to one of ordinary skill in the art to implement off-chip cache and scratch-pad memories in Che.).
As per claim 6:
Che and Wu disclosed the on-chip heterogeneous Al processor according to claim 
As per claim 7:
Che and Wu disclosed the on-chip heterogeneous Al processor according to claim 1, wherein the computation units are configured to support one or more of an independent parallel mode, a cooperative parallel mode, or an interactive cooperation mode, and wherein:
in the independent parallel mode, at least two of the plurality of computation subtasks are executed independently and in parallel with each other (Wu: Figures 3 and 8 elements S302-S303, paragraphs 111-116, 142-145)(Che: Figures 2 and 4 elements 210-220 and 403, paragraphs 29 and 48)(Che disclosed task allocation to target devices, but doesn’t explicitly state if task execution is performed sequentially or parallel. Wu disclosed distributing tasks that can be executed in parallel to parallel work queues for execution. The combination allows for executing a plurality of independent tasks in the computation graph of Che in parallel when the tasks aren’t dependent upon each other.);
in the cooperative parallel mode, at least two of the plurality of computation subtasks are executed cooperatively in a pipelined manner; and

As per claim 8:
Che and Wu disclosed the on-chip heterogeneous Al processor according to claim 1, wherein the controller distributes the plurality of computation subtasks to the computation units according to the capabilities of the computation units (Che: Figure 2 element 210, paragraph 14, 29)(The scheduler takes into consideration target device capabilities to process received tasks for execution.).
As per claim 9:
Che and Wu disclosed the on-chip heterogeneous Al processor according to claim 1, wherein the controller further comprises an access control module configured to read, from the off-chip memory and into the storage unit via the access interface, operational data required by at least one of the computation units to execute one or more computation subtasks distributed to the at least one of the computation units (Wu: Figure 1 element 620, paragraph 102)(Che: Figure 2 element 220, paragraph 29)(Che disclosed a single heterogeneous platform for executing tasks. Wu disclosed processor cores with caches for data storage. The combination allows for Che to include caches for each target device. Official notice is given that cache controllers can be implemented for the advantage of controlling memory access to caches. Thus, it would have been obvious to one of ordinary skill in the art to implement a cache controller in the combination. In view 
As per claim 10:
Che and Wu disclosed a on-chip heterogeneous Artificial Intelligence (AI) processor, comprising:
a plurality of computation clusters connected through an on-chip data exchange network (Wu: Figure 2 elements 710-720, paragraph 13)(Che: Figure 2 element 220, paragraph 29)(Che disclosed a single heterogeneous platform for executing tasks. Wu disclosed multiple processing clusters for parallel task execution. The combination allows for Che to include multiple heterogeneous platforms for task execution.), each of the computation clusters comprising:
at least two different architectural types of computation units (Che: Figures 1-2 elements 100 and 220, paragraphs 14, 26, and 30)(The neural network processing unit is a heterogeneous platform that can include 
an access control module (Wu: Figure 2 elements 710-720, paragraph 103)(Che: Figure 2 element 220, paragraph 29)(Che disclosed a single heterogeneous platform for executing tasks. Wu disclosed multiple processing 
a cache and an on-chip memory shared by the computation units (Wu: Figure 2 elements 710-720, paragraph 13)(Che: Figure 2 element 220, paragraph 29)(Che disclosed a single heterogeneous platform for executing tasks. Wu disclosed multiple processing clusters for parallel task execution, each including processor cores and caches. The combination allows for Che to include multiple heterogeneous platforms for task execution. Official notice is given that processing systems can include multiple on-chip memories for the advantage of increased memory access. Thus, it would have been obvious to one of ordinary skill in the art to implement added on-chip memories in the combination.);
a controller configured to partition a received computation graph associated with a neural network into a plurality of computation subtasks (Che: Figure 2-3 elements 210, 311-313, and 403, paragraphs 14, 32-33, 38-40, and 42)(The graph partitioner of the scheduler partitions a received computation graph into a plurality of subsets. Neural network models can be graphically represented by such computation graphs.) and distribute the plurality of computation subtasks to respective task queues associated with the computation units in each computation cluster, wherein the controller distributes each computation subtask according to a type of the computation subtask to the task queue 
an access interface configured to access an off-chip memory (Che: Figure 1 element 106, paragraph 19-20)(The memory controller accesses the off-chip host memory.); and
a host interface configured to interact with an off-chip host processor (Che: Figure 2, paragraphs 29-30)(Official notice is given that accelerators can receive offloaded tasks for execution from a host processor and host interface for the advantage of allowing parallel execution on a host processor and faster execution of offloaded tasks on an accelerator. Thus, it would have been obvious to one of ordinary skill in the art to implement a host processor and host interface in Che for the above advantages.).

As per claim 11:
The additional limitation(s) of claim 11 basically recite the additional limitation(s) of claim 3. Therefore, claim 11 is rejected for the same reason(s) as claim 3.
As per claim 12:
The additional limitation(s) of claim 12 basically recite the additional limitation(s) of claim 2. Therefore, claim 12 is rejected for the same reason(s) as claim 2.

Response to Arguments
The arguments presented by Applicant in the response, received on 12/13/2021 are not considered persuasive.
Applicant argues for claims 1 and 10:
“Che may mention that a heterogeneous platform may include various accelerators such as GPUs, FPGAs and ASICs, each of which can be used to process operations of machines-learning or deep-learning model. The accelerators discussed in Che (e.g., such as GPUs, FPGAs and ASICs) are specific to the operations of machines-learning or deep-learning models. And as shown in Figure 1 of Che, the top layer of cores provides circuitry representing an input layer to neural network, while the second layer of cores provides circuitry representing a hidden layer of a neural network. Accordingly, Che at best only teaches that the various types of accelerators included on a heterogeneous platform are all used for Al algorithms such as machines-learning, deep-learning models, graphics processing, etc. 
In contrast, in the solutions provided by amended claims 1 and 10, the two different architectural types of computation units included in the processor are not both used for Al algorithms or operations. Rather, a first one of the computation units is customized for a particular Al algorithm or operation, while a second one of the computation units is a programmable computation unit for performing 

This argument is not found to be persuasive for the following reason. Che disclosed a heterogeneous platform that includes various accelerators (e.g. GPUs, FPGAs, ASICs, etc.). Che alone doesn’t specify what types of applications/software programs are executed by the various accelerators. Official notice was given that it’s well-known to one of ordinary skill in the art that these types of accelerators (e.g. GPU/ASIC) execute AI algorithms/operations. As such, an accelerator executing such a program reads upon the customized computation unit. The FPGA of the heterogeneous platform reads upon the programmable computation unit. Thus, reading upon the claimed limitations.
Applicant argues for claims 1 and 10:
“Wu may discuss distributing a plurality of tasks to a plurality of work queues of one processor that includes multiple processor cores. But the tasks in Wu are distributed to these work queues according to the dependencies among the tasks instead of the type of the tasks. See, e.g., paragraphs 158-159 of Wu. In fact, Wu is directed to a multi-core processor and is completely silent about a heterogeneous platform of various accelerators. So, Wu fails to disclose or teach that a first one of two different architectural types of computation units is a customized computation unit for a particular Al algorithm or operation, and a second one of the computation units is a programmable computation unit. Neither does Wu disclose or teach distributing each computation subtask according to a type of the computation subtask to the task queue associated with a computation unit suitable for processing the type of the computation subtask, as recited in claims 1 and 10.”  

This argument is not found to be persuasive for the following reason. Wu disclosed distributing tasks to work queues of processor cores with the capability of executing a particular algorithm (i.e. type). The combination adds the work queues and 
Applicant argues for 1 and 10:
“Last but not least, Che teaches producing a sequence of nodes for representing an execution order of operations and a sequence of target devices corresponding to the sequence of nodes, and thus no work queue is required in Che. So, Applicant respectfully submits that those skilled in the art do not have any motivation to combine the work queue of Wu with the heterogeneous platform of Che to obtain the technical solutions recited by amended claims 1 and 10.” 

This argument is not found to be persuasive for the following reason. Proper motivation was given for the advantage of implementing work queues (i.e. buffering work and load balancing). Thus, the combination is proper.

	Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACOB A. PETRANEK whose telephone number is (571)272-5988.  The examiner can normally be reached on M-F 8:00-4:30.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached on (571) 272-4169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JACOB PETRANEK/Primary Examiner, Art Unit 2183