DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on June 2, 2022, has been entered.
  
Claims 1-9 and 11-14 are pending in this office action and presented for examination. Claims 1 and 12-13 are newly amended, and claim 10 is newly cancelled, by the RCE received June 2, 2022.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 7-9, and 11-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dao et al. (Dao) (US 6148395) in view of Aingaran et al. (Aingaran) (US 20140095468 A1) in view of Lee et al. (Lee) (US 20140344194 A1) in view of Shah et al. (Shah) (US 20150269074 A1) in view of Dockser (US 20070078923 A1).
Consider claim 1, Dao discloses a computing method applied to a chip (col. 2, line 60, single-chip multiprocessor), the chip comprising at least one processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs) and a computational accelerator, the computational accelerator being connected to each of the at least one processor core (col. 4, lines 19-21, multiple CPUs on the same integrated circuit chip should therefore be able to share a single high-performance FPU), wherein the computational accelerator comprises a plurality of computing units, each of the plurality of computing units is configured for executing a complex computation (Dao, col. 8, line 66 to col. 9, line 4, in this example, a first path of execution circuitry 65 is multiplication circuitry 70, for performing floating-point multiplication and division operations. Multiplication circuitry 70 may include a sequence of circuitry known in the art, such as a Booth recorder, multiplier arrays, an adder, and rounding circuitry; col 9, lines 11-15, a second path of execution circuitry 65 is implemented as adder circuitry 72, for performing additive and subtractive operations. In this example, adder circuitry 72 includes a sequence of an aligner, an adder, a LEO (left end out) shifter, normalization circuitry and rounding circuitry; col. 9, lines 23-28, the third path of execution circuitry 65 is a single-cycle execution unit 68, by way of which special single-cycle floating-point instructions may be executed. Examples of such single-cycle floating-point instructions include floating-point change of sign (FCHS) and floating-point absolute value (FABS)) and the plurality of computing units are configured for executing complex computations in parallel (Figure 4 shows the instructions being executed in parallel via use of pipelining), and the plurality of computing units correspond to a set of preset complex computational identifiers (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations); note that an instruction code — also called an opcode — in processor architecture is used to identify a particular instruction to be executed)), the computing method comprising: decoding (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1), by a target processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs; in other words, the recited target processor core corresponds to any of two or more microprocessor central processing units) among the at least one processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs), a to-be-executed instruction to obtain a computational identifier and at least one operand (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations); note that an instruction code — also called an opcode — in processor architecture is used to identify a particular instruction to be executed); in response to determining that the computational identifier obtained by decoding is a preset complex computational identifier included in the set of preset complex computational identifiers, generating, by the target processor core, a complex computational instruction using the computational identifier and the at least one operand obtained by the decoding (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations)), and adding the generated complex computational instruction to a complex computational instruction queue (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1); selecting, by the computational accelerator, a complex computational instruction from the complex computational instruction queue (col. 5, lines 50-53, dispatch stage 44 will also effect arbitration between instruction sequences from CPUs 10.sub.0, 10.sub.1 in the event of simultaneous floating-point requests); executing, by a computing unit corresponding to a preset complex computational identifier in the selected complex computational instruction in the computational accelerator, a complex computation indicated by the complex computational identifier in the selected complex computational instruction using at least one operand in the selected complex computational instruction as an inputted parameter (col. 6, lines 7-13, in this example, three execution stages 46, 47, 48 are included within floating-point pipeline 40, indicating that the floating-point arithmetic instructions may require up to three cycles to execute; of course, single-cycle instructions (such as change sign) may also be performed, in which case certain of execution stages 46, 47, 48 may be skipped; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations)), to obtain a computational result (col. 6, line 15, results of the instruction); and writing, by the computational accelerator, the obtained computational result as a complex computational result into a complex computational result queue (col. 6, lines 46-48, results from completion unit 75 are output to result register 76 which, in turn, drives writeback bus WB at its output; col. 9, lines 49-52, as shown in FIG. 3, writeback bus WB is coupled to output buffer 78, through which communication of the result of the floating-point operation to memory, via internal bus IBUS, may be effected).
To any extent to which Dao does not disclose “generating, by the target processor core, a complex computational instruction using the computational identifier and the at least one operand obtained by the decoding” (in view of an interpretation of “operand” that does not encompass an “operand location” — Examiner notes that this interpretation is argued against in a previous response to arguments section), Aingaran nevertheless discloses both that an operand may be an address where data is stored for a coprocessor to perform an operation, and an operand may be an immediate operand or an indirect operand ([0036], lines 1-6, an operand indicated in a CCB may be one of two types: an immediate operand or an indirect operand. An immediate operand is an operand that can be used immediately by a coprocessor when the coprocessor performs the operation without first requiring translation of the operand, such as a memory lookup; [0036], lines 9-14, an indirect operand is an operand that must first be translated or looked up before the coprocessor can perform the designated operation. An example of an indirect operand is a physical address that indicates where (e.g., in memory 140) table data is stored for the coprocessor to perform the operation).
Aingaran’s teaching of using an immediate operand precludes the necessity of a memory lookup (Aingaran, [0036], lines 1-6).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Aingaran with the invention of Dao in order to preclude the necessity of a memory lookup. Alternatively, this modification merely entails combining prior art elements (Dao’s instruction, and Aingaran’s immediate operand) according to known methods (Examiner submits that instructions comprising immediate operands have been well-known for decades) to yield predictable results (Dao’s instruction, comprising immediate operands), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.
	However, the combination thus far does not entail that the chip is an artificial intelligence chip. The combination thus far also does not entail a cache, the cache being connected to the computational accelerator and each of the at least one processor core respectively by wired connections, wherein the complex computational instruction queue is stored in the cache. The combination thus far also does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
	On the other hand, Lee discloses a chip being an artificial intelligence chip ([0024], lines 1-2, machine-learning accelerator (MLA) block; [0021], lines 1-3, for example, application specific integrated circuits (ASICs) are typically used to realize algorithms implemented in hardware). 
Lee’s teaching supports a range of computations required in various machine-learning frameworks while employing a specialized architecture that can exploit algorithmic structure in order to achieve low energy (Lee, [0007], lines 3-8).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Lee with the combination of Dao and Aingaran in order to support a range of computations required in various machine-learning frameworks while employing a specialized architecture that can exploit algorithmic structure in order to achieve low energy.
However, the combination thus far does not entail a cache, the cache being connected to the computational accelerator and each of the at least one processor core respectively by wired connections, wherein the complex computational instruction queue is stored in the cache. The combination thus far also does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
On the other hand, Shah discloses a cache ([0027], line 16, shared cache), the cache being connected to a computational accelerator and each of at least one processor core respectively by wired connections (Figure 2, shared cache 230, accelerators 290A-B, cores 210A-C), wherein a complex computational instruction queue is stored in the cache ([0027], lines 15-16, instruction have been written to a queue in the shared cache).
Shah’s teaching increases efficiency (Shah, [0003], lines 1-5; title).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Shah with the combination of Dao, Aingaran, and Lee in order to increase efficiency. Alternatively, this modification merely entails a combination of prior art elements (the complex computational instruction queue of the combination of Dao, Aingaran, and Lee as cited above, and Shah’s teaching of a cache to store accelerator instructions) according to known methods (Shah’s teaching of a cache to store accelerator instructions) to yield predictable results (the complex computational instruction queue of the combination of Dao, Aingaran, and Lee, implemented using a cache according to Shah), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.
However, the combination thus far does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
On the other hand, Dockser discloses complex computations comprises an exponentiation computation ([0001], line 4, exponential functions; [0019], lines 10-12, a floating-point exponential operator configured to execute floating-point exponential instructions), a square root extraction computation ([0019], lines 8-10, a floating-point square-root extractor configured to perform floating-point square-root extract instructions), and a trigonometric function computation ([0001], line 3, trigonometric functions; [0019], lines 14-16, a floating-point trigonometric operator configured to perform instructions for calculating trigonometric functions). (Note that, to any extent to which Dao does not disclose the computation being “complex”, Dockser teaches “complex” computations as cited above.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Dockser with the combination of Dao, Aingaran, Lee, and Shah, in order to increase the computational capabilities of the AI chip of the aforementioned combination. Alternatively, this modification merely entails a combination of prior art elements (the combination of Dao, Aingaran, Lee, and Shah as described above, and Dockser’s explicit disclosure of the well-known computation types of exponentiation, square root extraction, and trigonometric functions) according to known methods (Dockser explicitly discloses the well-known concept of using hardware to perform exponentiation, square root extraction, and trigonometric functions) to yield predictable results (the combination of Dao, Aingaran, Lee, and Shah as described above, further supporting performing exponentiation, square root extraction, and trigonometric functions), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.

Consider claim 7, the overall combination entails the computational accelerator is an application specific integrated circuit chip or a field programmable gate array (Dao, col. 2, line 60, single-chip multiprocessor; Lee, [0022], line 2, machine-learning accelerator (MLA) integrated circuit).

Consider claim 8, the overall combination entails the complex computational instruction queue and the complex computational result queue are first-in-first-out queues (Dao, col. 5, line 41, FIFO order; col. 6, lines 26-28, as noted above relative to the description of queue stages 41 in floating-point pipeline 40, instruction buffers 50 are preferably arranged in a FIFO manner; col. 6, lines 46-48, results from completion unit 75 are output to result register 76 which, in turn, drives writeback bus WB at its output; col. 9, lines 49-52, as shown in FIG. 3, writeback bus WB is coupled to output buffer 78, through which communication of the result of the floating-point operation to memory, via internal bus IBUS, may be effected).

Consider claim 9, the combination thus far discloses the complex computational instruction queue is stored in the cache (see the rejection of claim 1, which recited “a complex computational instruction queue stored in the cache”). In addition, Shah further discloses a complex computational result queue is stored in the cache ([0019], lines 1-2, after the accelerator performs operations on the data, it writes output data to the shared cache). Analogous to the rationales for obviousness involving Shah in the rejection of the independent claim: it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Shah with the combination of Dao, Aingaran, Lee, and Dockser in order to increase efficiency. Alternatively, this modification merely entails a combination of prior art elements (the complex computational result queue of Dao, Aingaran, Lee, and Dockser as cited above, and Shah’s teaching of a cache to store accelerator results) according to known methods (Shah’s teaching of a cache to store accelerator results) to yield predictable results (the complex computational result queue of Dao, Aingaran, Lee, and Dockser, implemented using a cache according to Shah), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.

Consider claim 11, the overall combination entails the preset complex computational identifier is an exponentiation identifier (Dockser, [0001], line 4, exponential functions; [0019], lines 10-12, a floating-point exponential operator configured to execute floating-point exponential instructions), a square root extraction identifier (Dockser, [0019], lines 8-10, a floating-point square-root extractor configured to perform floating-point square-root extract instructions), or a trigonometric function computation identifier (Dockser, [0001], line 3, trigonometric functions; [0019], lines 14-16, a floating-point trigonometric operator configured to perform instructions for calculating trigonometric functions).

Consider claim 12, Dao discloses a chip (col. 2, line 60, single-chip multiprocessor) comprising: at least one processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs); a computational accelerator connected to each of the at least one processor core (col. 4, lines 19-21, multiple CPUs on the same integrated circuit chip should therefore be able to share a single high-performance FPU), wherein the computational accelerator comprises a plurality of computing units, each of the plurality of computing units is configured for executing a complex computation (Dao, col. 8, line 66 to col. 9, line 4, in this example, a first path of execution circuitry 65 is multiplication circuitry 70, for performing floating-point multiplication and division operations. Multiplication circuitry 70 may include a sequence of circuitry known in the art, such as a Booth recorder, multiplier arrays, an adder, and rounding circuitry; col 9, lines 11-15, a second path of execution circuitry 65 is implemented as adder circuitry 72, for performing additive and subtractive operations. In this example, adder circuitry 72 includes a sequence of an aligner, an adder, a LEO (left end out) shifter, normalization circuitry and rounding circuitry; col. 9, lines 23-28, the third path of execution circuitry 65 is a single-cycle execution unit 68, by way of which special single-cycle floating-point instructions may be executed. Examples of such single-cycle floating-point instructions include floating-point change of sign (FCHS) and floating-point absolute value (FABS)) and the plurality of computing units are configured for executing complex computations in parallel (Figure 4 shows the instructions being executed in parallel via use of pipelining), and the plurality of computing units correspond to a set of preset complex computational identifiers (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations); note that an instruction code — also called an opcode — in processor architecture is used to identify a particular instruction to be executed)); the chip to implement operations, the operations comprising: decoding (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1), by a target processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs; in other words, the recited target processor core corresponds to any of two or more microprocessor central processing units) among the at least one processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs), a to-be-executed instruction to obtain a computational identifier and at least one operand; in response to determining that the computational identifier obtained by decoding is a preset complex computational identifier included in the set of preset complex computational identifiers, generating, by the target processor core, a complex computational instruction using the computational identifier and the at least one operand obtained by the decoding (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations)), and adding the generated complex computational instruction to a complex computational instruction queue (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1); selecting, by the computational accelerator, a complex computational instruction from the complex computational instruction queue (col. 5, lines 50-53, dispatch stage 44 will also effect arbitration between instruction sequences from CPUs 10.sub.0, 10.sub.1 in the event of simultaneous floating-point requests); executing, by a computing unit corresponding to a preset complex computational identifier in the selected complex computational instruction in the computational accelerator, a complex computation indicated by the complex computational identifier in the selected complex computational instruction using at least one operand in the selected complex computational instruction as an inputted parameter (col. 6, lines 7-13, in this example, three execution stages 46, 47, 48 are included within floating-point pipeline 40, indicating that the floating-point arithmetic instructions may require up to three cycles to execute; of course, single-cycle instructions (such as change sign) may also be performed, in which case certain of execution stages 46, 47, 48 may be skipped; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations)), to obtain a computational result (col. 6, line 15, results of the instruction); and writing, by the computational accelerator, the obtained computational result as a complex computational result into a complex computational result queue (col. 6, lines 46-48, results from completion unit 75 are output to result register 76 which, in turn, drives writeback bus WB at its output; col. 9, lines 49-52, as shown in FIG. 3, writeback bus WB is coupled to output buffer 78, through which communication of the result of the floating-point operation to memory, via internal bus IBUS, may be effected).
To any extent to which Dao does not disclose “generating, by the target processor core, a complex computational instruction using the computational identifier and the at least one operand obtained by the decoding” (in view of an interpretation of “operand” that does not encompass an “operand location” — Examiner notes that this interpretation is argued against in the response to arguments section), Aingaran nevertheless discloses both that an operand may be an address where data is stored for a coprocessor to perform an operation, and an operand may be an immediate operand or an indirect operand ([0036], lines 1-6, an operand indicated in a CCB may be one of two types: an immediate operand or an indirect operand. An immediate operand is an operand that can be used immediately by a coprocessor when the coprocessor performs the operation without first requiring translation of the operand, such as a memory lookup; [0036], lines 9-14, an indirect operand is an operand that must first be translated or looked up before the coprocessor can perform the designated operation. An example of an indirect operand is a physical address that indicates where (e.g., in memory 140) table data is stored for the coprocessor to perform the operation).
Aingaran’s teaching of using an immediate operand precludes the necessity of a memory lookup (Aingaran, [0036], lines 1-6).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Aingaran with the invention of Dao in order to preclude the necessity of a memory lookup. Alternatively, this modification merely entails combining prior art elements (Dao’s instruction, and Aingaran’s immediate operand) according to known methods (Examiner submits that instructions comprising immediate operands have been well-known for decades) to yield predictable results (Dao’s instruction, comprising immediate operands), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.
	However, the combination thus far does not disclose that the chip is an artificial intelligence chip. The combination thus far also does not entail a cache, the cache being connected to the computational accelerator and each of the at least one processor core respectively by wired connections, wherein the complex computational instruction queue is stored in the cache. The combination thus far also does not disclose a storage apparatus, storing at least one program thereon, wherein the at least one program, when executed by the artificial intelligence chip, causes the artificial intelligence chip to implement the aforementioned operations. The combination thus far also does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
	On the other hand, Lee discloses a chip being an artificial intelligence chip ([0024], lines 1-2, machine-learning accelerator (MLA) block; [0021], lines 1-3, for example, application specific integrated circuits (ASICs) are typically used to realize algorithms implemented in hardware). 
Lee’s teaching supports a range of computations required in various machine-learning frameworks while employing a specialized architecture that can exploit algorithmic structure in order to achieve low energy (Lee, [0007], lines 3-8).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Lee with the combination of Dao and Aingaran in order to support a range of computations required in various machine-learning frameworks while employing a specialized architecture that can exploit algorithmic structure in order to achieve low energy.
However, the combination thus far also does not entail a cache, the cache being connected to the computational accelerator and each of the at least one processor core respectively by wired connections, wherein the complex computational instruction queue is stored in the cache. The combination thus far also does not entail a storage apparatus, storing at least one program thereon, wherein the at least one program, when executed by the artificial intelligence chip, causes the artificial intelligence chip to implement the aforementioned operations. The combination thus far also does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
On the other hand, Shah discloses a cache ([0027], line 16, shared cache), the cache being connected to a computational accelerator and each of at least one processor core respectively by wired connections (Figure 2, shared cache 230, accelerators 290A-B, cores 210A-C), wherein a complex computational instruction queue is stored in the cache ([0027], lines 15-16, instruction have been written to a queue in the shared cache). Shah also discloses a storage apparatus, storing at least one program thereon, wherein the at least one program, when executed by a chip, causes the chip to implement operations ([0127]).
Shah’s teaching increases efficiency (Shah, [0003], lines 1-5; title).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Shah with the combination of Dao, Aingaran, and Lee in order to increase efficiency. Additionally, this modification merely entails a combination of prior art elements (the complex computational instruction queue and the artificial intelligence chip of the combination of Dao, Aingaran, and Lee as cited above, and Shah’s teaching of a cache to store accelerator instructions, as well as Shah’s teaching of a storage apparatus, storing at least one program thereon, wherein the at least one program, when executed by a chip, causes the chip to implement operations) according to known methods (Shah’s teaching of a cache to store accelerator instructions; Shah’s teaching of a storage apparatus, storing at least one program thereon, wherein the at least one program, when executed by a chip, causes the chip to implement operations) to yield predictable results (the complex computational instruction queue of the combination of Dao, Aingaran, and Lee, implemented using a cache according to Shah, with operations of the artificial intelligence chip of the combination of Dao, Aingaran, and Lee implemented using a computer-readable medium according to Shah), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143. 
However, the combination thus far does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
On the other hand, Dockser discloses complex computations comprises an exponentiation computation ([0001], line 4, exponential functions; [0019], lines 10-12, a floating-point exponential operator configured to execute floating-point exponential instructions), a square root extraction computation ([0019], lines 8-10, a floating-point square-root extractor configured to perform floating-point square-root extract instructions), and a trigonometric function computation ([0001], line 3, trigonometric functions; [0019], lines 14-16, a floating-point trigonometric operator configured to perform instructions for calculating trigonometric functions). (Note that, to any extent to which Dao does not disclose the computation being “complex”, Dockser teaches “complex” computations as cited above.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Dockser with the combination of Dao, Aingaran, Lee, and Shah, in order to increase the computational capabilities of the AI chip of the aforementioned combination. Alternatively, this modification merely entails a combination of prior art elements (the combination of Dao, Aingaran, Lee, and Shah as described above, and Dockser’s explicit disclosure of the well-known computation types of exponentiation, square root extraction, and trigonometric functions) according to known methods (Dockser explicitly discloses the well-known concept of using hardware to perform exponentiation, square root extraction, and trigonometric functions) to yield predictable results (the combination of Dao, Aingaran, Lee, and Shah as described above, further supporting performing exponentiation, square root extraction, and trigonometric functions), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.

Consider claim 13, Dao discloses a chip (col. 2, line 60, single-chip multiprocessor) implements operations, the chip comprising at least one processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs) and a computational accelerator, the computational accelerator being connected to each of the at least one processor core (col. 4, lines 19-21, multiple CPUs on the same integrated circuit chip should therefore be able to share a single high-performance FPU), wherein the computational accelerator comprises a plurality of computing units, each of the plurality of computing units is configured for executing a complex computation (Dao, col. 8, line 66 to col. 9, line 4, in this example, a first path of execution circuitry 65 is multiplication circuitry 70, for performing floating-point multiplication and division operations. Multiplication circuitry 70 may include a sequence of circuitry known in the art, such as a Booth recorder, multiplier arrays, an adder, and rounding circuitry; col 9, lines 11-15, a second path of execution circuitry 65 is implemented as adder circuitry 72, for performing additive and subtractive operations. In this example, adder circuitry 72 includes a sequence of an aligner, an adder, a LEO (left end out) shifter, normalization circuitry and rounding circuitry; col. 9, lines 23-28, the third path of execution circuitry 65 is a single-cycle execution unit 68, by way of which special single-cycle floating-point instructions may be executed. Examples of such single-cycle floating-point instructions include floating-point change of sign (FCHS) and floating-point absolute value (FABS)) and the plurality of computing units are configured for executing complex computations in parallel (Figure 4 shows the instructions being executed in parallel via use of pipelining), and the plurality of computing units correspond to a set of preset complex computational identifiers (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations); note that an instruction code — also called an opcode — in processor architecture is used to identify a particular instruction to be executed)), the operations comprising: decoding (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1), by a target processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs; in other words, the recited target processor core corresponds to any of two or more microprocessor central processing units) among the at least one processor core (col. 2, lines 64-65, two or more microprocessor central processing units, or CPUs), a to-be-executed instruction to obtain a computational identifier and at least one operand (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations); note that an instruction code — also called an opcode — in processor architecture is used to identify a particular instruction to be executed); in response to determining that the computational identifier obtained by decoding is a preset complex computational identifier included in the set of preset complex computational identifiers, generating, by the target processor core, a complex computational instruction using the computational identifier and the at least one operand obtained by the decoding, (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations)), and adding the generated complex computational instruction to a complex computational instruction queue (col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1); selecting, by the computational accelerator, a complex computational instruction from the complex computational instruction queue (col. 5, lines 50-53, dispatch stage 44 will also effect arbitration between instruction sequences from CPUs 10.sub.0, 10.sub.1 in the event of simultaneous floating-point requests); executing, by a computing unit corresponding to a preset complex computational identifier in the selected complex computational instruction in the computational accelerator, a complex computation indicated by the complex computational identifier in the selected complex computational instruction using at least one operand in the selected complex computational instruction as an inputted parameter (col. 6, lines 7-13, in this example, three execution stages 46, 47, 48 are included within floating-point pipeline 40, indicating that the floating-point arithmetic instructions may require up to three cycles to execute; of course, single-cycle instructions (such as change sign) may also be performed, in which case certain of execution stages 46, 47, 48 may be skipped; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations)), to obtain a computational result (col. 6, line 15, results of the instruction); and writing, by the computational accelerator, the obtained computational result as a complex computational result into a complex computational result queue (col. 6, lines 46-48, results from completion unit 75 are output to result register 76 which, in turn, drives writeback bus WB at its output; col. 9, lines 49-52, as shown in FIG. 3, writeback bus WB is coupled to output buffer 78, through which communication of the result of the floating-point operation to memory, via internal bus IBUS, may be effected).
To any extent to which Dao does not disclose “generating, by the target processor core, a complex computational instruction using the computational identifier and the at least one operand obtained by the decoding” (in view of an interpretation of “operand” that does not encompass an “operand location” — Examiner notes that this interpretation is argued against in the response to arguments section), Aingaran nevertheless discloses both that an operand may be an address where data is stored for a coprocessor to perform an operation, and an operand may be an immediate operand or an indirect operand ([0036], lines 1-6, an operand indicated in a CCB may be one of two types: an immediate operand or an indirect operand. An immediate operand is an operand that can be used immediately by a coprocessor when the coprocessor performs the operation without first requiring translation of the operand, such as a memory lookup; [0036], lines 9-14, an indirect operand is an operand that must first be translated or looked up before the coprocessor can perform the designated operation. An example of an indirect operand is a physical address that indicates where (e.g., in memory 140) table data is stored for the coprocessor to perform the operation).
Aingaran’s teaching of using an immediate operand precludes the necessity of a memory lookup (Aingaran, [0036], lines 1-6).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Aingaran with the invention of Dao in order to preclude the necessity of a memory lookup. Alternatively, this modification merely entails combining prior art elements (Dao’s instruction, and Aingaran’s immediate operand) according to known methods (Examiner submits that instructions comprising immediate operands have been well-known for decades) to yield predictable results (Dao’s instruction, comprising immediate operands), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.
	However, the combination thus far does not disclose that the chip is an artificial intelligence chip. The combination thus far also does not entail a cache, the cache being connected to the computational accelerator and each of the at least one processor core respectively by wired connections. The combination thus far also does not disclose a non-transitory computer readable medium, storing a computer program thereon, wherein the program, when executed by the artificial intelligence chip, implements the aforementioned operations. The combination thus far also does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
	On the other hand, Lee discloses a chip being an artificial intelligence chip ([0024], lines 1-2, machine-learning accelerator (MLA) block; [0021], lines 1-3, for example, application specific integrated circuits (ASICs) are typically used to realize algorithms implemented in hardware). 
Lee’s teaching supports a range of computations required in various machine-learning frameworks while employing a specialized architecture that can exploit algorithmic structure in order to achieve low energy (Lee, [0007], lines 3-8).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Lee with the combination of Dao and Aingaran in order to support a range of computations required in various machine-learning frameworks while employing a specialized architecture that can exploit algorithmic structure in order to achieve low energy.
However, the combination thus far also does not entail a cache, the cache being connected to the computational accelerator and each of the at least one processor core respectively by wired connections. The combination thus far also does not disclose a non-transitory computer readable medium, storing a computer program thereon, wherein the program, when executed by the artificial intelligence chip, implements the aforementioned operations. The combination thus far also does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
On the other hand, Shah discloses a cache ([0027], line 16, shared cache), the cache being connected to a computational accelerator and each of at least one processor core respectively by wired connections (Figure 2, shared cache 230, accelerators 290A-B, cores 210A-C). Shah also discloses a non-transitory computer readable medium, storing a computer program thereon, wherein the computer program, when executed by a chip, implements operations ([0127]).
Shah’s teaching increases efficiency (Shah, [0003], lines 1-5; title).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Shah with the combination of Dao, Aingaran, and Lee in order to increase efficiency. Additionally, this modification merely entails a combination of prior art elements (the artificial intelligence chip of the combination of Dao, Aingaran, and Lee as cited above, and Shah’s teaching of a non-transitory computer readable medium, storing a computer program thereon, wherein the computer program, when executed by a chip, implements operations) according to known methods (Shah’s teaching of a cache on a chip; Shah’s teaching of a non-transitory computer readable medium, storing a computer program thereon, wherein the computer program, when executed by a chip, implements operations) to yield predictable results (the chip of the combination of Dao, Aingaran, and Lee, comprising a shared cache (for example, to store the complex computational instruction queue, as per [0027], lines 15-16, of Shah) according to Shah, with operations of the artificial intelligence chip of the combination of Dao, Aingaran, and Lee implemented using a non-transitory computer readable medium according to Shah), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.
However, the combination thus far does not entail the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation.
On the other hand, Dockser discloses complex computations comprises an exponentiation computation ([0001], line 4, exponential functions; [0019], lines 10-12, a floating-point exponential operator configured to execute floating-point exponential instructions), a square root extraction computation ([0019], lines 8-10, a floating-point square-root extractor configured to perform floating-point square-root extract instructions), and a trigonometric function computation ([0001], line 3, trigonometric functions; [0019], lines 14-16, a floating-point trigonometric operator configured to perform instructions for calculating trigonometric functions). (Note that, to any extent to which Dao does not disclose the computation being “complex”, Dockser teaches “complex” computations as cited above.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Dockser with the combination of Dao, Aingaran, Lee, and Shah, in order to increase the computational capabilities of the AI chip of the aforementioned combination. Alternatively, this modification merely entails a combination of prior art elements (the combination of Dao, Aingaran, Lee, and Shah as described above, and Dockser’s explicit disclosure of the well-known computation types of exponentiation, square root extraction, and trigonometric functions) according to known methods (Dockser explicitly discloses the well-known concept of using hardware to perform exponentiation, square root extraction, and trigonometric functions) to yield predictable results (the combination of Dao, Aingaran, Lee, and Shah as described above, further supporting performing exponentiation, square root extraction, and trigonometric functions), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.

Consider claim 14, the overall combination entails an electronic device, comprising: a processor (Lee, [0022], line 3, central processing unit), a storage apparatus (Lee, [0022], lines 11-13, the CPU core 12 is interfaced with a program memory 16 and a data memory 18), and at least one artificial intelligence chip according to claim 12 (Lee, [0024], lines 1-2, machine-learning accelerator (MLA) block; [0021], lines 1-3, for example, application specific integrated circuits (ASICs) are typically used to realize algorithms implemented in hardware).

Claim 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dao, Aingaran, Lee, Shah, and Dockser as applied to claim 1 above, and further in view of Wu et al. (Wu) (US 20120233477 A1).
Consider claim 2, the combination thus far does not entail before decoding, by a target processor core among the at least one processor core, a to-be-executed instruction, the computing method further comprises: selecting, in response to receiving the to-be-executed instruction, a processor core for executing the to-be-executed instruction from the at least one processor core for use as the target processor core.
On the other hand, Wu discloses before decoding, by a target processor core among at least one processor core, a to-be-executed instruction, a method further comprises: selecting, in response to receiving the to-be-executed instruction, a processor core executing the to-be-executed instruction from the at least one processor core for use as the target processor core ([0029], lines 1-6, code is distributed between core 101 and 102 based on maximizing performance and power. For example, code regions are identified to perform better on one of the two cores 101, 102. As a result, when one of such code regions is encountered/detected, that code section is distributed to the appropriate core).
Wu’s teaching optimizes power and performance efficiency (Wu, [0029], lines 1-3; [0001], lines 1-3).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Wu with the combination of Dao, Aingaran, Lee, Shah, and Dockser in order to optimize power and performance efficiency.

Claims 3-4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dao, Aingaran, Lee, Shah, and Dockser as applied to claim 1 above, and further in view of Koehler et al. (Koehler) (US 20090113212 A1).
Consider claim 3, the combination thus far entails the complex computational instruction queue comprises a complex computational instruction queue corresponding to the each of the at least one processor core (Dao, col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1); and the adding the generated complex computational instruction to a complex computational instruction queue comprises: adding, by the target processor core, the generated complex computational instruction to a complex computational instruction queue corresponding to the target processor core (Dao, col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1); and the selecting, by the computational accelerator, a complex computational instruction from the complex computational instruction queue comprises: selecting, by the computational accelerator, the complex computational instruction from a complex computational instruction queue corresponding to the each of the at least one processor core (Dao, col. 5, lines 50-53, dispatch stage 44 will also effect arbitration between instruction sequences from CPUs 10.sub.0, 10.sub.1 in the event of simultaneous floating-point requests).
	However, the combination thus far does not entail the complex computational result queue comprises a complex computational result queue corresponding to each of the at least one processor core, and the writing, by the computational accelerator, the obtained computational result as a complex computational result into a complex computational result queue comprises: writing, by the computational accelerator, the obtained computational result as the complex computational result into a complex computational result queue corresponding to a processor core corresponding to the complex computational instruction queue of the selected complex computational instruction.
	On the other hand, Koehler discloses a complex computational result queue comprises a complex computational result queue corresponding to each of at least one processor core, and a writing, by a computational accelerator, an obtained computational result as a complex computational result into a complex computational result queue comprises: writing, by the computational accelerator, the obtained computational result as the complex computational result into a complex computational result queue corresponding to a processor core corresponding to the complex computational instruction queue of the selected complex computational instruction ([0020], lines 1-4, there may be a dedicated set of input/output buffers available in the crypto unit for every processor core that participates in sharing).
Koehler’s teaching enables independent and concurrent operation of engines within an accelerator (Koehler, [0034], lines 8-11).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Koehler with the combination of Dao, Aingaran, Lee, Shah, and Dockser in order to enable independent and concurrent operation of engines within an accelerator. Alternatively, this modification merely entails simple substitution of one known element (a result queue) for another (per-core result queues) to obtain predictable results (the combination of Dao, Aingaran, Lee, Shah, and Dockser, entailing per-core result queues rather than a result queue), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.

Consider claim 4, the overall combination entails after writing, by the computational accelerator, the obtained computational result as the complex computational result into a complex computational result queue corresponding to a processor core corresponding to the complex computational instruction queue of the selected complex computational instruction, the computing method further comprises: selecting, by the target processor core, the complex computational result from the complex computational result queue corresponding to the target processor core, and writing the selected complex computational result into at least one of: a result register in the target processor core, or a memory of the artificial intelligence chip (Dao, col. 9, lines 49-52, output buffer 78, through which communication of the result of the floating-point operation to memory, via internal bus IBUS, may be effected; Koehler, [0035], lines 4-6, retrieving target data from its associated, hard-wire connected output buffer 24A, 24B; [0052], lines 16-19, if there are target data units available in the respective output buffer, these data units may be stored to main memory).

Claims 5-6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dao, Aingaran, Lee, Shah, and Dockser as applied to claim 1 above, and further in view of Kahle et al. (Kahle) (US 6725354 B1).
Consider claim 5, the combination thus far entails the generating, by the target processor core, a complex computational instruction using the computational identifier and the at least one operand obtained by the decoding in response to determining that the computational identifier obtained by decoding is a preset complex computational identifier comprises: generating, by the target processor core, the complex computational instruction using the computational identifier, the at least one operand obtained by the decoding, in response to determining that the computational identifier obtained by the decoding is the preset complex computational identifier (Dao, col. 5, lines 27-35, the first stage of floating-point pipeline 40 is performed simultaneously for floating-point instructions detected by instruction decoders 16.sub.0, 16.sub.1, in their respective integer predecode stages 34.sub.0, 34.sub.1. As such, instruction queue stage 41.sub.0 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.0, while instruction queue stage 41.sub.1 receives a series of instruction codes for floating-point instructions detected in predecode stage 34.sub.1; col. 6, lines 35-40, each stage of instruction buffers 50 is preferably able to store an entire instruction code for a floating-point instruction which, in this case of x86-architecture CPUs 10, may include up to eight bytes (one for the tag, and up to seven bytes for the instruction code with identifiers for source and destination operand locations)); and the writing, by the computational accelerator, the obtained computational result as a complex computational result into a complex computational result queue comprises: writing, by the computational accelerator, the obtained computational result as the complex computational result into the complex computational result queue (Dao, col. 6, lines 46-48, results from completion unit 75 are output to result register 76 which, in turn, drives writeback bus WB at its output; col. 9, lines 49-52, as shown in FIG. 3, writeback bus WB is coupled to output buffer 78, through which communication of the result of the floating-point operation to memory, via internal bus IBUS, may be effected).
However, the combination thus far does not entail the aforementioned generation entails an identifier of the target processor core, and the aforementioned writing entails a processor core identifier in the selected complex computational instruction.
On the other hand, Kahle discloses generation entails an identifier of a target processor core (col. 6, lines 7-10, tag 264 of each entry 261 identifies either first processor core 201a or second processor core 201b as the source of entry's corresponding instruction 266), and writing entails a processor core identifier in a selected complex computational instruction (col. 6, lines 46-55, when a floating point instruction 266 completes execution in one of the pipelines 230, the depicted embodiment of shared floating point unit 231 routes instruction 266 to first processor core 201a and second processor core 201b. Each processor core 201 then examines the floating point instruction's tag 264 to determine the instruction's "owner." The processor core 201 that owns the floating point instruction will store the instructions result in an appropriate rename register while the processor core 201 that does not own the instruction will discard or ignore the instruction's results).
Kahle’s teaching enables the simultaneous processing of distinct execution streams or "threads" in a single shared resource (Kahle, col. 7, lines 2-6).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Kahle with the combination of Dao, Aingaran, Lee, Shah, and Dockser in order to enable the simultaneous processing of distinct execution streams or "threads" in a single shared resource. Alternatively, this modification merely entails simple substitution of one known element (the manner by which the combination of Dao, Aingaran, Lee, Shah, and Dockser directs a result to the corresponding processor core) for another (Kahle’s method of directing a result to the corresponding processor core) to obtain predictable results (the combination of Dao, Aingaran, Lee, Shah, and Dockser entailing Kahle’s identifier of a target processor core to direct a result to the corresponding processor core), which is an exemplary rationale that may support a conclusion of obviousness, as per MPEP 2143.

Consider claim 6, the overall combination entails after writing, by the computational accelerator, the obtained computational result and a processor core identifier in the selected complex computational instruction as the complex computational result into the complex computational result queue, the computing method further comprises: selecting, by the target processor core, a computational result in the complex computational result with the processor core identifier being the identifier of the target processor core from the complex computational result queue, and writing the computational result into at least one of: a result register in the target processor core, or a memory of the artificial intelligence chip (Kahle, col. 6, lines 7-10, tag 264 of each entry 261 identifies either first processor core 201a or second processor core 201b as the source of entry's corresponding instruction 266), and writing entails a processor core identifier in a selected complex computational instruction (col. 6, lines 46-55, when a floating point instruction 266 completes execution in one of the pipelines 230, the depicted embodiment of shared floating point unit 231 routes instruction 266 to first processor core 201a and second processor core 201b. Each processor core 201 then examines the floating point instruction's tag 264 to determine the instruction's "owner." The processor core 201 that owns the floating point instruction will store the instructions result in an appropriate rename register while the processor core 201 that does not own the instruction will discard or ignore the instruction's results; Dao, col. 9, lines 49-52, output buffer 78, through which communication of the result of the floating-point operation to memory, via internal bus IBUS, may be effected).

Response to Arguments
Applicant on page 9 argues: 'Applicant has amended the limitation "the complex computation refers to an exponentiation computation, a square root extraction computation, or a trigonometric function computation" to read "the complex computation is an exponentiation computation, a square root extraction computation, or a trigonometric function computation". Withdrawal of the claim rejection under 35 USC § 112 is respectfully requested.'
In view of the cancellation of claim 10, the previously presented rejection under 35 USC § 112 is withdrawn. Examiner generally notes that a related amendment to claim 1 made by the response dated June 2, 2022, differs from that which is recited above.

Applicant on page 10 argues: "A person skilled in the art understands that the complex computation such as exponentiation computation, square root extraction computation, or trigonometric function computation is computation that cannot be constituted by only a simple combination of additive operation and multiplication (see specification, paragraph [0045])".
Applicant may be arguing that paragraph's [0045] disclosure that "The complex computation refers to computation that cannot be constituted by simple combination of additive operation and multiplication, such as exponentiation, square root extraction, and trigonometric function computation" serves as an explicit definition, such that the metes and bounds of the limitation "complex computation" requires computation that cannot be constituted by simple combination of additive operation and multiplication, such as exponentiation, square root extraction, and trigonometric function computation. However, paragraph [3] discloses "Complex computation may be implemented by a basic computational instruction, but will reduce the execution efficiency of the complex computation (e.g., floating point square root extraction, floating point exponentiation, or trigonometric function computation)." In addition, paragraph [45], [74], and [94] further disclose "Here, the complex computation refers to computation with huge computational workload with respect to simple computation, while the simple computation may refer to computation with small computational workload." Therefore, it may be unclear as to whether a) Applicant intends "complex computation" to have an explicit definition in the specification, and b) which portions of the specification are intended to provide an explicit definition and which portions of the specification are not intended to provide an explicit definition. Examiner notes that the BRI of the relevant limitation may not necessarily allow for certain portions of the specification to contribute to an explicit definition but not other portions. For example, the language "refers to" in one portion of the specification being intended to signify an explicit definition may mean that the same language in another portion of the specification would also signify an explicit definition. 

Applicant across pages 11-12 argues: ‘Evidently, the three paths/units for executing the calculation in Dao are the multiplication circuitry 70, adder circuitry 72, single-cycle execution unit 68. None of them are capable of executing a complex computation such as exponentiation computation, square root extraction computation, or trigonometric function computation independently, and thus the multiplication circuitry 70, adder circuitry 72, and single-cycle execution unit 68 of Dao are not capable of executing the claimed complex computations in parallel. Therefore, the multiplication circuitry 70, adder circuitry 72, single-cycle execution unit 68 are physically different from the claimed "a plurality of computing units" each of which is configured for executing a complex computation such as an exponentiation computation, a square root extraction computation, or a trigonometric function computation, and the plurality of computing units are configured for executing complex computations in parallel. Accordingly, Dao at least fails to disclose the features of "the computational accelerator comprises a plurality of computing units, each of the plurality of computing units is configured for executing a complex computation and the plurality of computing units are configured for executing complex computations in parallel, wherein the complex computations comprise an exponentiation computation, a square root extraction computation, and a trigonometric function computation". 
Since none of the multiplication circuitry 70, adder circuitry 72, and single-cycle execution unit 68 of Dao is capable of executing a complex computation such as exponentiation computation, square root extraction computation, or trigonometric function computation independently, therefore even if a complex computation (such as such as exponentiation computation, square root extraction computation, or trigonometric function computation) were disclosed by the rest prior references, a person skilled in the art would not be motivated to apply the complex computation into the multiplication circuitry 70, adder circuitry 72, and single-cycle execution unit 68 of Dao to arrive the claimed invention. Therefore, Applicant submits that the other references Aingaran, Lee, Shah, Wu, Lee, and Kahle, do not remedy the deficiencies of Dao.’
	However, Examiner submits that while Dao considered alone may only disclose of multiplication circuitry, adder circuitry, and a single-cycle execution unit, Dockser explicitly discloses the specifically recited complex computations, and it would have been obvious to implement this functionality into the invention of Dao to increase the computational capabilities of the AI chip of the aforementioned combination. Examiner further submits that one of ordinary skill in the art before the effective filing date of the claimed invention would readily recognize that additional computational capabilities (e.g., supporting exponentiation, square root extraction, or trigonometric functions) can be implemented by adding additional circuitry alongside already-present circuitry configured to perform already-present functions (akin to how, in Dao, multiplication circuitry 70 is alongside adder circuitry 72) or by modifying the already-present circuitry so that the already-present circuitry can perform both already-present functions and additional functions (akin to an ALU being able to perform both arithmetic and logic functions). Therefore, Examiner submits that the overall combination renders obvious the newly amended limitations. 

Applicant on page 12 argues: “Independent claims 12 and 13 recite similar features as in amended claim 1, and are also patentable for at least the same reasons. Similarly, claims 2-9 and 11, depend from independent claim 1 and claim 14 depends from independent claim 13, include all of the features recited therein. Accordingly, claims 2-9, 11, and 14 are patentably distinguishable over Dao, Aingaran, Lee, Shah, Wu, Lee, and Kahle for at least those reasons stated above with respect to amended claim 1 and to the independent claims from which they ultimately depend."
Examiner’s response to arguments above is likewise applicable to the arguments directed to the aforementioned further claims. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEITH E VICARY whose telephone number is (571)270-1314. The examiner can normally be reached Monday to Friday, 9:00 AM to 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571)270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KEITH E VICARY/            Primary Examiner, Art Unit 2182