DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 11 November 2022, in response to the Office Action mailed 11 July 2022.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 4, 6, 8-16, 19, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mody (US 2018/0197067 – provisional app. No. 62/445493 cited), in view of Ambrose (US 2017/0344882), further in view of Kuszmaul (US 6,609,189), and further in view of Hsu (US 2001/0049782).

As per claim 1, Mody teaches an accelerator for processing of a convolutional neural network (CNN) comprising: a compute core comprising: a compute unit [an multi-mode engine (MME – the compute unit) including a vector multiply unit (VMU) including a plurality of multiply/accumulate (MAC) nodes (para. 0004, fig. 3, etc.)] each compute unit comprising: a first memory cache configured to store a map trace arranged as a plurality of cache lines each having at least one vector [the VMU is connected to a data feeder storing vectors of input data from a L2 cache which are provided to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 16; etc.)], the map trace corresponding to a contiguous set of input data in an input map for one layer of the CNN [the VMU is connected to a data feeder storing vectors of input data from a L2 cache which are provided to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 16; etc.)]; a second memory cache configured to store at least one vector in a kernel trace [the VMU is connected to a weight feeder storing vectors of weight data from a L2 cache to L1 weight caches, to output weights to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 15; etc.)], the kernel trace corresponding to a contiguous set of weights in a kernel of a convolutional layer or a fully-connected layer of the CNN [the VMU is connected to a weight feeder storing vectors of weight data from a L2 cache to L1 weight caches, to output weights to the VMU to process the set of inputs for a layer (paras. 0004, 0022-27; figs. 3 and 15; etc.)]; a vector multiply-accumulate unit (vMACs) connected to the first memory cache and the second memory cache [an multi-mode engine (MME – the compute core) including a vector multiply unit (VMU – the vMAC) including a plurality of multiply/accumulate (MAC) nodes (para. 0004, fig. 3, etc.)], each vMAC comprising: a plurality of multiply-accumulate units (MACs) [an multi-mode engine (MME – the compute core) including a vector multiply unit (VMU – the vMAC) including a plurality of multiply/accumulate (MAC) nodes (para. 0004, fig. 3, etc.)], each MAC including a multiplier unit configured to, for each of the plurality of cycles of the plurality of vMACs, multiply a first word that forms a portion of the at least one vector of a respective cache line of the plurality of cache lines in the map trace in the first memory cache by a second word that forms a portion of the at least one vector in the kernel trace in the second memory cache to produce an intermediate product [the MAC nodes each include multiplier and accumulator units to operate on bytes of weight and input data and output the result (paras. 0024-27, fig. 6, etc.)], and an adder unit that, for each of the plurality of cycles of the plurality of vMACs, adds the intermediate product to a third word to generate a sum of the intermediate product and the third word as an output [the MAC nodes each include multiplier and accumulator units to operate on bytes of weight and input data and output the result (paras. 0024-27, fig. 6, etc.)]; and a control core operatively connected to the plurality of compute units in the compute core [a controller is connected for data distribution over busses in the system (Mody: fig. 3, Ambrose: fig. 5, etc.)].
While Mody teaches an array of MAC units (see above) it does not explicitly teach a plurality of compute units with a plurality of vMAC units.  Furthermore, while Mody teaches the control core (see above) and that the MAC nodes operate on selected bytes of weight and input data and output the result (see, e.g., Mody: paras. 0024-27, fig. 6, etc.) it does not explicitly teach the control core being configured to select an operating mode of the compute core depending on a number of input maps to the convolutional layer of the CNN and control operation of the plurality of compute units according to the selected operating mode, the operating mode being either one of a cooperative mode and an independent mode, wherein, in the cooperative mode, each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a different portion of the map trace, and wherein, in the independent mode, each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a single portion of the map trace.  Mody has also not been relied upon for teaching a trace decoder configured to (i) receive a trace instruction that identifies an operating mode of the computer core, a start address of the map trace, and a length of the map trace and (ii) based on the trace instruction, for each of the plurality of cycles of the plurality of vMACs, increment the start address, fetch the respective cache line from the plurality of cache lines in the first memory cache, and forward the respective cache line to the plurality of vMACs, until the length of the map trace is reached, the control core being configured to provide the trace instruction to the trace decoder.
Ambrose teaches a plurality of compute units with a plurality of vMAC units [a system on chip (SoC) with multiple accelerators for executing a Convolutional Neural Network, which can including multiple groups of processing units (paras. 0021, 0055-60, fig. 5, etc.)], and the control core being configured to select an operating mode of the compute core depending on a number of input maps to the convolutional layer of the CNN and control operation of the plurality of compute units according to the selected operating mode, the operating mode being either one of a cooperative mode and an independent mode [based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (paras. 0179-197, etc.), using the controller of Mody, above], wherein, in the cooperative mode, each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a different portion of the map trace, and wherein, in the independent mode, each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a single portion of the map trace [based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (paras. 0179-197, etc.) where the two possible options for scheduling are the same or different portions of the vector].
Mody and Ambrose are analogous art, as they are within the same field of endeavor, namely accelerators for CNNs.
It would have been obvious to one of ordinary skill in the art to include multiple processing units in the system for processing the CNN layers, as well as scheduling of the input and weight data processing, as taught by Ambrose, for multiple copies of the compute unit MME taught by Mody.
Ambrose provides motivation as [having multiple processing units allows better exploitation of available parallelism in CNN algorithms (para. 0005, etc.) and where dynamic scheduling allows the system to pick an optimal schedule for different CNN layers using different sizes/parameters (para. 0013, etc.)].  Furthermore, it has been held that the mere duplication of the essential working parts of a device involves only routine skill in the art.  St. Regis Paper Co. v. Bemis Co., 193 USPQ 8.
Kuszmaul teaches a trace decoder configured to (i) receive a trace instruction that identifies an operating mode of the computer core, a start address of the map trace, [execution stations start fetching a new trace from the trace cache, based on the starting PC (address) of the trace from a prefix attached to the instruction starting the trace (col. 55, lines 17-34; etc.) which instruction(s) can be decoded by the decode logic (cols. 23-24, section 4.2.2 Decode Logic); therefore the instruction starting the trace identifies trace execution (mode) and the starting PC (start address) of the trace] and (ii) based on the trace instruction, for each of the plurality of cycles of the plurality of vMACs, increment the start address, fetch the respective cache line from the plurality of cache lines in the first memory cache, and forward the respective cache line to the plurality of vMACs [execution stations start fetching a new trace from the trace cache, based on the starting PC (address) of the trace from a prefix attached to the instruction starting the trace (col. 55, lines 17-34; etc.) where each instruction fetched will lead to incrementing/computing the next PC to fetch the next instruction, which could be fetched from the lines of the trace cache (col. 16, lines 26-60; col. 55, lines 17-34; figs. 21-22; etc.); for vMAC execution in Mody/Ambrose, above], the control core being configured to provide the trace instruction to the trace decoder [execution stations start fetching a new trace from the trace cache, based on the starting PC (address) of the trace from a prefix attached to the instruction starting the trace (col. 55, lines 17-34; etc.) which instruction(s) can be decoded by the decode logic (cols. 23-24, section 4.2.2 Decode Logic)].
Mody/Ambrose and Kuszmaul are analogous art, as they are within the same field of endeavor, namely instruction processing acceleration/performance.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the trace cache including identifying and decoding trace instructions, as taught by Kuszmaul, for the instruction processing for the CNN in the system taught by Mody/Ambrose.
Kuszmaul provides motivation as [trace caches allow greater parallelism and performance by allowing parallel execution across multiple branches (col. 54, line 64 to col. 55, line 16; etc.)].
Hsu teaches (i) receiving a trace instruction that identifies an operating mode of the computer core, a length of the map trace [a trace is identified including identifying the length of the trace, which may be set as a multiple of the cache line size (para. 0028, etc.)] and (ii) based on the trace instruction, for each of the plurality of cycles of the plurality of vMACs, increment the start address, fetch the respective cache line from the plurality of cache lines in the first memory cache, and forward the respective cache line to the plurality of vMACs, until the length of the map trace is reached [each instruction increments the program counter (para. 0023, etc.) where execution of the trace continues until the trace is stopped based on reaching the trace length (para. 0028, etc.); for vMAC execution in Mody/Ambrose, above].
Mody/Ambrose/Kuszmaul and Hsu are analogous art, as they are within the same field of endeavor, namely instruction processing acceleration/performance.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the trace length indicator taught by Hsu for ending the trace in the system taught by Mody/Ambrose/Kuszmaul.
Hsu provides motivation as [by including a trace length indicator that matches a multiple of the cache line size of the trace cache, cache operations are made easier as entire cache lines are used at a time (para. 0028, etc.) and allows for more freedom of the trace selection (para. 0009, etc.)].

As per claim 4, Mody/Ambrose/Kuszmaul/Hsu teaches a memory interface [the MME is connected via a L2 cache to main memory via a master port (Mody: para. 0021, fig. 3, etc.) using an external memory interface to connect o external memory (Ambrose: fig. 5, etc.)]; and a compute cluster connected to the memory interface, the compute cluster comprising: the compute core [the multi-accelerator SoC is connected to the external memory interface (Ambrose: fig. 5, etc.); where the MME is connected via a L2 cache to main memory via a master port (Mody: para. 0021, fig. 3, etc.)]; a data distribution network [the controller is connected for data distribution over busses in the system (Ambrose: fig. 5, Mody: fig. 3etc.)]; and the control core, the control core being operatively connected to the data distribution network [the controller is connected for data distribution over busses in the system (Mody: fig. 3, Ambrose: fig. 5, etc.)].

As per claim 6, Mody/Ambrose/Kuszmaul/Hsu teaches wherein, in response to the number of input maps to the convolution layer of the CNN being a multiple of 16, the control core selects the cooperative mode as the operating mode of the compute core [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].
Mody teaches that the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022) but does not explicitly teach the determination of a multiple of 16.  However, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 8, Mody/Ambrose/Kuszmaul/Hsu teaches wherein, in response to the number of input maps to the convolutional layer of the CNN not being a multiple of 16, the control core selects the independent mode as the operating mode of the compute core [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].
Mody teaches that the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022) but does not explicitly teach the determination of a multiple of 16.  However, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 9, Mody/Ambrose/Kuszmaul/Hsu teaches wherein the first memory cache in each compute unit is implemented as a first scratchpad memory device in the compute core and the second memory cache in each compute unit is implemented as a second scratchpad memory device [the VMU is connected to an L2 cache storing weight and input data, which is passed to weight and data feeders which temporarily store the weight/input data and then passed to the VMU (Mody: paras. 0004, 0022-27; figs. 3 and 15; etc.)].

As per claim 10, Mody/Ambrose/Kuszmaul/Hsu teaches wherein each compute unit in the plurality of compute units comprises four vMACs and each vMAC further comprises sixteen MACs [the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022)].  
While Mody/Ambrose does not explicitly teach the number of necessary vMACs and MACs, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 11, Mody/Ambrose/Kuszmaul/Hsu teaches wherein a first vMAC in the plurality of vMACs reads 256-bits of data as the at least one vector in the map trace from the first memory cache and each MAC in the plurality of MACs in the first vMAC receives a 16-bit first word from the 256-bits of data [the vectors may be stored in a 4x256 register for 256 bit inputs (Mody: para. 0029, fig. 3, etc.)]. 
While Mody/Ambrose does not explicitly teach the size of necessary vectors/portions used, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 12, Mody/Ambrose/Kuszmaul/Hsu teaches wherein the first memory cache stores 256 kB of map trace data and the second memory cache stores 32 kB of kernel trace data [the L1 memory at least is 4KB (Mody: fig. 15, etc.)].
While Mody/Ambrose does not explicitly teach the size all the memories used, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 13, Mody/Ambrose/Kuszmaul/Hsu teaches a single memory cache that is the first memory cache in each compute unit in the plurality of compute units in the compute core, the single memory cache being shared by the plurality of compute units [the weights may come from a shared L2 cache to be processed by the processing units (Mody: fig. 3, etc.)].

As per claim 14, Mody/Ambrose/Kuszmaul/Hsu teaches wherein the compute core further comprises four compute units [the system includes at least 4 compute units (Ambrose: fig. 5, etc.)].

As per claim 15, Mody/Ambrose/Kuszmaul/Hsu teaches wherein the single memory cache stores 512 kB of the map trace data [the L1 memory at least is 4KB (Mody: fig. 15, etc.)].
While Mody/Ambrose does not explicitly teach the size all the memories used, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 16, see the rejection of claim 1, above, wherein Mody/Ambrose/Kuszmaul/Hsu also teaches loading, with a control core in the accelerator, the map trace and kernel trace data [the VMU is connected to an L2 cache storing weight and input data, which is passed to weight and data feeders which temporarily store the weight/input data and then passed to the VMU (Mody: paras. 0004, 0022-27; figs. 3 and 15; etc.) and the controller is connected for data distribution over busses in the system (Mody: fig. 3, Ambrose: fig. 5, etc.)].

As per claim 19, Mody/Ambrose/Kuszmaul/Hsu teaches operating, with the control core, the plurality of MACs in the cooperative mode in response to the number of input maps to the convolutional layer of the CNN being a multiple of 16; and operating, with the control core, the plurality of MACs in the independent mode in response to the number of input maps to the convolutional layer of the CNN not being a multiple of 16 [the MAC nodes each include multiplier and accumulator units to operate on selected bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) where, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.)].
Mody teaches that the vectors are 16 bytes and the units handle each byte (see, e.g., Mody: para. 0022) but does not explicitly teach the determination of a multiple of 16.  However, it has been held that where the general conditions of the claim are disclosed in the prior art, discovering the optimum or working value/ranges involves only routine skill in the art.  In re Aller, 105 USPQ 233 and In re Boesch, 617 F.2d 272, 205 USPQ 215 (CCPA 1980).

As per claim 20, see the rejection of claim 3, above.


Claim 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mody, Ambrose, Kuszmaul, and Hsu as applied to claim 1 above, and further in view of Bittner (US 2018/0157465).

As per claim 2, Mody/Ambrose/Kuszmaul/Hsu teach the accelerator of claim 1, as described above.
While Mody/Ambrose/Kuszmaul/Hsu teaches shifting the results (see, e.g., Mody: para. 0025, etc.) it does not teach each vMAC further comprising: a shift register connected to outputs of the plurality of MACs, the shift register being configured to generate a series of outputs, each output in the series of outputs corresponding to an output of one MAC in the plurality of MACs; and a gather adder configured to generate one of: a single sum of the series of outputs of the shift register and a bias value as an output; or a plurality of sums, each sum in the plurality of sums corresponding to a sum of an output of one MAC in the plurality of MACs received from the shift register and another bias value.
Bittner teaches each vMAC further comprising: a shift register connected to outputs of the plurality of MACs [one or more shifters may be used on the MAC outputs (paras. 0040-45, fig. 1, etc.)], the shift register being configured to generate a series of outputs, each output in the series of outputs corresponding to an output of one MAC in the plurality of MACs [one or more shifters may be used on the MAC outputs (paras. 0040-45, fig. 1, etc.)]; and a gather adder configured to generate one of: a single sum of the series of outputs of the shift register and a bias value as an output; or a plurality of sums, each sum in the plurality of sums corresponding to a sum of an output of one MAC in the plurality of MACs received from the shift register and another bias value [the shifted data may be added to bias values to produce an output (paras. 0040-45, fig. 1, etc.)].
Mody/Ambrose/Kuszmaul/Hsu and Bittner are analogous art, as they are within the same field of endeavor, namely neural network accelerators.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include the shifter(s) and bias values taught by Bittner, for the shifting output in the CNN in the system taught by Mody/Ambrose/Kuszmaul/Hsu.
Bittner provides motivation as [the MAC outputs must be appropriately shifted, while typically bias values are added in NN calculations (paras. 0002, 0042-44, etc.)].


Claim 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mody, Ambrose, Kuszmaul, and Hsu as applied to claim 1 above, and further in view of Henry (US 2018/0157961)

As per claim 3, Mody/Ambrose/Kuszmaul/Hsu teaches the accelerator of claim 1, as described above.
While Mody/Ambrose/Kuszmaul/Hsu also teaches utilizing a max pooling layer (see, e.g., Mody: para. 0003) it does not explicitly teach each compute unit in the plurality of compute units further comprising: a vector maxpool unit (vMAX) connected to the first memory cache of the compute unit, the vMAX comprising a plurality of comparators that implement a max pooling layer of the CNN based on outputs from the plurality of vMACs that are stored in the first memory cache of the compute unit.
Henry teaches each compute unit in the plurality of compute units further comprising: a vector maxpool unit (vMAX) connected to the first memory cache of the compute unit, the vMAX comprising a plurality of comparators that implement a max pooling layer of the CNN based on outputs from the plurality of vMACs that are stored in the first memory cache of the compute unit [an ALU including a combination of comparators and muxes may be used to select a max value for the pooling layer (para. 0184, etc.)].
Mody/Ambrose/Kuszmaul/Hsu and Henry are analogous art, as they are within the same field of endeavor, namely convolutional neural network accelerators.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize the maxpool hardware for the pooling layer, taught by Henry, for performing the max pooling layer taught by Mody/Ambrose/Kuszmaul/Hsu.
Henry and Mody provide motivation as [the max operation is often necessary in pooling layers in neural network applications (Henry: para. 0184, etc.); which may be used to normalize results across various classes of interest (Mody: para. 003, etc.)].


Response to Arguments
Applicant's arguments filed 11 November 2022 have been fully considered but they are not persuasive.

Applicant argues that the cited art does not teach a control core operatively connected to the plurality of compute units in the compute core, the control core being configured to select an operating mode of the compute core depending on a number of input maps to the convolutional layer of the CNN and control operation of the plurality of compute units according to the selected operating mode, the operating mode being either one of a cooperative mode and an independent mode, wherein, in the cooperative mode, each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a different portion of the at least one vector in the map trace, and wherein, in the independent mode, each MAC in the plurality of MACs in at least one of the plurality of vMACs in each compute unit receives the first word from a single portion of the at least one vector in the map trace
However, Mody/Ambrose teaches a controller is connected for data distribution over busses in the system (Mody: fig. 3, Ambrose: fig. 5, etc.) where the MAC nodes each include multiplier and accumulator units to operate on bytes of weight and input data and output the result (Mody: paras. 0024-27, fig. 6, etc.) and, based upon the number of inputs and the available number of processing units, a scheduling scheme may be selected to determine scheduling of the input and weight data for processing (Ambrose: paras. 0179-197, etc.) where the only two possible options for scheduling are the same or different portions of the vector.

Applicant’s further remarks are drawn to the amendments made to the claims, which have been addressed, above, including by the newly cited references to Kuszmaul and Hsu.


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 5, 7, 17, and 18 are cancelled; claims 1-4, 6, 8-16, 19, and 20 are rejected.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Peled (US 6,073,213) – discloses a trace cache including a start instruction for a trace.
Davis (US 2008/0114964) – discloses a trace cache including a field for a number of instructions in the trace.
Gonzalez et al. (Trace-level reuse, Sept. 1999, pgs. 1-8) – discloses a reuse trace memory including identifying a trace instruction at an initial PC and different sizes/set associative implementations for the RTM(trace cache).

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections.  See 37 CFR 1.111(c).

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GEORGE GIROUX/Primary Examiner, Art Unit 2128