DETAILED ACTION
Claims 1-8 and 10-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 07/13/2022 has been entered.
 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-6, 10-12, 15, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Smith et al. (US 2015/0355996 A1) in further view of Lai (US 2011/0082999 A1).

Smith was cited in the previous Office Action.
Regarding claim 1, Smith teaches the invention substantially as claimed including a computer-implemented method, comprising: 
generating a first signal to sample performance data for a plurality of engines included in a processor, wherein the performance data is captured by a plurality of  performance monitors (([0047]: single trace cell 420 may be implemented for each partition in the streaming multi-processor (SM) SM 250; [0049] In addition to allowing a programmer or driver to change the sampling frequency of the trace cell 420, the trace cell(s) 420 may be programmed to collect trace information from one to many SMs 250…the logic 421 may include registers for indicating whether the trace cell 420 should be enabled for a particular SM 250 (such as by using a 16-bit vector to indicate which SMs 250 should activate trace information collection), registers for indicating whether the trace cell 420 should collect data for any particular thread block(s) resident on the SM 250, registers for specifying a start time and an end time that indicate a range for collecting trace information based on a system clock or based on an elapsed time since the start of execution of a workload, and so forth; [0054]: The programmer may run a development platform such as NVIDIA.RTM. Nsight for Visual Studio on a host computer… Prior to executing the program on the PPU 200, the graphics application may be configured to setup the PPU 200 to collect various execution statistics using the trace cell 420 implemented in the PPU 200. For example, an API call generated by the graphics application may cause the driver to transmit an instruction to the PPU 200 that sets a value in a register to enable trace information collection.); 
receiving, based on the first signal, the performance data from the plurality of performance monitors ([0054]: Then, during execution of the program, the trace cell 420 collects records containing trace information; [0058]: a performance monitor that tracks performance statistics for the various SMs 250 of the PPU 200 may be implemented. The performance monitor may include various performance monitor (PM) counters that track, among other statistics, how many clock cycles a particular SM 250 was active or inactive during a given GPU context, a number of tasks or threads launched by a particular SM 250, and so forth. In one embodiment, the functionality of the trace cell 420 described above may be implemented in the existing performance monitor unit); 
extracting a plurality of performance data subsets from the performance data based on a plurality of identifiers included in the performance data, wherein each of the plurality of performance data subsets corresponds to a different identifier included in the plurality of identifiers, and wherein each of the plurality of identifiers corresponds to a different engine included in the plurality of engines ([0047]: a single trace cell 420 may be implemented per SM 250; [0049]: the trace cell(s) 420 may be programmed to collect trace information from one to many SMs 250 based on a core identifier for each SM 250, collect trace information from one to many thread blocks based on a thread block identifier for each thread block, collect trace information for a specific workload (i.e., task) based on a task identifier, or collect trace information within a particular time range for multiple workloads. [0058]: The performance monitor may include various performance monitor (PM) counters that track, among other statistics, how many clock cycles a particular SM 250 was active or inactive during a given GPU context, a number of tasks or threads launched by a particular SM 250, and so forth); and 
storing each of the plurality of performance data subsets  in a different one of a plurality of data stores ([0048] The trace cell 420 is a hardware unit that determines when trace information should be sampled from the scheduler unit 310 and initiates operations to store the trace information in the memory 204; [0033]: In one embodiment, the PPU 200 comprises U memory interfaces 280(U), where each memory interface 280(U) is connected to a corresponding memory device 204(U). For example, PPU 200 may be connected to up to 6 memory devices 204; [0050]; [0058]), wherein each data store included in the plurality of data stores is accessible to a corresponding engine included in the plurality of engines ([0036]: In one embodiment, the TMU 215 may configure different SMs 250 to execute different shader programs concurrently. For example, a first subset of SMs 250 may be configured to execute a vertex shader program while a second subset of SMs 250 may be configured to execute a pixel shader program. The first subset of SMs 250 processes vertex data to produce processed vertex data and writes the processed vertex data to the L2 cache 265 and/or the memory 204 [0045]: the SM 250 comprises J texture units 390. The texture units 390 are configured to load texture maps (i.e., a 2D array of texels) from the memory 204).
While Smith teaches receiving performance data from the one or more performance monitor as noted above, Smith does not expressly teach extracting a first subset of the performance data that is associated with a first engine included in the plurality of engines; and
wherein each data store included in the plurality of data stores is accessible from a distinct virtual address space that is assigned to the corresponding engine.

	However Smith does teach:
	[0047]: a single trace cell 420 may be implemented per SM 250.
[0049]: the trace cell(s) 420 may be programmed to collect trace information from one to many SMs 250 based on a core identifier for each SM 250.
[0058]: The performance monitor may include various performance monitor (PM) counters that track, among other statistics, how many clock cycles a particular SM 250 was active or inactive during a given GPU context, a number of tasks or threads launched by a particular SM 250, and so forth.

	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to understand the teachings of Smith to encompass extracting a first subset of performance data associated with a first engine given that the “receiving” limitation allows for a single performance monitor and by allowing to obtain/extract trace information based on an identifier reasonably teaches the limitation. As such, Smith teaches the claimed limitation.  

	Smith does not expressly teach but Lai does teach wherein each data store included in the plurality of data stores is accessible from a distinct virtual address space that is assigned to the corresponding engine ([0004] A conventional data processing engine (such as a general purpose microprocessor) may access one or more address spaces. Each address space may be used to access either memory or I/O devices, or both. The address spaces of memory and I/O devices may be separated by different load/store instructions. For example, the instruction LoadMemory is used to access the memory address space, while the instruction LoadIO is used to access the I/O address space. Alternatively, the address spaces of memory and I/O devices may be separated according to physical address space segments (without address translation) or virtual address space segments (with address translation). Each segment has a different address range. [0039]; [0044]).
	
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lai with the teachings of Smith to further have access to a memory address space to direct to when accessing a data store. The modification would have been motivated by the desire of having access to dedicated resources. 

Regarding claim 2, Smith teaches wherein the first data store is inaccessible to all other engines included in the plurality of engines ([0034]: In one embodiment, each of the SMs 250 also implements an L1 cache. The L1 cache is private memory that is dedicated to a particular SM 250.).  

Regarding claim 3, Smith teaches further comprising: 
extracting a portion of the performance data that is not traceable to any engine included in the plurality of engines and storing the portion of the performance data that is not traceable to any engine in a second data store ([0018] In one embodiment, each core implements at least one trace cell that includes logic for collecting trace information associated with each of the threads in the thread blocks managed by one or more micro-scheduler units included in the core. The trace cell may be configured to periodically collect trace information on each of the thread blocks managed by the micro-scheduler. The trace information for a particular thread block may include an identifier that indicates which core in the plurality of cores the thread block is allocated to, an address associated with a program counter for the thread block, and a stall vector (i.e., a vector that indicates a reason why the thread block is stalled). The trace cell may stream this information to a buffer (i.e., a FIFO) for temporary storage until the trace cell can write the trace information out to an event buffer in memory for later analysis.).  

Regarding claim 4, Smith teaches wherein the first data store is accessible to an authorized entity associated with the processor and inaccessible to all engines included in the plurality of engines ([0018]: The trace cell may stream this information to a buffer (i.e., a FIFO) for temporary storage until the trace cell can write the trace information out to an event buffer in memory for later analysis. The trace information may be displayed to a programmer for the programmer to be able to analyze hotspots or bottlenecks in the source code. For example, NVIDIA.RTM. Nsight is a development platform for generating shader code in Microsoft.RTM. Visual Studio. Nsight includes a graphical user interface that may be configured to display trace information such that a programmer can analyze the source code based on execution of the compiled program on the parallel processing unit.).  

Regarding claim 5, Smith teaches wherein generating the first signal to sample the performance data comprises: 
transmitting the first signal to an array of signal counters included in a first performance monitor included in the plurality performance monitors and109PATENT Attorney Docket No.: NVDA/19S00036US4sampling, via the array of signal counters, at least a portion of the performance data ([0024] At step 154, during each clock cycle, a determination is made as to whether to sample the trace information. In one embodiment, trace information is sampled at a particular sampling frequency every N clock cycles. A counter may be incremented based on a CLK signal and, when the counter reaches a threshold value, trace information is collected from the micro-scheduler and the counter is reset; [0058]: a performance monitor that tracks performance statistics for the various SMs 250 of the PPU 200 may be implemented. The performance monitor may include various performance monitor (PM) counters that track, among other statistics, how many clock cycles a particular SM 250 was active or inactive during a given GPU context, a number of tasks or threads launched by a particular SM 250, and so forth.).

Regarding claim 6, Smith teaches wherein generating the first signal to sample the performance data comprises: 
combining one or more signals received by a first performance monitor included in the plurality of performance monitors according to a logical signal expression ([0002]: Trace information can be collected where the current program counter value for the active thread is sampled at periodic intervals such as every 10,000 cycles or when an event counter reaches a particular value (e.g., after every 100 cache misses, after 50 branch calls, etc.). Such collection methods may be enabled by hardware implemented within the processor such as the Performance Monitor); 
determining that a condition of the logical signal expression is met ([0024] A counter may be incremented based on a CLK signal and, when the counter reaches a threshold value, trace information is collected from the micro-scheduler and the counter is reset.); 
in response, transmitting the first signal to an array of signal counters included in the first performance monitor and sampling, via the array of signal counters, at least a portion of the performance data ([0058]: a performance monitor that tracks performance statistics for the various SMs 250 of the PPU 200 may be implemented. The performance monitor may include various performance monitor (PM) counters that track, among other statistics, how many clock cycles a particular SM 250 was active or inactive during a given GPU context, a number of tasks or threads launched by a particular SM 250, and so forth.).  

Regarding claim 10, it is a media/product type claim having similar limitations as claim 1 above. Therefore, it is rejected under the same rationale above.

Regarding claim 11, it is a media/product type claim having similar limitations as claim 5 above. Therefore, it is rejected under the same rationale above.

Regarding claim 12, it is a media/product type claim having similar limitations as claim 6 above. Therefore, it is rejected under the same rationale above.

Regarding claim 15, Smith teaches wherein the performance data is based on a first performance monitor associated with a first clock signal domain and a second performance monitor associated with a second clock signal domain ([0058] In some conventional graphics processors, a performance monitor that tracks performance statistics for the various SMs 250 of the PPU 200 may be implemented. The performance monitor may include various performance monitor (PM) counters that track, among other statistics, how many clock cycles a particular SM 250 was active or inactive during a given GPU context, a number of tasks or threads launched by a particular SM 250, and so forth. In these systems, the performance monitor may implement a streaming interface to transmit the performance monitor statistics to memory for analysis. In one embodiment, the functionality of the trace cell 420 described above may be implemented in the existing performance monitor unit).

Regarding claim 16, Smith teaches wherein the performance data is associated with a duration of time between the first signal and a second signal to sample the performance data for the plurality of engines ([0049] In addition to allowing a programmer or driver to change the sampling frequency of the trace cell 420, the trace cell(s) 420 may be programmed to collect trace information from one to many SMs 250 based on a core identifier for each SM 250, collect trace information from one to many thread blocks based on a thread block identifier for each thread block, collect trace information for a specific workload (i.e., task) based on a task identifier, or collect trace information within a particular time range for multiple workloads. In other words, the logic 421 may include registers for indicating whether the trace cell 420 should be enabled for a particular SM 250 (such as by using a 16-bit vector to indicate which SMs 250 should activate trace information collection), registers for indicating whether the trace cell 420 should collect data for any particular thread block(s) resident on the SM 250, registers for specifying a start time and an end time that indicate a range for collecting trace information based on a system clock or based on an elapsed time since the start of execution of a workload, and so forth.).

Regarding claim 18, it is a system type claim having similar limitations as claim 1 above. Therefore, it is rejected under the same rationale above.

Claims 7, 8, 13, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Smith and Lai, as applied to claim 1, in further view of Griffin et al. (US 8,738,860 B1).

Regarding claim 7, Smith and Lai do not expressly teach wherein the performance data is based on a first signal group received via a first multiplexor.
	
However, Griffin teaches wherein the performance data is based on a first signal group received via a first multiplexor (Col. 89 lines 41-56: A set of multiplexors are configured to select internal information to be traced. For example, items such as the current program counter, trap address, L2 cache miss address. etc. are traced for debugging and/or profiling. The select control for the trace information multiplexor can be provided, for example, in a Special Purpose Register (SPR). A set of multiplexors (i.e., first multiplexor) are configured to select internal events which determine when the state being traced is sampled. For example, items such as conditional branches taken, L1 cache misses, and program stalls can be events of interest. The value of the trace information when such events occur can be useful to determine during debugging program errors or performance issues because the trace information can help pinpoint what caused the stall or the cache miss or the branch outcome. In some implementations, the select control for the event multiplexor can be provided in a Special Purpose Register (SPR).).

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Griffin with the teachings of Smith and Lai to utilize multiplexors to trigger sampling. The modification would have been motivated by the desire of obtain information on cache misses and program stalls.

Regarding claim 8, Griffin teaches wherein the performance data is further based on a second signal group received via a second multiplexor (Col. 89 lines 41-56: A set of multiplexors are configured to select internal information to be traced. For example, items such as the current program counter, trap address, L2 cache miss address. etc. are traced for debugging and/or profiling. The select control for the trace information multiplexor can be provided, for example, in a Special Purpose Register (SPR). A set of multiplexors (i.e., second multiplexor) are configured to select internal events which determine when the state being traced is sampled. For example, items such as conditional branches taken, L1 cache misses, and program stalls can be events of interest. The value of the trace information when such events occur can be useful to determine during debugging program errors or performance issues because the trace information can help pinpoint what caused the stall or the cache miss or the branch outcome. In some implementations, the select control for the event multiplexor can be provided in a Special Purpose Register (SPR).).

Regarding claim 13, it is a media/product type claim having similar limitations as claim 7 above. Therefore, it is rejected under the same rationale above.

Regarding claim 14, it is a media/product type claim having similar limitations as claim 8 above. Therefore, it is rejected under the same rationale above.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Smith and Lai as applied to claim 16, in further view of Faasse et al. (US 2019/0278355 A1).

Regarding claim 17, Smith and Lai do not expressly teach wherein the first signal coincides with a first context switch event associated with a first engine and the second signal coincides with a second context switch event associated with the first engine.
	
	However, Faasse teaches wherein the first signal coincides with a first context switch event associated with the first engine and the second signal coincides with a second context switch event associated with the first engine ([0037] In one example, the context switch block 202 may also be to receive a signal to perform a subsequent context switch and to store a current processor performance state with an updated context information in response to the signal. In other words, when a subsequent context switch is initiated, the context switch block 202 may cause the processor 102 may update the context information with the current processor performance state. The updated context information may then be stored in memory for later retrieval.).

It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Faasse with the teachings of Smith and Lai to monitor performance during execution of different tasks. The modification would have been motivated by the desire of monitoring processor utilization while executing different tasks.

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Smith and Lai as applied to claim 18, in further view of Khodorkovsky et al. (US 2013/0155073 A1).

Regarding claim 19, Smith and Lai do not expressly teach wherein the processor executes a plurality of virtual machines, and further comprising: determining that no virtual machine included in the plurality of virtual machines is utilizing a first circuit subsection included in the processor and 112PATENTAttorney Docket No.: NVDA/19S00036US4reducing a supply voltage associated with the first circuit subsection.

	However, Khodorkovsky teaches wherein the processor executes a plurality of virtual machines, and further comprising: determining that no virtual machine included in the plurality of virtual machines is utilizing a first circuit subsection included in the processor and 112PATENTAttorney Docket No.: NVDA/19S00036US4reducing a supply voltage associated with the first circuit subsection ([0051]: If the active VM 40 has an estimated active time slice greater than the threshold time at block 90, the power management unit 24 uses the activity history context for the active VM 40 as illustrated at block 93. Therefore, the physical machine power manager 60 adjusts a power state by changing the preferred frequency and voltage settings of the clock generator 22 or engine controls 62 based on the virtual machine activity history context. For example, if the VM 40 used a lower power state for an engine 14, 16, 18, 20 of the graphics processing core 12 during one or more previous time slices, the power manager 60 will prefer to decrease the frequency of the particular engine.).

	It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Khodorkovsky with the teachings of Smith and Lai to modify voltage settings based on virtual machine utilization. The modification would have been motivated by the desire of improving power consumption.

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Smith and Lai as applied to claim 18, in further view of Jain et al. (US 2019/0384367 A1).

Regarding claim 20, Smith and Lai do not expressly teach wherein each circuit subsection included in a plurality of circuit subsections is associated with a different engine included in the plurality of engines, and further comprising: determining that a first circuit subsection included in the plurality of circuit subsections is consuming more power than each of the other circuit subsections included in the plurality of circuit subsections and reducing a frequency of a clock signal associated with the first circuit subsection.

	However, Jain teaches wherein each circuit subsection included in a plurality of circuit subsections is associated with a different engine included in the plurality of engines, and further comprising: determining that a first circuit subsection included in the plurality of circuit subsections is consuming more power than each of the other circuit subsections included in the plurality of circuit subsections and reducing a frequency of a clock signal associated with the first circuit subsection (Abstract: an electronic device that can be configured to include a plurality of chiplets, a plurality of resources, a system thermal engine, and at least one processor. The at least one processor is configured to cause the system thermal engine to monitor the plurality of chiplets, where the plurality of chiplets are part of a multi-chip module, determine that a first chiplet from the plurality of chiplets has reached a threshold temperature, and reduce power to the first chiplet without reducing power to the other chiplets in the plurality of chiplets; [0044]: Different chiplets have different residency and clock frequency characteristics based on the workload assigned to the chiplet. For example, graphic centric workloads will have higher graphic chiplet workload requirements (e.g., a graphic chiplet needs a longer residency or amount to time to use a higher power for an increased clock frequency as compared to a logic chiplet) and lower logic workload requirements (e.g., a logic chiplet needs a lower residency or amount to time to use a higher power for an increased clock frequency as compared to a graphic chiplet). In the example of similar functions on each chiplet (e.g., multiple core chiplets with different speeds) the amount of time a chiplet will use a higher power to use an increased clock frequency will be different (e.g., the amount of time will be longer for slower/lower power chiplets as they generate less heat and therefore can operate at the higher power longer). System thermal engine 108 can allow each chiplet to reach its threshold temperature before its power and clock frequency is reduced; [0068]; Fig. 6).

	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Jain with the teachings of Smith and Lai to determine whether one chiplet/engine is exceeding power consumption and modify the power provided to it without affecting other chiplets/engines. The modification would have been motivated by the desire of avoiding reduction of performance due to thermal increases based on excessive consumption in chiplets of a device.

Response to Arguments
Applicant’s arguments with respect to claims 1-8 and 10-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
(US 2010/0191923 A1) See at least relevant portions of Abstract, [0056], and [0075].
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JORGE A CHU JOY-DAVILA whose telephone number is (571)270-0692. The examiner can normally be reached Monday-Friday, 9:00am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai T An can be reached on (571)-272-3756. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JORGE A CHU JOY-DAVILA/Primary Examiner, Art Unit 2195