DETAILED ACTION
Response to Arguments

	Applicant argues that Gene Wu, et al., "GPGPU performance and power estimation using machine learning," IEEE 21st international symposium on high performance computer architecture (HPCA), IEEE, 2015 (hereinafter "Wu") does not teach claim one’s derived counter value. See pg. 19 of Applicant’s Remark’s submitted on 07/25/2022 (stating that “[w]u does not teach a model that predicts a ‘derived counter value,’ as claimed. Moreover, Wu does not teach a ‘derived counter value’ that ‘predicts performance attributes of the processor.’). 
	Examiner respectfully disagrees. As MPEP § 2173.01(I) states, “Under a broadest reasonable interpretation, words of the claim must be given their plain meaning…[t]he plain meaning of a term means the ordinary and customary meaning given to the term by those of ordinary skill in the art at the time of the invention…[h]owever, the best source for determining the meaning of a claim term is the specification - the greatest clarity is obtained when the specification serves as a glossary for the claim terms.” (Emphasis added). With this statement in mind paragraph ¶0015 of Applicant’s Specification filed 03/15/2018 defines a derived counter value as follows: “the derived counter value indicates application performance for a portion of a program executing on the processor…[i]n some embodiments, the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement.”(Emphasis added).
	As Wu details in fig. 2 on page 567, which has been reproduced down below, target execution time/power are ultimately outputted.  

    PNG
    media_image1.png
    346
    393
    media_image1.png
    Greyscale


Furthermore, as Wu states on page 575 in the conclusionary section, “[i]n this work, we presented a high-level GPGPU performance and power predictor.” Accordingly, Wu teaches claim one’s limitation of the derived counter value predicts performance attributes of the processor. 

	Applicant also argues that Wu does not teach counter engine circuitry comprising an artificial neural network (ANN) configured to dynamically modify the model based on the derived counter value. See pgs. 19-20 of Applicant’s Remark’s submitted on 07/25/2022 (stating that Wu does not teach “counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value, wherein: the derived counter value predicts performance attributes of the processor, and the counter engine circuitry comprising an artificial neural network (ANN) configured to dynamically modify the model based on the derived counter value.”). 
	Examiner respectfully disagrees. As MPEP § 2173.01(I) states, “Under a broadest reasonable interpretation, words of the claim must be given their plain meaning…[t]he plain meaning of a term means the ordinary and customary meaning given to the term by those of ordinary skill in the art at the time of the invention…[h]owever, the best source for determining the meaning of a claim term is the specification - the greatest clarity is obtained when the specification serves as a glossary for the claim terms.” (Emphasis added). 
Accordingly paragraph ¶0015 of Applicant’s Specification filed 03/15/2018 defines a derived counter value as follows: “the derived counter value indicates application performance for a portion of a program executing on the processor…[i]n some embodiments, the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement.”(Emphasis added).  And paragraph ¶0033 of Applicant’s Specification filed 03/15/2018 defines hardware counters as follows: “[h]ardware counters can include and can also be referred to as hardware performance counters, performance monitors, event counters…[h]ardware counters can be configured with a [‘]tick rate[’]. For example, rather than incrementing or counting once for every core cycle, a performance counter can be configured to increment or count once after every 64 core cycles, or at any other desired rate.” (Emphasis added). 
As Wu details in fig. 2 on page 567, which has been reproduced down below details the model construction and usage flow. 

    PNG
    media_image2.png
    596
    964
    media_image2.png
    Greyscale

	As page 567 of Wu details, “[t]he construction algorithm uses a training data set containing execution times and performance counter values [i.e. hardware performance counter value ] collected from executing training kernels on real hardware… [o]nce the model is constructed, it can be used to predict the performance of new kernels [i.e. program performance], from outside the training set, at any target hardware configuration…[t]o make a prediction, the kernel’s performance counter values and base execution time must first be gathered by executing it on the base hardware configuration. These are then passed to the model, along with the desired target hardware configuration, which will output a predicted execution time at that target configuration [i.e. the derived counter value predicts performance attributes of the processor].” (Emphasis added).  
	Then on page 570 Wu details that “[a]fter forming representative clusters, the next step is to build classifiers, which are implemented using neural networks[i.e. an artificial neural network (ANN)], that can map kernels to clusters using performance counter values…[o]ne neural network is built and trained per cluster set [i.e. configured to dynamically modify the model based on the derived counter value].” (Emphasis added).  
Accordingly, Wu does teach “counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value, wherein: the derived counter value predicts performance attributes of the processor, and the counter engine circuitry comprising an artificial neural network (ANN) configured to dynamically modify the model based on the derived counter value.”

Applicant argues that Wu does not teach a model that is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values. See pg. 21 of Applicant’s Remark’s submitted on 07/25/2022(stating that Wu in view of Dimitrov does not teach a model that is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values).
Examiner respectfully disagrees. Accordingly paragraph ¶0015 of Applicant’s Specification filed 03/15/2018 defines a derived counter value as follows: “the derived counter value indicates application performance for a portion of a program executing on the processor…[i]n some embodiments, the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement.”(Emphasis added).  And paragraph ¶0033 of Applicant’s Specification filed 03/15/2018 defines hardware counters as follows: “[h]ardware counters can include and can also be referred to as hardware performance counters, performance monitors, event counters…[h]ardware counters can be configured with a [‘]tick rate[’]. For example, rather than incrementing or counting once for every core cycle, a performance counter can be configured to increment or count once after every 64 core cycles, or at any other desired rate.” (Emphasis added).
As Wu details on page 570, “[a]fter forming representative clusters, the next step is to build classifiers, which are implemented using neural networks, that can map kernels to clusters using performance counter values. One neural network is built and trained per cluster set.” (Emphasis added).  Accordingly, in training the neural network (i.e. the model) both performance counter values [i.e. hardware performance counter values] and clusters [i.e. derived counter values] are compared and used for training the neural network (i.e. using the loss function and backpropagation). This comparison between performance counter values and clusters is illustrated by fig. 9 reproduced below, which details how the neural network is trained. 

    PNG
    media_image3.png
    315
    684
    media_image3.png
    Greyscale

Accordingly, Wu does teach a model that is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 4, 7, 9, 11-13, 15, 18, 20, 22-26, 29, 31, 33-37, 40, 42, 44-50 are rejected under 35 U.S.C. 103 as being unpatentable over Wu, Gene, et al. "GPGPU performance and power estimation using machine learning." 2015 IEEE 21st international symposium on high performance computer architecture (HPCA). IEEE, 2015 (“Wu”) in view of Dimitrov et al. US 2019/0213775 Al(“Dimitrov”). 
Regarding claim 1, Wu teaches a processor configured to determine a derived counter value based on a hardware performance counter, the processor comprising: counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value wherein: the derived counter value predicts performance attributes of the processor, and the counter engine circuitry comprising an artificial neural network (ANN) configured to dynamically modify the model based on the derived counter value; and output circuitry configured to communicate the derived counter value to a hardware control circuit (Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration” & see Wu, pg. 570, left-column, “After forming representative clusters, the next step is to build classifiers, which are implemented using neural networks, that can map kernels to clusters using performance counter values… [o]ne neural network is built and trained per cluster set. The features used as inputs to the neural networks are listed in Table II… [t]he neural network topology is shown in Fig. 9. The neural network outputs one value, between 0 and 1, per cluster in its cluster set. The cluster with the highest output value is selected as the chosen cluster for the kernel… [a]fter construction, the neural network is used to select which clusters best describe the scaling behavior of the kernel… [t]he centroids of the selected clusters are used as the kernel’s scaling values. The predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values….” Wu teaches fig. 6 and a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values (i.e. counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value) Each cluster set is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space, for example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein: the derived counter value predicts performance attributes of the processor) the neural network topology is shown in Fig. 9. The neural network outputs one value, between 0 and 1, per cluster in its cluster set. The cluster with the highest output value is selected as the chosen cluster for the kernel after construction, the neural network is used to select which clusters best describe the scaling behavior of the kernel, the centroids of the selected clusters are used as the kernel’s scaling values (i.e. and the counter engine circuitry comprising an artificial neural network (ANN) configured to dynamically modify the model based on the derived counter value) the predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values (i.e. and output circuitry configured to communicate the derived counter value to a hardware control circuit)). 
Wu does not teach: input circuitry configured to input a hardware performance counter value. 
However, Dimitrov teaches: input circuitry configured to input a hardware performance counter value (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits configured to measure different performance-related values in real-time. In one embodiment, PMs may be configured to monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu’s processor in view of Dimitrov the motivation to do so would be to have performance monitoring counters(PMs) for monitoring  multithreaded applications (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits PMs may be configured to
monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”).
Regarding claim 2, Wu in view of Dimitrov teaches the processor of claim 1, wherein the hardware control circuit comprises an operating system scheduler, a memory controller, a power manager, a data prefetcher, or a cache controller(Dimitrov, para. 0025, “[T]he one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the memory interface clock frequency represent a data prefetcher).1
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 1.
Regarding claim 4, Wu in view of Dimitrov teaches the processor of claim 1, wherein the model comprises or is generated by the artificial neural network (ANN)(Dimitrov, para. 0030, fig. 1C(124), “FIG. 1C illustrates an exemplary neural network 124, configured to implement one or more aspects of one embodiment.”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 1.
Regarding claim 7, Wu in view of Dimitrov teaches the processor of claim 1, wherein the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”)indicates a predicted execution time for a portion of a program executing on the processor(Wu, pg. 567, fig. 2, “Once the model is constructed, it can be used to predict the performance of new kernels, from outside the training set, at any target hardware configuration within the range of the training data. To make a prediction, the kernel’s performance counter values and base execution time must first be gathered by executing it on the base hardware configuration. These are then passed to the model, along with the desired target hardware configuration, which will output a predicted execution time at that target configuration.” Wu teaches: will output a predicted execution time at that target configuration (i.e. indicates a predicted execution time for a portion of a program executing on the processor)).
Regarding claim 9 Wu in view of Dimitrov teaches the processor of claim 1, wherein the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement (Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.” Wu teaches: the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein the derived counter value indicates a predicted frequency requirement)).2
Regarding claim 11, Wu in view of Dimitrov teaches the processor of claim 1, further comprising circuitry configured to manage power or frequency of the processor(Dimitrov, para. 0025, “[T]he control unit includes a machine learning model configured to receive the performance monitor values as inputs and to update [i.e., predict] the one or more operating parameters as outputs during execution of the multithreaded application…the one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the core clock frequency represents the frequency of the processor and the core operating voltage represents manage power)3based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).4
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 1.
Regarding claim 12, Wu teaches a prediction unit implemented on a processor core and configured to determine a derived counter value based on a hardware performance counter, the processor core comprising: counter engine circuitry configured to determine the derived counter value based on applying a model to the hardware performance counter value, wherein: the derived counter value predicts performance attributes of the processor, and the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values; and output circuitry configured to communicate the derived counter value to a hardware control circuit(Wu, pgs. 567-568, left-column, see also figs. 2, 3, 4, 5, 6, and 9, “The model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set. The performance counter values collected while executing each training kernel on the base hardware configuration are also stored…[i]n the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration” & see Wu, pg. 570, left-column, “After forming representative clusters, the next step is to build classifiers, which are implemented using neural networks, that can map kernels to clusters using performance counter values… [o]ne neural network is built and trained per cluster set. The features used as inputs to the neural networks are listed in Table II… [t]he neural network topology is shown in Fig. 9. The neural network outputs one value, between 0 and 1, per cluster in its cluster set. The cluster with the highest output value is selected as the chosen cluster for the kernel… [a]fter construction, the neural network is used to select which clusters best describe the scaling behavior of the kernel… [t]he centroids of the selected clusters are used as the kernel’s scaling values. The predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values….” Wu teaches fig. 6 and a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values (i.e. counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value) Each cluster set is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space, for example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein: the derived counter value predicts performance attributes of the processor) the model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set (i.e. and the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values) the predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values (i.e. and output circuitry configured to communicate the derived counter value to a hardware control circuit)).  
Wu does not teach: input circuitry configured to input a hardware performance counter value. 
However, Dimitrov teaches: input circuitry configured to input a hardware performance counter value (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits configured to measure different performance-related values in real-time. In one embodiment, PMs may be configured to monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu’s unit in view of Dimitrov the motivation to do so would be to have performance monitoring counters(PMs) for monitoring  multithreaded applications (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits PMs may be configured to
monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”).
Regarding claim 13, Wu in view of Dimitrov teaches the prediction unit of claim 12, wherein the hardware control circuit comprises an operating system scheduler, a memory controller, a power manager, a data prefetcher, or a cache controller(Dimitrov, para. 0025, “[T]he one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the memory interface clock frequency represent a data prefetcher).5
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 12.
Regarding claim 15, Wu in view of Dimitrov teaches the prediction unit of claim 12, wherein the model comprises or is generated by an artificial neural network (ANN) (Dimitrov, para. 0030, fig. 1C(124), “FIG. 1C illustrates an exemplary neural network 124, configured to implement one or more aspects of one embodiment.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 12.
Regarding claim 18, Wu in view of Dimitrov teaches the prediction unit of claim 12, wherein the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”)indicates a predicted execution time for a portion of a program executing on the processor(Wu, pg. 567, fig. 2, “Once the model is constructed, it can be used to predict the performance of new kernels, from outside the training set, at any target hardware configuration within the range of the training data. To make a prediction, the kernel’s performance counter values and base execution time must first be gathered by executing it on the base hardware configuration. These are then passed to the model, along with the desired target hardware configuration, which will output a predicted execution time at that target configuration.” Wu teaches: will output a predicted execution time at that target configuration (i.e. indicates a predicted execution time for a portion of a program executing on the processor)).
Regarding claim 20, Wu in view of Dimitrov teaches the prediction unit of claim 12, wherein the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.” Wu teaches: the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein the derived counter value indicates a predicted frequency requirement)).6
Regarding claim 22, Wu in view of Dimitrov teaches the prediction unit of claim 12, further comprising circuitry configured to manage power or frequency of the processor based(Dimitrov, para. 0025, “[T]he control unit includes a machine learning model configured to receive the performance monitor values as inputs and to update [i.e., predict] the one or more operating parameters as outputs during execution of the multithreaded application…the one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the core clock frequency represents the frequency of the processor and the core operating voltage represents manage power)7on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).
 It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 12.
Regarding claim 23 Wu teaches a method for determining a derived counter value based on a hardware performance counter of a processor, the method comprising: determining the derived counter value by applying a model to the hardware performance counter value using the counter engine, wherein: the derived counter value predicts performance attributes of the processor, and the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values; and communicating the derived counter value to a hardware control circuit (Wu, pgs. 567-568, left-column, see also figs. 2, 3, 4, 5, 6, and 9, “The model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set. The performance counter values collected while executing each training kernel on the base hardware configuration are also stored…[i]n the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration” & see Wu, pg. 570, left-column, “After forming representative clusters, the next step is to build classifiers, which are implemented using neural networks, that can map kernels to clusters using performance counter values… [o]ne neural network is built and trained per cluster set. The features used as inputs to the neural networks are listed in Table II… [t]he neural network topology is shown in Fig. 9. The neural network outputs one value, between 0 and 1, per cluster in its cluster set. The cluster with the highest output value is selected as the chosen cluster for the kernel… [a]fter construction, the neural network is used to select which clusters best describe the scaling behavior of the kernel… [t]he centroids of the selected clusters are used as the kernel’s scaling values. The predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values….” Wu teaches fig. 6 and a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values (i.e. determining the derived counter value by applying a model to the hardware performance counter value using the counter engine) Each cluster set is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space, for example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein: the derived counter value predicts performance attributes of the processor)  the model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set (i.e. and the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values)  the predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values (i.e. and communicating the derived counter value to a hardware control circuit)).
Wu does not teach: inputting a hardware performance counter value to a counter engine. 
However, Dimitrov teaches: inputting a hardware performance counter value to a counter engine (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits configured to measure different performance-related values in real-time. In one embodiment, PMs may be configured to monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu’s method in view of Dimitrov the motivation to do so would be to have performance monitoring counters(PMs) for monitoring  multithreaded applications (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits PMs may be configured to
monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”).
Regarding claim 24, Wu in view of Dimitrov teaches the method of claim 23, wherein the hardware control circuit comprises an operating system scheduler, a memory controller, a power manager, a data prefetcher, or a cache controller(Dimitrov, para. 0025, “[T]he one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the memory interface clock frequency represent a data prefetcher).8
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 23.
Regarding claim 25, Wu in view of Dimitrov teaches the method of claim 23, further comprising dynamically changing the model during operation of the processor(Dimitrov, para. 0020, “Furthermore, different portions of a given application can have different model parameters [for a neural network]. Model parameters can be loaded into the neural network subsystem prior to launching the application, and the model parameters can be updated as the application executes.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 23.
Regarding claim 26, Wu in view of Dimitrov teaches the method of claim 23, wherein the model comprises or is generated by an artificial neural network (ANN) (Dimitrov, para. 0030, fig. 1C(124), “FIG. 1C illustrates an exemplary neural network 124, configured to implement one or more aspects of one embodiment.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 23.
Regarding claim 29, Wu in view of Dimitrov teaches the method of claim 23, wherein the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”) indicates a predicted execution time for a portion of a program executing on the processor(Wu,  pg. 567, fig. 2, “Once the model is constructed, it can be used to predict the performance of new kernels, from outside the training set, at any target hardware configuration within the range of the training data. To make a prediction, the kernel’s performance counter values and base execution time must first be gathered by executing it on the base hardware configuration. These are then passed to the model, along with the desired target hardware configuration, which will output a predicted execution time at that target configuration.” Wu teaches: will output a predicted execution time at that target configuration (i.e. indicates a predicted execution time for a portion of a program executing on the processor)).
Regarding claim 31, Wu in view of Dimitrov teaches the method of claim 23, wherein the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.” Wu teaches: the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein the derived counter value indicates a predicted frequency requirement)).9
Regarding claim 33, Wu in view of Dimitrov teaches the method of claim 23, further comprising determining a power or frequency of the processor based(Dimitrov, para. 0025, “[T]he control unit includes a machine learning model configured to receive the performance monitor values as inputs and to update [i.e., predict] the one or more operating parameters as outputs during execution of the multithreaded application…the one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the core clock frequency represents the frequency of the processor and the core operating voltage represents manage power)10on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 23.
Regarding claim 34, Wu teaches instructions stored on a non-transitory computer-readable medium which when executed by a processor cause the processor to determine a derived counter value based on a hardware performance counter by: determining the derived counter value by applying a model to the hardware performance counter value using the counter engine; wherein: -7-Applicant: Advanced Micro Devices, Inc.Application No.: 15/922,875the derived counter value predicts performance attributes of the processor, and the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values; and communicating the derived counter value to a hardware control circuit(Wu, pgs. 567-568, left-column, see also figs. 2, 3, 4, 5, 6, and 9, ““The model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set. The performance counter values collected while executing each training kernel on the base hardware configuration are also stored…[i]n the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration” & see Wu, pg. 570, left-column, “After forming representative clusters, the next step is to build classifiers, which are implemented using neural networks, that can map kernels to clusters using performance counter values… [o]ne neural network is built and trained per cluster set. The features used as inputs to the neural networks are listed in Table II… [t]he neural network topology is shown in Fig. 9. The neural network outputs one value, between 0 and 1, per cluster in its cluster set. The cluster with the highest output value is selected as the chosen cluster for the kernel… [a]fter construction, the neural network is used to select which clusters best describe the scaling behavior of the kernel… [t]he centroids of the selected clusters are used as the kernel’s scaling values. The predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values….” Wu teaches fig. 6 and a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values (i.e. determining the derived counter value by applying a model to the hardware performance counter value using the counter engine) Each cluster set is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space, for example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein: the derived counter value predicts performance attributes of the processor) the model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set (i.e. and the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values)  the predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values (i.e. and communicating the derived counter value to a hardware control circuit)). 
Wu does not teach: inputting a hardware performance counter value to a counter engine.
However, Dimitrov teaches: inputting a hardware performance counter value to a counter engine(Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits configured to measure different performance-related values in real-time. In one embodiment, PMs may be configured to monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu’s medium in view of Dimitrov the motivation to do so would be to have performance monitoring counters(PMs) for monitoring  multithreaded applications (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits PMs may be configured to
monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”).
Regarding claim 35, Wu in view of Dimitrov teaches the instructions of claim 34, wherein the hardware control circuit comprises an operating system scheduler, a memory controller, a power manager, a data prefetcher, or a cache controller(Dimitrov, para. 0025, “[T]he one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the memory interface clock frequency represent a data prefetcher).11
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 34.
Regarding claim 36, Wu in view of Dimitrov teaches the instructions of claim 34, further comprising instructions for dynamically changing the model during operation of the processor (Dimitrov, para. 0020, “Furthermore, different portions of a given application can have different model parameters [for a neural network]. Model parameters can be loaded into the neural network subsystem prior to launching the application, and the model parameters can be updated as the application executes.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 34.
Regarding claim 37, Wu in view of Dimitrov teaches the instructions of claim 34, wherein the model comprises or is generated by an artificial neural network (ANN) (Dimitrov, para. 0030, fig. 1C(124), “FIG. 1C illustrates an exemplary neural network 124, configured to implement one or more aspects of one embodiment.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 34.
Regarding claim 40, Wu in view of Dimitrov teaches the instructions of claim 34, wherein the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”)indicates a predicted execution time for a portion of a program executing on the processor(Wu, pg. 567, fig. 2, “Once the model is constructed, it can be used to predict the performance of new kernels, from outside the training set, at any target hardware configuration within the range of the training data. To make a prediction, the kernel’s performance counter values and base execution time must first be gathered by executing it on the base hardware configuration. These are then passed to the model, along with the desired target hardware configuration, which will output a predicted execution time at that target configuration.” Wu teaches: will output a predicted execution time at that target configuration (i.e. indicates a predicted execution time for a portion of a program executing on the processor)).
Regarding claim 42, Wu in view of Dimitrov teaches the instructions of claim 34, wherein the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.” Wu teaches: the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein the derived counter value indicates a predicted frequency requirement)).12
Regarding claim 44, Wu in view of Dimitrov teaches the instructions of claim 34, further comprising instructions for determining a power or frequency of the processor (Dimitrov, para. 0025, “[T]he control unit includes a machine learning model configured to receive the performance monitor values as inputs and to update [i.e., predict] the one or more operating parameters as outputs during execution of the multithreaded application…the one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the core clock frequency represents the frequency of the processor and the core operating voltage represents manage power)13based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 34.
Regarding claim 45,  Wu teaches a system comprising: a processor(Wu, pg. 570, left-column, “In order to validate the accuracy of our performance and power prediction models, we execute a collection of OpenCL…applications on a real GPU while varying its hardware configuration…[w]e used an AMD RadeonTM HD 7970 GPU as our test platform. By default, this GPU has 32 compute units (2048 execution units), which can run at up to 1 GHz, and 12 channels of GDDR5 memory running at 1375 MHz (yielding 264 GB/s of DRAM bandwidth).”); counter engine circuitry configured to determine a derived counter value based on applying a model to the hardware performance counter value, wherein: the derived counter value predicts performance attributes of the processor, and the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values; and output circuitry configured to communicate the derived counter value to a hardware control circuit of the processor(Wu, pgs. 567-568, left-column, see also figs. 2, 3, 4, 5, 6, and 9, “The model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set. The performance counter values collected while executing each training kernel on the base hardware configuration are also stored…[i]n the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration” & see Wu, pg. 570, left-column, “After forming representative clusters, the next step is to build classifiers, which are implemented using neural networks, that can map kernels to clusters using performance counter values… [o]ne neural network is built and trained per cluster set. The features used as inputs to the neural networks are listed in Table II… [t]he neural network topology is shown in Fig. 9. The neural network outputs one value, between 0 and 1, per cluster in its cluster set. The cluster with the highest output value is selected as the chosen cluster for the kernel… [a]fter construction, the neural network is used to select which clusters best describe the scaling behavior of the kernel… [t]he centroids of the selected clusters are used as the kernel’s scaling values. The predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values….” Wu teaches fig. 6 and a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values (i.e. counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value) Each cluster set is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space, for example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein: the derived counter value predicts performance attributes of the processor) the model construction and usage flow are depicted in Fig. 2. The construction algorithm uses a training data set containing execution times and performance counter values collected from executing training kernels on real hardware. The values in the training set are shown in Fig. 3. For each training kernel, execution times and performance counter values across a range of hardware configurations are stored in the training set (i.e. wherein the model is trained based on a comparison between exemplary pairs of hardware performance counter values and corresponding derived counter values) the predicted target configuration execution time is then calculated by multiplying the base hardware configuration execution time by the appropriate scaling values (i.e. and output circuitry configured to communicate the derived counter value to a hardware control circuit)). 
Wu does not teach: input circuitry configured to input a hardware performance counter value from the processor.
However, Dimitrov teaches: input circuitry configured to input a hardware performance counter value from the processor (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits configured to measure different performance-related values in real-time. In one embodiment, PMs may be configured to monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu’s system in view of Dimitrov the motivation to do so would be to have performance monitoring counters(PMs) for monitoring  multithreaded applications (Dimitrov, para. 0017, “The multiprocessing unit includes performance monitoring counters (PMs), comprising logic circuits PMs may be configured to
monitor at least one of a memory request counter, a memory system bandwidth utilization, a memory system storage capacity utilization, a cache hit rate, a count of instructions executed per clock cycle for one or more threads of a multithreaded program, and a count of instructions executed for one or more threads of the multithreaded program.”).
Regarding claim 46, Wu in view of Dimitrov teaches the system of claim 45, wherein the hardware control circuit comprises an operating system scheduler, a memory controller, a power manager, a data prefetcher, or a cache controller(Dimitrov, para. 0025, “[T]he one or more operating parameters include at least one of a maximum number of concurrently executing threads, a maximum number of active processing cores, a tile caching enable/disable flag, a core clock frequency, a memory interface clock frequency, and a core operating voltage.” Note: It is being interpreted that the memory interface clock frequency represent a data prefetcher).14
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 45.
Regarding claim 47, Wu in view of Dimitrov teaches the system of claim 45, wherein the model comprises or is generated by an artificial neural network (ANN) (Dimitrov, para. 0030, fig. 1C(124), “FIG. 1C illustrates an exemplary neural network 124, configured to implement one or more aspects of one embodiment.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 45.
Regarding claim 48, Wu in view of Dimitrov teaches the system of claim 45, wherein the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”)indicates a predicted execution time for a portion of a program executing on the processor(Wu, pg. 567, fig. 2, “Once the model is constructed, it can be used to predict the performance of new kernels, from outside the training set, at any target hardware configuration within the range of the training data. To make a prediction, the kernel’s performance counter values and base execution time must first be gathered by executing it on the base hardware configuration. These are then passed to the model, along with the desired target hardware configuration, which will output a predicted execution time at that target configuration.” Wu teaches: will output a predicted execution time at that target configuration (i.e. indicates a predicted execution time for a portion of a program executing on the processor)).
Regarding claim 49, Wu in view of Dimitrov teaches the system of claim 45, wherein the derived counter value indicates a predicted memory address, a predicted power requirement, or a predicted frequency requirement(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.” Wu teaches: the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8 (i.e. wherein the derived counter value indicates a predicted frequency requirement)).15.
Regarding claim 50, Wu in view of Dimitrov teaches the system of claim 45, wherein the counter engine is disposed on the processor (Dimitrov, paras. 0026-0027, fig. 1B(110,112, 120, 122, 114), “As shown, the processing system 110 includes a multiprocessing unit 112 and a control unit 120… [in] one embodiment, the multiprocessing unit 112 and the control unit 120 are fabricated within a common integrated circuit die, such as a GPU die… [t]he control unit 120 implements a machine learning model 122, configured to receive the monitor values 114 as inputs.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Wu with the above teachings of Dimitrov for the same rationale stated at Claim 45.
Claims 3, 6, 14, 17, 28, and 39 are rejected under 35 U.S.C. 103 as being unpatentable over Wu, Gene, et al. "GPGPU performance and power estimation using machine learning." 2015 IEEE 21st international symposium on high performance computer architecture (HPCA). IEEE, 2015 (“Wu”) in view of Dimitrov et al. US 2019/0213775 Al(“Dimitrov”) and in view of Sayadi et al. "Machine learning-based approaches for energy-efficiency prediction and scheduling in composite cores architectures." 2017 IEEE International Conference on Computer Design (ICCD). IEEE, 2017(“Sayadi”).
Regarding dependent claim 3, Wu in view of Dimitrov teaches the processor of claim 1, but does not teach, circuitry configured to dynamically select a new model from stored models during operation of the processor. 
However Sayadi teaches circuitry configured to dynamically select a new model from stored models during operation of the processor(Sayadi, pg. 2, right-column, “[As] Fig. 3 depicts[,] our three-stage approach for predicting the right core type and application configuration when running a multithreaded application on composite cores architecture. Our machine learning-based approach begins from extracting microarchitectural data (referred as feature extraction), from different parallel regions of application to characterize the multithreaded workload. These data (or features) include the hardware performance counter data, which are representative of application behavior at run-time. Next, a machine learning based predictor (that is built off-line) takes in these features and predicts the best configuration settings for a given parallel region. For this purpose, we have implemented five well known machine learning algorithms and compare them in terms of accuracy, power and area overhead to find to the most effective learning model which yields in optimized energy efficiency. Finally, we configure the processor and schedule the application to run on the predicted configuration.” Note: It is being interpreted that implementing five well known machine learning algorithms and comparing them in terms of accuracy, power and area overhead to find to the most effective learning model to configure the processor and scheduler represents the limitation of: circuitry configured to dynamically select a new model from stored models during operation of the processor).
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu in view of Dimitrov and in view of Sayadi,  the motivation to do so would be to devise an optimal scheduling system for a composite core architecture (CCA) and multithreaded applications(Sayadi, pg. 2, left-column, “Our study is focused on a CCA where many little cores (base) can be configured into few big cores (composed) and vice versa… [g]iven the dispersed pattern of optimum
configuration, we develop various machine learning models to predict the energy-efficiency of parallel regions, and guide scheduling and fine-tuning parameters to maximize the energy efficiency.”). 
Regarding dependent claim 6, Wu in view of Dimitrov teaches the processor of claim 1, but does not teach wherein the model comprises a regression model.
However Sayadi teaches wherein the model comprises a regression model(Sayadi, pg. 5, right-column,  “As mentioned earlier, in this work we implement different machine learning models to estimate the EDP. Table IV shows five machine learning models that we use for predicting the best processor and application configuration to deliver the lowest EDP. These models include least square median, linear regression, Multi-layer Perceptron (an artificial neural network model), and two decision tree techniques namely REPTree and M5Tree.”).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to modify the teachings of Wu in view of Dimitrov with the above teachings of Sayadi for the same rationale stated at dependent claim 3.
Referring to dependent claims 14 and 17, they are rejected on the same basis as dependent claims 3 and 6 since they are analogous claims.
Referring to dependent claim 28 it is rejected on the same basis as dependent claim 6 since it is an analogous claim.
Referring to dependent claim 39, it is rejected on the same basis as dependent claim 6 since it is an analogous claim.
Claims 5, 15, 28, and 38 are rejected under 35 U.S.C. 103 as being unpatentable over Wu, Gene, et al. "GPGPU performance and power estimation using machine learning." 2015 IEEE 21st international symposium on high performance computer architecture (HPCA). IEEE, 2015 (“Wu”) in view of Dimitrov et al. US 2019/0213775 Al(“Dimitrov”) and in view of Cummins, et al. "End-to-end deep learning of optimization heuristics." 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2017(“Cummins”).
Regarding dependent claim 5, Wu in view of Dimitrov teaches the processor of claim 4, but does not teach wherein the ANN comprises at least one of a convolutional neural network (CNN), a recurrent neural network (RNN), or a combination of a CNN or an RNN with a fully connected neural network. 
However Cummins teaches wherein the ANN comprises at least one of a convolutional neural network (CNN), a recurrent neural network (RNN), or a combination of a CNN or an RNN with a fully connected neural network (Cummins, pg. 222, left-column, “We use the […] Long Short-Term Memory (LSTM) architecture […] for sequence characterization. LSTMs implements a Recurrent Neural Network in which the activations
of neurons are learned with respect not just to their current inputs, but to previous inputs in a sequence…We use a two layer LSTM network. The network receives a sequence of embedding vectors, and returns a single output vector, characterizing the entire sequence [as detailed by figure. 4].” & see also Cummins, pg. 222, right-column, “The final component of DeepTune is comprised of two fully connected neural network layers. The first layer consists of 32 neurons. The second layer consists of a single neuron for each possible heuristic decision [as detailed by figure. 4].”).16
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu in view of Dimitrov and in view of Cummins, the motivation to do so would be to develop a better heuristic for device mapping and thread coarsening without the need for expert intervention regarding optimization(Cummins, pg. 220, left-column, “We evaluated our approach on two problems: heterogeneous device mapping and GPU thread coarsening. Good heuristics for these two problems are important for extracting performance from heterogeneous systems, and the fact that machine learning has been used before for heuristic construction for these problems allows direct comparison. Prior machine learning approaches resulted in good heuristics which extracted 73% and 79% of the available performance respectively but required extensive human effort to select the appropriate features. Nevertheless, our approach was able to outperform them by 14% and 12%, which indicates a better identification of important program characteristics, without any expert help.”).  
Referring to dependent claims 15, 28, and 38 they are rejected on the same basis as dependent claim 5 since they are analogous claims.
Claims 8, 19, 30, and 41 are rejected under 35 U.S.C. 103 as being unpatentable over Wu, Gene, et al. "GPGPU performance and power estimation using machine learning." 2015 IEEE 21st international symposium on high performance computer architecture (HPCA). IEEE, 2015 (“Wu”) in view of Dimitrov et al. US 2019/0213775 Al(“Dimitrov”) and in view of Zheng, et al. "Integrating profile-driven parallelism detection and machine-learning-based mapping." ACM Transactions on Architecture and Code Optimization (TACO) 11.1 (2014)(“Zheng”).
Regarding dependent claim 8, Wu in view of Dimitrov teaches the processor of claim 1, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”). 
 Wu in view of Dimitrov does not teach: circuitry configured to determine whether to execute a portion of a program serially or in parallel.
However, Zheng teaches: circuitry configured to determine whether to execute a portion of a program serially or in parallel(Zheng pg. 2, sec. Overview, fig. 4, fig. 5, “Our approach integrates profile-driven parallelism detection and machine-learning-based mapping into a single framework. We use profiling data to extract actual control and data dependence and enhance the corresponding static analysis with dynamic information. Subsequently, we apply an offline trained machine learning-based prediction mechanism to each parallel loop candidate and decide if and how the parallel mapping should be performed.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu in view of Dimitrov and in view of Zheng,  the motivation to do so would be automate the difficult task of parallelizing sequential code rather than having expensive expert programmers do so(Zheng, pgs. 1-2, “Multicore computing systems are widely seen as the most viable means of delivering performance with increasing transistor densities…[h]owever, this potential cannot be realized unless the application has been well parallelized. Unfortunately, efficient parallelization of a sequential program is a challenging and error-prone task. It is widely acknowledged that manual parallelization by expert programmers results in the most efficient parallel implementation but is a costly and time-consuming approach. Parallelizing compiler technology, on the other hand, has
the potential to greatly reduce this cost.”). 
Regarding claim 19, Wu in view of Dimitrov teaches the prediction unit of claim 12, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”). 
 Wu in view of Dimitrov does not teach: circuitry configured to determine whether to execute a portion of a program serially or in parallel.
However, Zheng teaches: circuitry configured to determine whether to execute a portion of a program serially or in parallel(Zheng pg. 2, sec. Overview, fig. 4, fig. 5, “Our approach integrates profile-driven parallelism detection and machine-learning-based mapping into a single framework. We use profiling data to extract actual control and data dependence and enhance the corresponding static analysis with dynamic information. Subsequently, we apply an offline trained machine learning-based prediction mechanism to each parallel loop candidate and decide if and how the parallel mapping should be performed.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu in view of Dimitrov and  in view of Zheng the motivation to do so would be automate the difficult task of parallelizing sequential code rather than having expensive expert programmers do so(Zheng, pgs. 1-2, “Multicore computing systems are widely seen as the most viable means of delivering performance with increasing transistor densities…[h]owever, this potential cannot be realized unless the application has been well parallelized. Unfortunately, efficient parallelization of a sequential program is a challenging and error-prone task. It is widely acknowledged that manual parallelization by expert programmers results in the most efficient parallel implementation but is a costly and time-consuming approach. Parallelizing compiler technology, on the other hand, has
the potential to greatly reduce this cost.”). 
Regarding claim 30, Wu in view of Dimitrov teaches the method of claim 23, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”). 
Wu in view of Dimitrov does not teach: determining whether to execute a portion of a program serially or in parallel.
However, Zheng teaches: determining whether to execute a portion of a program serially or in parallel (Zheng pg. 2, sec. Overview, fig. 4, fig. 5, “Our approach integrates profile-driven parallelism detection and machine-learning-based mapping into a single framework. We use profiling data to extract actual control and data dependence and enhance the corresponding static analysis with dynamic information. Subsequently, we apply an offline trained machine learning-based prediction mechanism to each parallel loop candidate and decide if and how the parallel mapping should be performed.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu in view of Dimitrov  and in view of Zheng, the motivation to do so would be automate the difficult task of parallelizing sequential code rather than having expensive expert programmers do so(Zheng, pgs. 1-2, “Multicore computing systems are widely seen as the most viable means of delivering performance with increasing transistor densities…[h]owever, this potential cannot be realized unless the application has been well parallelized. Unfortunately, efficient parallelization of a sequential program is a challenging and error-prone task. It is widely acknowledged that manual parallelization by expert programmers results in the most efficient parallel implementation but is a costly and time-consuming approach. Parallelizing compiler technology, on the other hand, has
the potential to greatly reduce this cost.”). 
Regarding claim 41, Wu in view of Dimitrov teaches the instructions of claim 34, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”). 
 Wu in view of Dimitrov does not teach: instructions for determining whether to execute a portion of a program serially or in parallel.
However, Zheng teaches: instructions for determining whether to execute a portion of a program serially or in parallel (Zheng pg. 2, sec. Overview, fig. 4, fig. 5, “Our approach integrates profile-driven parallelism detection and machine-learning-based mapping into a single framework. We use profiling data to extract actual control and data dependence and enhance the corresponding static analysis with dynamic information. Subsequently, we apply an offline trained machine learning-based prediction mechanism to each parallel loop candidate and decide if and how the parallel mapping should be performed.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to modify Wu in view of Dimitrov  and in view of Zheng,  the motivation to do so would be automate the difficult task of parallelizing sequential code rather than having expensive expert programmers do so(Zheng, pgs. 1-2, “Multicore computing systems are widely seen as the most viable means of delivering performance with increasing transistor densities…[h]owever, this potential cannot be realized unless the application has been well parallelized. Unfortunately, efficient parallelization of a sequential program is a challenging and error-prone task. It is widely acknowledged that manual parallelization by expert programmers results in the most efficient parallel implementation but is a costly and time-consuming approach. Parallelizing compiler technology, on the other hand, has
the potential to greatly reduce this cost.”). 
Claims 10, 21, 32 and 43 are rejected under 35 U.S.C. 103 as being unpatentable over Wu, Gene, et al. "GPGPU performance and power estimation using machine learning." 2015 IEEE 21st international symposium on high performance computer architecture (HPCA). IEEE, 2015 (“Wu”) in view of Dimitrov et al. US 2019/0213775 Al(“Dimitrov”) and in view of Song et al. "A simplified and accurate model of power-performance efficiency on emergent GPU architectures." 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 2013(“Song”).
Regarding claim 10, Wu in view of Dimitrov teaches the processor of claim 1, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).  
Wu in view of Dimitrov does not teach: circuitry configured to determine an address for a memory access.
However, Song teaches: circuitry configured to determine an address for a memory access (Song, pg. 683, sec. Identifying Potential Performance Bottlenecks, fig. 6, fig. 15, Fig. 15 details that a’s use of global memory was optimized in c when global memory usage was reduced by coalescing memory access and using shared memory units and then in d in which shared memory bank conflicts were eliminated. Note: It is being interpreted that the memory optimization from a to c to d represents circuitry configured to determine an address for a memory access).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Wu in view of Dimitrov and in view of Song, the motivation to do so would be to decrease performance bottlenecks through the effective usage of GPU memory and performance counters(Song, pg. 674, sec. I Introduction, fig. 2, “We believe GPU power models must be simpler, more accurate, and applicable to emergent systems...[f]urthermore, such models should lend themselves to use at runtime and provide enough insight to isolate both power and performance bottlenecks despite their simplicity. We propose an approach (see Fig. 2) that relies on GPU performance counter data to estimate energy use on a real system without the need of external power metering hardware or simulation.” ). 
Regarding claim 21, Wu in view of Dimitrov teaches the prediction unit of claim 12, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).  
Wu in view of Dimitrov does not teach: circuitry configured to determine an address for a memory access.
However, Song teaches: circuitry configured to determine an address for a memory access (Song, pg. 683, sec. Identifying Potential Performance Bottlenecks, fig. 6, fig. 15, Fig. 15 details that a’s use of global memory was optimized in c when global memory usage was reduced by coalescing memory access and using shared memory units and then in d in which shared memory bank conflicts were eliminated. Note: It is being interpreted that the memory optimization from a to c to d represents circuitry configured to determine an address for a memory access).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Wu in view of Dimitrov and in view of Song, the motivation to do so would be to decrease performance bottlenecks through the effective usage of GPU memory and performance counters(Song, pg. 674, sec. I Introduction, fig. 2, “We believe GPU power models must be simpler, more accurate, and applicable to emergent systems...[f]urthermore, such models should lend themselves to use at runtime and provide enough insight to isolate both power and performance bottlenecks despite their simplicity. We propose an approach (see Fig. 2) that relies on GPU performance counter data to estimate energy use on a real system without the need of external power metering hardware or simulation.” ).  
Regarding claim 32, Wu in view of Dimitrov teaches the method of claim 23, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).  
Wu in view of Dimitrov does not teach: circuitry configured to determine an address for a memory access.
However, Song teaches: circuitry configured to determine an address for a memory access (Song, pg. 683, sec. Identifying Potential Performance Bottlenecks, fig. 6, fig. 15, Fig. 15 details that a’s use of global memory was optimized in c when global memory usage was reduced by coalescing memory access and using shared memory units and then in d in which shared memory bank conflicts were eliminated. Note: It is being interpreted that the memory optimization from a to c to d represents circuitry configured to determine an address for a memory access).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Wu in view of Dimitrov and in view of Song, the motivation to do so would be to decrease performance bottlenecks through the effective usage of GPU memory and performance counters(Song, pg. 674, sec. I Introduction, fig. 2, “We believe GPU power models must be simpler, more accurate, and applicable to emergent systems...[f]urthermore, such models should lend themselves to use at runtime and provide enough insight to isolate both power and performance bottlenecks despite their simplicity. We propose an approach (see Fig. 2) that relies on GPU performance counter data to estimate energy use on a real system without the need of external power metering hardware or simulation.” ). 
Regarding claim 43, Wu in view of Dimitrov teaches the instructions of claim 34, further comprising based on the derived counter value(Wu, pgs. 567-568, right-column, see also figs. 2, 3, 4, 5, 6, and 9, “In the second phase, a classifier is constructed to predict which cluster’s scaling behavior best describes a new kernel based on its performance counter values…[f]ig. 6 gives a detailed view of the model architecture. Notice that the model contains multiple sets of clusters and classifiers. Each cluster set and classifier pair is responsible for providing scaling behaviors for a subset of the CU, engine frequency, and memory frequency parameter space. For example, the top cluster set in Fig. 6 provides information about the scaling behavior when CU count is 8. This set provides performance scaling behavior when engine and memory frequencies are varied and CU count is fixed at 8. The exact number of sets and classifier pairs in a model depends on the hardware configurations that appear in the training set. The scaling information from these cluster sets allows scaling from the base to any other target configuration.”).  
Wu in view of Dimitrov does not teach: instructions for determining an address for a memory access.
However, Song teaches: instructions for determining an address for a memory access (Song, pg. 683, sec. Identifying Potential Performance Bottlenecks, fig. 6, fig. 15, Fig. 15 details that a’s use of global memory was optimized in c when global memory usage was reduced by coalescing memory access and using shared memory units and then in d in which shared memory bank conflicts were eliminated. Note: It is being interpreted that the memory optimization from a to c to d represents circuitry configured to determine an address for a memory access).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Wu in view of Dimitrov and in view of Song,  the motivation to do so would be to decrease performance bottlenecks through the effective usage of GPU memory and performance counters(Song, pg. 674, sec. I Introduction, fig. 2, “We believe GPU power models must be simpler, more accurate, and applicable to emergent systems...[f]urthermore, such models should lend themselves to use at runtime and provide enough insight to isolate both power and performance bottlenecks despite their simplicity. We propose an approach (see Fig. 2) that relies on GPU performance counter data to estimate energy use on a real system without the need of external power metering hardware or simulation.” ). 

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Adam Clark Standke whose telephone number is (571)270-1806. The examiner can normally be reached 10AM-7PM M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



Adam Clark Standke
Assistant Examiner
Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        2 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        3 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        4 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        5 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        6 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        7 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        8 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        9 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        10 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        11 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        12 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        13 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        14 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        15 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        16 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.