DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The action is responsive to the applicant’s amendment filed on 03/22/2021. Claims 1-21 remain pending in the application. Applicant’s amendment to claim 1, 7-8, 10, 15, 17-18, 20, and a new claim 21 have been considered. Applicant’s amendment and argument to claim 1, 15, and 20 are persuasive, however upon reconsideration a new rejection has been made for the independent claims 1, 15, and 20.   See rejection under 35 U.S.C 103 below.

Response to Argument
In response to the applicant’s argument for claim 8 on Remarks page 9, “the cited Pineiro appears to disparage the approach, due to its increased manufacturing cost”. The amended claim has not been found persuasive.
Examiner respectfully disagrees, while Pineiro’s approach might increase manufacturing cost because of the disclosed approach, Pineiro’s approach is still provide some efficient that one of ordinary skill in the art would be motivated to use, such efficient is when performing low precision to approximate a function such as reciprocal, only 14 bit precision binary mantissa is required in the result, and when performing high precision estimate for the reciprocal function would require 23 bit of mantissa as shown in figure 5 paragraph [0024]. It is understandable that in computer system, there is often tradeoff, having low and high precision provides the flexibility to have difference precisions to use depends on the functionality of the system would be efficient for a system.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-3, 5-7, 10, 13-18, 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Yu ( US 2017/0083287 A1) in view of Introduction to computer architecture slides (NPL), Azadet (US 2014/0086361 A1) and in view of Christopher (NPL – Optimizing SIMT Architecture using CUDA).

Regarding claim 1, Yu teaches a processor (Yu, Fig.1 CPU) that uses a series of instructions to compute a result for each of the plurality of data element values using a polynomial function to approximate a complex function (Yu, [0076-0077] figure 3A and 3B shows that coefficient a1-c4 of the second order polynomial expression, the coefficients corresponding to variable x1, x2, x3… and use the second order polynomial to approximate the elementary function). The series of instructions use coefficients stored in a lookup location for the complex function (Yu, [0077] the coefficient is obtained from the LUT as shown in figure 3B), data element value within different data element value ranges use different sets of coefficients (Yu, [0080] the variable X is split into non-uniform variable sections, and which corresponding to different set of coefficient as shown in figure 4A and 12), and store results of the computation in the output location (Yu, [0066] the memory device may store data being processed by the GPU (such as results of arithmetic operations)). Yu teaches that either the CPU or the GPU of Fig. 1 may be used to perform the computation (Yu [0065]) and that either the CPU or GPU can access the lookup table (Yu [0067]). Taken together, these portions of Yu suggest that either the CPU or GPU can be used to perform the complex function approximation.
Yu describes the CPU at a high level (a functional block) and does not disclose the internals of the CPU. Accordingly, Yu does not teach decode circuitry and execution circuitry as claimed.
The Introduction to Computer Architecture slides (ICA Slides) teaches the ordinary stages of a processor pipeline including decode and execution stages (ICA slides #17).
It would have been obvious to one of ordinary skill in the art before the effective filing date to fill in the “gap” in Yu’s disclosure and modify the CPU of Yu to have the ordinary features of a processor (i.e., a decode stage/circuit and an execution stage/circuit). This modification would have been obvious because it is merely combining prior art elements (the standard stages of a processor pipeline) according to known methods (techniques taught in a basic computer architecture class) to yield predictable results (an operative general purpose processor). 
As modified, the combination of Yu in view of the ICA Slides teaches a CPU having a decode circuit and execution circuit in which a series of instructions to compute a result for each of the plurality of data element values using a polynomial function to approximate a complex function. 
Azadet teaches a processor is provided having an instruction set with user defined non-linear functions for approximating complex function using polynomial functions (Azadet, [0007]). Azadet describe how digital pre-distortion may be applied using at least one instruction. At least single instruction to perform a complex function approximation.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the processor of the combination of Yu in view of the ICA Slides to use a single instruction, as taught by Azadet, to respond to and implement the complex function approximation. This modification would have been obvious because using the single instruction to perform the approximation of complex function would be easy for assembly-level programmer, good code density and improve the program with complex instruction as described in ICA slide #6 comparing CISC vs RISC. In the resulting system, the processor of the combination of Yu in view of the ICA Slides and Azadet could either provide hardware support for implementing the instruction or trap the instruction (as described in ICA Slide #15) and then execute the existing functional software of Yu in order to emulate the single instruction.
As modified, the combination of Yu in view of ICA and Azadet teaches a system having a decide circuit and execution circuit in which an instruction to compute a result for each of the plurality of data element values using a polynomial function to approximate a complex function. 
the combined system of Yu in view of ICA and Azadet also teaches that the GPU executes various type of graphics processing pipelines, such as a compute unified device architecture (CUDA) (Yu, [0065]), however the combined system of Yu in view of ICA and Azadet does not explicitly discloses the computing for the plurality of data element values is performed simultaneously by a plurality of threads of the execution circuitry. 
Christopher teaches a method to perform parallel programming computation using CUDA on a GPU, particularly, the use of Nvidia GPU to which their parallel architecture are based on SIMT architecture (Christopher, section “Introduction”). In SIMT (Single Instruction Multiple Threading), multiple threads are processed by a single instruction in lock-step and each thread executes the same instruction, but possibly on different data.


Regarding claim 2, the combined system of Yu in view of the ICA slides, Azadet and Christopher discloses the invention as in parent claim above, including operation code (opcode) of the instruction specifies a transcendental function for which the approximation is performed (Yu, figure 16-21 [0015] the arithmetic operation include square root function, an inverse square root function, a reciprocal function, a log function, an exponential function, a power series function, and a trigonometric function)

Regarding claim 3, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the lookup location stores mappings, each being between a data element value range and a set of coefficients for the polynomial function to approximate the complex function (Yu, [0100] figure 7 illustrates the LUT that include the set of coefficient corresponding to the address item, which is the variable section).

Regarding claim 5, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the width of the data element value range is a fix value for each of the plurality of data element values (Yu, [0049] the width of the variable section can be uniform and have a fixed value as shown in figure 15. or the width of the variable section can be non-uniform as shown in figure 8).

Regarding claim 6, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the execution includes to compare a data element value with a plurality of mappings in parallel for range determination of the data element value (Yu, [0105] figure 8 the LUT addressing 820, compare the variable x to determine if x should be mapped to which address or section. [0146] the examples multiple processors are used, the hardware component has any one or more of different processing, example parallel processors).

Regarding claim 7, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including wherein the instruction comprises an additional operand specifying the lookup location (Azadet, [0008] the user specified parameter may comprises a look-up table storing values of the non-linear function for a finite number of input values).

Regarding claim 10, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the instruction comprises an additional operand specifying the complex function to be approximated (Yu, [0007] performing an arithmetic operation by a processing apparatus includes determining a polynomial expression approximating an arithmetic operation to be performed on a variable, see [0068] for list of arithmetic operation can be performed).

Regarding claim 13, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the execution circuitry comprises two or more computing units, each for executing the instruction using a warp or a thread (Yu, [0147] fig1-22 are performed by computing hardware, by one or more processors or computers, as described above execution instructions). 

Regarding claim 14, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the processor is a graphics processing unit (GPU) (figure 1 shows the process apparatus is the GPU).

Regarding claim 15, claim 15 is a method that corresponding to the apparatus of claim 1, therefore, it is rejected for the same reasons as claim 1.

Regarding claim 16, the same teaching that is used for claim 3 can be applied equally to teach claim 16.

Regarding claim 17, the same teaching that is used for claim 7 can be applied equally to teach claim 17.

Regarding claim 18, the same teaching that is used for claim 10 can be applied equally to teach claim 18.

Regarding claim 20, claim 20 is a non-transitory machine readable medium storing an instruction corresponding to the apparatus claim 1, it is rejected for the same reasons as claim 1.
Regarding claim 21, the combined system of Yu in view of the ICA slides, Azadet and Christophe including the computing for the plurality of data element values comprises a plurality of iterations, and wherein a result from an earlier iteration for each of the plurality of data element values is an input to the earlier iteration's immediate next iteration along with exponentiation of a corresponding data element value (Yu, figure 10 [0112] compute the coefficients obtained from the LUTs and the variables X. calculator 140 performs multiply-accumulation where a result of                         
                            
                                
                                    C
                                
                                
                                    0
                                
                            
                        
                     [i.e. a result from an earlier iteration] is accumulated to the product of (                        
                            
                                
                                    C
                                
                                
                                    1
                                
                            
                            
                                
                                    X
                                
                                
                                    '
                                
                            
                            )
                        
                     [i.e. the earlier iteration’s immediate next iteration], then the result of                         
                            
                                
                                    C
                                
                                
                                    0
                                
                            
                            +
                            
                                
                                    
                                        
                                            C
                                        
                                        
                                            1
                                        
                                    
                                    X
                                
                                
                                    '
                                
                            
                        
                     is accumulated with                         
                            
                                
                                    C
                                
                                
                                    2
                                
                            
                            
                                
                                    X
                                
                                
                                    2
                                
                            
                        
                    . Note when                         
                            
                                
                                    C
                                
                                
                                    0
                                
                            
                             
                        
                    is input to the earlier iteration’s immediate next iteration, the variable                         
                            
                                
                                    X
                                
                                
                                    0
                                
                            
                             
                        
                    [i.e. exponentiation of a corresponding data element value] is also input).

Claim 4, 8-9  are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of ICA slides, Azadet, and Christopher as applied to claim 1 above, and further in view of Pineiro (US 2014/0222883).

Regarding claim 4, the combined system of Yu in view of the ICA slides, Azadet, and Christopher discloses the invention substantially as claimed. See the rejection of claim 1 above. the combined system of Yu in view of ICA slides and Azadet further discloses that as the variable sections reduce, the precision of the approximation is increased (Yu, [0126]). However, the system of Yu in view of ICA slides and Azadet does not explicitly discloses a first data element value range, which requires a first computational precision, shares low order 
Pineiro teaches a first data element value range, which requires a first computational precision, shares low order coefficients for the polynomial function with a second data element value range that requires a second and lower computational precision (Pineiro, [0022,0023] figure 3, the math circuit is able to compute low precision and high precision. The low precision share the coefficients (C1, C2) table of the high precision estimate).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the combined system of Yu in view of ICA slides and Azadet to have higher order coefficient for the polynomial function as the variable section reduce, and have the low precision share the coefficients of the high precision as disclosed in Pineiro. The modification can be done by reuse a portion of a lookup table in the high estimate that were generated to produce the coefficients of a higher order polynomial (Pinero, [0014]). This modification would have been obvious because Pineiro explicitly discloses the implementation for approximating low precision by sharing the look-up table of the high precision estimate (Pineiro, [0014]). Doing this would allow make more efficient use of chip real estate as recognized by Pineiro in [0014].

Regarding claim 8, the combined system of Yu in view of the ICA slides, Azadet and Christopher discloses the invention substantially as claimed. See the rejection of claim 1 above. The combined system Yu in view of ICA slides and Azadet and further in view of Pineiro described in claim 4 also discloses the invention, including the instruction comprises an additional operand specifying a precision requirement for the computation (Pineiro, [0012] the new processor family, which may be expected to implement various precisions of a particular math function).

Regarding claim 9, the combined system of Yu in view of ICA slides, Azadet, and Christopher and further in view of Pineiro discloses the invention as in parent claim above, including a first precision requirement causes a first set of coefficients to be used for the computation and a second and different precision requirement causes a second set of coefficients to be used (Pineiro, figure 3, the high precision circuit 12 uses the set of coefficient C0,C1,C3 to obtain higher precision, and the low precision circuit 11 uses the different set of coefficient C0, C1 to obtain lower precision).

Claim 11-12,19  is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of ICA slides and Azadet and Christopher as applied to claim 1 above, and further in view of Koster (US 2017/0316307).

Regarding claim 11, the combined system of Yu in view of the ICA slides, Azadet, and Christopher discloses the invention substantially as claimed. See the rejection of claim 1 above. However, the system of Yu in view of ICA slides and Azadet does not explicitly discloses the complex function is an activation function of a neural network.
Koster teaches the complex function is an activation function of a neural network (Koster, [0032] activation of the neural network applied to the function (e.g., non-linear sigmoid function). As in the applicant’s disclosure, [00250] complex function are widely used in today’s computation and activation function may be a complex function (e.g., sigmoid function)).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the combined system of Yu in view of ICA slides and Azadet approximation of the complex function, since sigmoid function uses exponential operation to calculate, and use it as the activation of the neural network. Doing so would allow 

Regarding claim 12, the combined system of Yu in view of the ICA slides, Azadet, and Christopher discloses the invention substantially as claimed. See the rejection of claim 1 above. However, the system of Yu in view of ICA slides and Azadet does not explicitly discloses the plurality of data element values is from a tile of two- dimensional data elements.
Koster teaches the plurality of data element values is from a tile of two- dimensional data elements (Koster, [0019] the input tensor maybe two dimensional, and a tensor typically comprises a plurality of values).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify plurality of inputs of the combined system Yu in view of ICA slides and Azadet to be a two dimensional input to perform the approximation. Doing so would allow the system to increase and maintain the high level of precision when performing computation for the neural network as recognized by Koster [0002]. 

Regarding claim 19, the same teaching that is used for claim 11 can be applied equally to teach claim 19.
Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUY DUONG whose telephone number is (571)272-2764.  The examiner can normally be reached on Mon-Friday 7:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached on 571-272-4169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.








/HUY DUONG/            Examiner, Art Unit 2182                                                                                                                                                                                            	(571)272-2764

/Aimee Li/            Supervisory Patent Examiner, Art Unit 2183