DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The action is responsive to the applicant’s amendment filed on 07/21/2021. Claims 1-7, 9-21 remain pending in the application.

Claim Objections
Claims 1, 15 and 20 are objected to because of the following informalities:    
Claims 1, 15 and 20, line 5, 5, and 7, respectively, “the computation” should be “a computation”.
Appropriate correction is required.

Claim Rejections - 35 USC § 112(a)
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

Claims 7, 10, 17, and 18 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for 
Regarding claims 7 and 17, recite “the instruction comprises an additional operand specifying the lookup location”.  Specification [0285] describes the instruction comprises operand1_output, operand2_input, and operand3_lookup. The specification does not describe the operand specifying a precision requirement for the computation and an additional operand specifying the lookup location are in a single instruction.

Regarding claims 10 and 18, recite “the instruction comprises an additional operand specifying the complex function to be approximated”.  Specification [0286] describes the instruction comprises operand1_output, operand2_input, and operand3_complex. The specification does not describe the operand specifying a precision requirement for the computation and an additional operand specifying the lookup location are in a single instruction.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-7, 9-10, 13-18, 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Yu ( US 2017/0083287 A1) in view of Pineiro (US - 20140222883), Azadet (US 2014/0086361 A1), Saulsbury (US - 20020019928) and in view of Christopher (NPL – Optimizing SIMT Architecture using CUDA).

Regarding claim 1, Yu teaches a processor (Yu, Fig.1 CPU) that uses a series of instructions to compute a result for each of the plurality of data element values using a polynomial function to approximate a complex function (Yu, [0076-0077] figure 3A and 3B shows that coefficient a1-c4 of the second order polynomial expression, the coefficients corresponding to variable x1, x2, x3… and use the second order polynomial to approximate the elementary function). The series of instructions use coefficients stored in a lookup location for the complex function (Yu, [0077] the coefficient is obtained from the LUT as shown in figure 3B), data element value within different data element value ranges use different sets of coefficients (Yu, [0080] the variable X is split into non-uniform variable sections, and which corresponding to different set of coefficient as shown in figure 4A and 12), and store results of the computation in the output location (Yu, [0066] the memory device may store data being processed by the GPU (such as results of arithmetic operations)). Yu teaches that either the CPU or the GPU of Fig. 1 may be used to perform the computation (Yu [0065]) and that either the CPU or GPU can access the lookup table (Yu [0067]). Taken together, these portions of Yu suggest that either the CPU or GPU can be used to perform the complex function approximation.
Yu describes the CPU at a high level (a functional block) and does not disclose the internals of the CPU. Accordingly, Yu does not teach decode circuitry and execution circuitry as claimed.
Pineiro teaches a processor that comprises a deciding circuitry and the execution circuitry to execute the decoded instruction (Pineiro, figure 5) and a math circuit for (Pineiro figure 1, [0016-0017] illustrates that the approximation can be done with at least lower precision and higher precision, wherein the computing circuit 2 and 5 share or reuse the lookup table, and coefficient are selected based on the precision desired. For example, for high precision estimate, coefficient C0, C1, and C2 are selected, and for low precision estimate, only C0 and C1 are selected)
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify Yu’s system to reuse or share the coefficients table for different precision requirement as disclosed in Pineiro. This modification would be obvious because Yu and Pineiro disclose the system of approximating a complex function using lookup tables, and by reusing or share the coefficient for approximating with different precisions, it allows more efficient use of chip real estate as recognized by Pineiro in [0014]. In addition, by selecting different coefficient sets for different precision requirement to best fit the approximation would improve the efficient of the overall system because the system lower precision computation have less bit representation than higher precision, thus would require less time for computation. 
As modified, the system of Yu in view of Pineiro teaches the decoding circuitry and the execution circuitry to compute a result for each of the plurality of data element using a polynomial function to approximate a complex function, wherein coefficients are stored in a lookup table and selected based on the precision requirement. The combined system of Yu in view of Pineiro does not teach decoding an instruction
Azadet teaches a processor is provided having an instruction set with user defined non-linear functions for approximating complex function using polynomial functions (Azadet, [0007]). Azadet describe how digital pre-distortion may be applied using at least one instruction. At least single instruction to perform a complex function approximation.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the processor of the combination of Yu in view of Pineiro to use a single instruction, as taught by Azadet, to respond to and implement the complex function approximation. This modification would have been obvious because using the single instruction to perform the approximation of complex function would be easy for assembly-level programmer, good code density and improve the program with complex instruction. In the resulting system, the processor of the combination of Yu in view of Pineiro and Azadet could either provide hardware support for implementing the instruction or trap the instruction and then execute the existing functional software of Yu in order to emulate the single instruction.
As modified, the combination of Yu in view of Pineiro and Azadet teaches a system having a decode circuit and execution circuit in which an instruction to compute a result for each of the plurality of data element values using a polynomial function to approximate a complex function.  however, the combined system of Yu in view of Pineiro and Azadet does not teach the instruction format that comprises a first operand specifying an output location, a second operand specifying a plurality of data element values to be computed, and a third operand specifying a precision requirement for the computation.
Saulsbury teaches an instruction format that comprises a first operand specifying an output location, a second operand specifying a plurality of data element values to be computed, and a third operand specifying a precision requirement for the computation (Saulsbury, figure 4 [0040], Rd [0:5] 420 is the first operand specifying the destination address, Rs1 412 is the second operand, which is source address used to load registers from register file, [0043] table I, type field  [18-20] is the third operand that specify the different precision depending on the type, see table I)

As modified, the combination of Yu in view of Pineiro, Azadet and Saulsbury teaches a system having a decode circuit and an execution circuit to decode and execute an instruction that comprises 3 operands to compute a result for each of the plurality of data element values using a polynomial function to approximate a complex function, where the coefficient are stored in the LUT and the coefficients are selected based on precision requirement, and wherein data elements value within different data element values ranges use different sets of coefficient. 
the combined system of Yu in view of Pineiro, Azadet and Saulsbury also teaches that the GPU executes various type of graphics processing pipelines, such as a compute unified device architecture (CUDA) (Yu, [0065]), however the combined system of Yu in view of Pineiro, Azadet and Saulsbury does not explicitly discloses the computing for the plurality of data element values is performed simultaneously by a plurality of threads of the execution circuitry. 
Christopher teaches a method to perform parallel programming computation using CUDA on a GPU, particularly, the use of Nvidia GPU to which their parallel architecture are based on SIMT architecture (Christopher, section “Introduction”). In SIMT (Single Instruction Multiple Threading), multiple threads are processed by a single instruction in lock-step and each thread executes the same instruction, but possibly on different data.
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the combined system of Yu in view of Pineiro, Azadet and Saulsbury to use the method described by Christopher to perform parallel computation 

Regarding claim 2, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher discloses the invention as in parent claim above, including operation code (opcode) of the instruction specifies a transcendental function for which the approximation is performed (Yu, figure 16-21 [0015] the arithmetic operation include square root function, an inverse square root function, a reciprocal function, a log function, an exponential function, a power series function, and a trigonometric function)

Regarding claim 3, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christophe discloses the invention as in parent claim above, including the lookup location stores mappings, each being between a data element value range and a set of coefficients for a corresponding polynomial function to approximate the complex function (Yu, [0100] figure 7 illustrates the LUT that include the set of coefficient corresponding to the address item, which is the variable section).

Regarding claim 4, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher discloses the invention as in parent claim above, including a first data element value range, which requires a first computational precision, shares low order coefficients for the corresponding polynomial function with a second data element value range that requires a second and lower computational precision (Pineiro, [0022,0023] figure 3, the math circuit is able to compute low precision and high precision. The low precision share the coefficients (C1, C2) table of the high precision estimate).

Regarding claim 5, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher discloses the invention as in parent claim above, including the width of the data element value range is a fix value for each of the plurality of data element values (Yu, [0049] the width of the variable section can be uniform and have a fixed value as shown in figure 15. or the width of the variable section can be non-uniform as shown in figure 8).

Regarding claim 6, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher discloses the invention as in parent claim above, including the execution includes to compare a data element value with a plurality of mappings in parallel for range determination of the data element value (Yu, [0105] figure 8 the LUT addressing 820, compare the variable x to determine if x should be mapped to which address or section. [0146] the examples multiple processors are used, the hardware component has any one or more of different processing, example parallel processors).

Regarding claim 7, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher discloses the invention as in parent claim above, including wherein the instruction comprises an additional operand specifying the lookup location (Azadet, [0008] the user specified parameter may comprises a look-up table storing values of the non-linear function for a finite number of input values. [0036, 0050] the lookup table for each user defined can be stored and loaded into a register).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify the instruction format as disclosed in the combined system of Yu in view of Pineiro, Azadet and Saulsbury and Christopher to include an operand specifying the lookup location as disclosed in Azadet. This modification would be obvious because all the elements were known in prior art and one of ordinary skills in the art can modify the instruction format as disclosed in Saulsbury in combination of Azadet to yield a predictable result of having an instruction that include an operand to receive specification on lookup tables for storage in a register.

Regarding claim 9, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher discloses the invention including, a first precision requirement causes a first set of coefficients to be used for the computation and a second and different precision requirement causes a second set of coefficients to be used (Pineiro, figure 3, the high precision circuit 12 uses the set of coefficient C0,C1,C3 to obtain higher precision, and the low precision circuit 11 uses the different set of coefficient C0, C1 to obtain lower precision).

Regarding claim 10, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher discloses the invention as in parent claim above, including the instruction comprises an additional operand specifying the complex function to be approximated (Yu, [0007] performing an arithmetic operation by a processing apparatus includes determining a polynomial expression approximating an arithmetic operation to be performed on a variable, see [0068] for list of arithmetic operation can be performed. Pineiro, figure 3 [0022] function identifier FID uses to indicate the type of function to be estimated).


Regarding claim 13, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the execution circuitry comprises two or more computing units, each for executing the instruction using a warp or a thread (Yu, [0147] fig1-22 are performed by computing hardware, by one or more processors or computers, as described above execution instructions). 

Regarding claim 14, the combined system of Yu in view of the ICA slides, Azadet and Christophe discloses the invention as in parent claim above, including the processor is a graphics processing unit (GPU) (figure 1 shows the process apparatus is the GPU).

Regarding claim 15-18, they are method claims that are corresponding to apparatus claims 1, 3, 7, 10, respectively. Thus they are rejected for the same reasons as the apparatus claims.

Regarding claim 20, claim 20 is a non-transitory machine readable medium storing an instruction corresponding to the apparatus claim 1, it is rejected for the same reasons as claim 1.
Regarding claim 21, the combined system of Yu in view of Pineiro, Azadet, Saulsbury and Christopher including the computing for the plurality of data element values comprises a plurality of iterations, and wherein a result from an earlier iteration for each of the plurality of data element values is an input to the earlier iteration's immediate next iteration along with exponentiation of a corresponding data element value (Yu, figure 10 [0112] compute the coefficients obtained from the LUTs and the variables X. calculator 140 performs multiply-accumulation where a result of                         
                            
                                
                                    C
                                
                                
                                    0
                                
                            
                        
                     [i.e. a result from an earlier iteration] is accumulated to the product of (                        
                            
                                
                                    C
                                
                                
                                    1
                                
                            
                            
                                
                                    X
                                
                                
                                    '
                                
                            
                            )
                        
                     [i.e. the earlier iteration’s immediate next iteration], then the result of                         
                            
                                
                                    C
                                
                                
                                    0
                                
                            
                            +
                            
                                
                                    
                                        
                                            C
                                        
                                        
                                            1
                                        
                                    
                                    X
                                
                                
                                    '
                                
                            
                        
                     is accumulated with                         
                            
                                
                                    C
                                
                                
                                    2
                                
                            
                            
                                
                                    X
                                
                                
                                    2
                                
                            
                        
                    . Note when                         
                            
                                
                                    C
                                
                                
                                    0
                                
                            
                             
                        
                    is input to the earlier iteration’s immediate next iteration, the variable                         
                            
                                
                                    X
                                
                                
                                    0
                                
                            
                             
                        
                    [i.e. exponentiation of a corresponding data element value] is also input).

Claim 11-12,19  is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of ICA slides and Azadet and Christopher as applied to claim 1 above, and further in view of Koster (US 2017/0316307).

Regarding claim 11, the combined system of Yu in view of the ICA slides, Azadet, and Christopher discloses the invention substantially as claimed. See the rejection of claim 1 above. However, the system of Yu in view of ICA slides and Azadet does not explicitly discloses the complex function is an activation function of a neural network.
Koster teaches the complex function is an activation function of a neural network (Koster, [0032] activation of the neural network applied to the function (e.g., non-linear sigmoid function). As in the applicant’s disclosure, [00250] complex function are widely used in today’s computation and activation function may be a complex function (e.g., sigmoid function)).


Regarding claim 12, the combined system of Yu in view of the ICA slides, Azadet, and Christopher discloses the invention substantially as claimed. See the rejection of claim 1 above. However, the system of Yu in view of ICA slides and Azadet does not explicitly discloses the plurality of data element values is from a tile of two- dimensional data elements.
Koster teaches the plurality of data element values is from a tile of two- dimensional data elements (Koster, [0019] the input tensor maybe two dimensional, and a tensor typically comprises a plurality of values).
It would have been obvious for one of ordinary skills in the art before the effective filing date of the claimed invention to modify plurality of inputs of the combined system Yu in view of ICA slides and Azadet to be a two dimensional input to perform the approximation. Doing so would allow the system to increase and maintain the high level of precision when performing computation for the neural network as recognized by Koster [0002]. 

Regarding claim 19, the same teaching that is used for claim 11 can be applied equally to teach claim 19.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUY DUONG whose telephone number is (571)272-2764.  The examiner can normally be reached on Mon-Friday 7:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached on 571-272-4169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HUY DUONG/Examiner, Art Unit 2182                                                                                                                                                                                            (571)272-2764

/Aimee Li/Supervisory Patent Examiner, Art Unit 2183