DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

EXAMINER’S AMENDMENT
2.	Authorization for this examiner’s amendment was given in an interview with Michael A. DeSanctis (Reg. No. 39,957) on May 27, 2021.
The application has been amended as follows: 
1. 	(Currently Amended)  A method comprising: 
responsive to receipt by a graphics processing unit (GPU) of a single instruction specifying a vector normalization operation to be performed on each vector of a set of V vectors:
generating, by a first processing unit of the GPU, V squared length values, each representing a squared length of one of the set of V vectors, N squared length values at a time, by, for each N sets of inputs, each representing a plurality of component vectors for N of the set of V vectors and stored in respective registers of a first set of V/N registers, performing N parallel dot product operations on the N sets of inputs, wherein V and N are integers > 0 and V > N; and
generating, by a second processing unit of the GPU, V sets of outputs, each representing a plurality of normalized component vectors of one of the set 

2. 	(Original)  The method of claim 1, wherein said generating, by a second processing unit of the GPU, V sets of outputs stores the V sets of outputs, N sets of outputs at a time, in respective registers of a second set of V/N registers.

3. 	(Original)  The method of claim 2, wherein V is eight and wherein N is two.

4. 	(Original)  The method of claim 3, wherein the first set of V/N registers comprises four 256-bit registers, and wherein the plurality of component vectors comprises three 32-bit component vectors.

5. 	(Original)  The method of claim 4, wherein the second set of V/N registers comprises four 256-bit registers, and wherein the plurality of normalized component vectors comprises three 32-bit normalized component vectors.

6. 	(Original)  The method of claim 3, wherein the first processing unit comprises a floating point unit (FPU) and wherein the second processing unit comprises a co-processor.

Original)  The method of claim 3, wherein the N parallel dot product operations result from a 2-wide Single Instruction Multiple Data (SIMD) dot product instruction.

8. 	(Original)  The method of claim 3, wherein the N parallel operations result from a 2-wide Single Instruction Multiple Data (SIMD) instruction.

9. 	(Original)  The method of claim 1, wherein the reciprocal square root function comprises performing a single-precision reciprocal square root operation on an operand, including:
performing a reciprocal square root operation on an exponent component of the operand;
performing a reciprocal square root operation on a mantissa component of the operand, comprising:
dividing the mantissa component into a first sub-component and a second sub-component;
determining a result of the reciprocal square root operation for the first sub-component; and
determining a result of the reciprocal square root operation for the second sub-component; and
returning a result of the reciprocal square root operation.

10. 	(Currently Amended)  A graphics processing unit (GPU) comprising: 
> 0 and V > N; 
a first processing unit coupled to the first set of V/N registers;
a second processing unit coupled to the first set of V/N registers;
an execution unit operable to, responsive to receipt of a single instruction specifying a vector normalization operation to be performed on each vector of a set of V vectors, (i) issue V/N N-wide Single Instruction Multiple Data (SIMD) dot product operations to be performed by the first processing unit; and (ii) issue V/N N-wide Single Instruction Multiple Data (SIMD) operations that implement a combination of a reciprocal square root function and a vector scaling function to be performed by the second processing unit; 
wherein the first processing unit is operable to generate V squared length values, each representing a squared length of one of the set of V vectors, N squared length values at a time, by, for each N sets of inputs, each representing a plurality of component vectors for N of the set of V vectors and stored in respective registers of the first set of V/N registers, executing one of the V/N N-wide SIMD dot product operations; and
wherein the second processing unit is operable to generate V sets of outputs, each representing a plurality of normalized component vectors of one of the set of V vectors, N sets of outputs at a time, by for each N squared length values of the V squared length values, executing one of the V/N N-wide SIMD operations.
Original)  The GPU of claim 10, further comprising a first set of V/N registers and wherein the V sets of outputs are stored, N sets of outputs at a time, in respective registers of the second set of V/N registers.

12. 	(Original)  The GPU of claim 11, wherein V is eight and wherein N is two.

13. 	(Original)  The GPU of claim 12, wherein the first set of V/N registers comprises four 256-bit registers, and wherein the plurality of component vectors comprises three 32-bit component vectors.

14. 	(Original)  The GPU of claim 13, wherein the second set of V/N registers comprises four 256-bit registers, and wherein the plurality of normalized component vectors comprises three 32-bit normalized component vectors.

15. 	(Original)  The GPU of claim 12, wherein the first processing unit comprises a floating point unit (FPU) and wherein the second processing unit comprises a co-processor.

16. 	(Original)  The GPU of claim 10, wherein the reciprocal square root function comprises performing a single-precision reciprocal square root operation on an operand, including:
performing a reciprocal square root operation on an exponent component of the operand;

dividing the mantissa component into a first sub-component and a second sub-component;
determining a result of the reciprocal square root operation for the first sub-component; and
determining a result of the reciprocal square root operation for the second sub-component; and
returning a result of reciprocal square root operation.
17.  	(Original)  The GPU of claim 16, wherein determining the value of the first sub-component comprises determining an initial estimate for the first sub-component and determining a difference between an actual value of the first sub-component and the initial estimate for the first sub-component.
18. 	(Original)  The GPU of claim 17, wherein determining the initial estimate comprises performing a linear interpolation.
19. 	(Original)  The GPU of claim 18, wherein the difference between the actual value of the first sub-component and the initial estimate for the first sub-component is determined via a piecewise linear approximation.
20. 	(Original)  The GPU of claim 16, wherein determining the result of the reciprocal square root operation for the first and second sub-components is performed in parallel.

Response to Arguments
3.	Applicant’s Amendment/Remark filed on June 18, 2018 has been considered and persuasive. Claims 1 and 10 have been amended.  Therefore, claims 1-20 are allowance.

Allowable Subject Matter
4.	Claims 1-20 are allowed.
The following is an examiner’s statement of reasons for allowance:
Consider independent claims 1 and 10 the best prior arts found of record during the examination of the present application.
In view of the present application, the prior arts made of record and considered pertinent to the applicant’s disclosure does not teach or suggest the claimed limitations.
Per claims 10 the cited prior arts, taken individually or in combination, do not teach the cited claim limitations having the following limitations:
generating, by a first processing unit of the GPU, V squared length values, each representing a squared length of one of the set of V vectors, N squared length values at a time, by, for each N sets of inputs, each representing a plurality of component vectors for N of the set of V vectors and stored in respective registers of a first set of V/N registers, performing N parallel dot product operations on the N sets of inputs, wherein V and N are integers > 0 and V > N; and
generating, by a second processing unit of the GPU, V sets of outputs, each representing a plurality of normalized component vectors of one of the set of V vectors, N sets of outputs at a time, by, for each N squared length values of the V squared length values, performing N parallel operations on the N squared length values, wherein each of the N parallel operations implement a combination of a reciprocal square root function and a vector scaling function.

Per claims 10 the cited prior arts, taken individually or in combination, do not teach the cited claim limitations having the following limitations:
an execution unit operable to, responsive to receipt of a single instruction specifying a vector normalization operation to be performed on each vector of a set of V vectors, (i) issue V/N N-wide Single Instruction Multiple Data (SIMD) dot product operations to be performed by the first processing unit; and (ii) issue V/N N-wide Single Instruction Multiple Data (SIMD) operations that implement a combination of a reciprocal square root function and a vector scaling function to be performed by the second processing unit; 
wherein the first processing unit is operable to generate V squared length values, each representing a squared length of one of the set of V vectors, N squared length values at a time, by, for each N sets of inputs, each representing a plurality of component vectors for N of the set of V vectors and stored in respective registers of the first set of V/N registers, executing one of the V/N N-wide SIMD dot product operations; and
wherein the second processing unit is operable to generate V sets of outputs, each representing a plurality of normalized component vectors of one of the set of V vectors, N sets of outputs at a time, by for each N squared length values of the V squared length values, executing one of the V/N N-wide SIMD operations.

4.	Accordingly, in light of the cited references, the present invention is novel and non-obvious since the prior arts of record do not contain either explicitly or implicitly the limitations as a whole as disclosed in claims 1-20. In addition, any reasonable combination of the cited references cannot be used to reconstruct the claimed invention. Therefore, the present application as claimed is allowable.  Hence, the present application is allowable as claimed.
5.	Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance".

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KIM THANH THI TRAN whose telephone number is (571)270-1408.  The examiner can normally be reached on Monday-Friday 7:00am-4:00pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JENNIFER MEHMOOD can be reached on 5712722976.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/KIM THANH T. TRAN/
Examiner, Art Unit 2612

/JENNIFER MEHMOOD/Supervisory Patent Examiner, Art Unit 2612