DETAILED ACTION

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Double Patenting

1.	The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees.   A nonstatutory obviousness-type double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and  In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).

2.	A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on a nonstatutory double patenting ground provided the conflicting application or patent either is shown to be commonly owned with this application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement.
Effective January 1, 1994, a registered attorney or agent of record may sign a terminal disclaimer. 

3.	A terminal disclaimer signed by the assignee must fully comply with 37 CFR 3.73(b).

4.	Claims 1, 4, 9, 14, 15, 16, and 19 are rejected on the ground of nonstatutory obviousness-type double patenting as being unpatentable over claim 1, 6, 9, 14, and 15 of U.S. Patent No. 11361496. 

5.	The subject matter claimed in the instant application is fully disclosed in the referenced patent since the referenced patent and the instant application are claiming common subject matter, as shown in Table 1 below.

6.	Furthermore, there is no apparent reason why applicant would be prevented from presenting claims corresponding to those of the instant application in the other patent.  See In re Schneller, 397 F.2d 350, 158 USPQ 210 (CCPA 1968).  See also MPEP § 804.

7.	Effective January 1, 1994, a registered attorney or agent of record may sign a terminal disclaimer. A terminal disclaimer signed by the assignee must fully comply with 37 CFR 3.73(b).
Instant Application, 17/827067
Claim 1. A graphics processor comprising:




a) a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and


b) a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.


Claim 4. The graphics processor of claim 1, wherein the first processing cluster includes a ray tracing circuit to perform the ray tracing operation.  




Claim 9. A method comprising: 


a) fetching an instruction from an instruction cache of a graphics processing unit (GPU), the instruction a single instruction multiple data (SIMD) instruction having multiple operands, at least one of the multiple operands in a bfloatl6 (BFT6) number format, wherein the multiple operands include a first source operand, a second source operand, and a third source operand, the GPU includes a shared memory coupled with the instruction cache, and circuitry coupled with the shared memory and the instruction cache; 




b) dispatching a warp of threads to a processing resource of the GPU in response to the instruction, wherein the processing resource includes multiple texture units, a first core including hardware to accelerate matrix operations, and a second core configured to execute a thread of the instruction; and executing the thread of the instruction using the second core, wherein executing the thread of the instruction includes multiplying an element of the second source operand by an element of the third source operand and adding an element of the first source operand to a result of the multiply.




Claim 14. The method of claim 9, further comprising performing a dot product operation via the second core based on the instruction.  



Claim 15. The method of claim 9, further comprising performing a dot product operation via the first core based on the instruction.

Patent, 11361496
Claim 1. A graphics processing unit (GPU) comprising: 
a single instruction, multiple thread (SIMT) multiprocessor comprising: an instruction cache; 
a shared memory coupled with the instruction cache; circuitry coupled with 
the shared memory and the instruction cache, the circuitry including: multiple texture units; 

a) a first core including hardware to accelerate matrix operations; a second core configured to: receive an instruction having multiple operands, wherein the  instruction is a single instruction multiple data (SIMD) instruction, at least one of the multiple operands in a bfloatl6 (BF16) number format, the multiple operands include a first source operand,

 b) a second source operand, and a third source operand, and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent; and process the instruction, wherein to process the instruction includes to multiply the second source operand by the third source operand, add a first source operand to a result of the multiply, 

and apply a rectified linear unit function to a result of the add…


Claim 6. The GPU of claim 1, further comprising a third core including hardware to accelerate ray tracing operations.




Claim 9. A method comprising: 


a) fetching an instruction from an instruction cache of a graphics processing unit (GPU), the instruction a single instruction multiple data (SIMD) instruction having multiple operands, at least one of the multiple operands in a bfloatl6 (BF16) number format, wherein the multiple operands include a first source operand, a second source operand, and a third source operand, and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent, the GPU includes a shared memory coupled with the instruction cache and circuitry coupled with the shared memory and the instruction cache; 

b) dispatching a warp of threads to a single instruction multiple thread (SIMT) multiprocessor of the GPU in response to the instruction, wherein the SIMT Multiprocessor includes multiple texture units, a first core including hardware to accelerate matrix operations, and a second core configured to execute a thread of the instruction; and processing the instruction using the second core, wherein processing the instruction includes multiplying the second source operand by the third source operand, adding a first source operand to a result of the multiply, 



and applying a rectified linear unit function to a result of the adding.

Claim 14. The method of claim 9, further comprising performing texture processing operations via texture processing circuitry that is external to and coupled with the SIMT multiprocessor.  

Claim 15. The method of claim 9, further comprising performing a dot product operation via the second core in response to the instruction.




 
8.	Regarding claims 1, 4, 9, 14, 15, 16, and 19 of the instant application, 17/827067, the first row of Table 1 above shows that these claims maps to the claims 1, 6, 9, 14, and 15 from the Patent 11361496 where the bold lower case lettered sections correspond across the columns of the table to the corresponding features between the instant applications and referenced patent.  It is obvious that the above grouping of claim elements supports a nonstatutory obviousness-type double patenting rejection as the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) as the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s).

Claim Rejections - 35 USC § 103

9.        In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

10.         The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

11.         Claims 1-6, 8 , and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Pasca et al., US 2019/0042193 A1, in  view of Shi et al., US 2018/0373200 A1, and further in view of Dammertz et al., US 2009/0189898 A1.

12.         As per claim 1, Pasca discloses: A graphics processor comprising: 
a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format (Pasca, [0061], “Each DSP circuitry 60 may then determine a portion of the final dot-product. For example, a DSP circuitry 60 may determine a first product of a first pair of inputs, may determine a second product of a second pair of inputs, and may output a sum of the first product and the second product.”, and [0049], “As illustrated, in some embodiments, the process 140 may begin by scaling set of inputs to the DSP circuitry 60 from a first format to a second format (process block 142). For example, an input having a bfloat16 floating-point format (e.g., a 1-bit sign field, an 8-bit exponent field, and a 7-bit fraction field) may be scaled to half-precision floating-point format.”)

13.	Pasca doesn’t expressly disclose:
 a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; 

14.	Shi discloses: 
 a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier. (Shi, [0108],” Tensor cores configured to perform matrix operations, and, in one embodiment, one or more tensor cores are included in the cores 550. In particular, the tensor cores are configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inferencing. In one embodiment, each tensor core operates on a 4×4 matrix and performs a matrix multiply and accumulate operation D=A×B+C, where A, B, C, and D are 4×4 matrices.”)

15.         Pasca is analogous art with respect to Shi because they are from the same field of endeavor, namely image processing.  At the time the application was filed , it would have been obvious to a person of ordinary skill in the art to include the process of a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier, as taught by Shi into the teaching of Pasca.  The suggestion for doing so would increase the range of the formats.  Therefore, it would have been obvious to combine Pasca with Shi.

16.	Pasca in view of Shi doesn’t expressly disclose:
A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; 

17.	Dammertz discloses:
A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; (Dammertz, Abstract , “Methods, systems, devices, and computer program code (software) products enable acceleration of ray tracing by using acceleration data structures with high arity to enable processing of nodes using streaming SIMD (Single Instruction, Multiple Data) instructions with reduced memory requirements.”)

18.       Dammertz is analogous art with respect to Pasca in view of Shi because they are from the same field of endeavor, namely image processing.  At the time the application was filed, it would have been obvious to a person of ordinary skill in the art to include the process of that the A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation, as taught by Dammertz into the teaching of Pasca in view of Shi.  The suggestion for doing so would provide a fast ray tracing.  Therefore, it would have been obvious to combine Dammertz with Pasca in view of Shi.

19.         As per claim 2, Pasca in view of Shi, and in view of Dammertz discloses: The graphics processor of claim 1, wherein the instruction causes the multiplier to multiply second and third source operands having the BF16 format while the accumulator adds a first source operand with output from the BF16 multiplier.  (Pasca, [0061], “Each DSP circuitry 60 may then determine a portion of the final dot-product. For example, a DSP circuitry 60 may determine a first product of a first pair of inputs, may determine a second product of a second pair of inputs, and may output a sum of the first product and the second product.”, and [0049], “As illustrated, in some embodiments, the process 140 may begin by scaling set of inputs to the DSP circuitry 60 from a first format to a second format (process block 142). For example, an input having a bfloat16 floating-point format (e.g., a 1-bit sign field, an 8-bit exponent field, and a 7-bit fraction field) may be scaled to half-precision floating-point format.”)

20.         As per claim 3, Pasca in view of Shi, and in view of Dammertz discloses: The graphics processor of claim 2, wherein the first source operand comprises a single- precision floating point format while the second and third source operands comprise BF16 format.  (Pasca, [0061], “Each DSP circuitry 60 may then determine a portion of the final dot-product. For example, a DSP circuitry 60 may determine a first product of a first pair of inputs, may determine a second product of a second pair of inputs, and may output a sum of the first product and the second product.”, and [0049], “As illustrated, in some embodiments, the process 140 may begin by scaling set of inputs to the DSP circuitry 60 from a first format to a second format (process block 142). For example, an input having a bfloat16 floating-point format (e.g., a 1-bit sign field, an 8-bit exponent field, and a 7-bit fraction field) may be scaled to half-precision floating-point format.”)

21.         As per claim 4, Pasca in view of Shi, and in view of Dammertz discloses: The graphics processor of claim 1, wherein the first processing cluster includes a ray tracing circuit to perform the ray tracing operation. (Dammertz, Abstract, “Methods, systems, devices, and computer program code (software) products enable acceleration of ray tracing by using acceleration data structures with high arity to enable processing of nodes using streaming SIMD (Single Instruction, Multiple Data) instructions with reduced memory requirements.”)

22.         As per claim 5, Pasca in view of Shi, and in view of Dammertz discloses: The graphics processor of claim 4, wherein the ray tracing circuit to configured perform the ray tracing operation in response to a request received from a processing resource of the first processing cluster. (Dammertz, [0247], “he following discussion describes in greater detail certain issues in ray tracing technology, and particular aspects of the invention that address those issues. FIG. 29 is a diagram illustrating the "self-intersection" problem, FIG. 29 shows a ray tracing procedure 500, including an image surface 502, an observation point 504, and a light source 506. In order to synthesize an image of the surface, a series of computations arc performed in order to locate rays extending between die observation point 504 and the surface 502. FIG. 29 shows one such ray 508. Ideally, there is then calculated the exact point of intersection 510 between the ray 508 and the surface 502.”)
            
23.         As per claim 6, Pasca in view of Shi, and in view of Dammertz discloses: The graphics processor of claim 1, wherein the first processing cluster includes a matrix processing circuit to perform the matrix multiply operation. (Shi, [0108],” Tensor cores configured to perform matrix operations, and, in one embodiment, one or more tensor cores are included in the cores 550. In particular, the tensor cores are configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inferencing. In one embodiment, each tensor core operates on a 4×4 matrix and performs a matrix multiply and accumulate operation D=A×B+C, where A, B, C, and D are 4×4 matrices.”)

24.         As per claim 8, Pasca in view of Shi, and in view of Dammertz discloses: The graphics processor of claim 1, wherein the floating-point unit of the second processing cluster is included in a matrix processing circuit of the second processing cluster. (Pasca, [0015], “More specifically, the present disclosure relates to scaling a variable to a suitable representation based on available hardware (e.g., hard logic) in an integrated circuit. For example, an input in a first number format (e.g., bfloat16) may be scaled to a second number format (e.g., half-precision floating-point) so that a digital signal processing (DSP) circuit implemented to receive inputs in the second number format may perform one or more arithmetic operations on the input. Further, in some embodiments, the output produced by the DSP circuit in a second or third number format (e.g., single-precision floating-point) may be scaled back to the first number format. Accordingly, arithmetic operations, such as a dot-product, performed in a first format may be emulated by scaling the inputs to and/or the outputs from arithmetic operations performed in a second format.”)

25.	Claims 16-20, which are similar in scope respectively to claims 1-5, are thus rejected under the same rationale.

26.         Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Pasca et al., US 2019/0042193 A1, in view of Shi et al., US 2018/0373200 A1, and in view of Dammertz et al., US 2009/0189898 A1, and further in view of Vantrease et al., US 2019/0294413 A1.

27.         As per claim 7, Pasca in view of Shi, and in view of Dammertz discloses: The graphics processor of claim 5, (See the rejection of claim 5 above.) 

28.	Pasca in view of Shi, and in view of Dammertz doesn’t expressly disclose: the matrix processing circuit includes a systolic array. 

29.	Vantrease discloses: the matrix processing circuit includes a systolic array. (Vantrease, [0075], “ In some implementations, computing engine 524 may be a matrix multiplication unit that may be used for matrix convolution and/or matrix multiplication, and thus may be used to implement a convolution layer or a fully-connected layer. For example, in some implementations, computing engine 524 may include a systolic array that includes a two-dimensional array of processing elements arranged in rows and columns.”)

30.       Vantrease is analogous art with respect to Pasca in view of Shi, and in view of Dammertz because they are from the same field of endeavor, namely image processing.  At the time the application was filed, it would have been obvious to a person of ordinary skill in the art to include the process of that the matrix processing circuit includes a systolic array, as taught by Vantrease into the teaching of Pasca in view of Shi, and in view of Dammertz.  The suggestion for doing so would generate a high-precision output of the sum of products.  Therefore, it would have been obvious to combine Vantrease with Pasca in view of Shi and in view of Dammertz.

31.         Claims 9-11, and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Nalluri et al., US 2019/0066255 A1, in view of Pasca et al., US 2019/0042193 A1, and further in view of Shi et al., US 2018/0373200 A1.

32. 	As per claim 9, Nalluri discloses: A method comprising:
 fetching an instruction from an instruction cache of a graphics processing unit (GPU), (Nalluri, [0078], “ As illustrated in FIG. 6B, a graphics execution unit 608 can include an instruction fetch unit 637, a general register file array (GRF) 624, an architectural register file array (ARF) 626, a thread arbiter 622, a send unit 630, a branch unit 632, a set of SIMD floating point units (FPUs) 634, and in one embodiment a set of dedicated integer SIMD ALUs 635.) ”), the instruction a single instruction multiple data (SIMD) instruction having multiple operands, (Nalluri, [0086], “For each format, instruction opcode 712 defines the operation that the execution unit is to perform. The execution units execute each instruction in parallel across the multiple data elements of each operand. For example, in response to an add instruction the execution unit performs a simultaneous add operation across each color channel representing a texture element or picture element.”)
dispatching a warp of threads to a processing resource of the GPU in response to the instruction, (Nalluri, [0052], “The 3D pipeline 312 and media pipeline 316 process the commands and data by performing operations via logic within the respective pipelines or by dispatching one or more execution threads to a graphics core array 414.”), wherein the processing resource includes multiple texture units, a first core including hardware to accelerate matrix operations, and a second core configured to execute a thread of the instruction; (Nalluri,, [0135], “As shown in FIG. 14A, the graphics core 1400 includes a shared instruction cache 1402, a texture unit 1418, and a cache/shared memory 1420 that are common to the execution resources within the graphics core 1400. The graphics core 1400 can include multiple slices 1401A-1401N or partition for each core, and a graphics processor can include multiple instances of the graphics core 1400. The slices 1401A-1401N can include support logic including a local instruction cache 1404A-1404N, a thread scheduler 1406A-1406N, a thread dispatcher 1408A-1408N, and a set of registers 1410A. To perform logic operations, the slices 1401A-1401N can include a set of additional function units (AFUs 1412A-1412N), floating-point units (FPU 1414A-1414N), integer arithmetic logic units (ALUs 1416-1416N), address computational units (ACU 1413A-1413N), double-precision floating-point units (DPFPU 1415A-1415N), and matrix processing units (MPU 1417A-1417N).”, and also [0052], [0136]) 

33.	Nalluri doesn’ t expressly discloses
at least one of the multiple operands in a bfloatl6 (BFT6) number format, wherein the multiple operands include a first source operand, a second source operand, and a third source operand, the GPU includes a shared memory coupled with the instruction cache, and circuitry coupled with the shared memory and the instruction cache;
executing the thread of the instruction using the second core, wherein executing the thread of the instruction includes multiplying an element of the second source operand by an element of the third source operand and adding an element of the first source operand to a result of the multiply.

34.	Pasca discloses: at least one of the multiple operands in a bfloatl6 (BFT6) number format, wherein the multiple operands include a first source operand, a second source operand, and a third source operand, the GPU includes a shared memory coupled with the instruction cache, and circuitry coupled with the shared memory and the instruction cache;  (Pasca, [0061], “Each DSP circuitry 60 may then determine a portion of the final dot-product. For example, a DSP circuitry 60 may determine a first product of a first pair of inputs, may determine a second product of a second pair of inputs, and may output a sum of the first product and the second product.”, and [0049], “As illustrated, in some embodiments, the process 140 may begin by scaling set of inputs to the DSP circuitry 60 from a first format to a second format (process block 142). For example, an input having a bfloat16 floating-point format (e.g., a 1-bit sign field, an 8-bit exponent field, and a 7-bit fraction field) may be scaled to half-precision floating-point format.”)

35.         Pasca is analogous art with respect to Nalluri because they are from the same field of endeavor, namely image processing.  At the time the application was filed , it would have been obvious to a person of ordinary skill in the art to include the process of at least one of the multiple operands in a bfloatl6 (BFT6) number format, wherein the multiple operands include a first source operand, a second source operand, and a third source operand, the GPU includes a shared memory coupled with the instruction cache, and circuitry coupled with the shared memory and the instruction cache, as taught by Nalluri into the teaching of Pasca.  The suggestion for doing so would increase the range of the formats.  Therefore, it would have been obvious to combine Pasca with Nalluri.

36.	Nalluri in view of Pasca doesn’ t expressly discloses:
executing the thread of the instruction using the second core, wherein executing the thread of the instruction includes multiplying an element of the second source operand by an element of the third source operand and adding an element of the first source operand to a result of the multiply. 

37.	`Shi discloses: executing the thread of the instruction using the second core, wherein executing the thread of the instruction includes multiplying an element of the second source operand by an element of the third source operand and adding an element of the first source operand to a result of the multiply. (Shi, [0108],” Tensor cores configured to perform matrix operations, and, in one embodiment, one or more tensor cores are included in the cores 550. In particular, the tensor cores are configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inferencing. In one embodiment, each tensor core operates on a 4×4 matrix and performs a matrix multiply and accumulate operation D=A×B+C, where A, B, C, and D are 4×4 matrices.”)  

38.       Shi is analogous art with respect to Nalluri in view of Pasca because they are from the same field of endeavor, namely image processing.  At the time the application was filed, it would have been obvious to a person of ordinary skill in the art to include the process of that the A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation, as taught by Shi into the teaching of Nalluri in view of Pasca.  The suggestion for doing so would scale  the range of a variable to a suitable representation based on available hardware . Therefore, it would have been obvious to combine Shi with Nalluri in view of Pasca.

39. 	As per claim 10, Nalluri in view of Pasca, and in view of Shi discloses: The method of claim 9, further comprising performing, via the first core, a parallel matrix multiply operation on input having the BF16 number format. (Pasca, [0061], “Each DSP circuitry 60 may then determine a portion of the final dot-product. For example, a DSP circuitry 60 may determine a first product of a first pair of inputs, may determine a second product of a second pair of inputs, and may output a sum of the first product and the second product.”, and [0049], “As illustrated, in some embodiments, the process 140 may begin by scaling set of inputs to the DSP circuitry 60 from a first format to a second format (process block 142). For example, an input having a bfloat16 floating-point format (e.g., a 1-bit sign field, an 8-bit exponent field, and a 7-bit fraction field) may be scaled to half-precision floating-point format.”)
 
40. 	As per claim 11, Nalluri in view of Pasca, and in view of Shi discloses: The method as in claim 9, wherein the processing resource is a single instruction, multiple thread (SIMT) multiprocessor and the method comprises dispatching the warp of threads to the SIMT multiprocessor.  (Nalluri, [0052], “The 3D pipeline 312 and media pipeline 316 process the commands and data by performing operations via logic within the respective pipelines or by dispatching one or more execution threads to a graphics core array 414.”, and Nalluri,, [0135], “As shown in FIG. 14A, the graphics core 1400 includes a shared instruction cache 1402, a texture unit 1418, and a cache/shared memory 1420 that are common to the execution resources within the graphics core 1400. The graphics core 1400 can include multiple slices 1401A-1401N or partition for each core, and a graphics processor can include multiple instances of the graphics core 1400. The slices 1401A-1401N can include support logic including a local instruction cache 1404A-1404N, a thread scheduler 1406A-1406N, a thread dispatcher 1408A-1408N, and a set of registers 1410A. To perform logic operations, the slices 1401A-1401N can include a set of additional function units (AFUs 1412A-1412N), floating-point units (FPU 1414A-1414N), integer arithmetic logic units (ALUs 1416-1416N), address computational units (ACU 1413A-1413N), double-precision floating-point units (DPFPU 1415A-1415N), and matrix processing units (MPU 1417A-1417N).”, and also [0052],  [0136]) 

41. 	As per claim 13, Nalluri in view of Pasca, and in view of Shi discloses: The method of claim 11, further comprising performing texture processing operations via texture processing circuitry that is external to and coupled with the SIMT multiprocessor.  (Nalluri, [0052], “The 3D pipeline 312 and media pipeline 316 process the commands and data by performing operations via logic within the respective pipelines or by dispatching one or more execution threads to a graphics core array 414.”, and , [0135], “As shown in FIG. 14A, the graphics core 1400 includes a shared instruction cache 1402, a texture unit 1418, and a cache/shared memory 1420 that are common to the execution resources within the graphics core 1400. The graphics core 1400 can include multiple slices 1401A-1401N or partition for each core, and a graphics processor can include multiple instances of the graphics core 1400. The slices 1401A-1401N can include support logic including a local instruction cache 1404A-1404N, a thread scheduler 1406A-1406N, a thread dispatcher 1408A-1408N, and a set of registers 1410A. To perform logic operations, the slices 1401A-1401N can include a set of additional function units (AFUs 1412A-1412N), floating-point units (FPU 1414A-1414N), integer arithmetic logic units (ALUs 1416-1416N), address computational units (ACU 1413A-1413N), double-precision floating-point units (DPFPU 1415A-1415N), and matrix processing units (MPU 1417A-1417N).”, and also [0052],  [0136])

42. 	As per claim 14, Nalluri in view of Pasca, and in view of Shi discloses: The method of claim 9, further comprising performing a dot product operation via the second core based on the instruction.  (Nalluri, [0083], “In one embodiment, arrays of multiple instances of the graphics execution unit 608 can be instantiated in a graphics sub-core grouping (e.g., a sub-slice). For scalability, product architects can chose the exact number of execution units per sub-core grouping. In one embodiment the execution unit 608 can execute instructions across a plurality of execution channels. In a further embodiment, each thread executed on the graphics execution unit 608 is executed on a different channel.”)

43. 	As per claim 15, Nalluri in view of Pasca, and in view of Shi discloses: The method of claim 9, further comprising performing a dot product operation via the first core based on the instruction. (Nalluri, [0083], “In one embodiment, arrays of multiple instances of the graphics execution unit 608 can be instantiated in a graphics sub-core grouping (e.g., a sub-slice). For scalability, product architects can chose the exact number of execution units per sub-core grouping. In one embodiment the execution unit 608 can execute instructions across a plurality of execution channels. In a further embodiment, each thread executed on the graphics execution unit 608 is executed on a different channel.”)

44.         Claim 12 is  rejected under 35 U.S.C. 103 as being unpatentable over Nalluri et al., US 2019/0066255 A1, in view of Pasca et al., US 2019/0042193 A1, and in view of Shi et al., US 2018/0373200 A1, and further in view of Dammertz et al., US 2009/0189898 A1.

45. 	As per claim 12, Nalluri in view of Pasca, and in view of Shi discloses: The method of claim 11, wherein the SIMT multiprocessor includes a third core to accelerate ray tracing operations and the method further comprises accelerating a ray tracing operation via the third core in parallel with processing the instruction.  

46.	Pasca in view of Shi doesn’t expressly disclose:
A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; 

47.	Dammertz discloses:
A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; (Dammertz, Abstract , “Methods, systems, devices, and computer program code (software) products enable acceleration of ray tracing by using acceleration data structures with high arity to enable processing of nodes using streaming SIMD (Single Instruction, Multiple Data) instructions with reduced memory requirements.”)

48.       Dammertz is analogous art with respect to Shi in view of Pasca because they are from the same field of endeavor, namely image processing.  At the time the application was filed, it would have been obvious to a person of ordinary skill in the art to include the process of that the A first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation, as taught by Dammertz into the teaching of Shi in view of Pasca.  The suggestion for doing so would provide a fast ray tracing.  Therefore, it would have been obvious to combine Dammertz with Shi in view of Pasca.

Conclusion 
49.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDERRAHIM MEROUAN whose telephone number is (571)270-5254.  The examiner can normally be reached on Monday to Friday 8 AM-5 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor Kent Chang can be reached on 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ABDERRAHIM MEROUAN/Primary Examiner, Art Unit 2619