DETAILED ACTION
*Note in the following document:
1. Texts in italic bold format are limitations quoted either directly or conceptually from claims/descriptions disclosed in the instant application.
2. Texts in regular italic format are quoted directly from cited reference or Applicant’s arguments.
3. Texts with underlining are added by the Examiner for emphasis.
4. Texts with 

	Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 7 February 2022 has been entered.

 Status of Claims
This is in response to applicant’s amendment/response file on 7 February 2022, which has been entered and made of record.  Claim(s) 2 and 8 have been amended.  No Claim has been added or cancelled.  Claim(s) 2-13 are pending in the application.
	
	Response to Arguments
Applicant’s arguments, see p.5-6, filed on 7 February 2022, with respect to 35 U.S.C. §103 rejection to Claims 1-12 have been fully considered but they are not persuasive.
Applicant argues reference of Laine fails to disclose a multi-core group comprises a plurality of graphics cores to process one or more shader programs; a plurality of tensor cores, apart from the plurality of graphics cores, to perform matrix operations including matrix multiplication operations for neural network training and inferencing; a plurality of one or more ray tracing cores, apart from the plurality of graphics cores and the plurality of tensor cores, to perform ray tracing operations (p.5-6).  More specifically Applicant argues that the graphics core, tensor core and ray tracing core are apart from each other or separately operated.  The Examiner respectfully disagrees.
Laine shows in Fig.22 a SM module, which is one sub module of a GPC 1750 as shown in Fig.20,  includes a plurality of core 1950, SFU 1952 and LSU 1954 ([0337]: IG. 22 illustrates the streaming multi-processor 1840 of FIG. 20, in accordance with an embodiment. As shown in FIG. 22, the SM 1840 includes an instruction cache 1905, one or more scheduler units 1910, a register file 1920, one or more processing cores 1950, one or more special function units (SFUs) 1952, one or more load/store units (LSUs) 1954, an interconnect network 1980, a shared memory/L1 cache 1970). The processing core 1950, SFU 1952 and LSU 1954 share L1 cache 1970.

    PNG
    media_image1.png
    740
    536
    media_image1.png
    Greyscale



    PNG
    media_image2.png
    733
    507
    media_image2.png
    Greyscale

	
Laine further teaches the processing core 1950 can be configured to process one or more shader program ([0337]-[0338]: The tasks are allocated to a particular DPC 1820 within a GPC 1750 and, if the task is associated with a shader program, the task may be allocated to an SM 1840. The scheduler unit 1910 receives the tasks from the work distribution unit 1725 and manages instruction scheduling for one or more thread blocks assigned to the SM 1840. The scheduler unit 1910 schedules thread blocks for execution as warps of parallel threads, where each thread block is allocated at least one warp. In an embodiment, each warp executes 32 threads. The scheduler unit 1910 may manage a plurality of different thread blocks, allocating the warps to the different thread blocks and then dispatching instructions from the plurality of different cooperative groups to the various functional units (i.e., cores 1950, SFUs 1952, and LSUs 1954) during each clock cycle), to include a tensor core ([0343]: In an embodiment, the cores 1950 include 64 single-precision (32-bit) floating point cores, 64 integer cores, 32 double-precision (64-bit) floating point cores, and 8 tensor cores) to perform matrix operations including matrix multiplication operations for neural network training and inferencing ([0344]: In particular, the tensor cores are configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inferencing. In an embodiment, each tensor core operates on a 4×4 matrix and performs a matrix multiply and accumulate operation D=A><B+C, where A, B, C, and D are 4×4 matrices).  Laine further teaches or suggests SFU 1952 can be configured to perform ray tracing operations ([0155]: FIG. 10A shows an exemplary ray tracing shading pipeline 900 that may be performed by SM 132 and accelerated by TTU 70.  Note Laine teaches ray tracing is a shader program and performed by SM.  Laine further teaches SM is the core which a shader program task is allocated as cited in [0337]-[0338] above and further teaches Each SM 1840 also comprises M SFUs 1952 that perform special functions (e.g., attribute evaluation, reciprocal square root, and the like). In an embodiment, the SFUs 1952 may include a tree traversal unit configured to traverse a hierarchical tree data structure.  Note Laine teaches traversing an acceleration data structure is required for real time ray tracing, see [0008]: a hardware-based traversal coprocessor that efficiently traverses an acceleration data structure e.g., for real time ray tracing).
Laine discloses there are one or more processing 1950, SFU, and LSU.  Therefore it would have been obvious to a POSITA before the effective filing date of the claimed invention to dedicate one of processing cores 1950 to perform shader programs, another one of processing cores 1950 by remove some floating point, integer double precision floating cores to perform matrix operations, and one SFU to perform ray tracing operations.  The motivation is to simplify and optimize logic and computation according to different operation requirements.  As shown in Fig.22, the processing cores 1950, SFU cores 1952 and LSU core 1954 all are apart from each other and share a cache L1 1970.  Therefore Laine teaches or suggests the limitation Applicant argued for.
Based on above reasoning, the Examiner maintains 35 U.S.C. §103 rejection to Claim 1/8.  Applicant’s argument regarding dependent claims are based on their dependency on Claim 1/8 (p.7 lines 3-5).  Therefore same reason is applied to dependent claims.
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 2-13 are rejected under 35 U.S.C. 103 as being unpatentable over Laine et al. (US 2020/0051314 A1) in view of Fuetterling et al. (“Accelerated Single Ray Tracing for Wide Vector Unit”, HPG’17 July 28-30, 2017).
Regarding Claim 2, Laine discloses a graphics processing unit ([0065]:  the technology herein provides a generic capability to determine, for a thread running in a GPU …), comprising: 
a plurality of multi-core groups ([0311]: In an embodiment, each GPC 1750 includes … one or more Data Processing Clusters (DPCs) 1820.  Notice the DPC 1820 includes a plurality of SM 1840 in Fig.20), wherein a multi-core group comprises: 
a plurality of graphics cores to process one or more shader programs ([0337]-[0338]: The tasks are allocated to a particular DPC 1820 within a GPC 1750 and, if the task is associated with a shader program, the task may be allocated to an SM 1840. The scheduler unit 1910 receives the tasks from the work distribution unit 1725 and manages instruction scheduling for one or more thread blocks assigned to the SM 1840. The scheduler unit 1910 schedules thread blocks for execution as warps of parallel threads, where each thread block is allocated at least one warp. In an embodiment, each warp executes 32 threads. The scheduler unit 1910 may manage a plurality of different thread blocks, allocating the warps to the different thread blocks and then dispatching instructions from the plurality of different cooperative groups to the various functional units (i.e., cores 1950, SFUs 1952, and LSUs 1954) during each clock cycle); 
a plurality of tensor cores ([0343]: In an embodiment, the cores 1950 include 64 single-precision (32-bit) floating point cores, 64 integer cores, 32 double-precision (64-bit) floating point cores, and 8 tensor cores), apart from the plurality of the plurality of graphics cores (Laine discloses there are one or more processing 1950.  Therefore it would have been obvious to a POSITA before the effective filing date of the claimed invention to dedicate one of processing cores 1950 to perform shader programs, another one of processing cores 1950 to perform matrix operations by remove some floating point, integer double precision floating cores.  The motivation is to simply and optimize logic and computation according to different operation requirements.  As shown in Fig.22, the processing cores 1950, SFU cores 1952 and LSU core 1954 all are apart from each other and share a cache L1 1970), to perform matrix operations ([0344]: Tensor cores are configured to perform matrix operation) including matrix multiplication operations for neural network training and inferencing ([0344]: In particular, the tensor cores are configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inferencing. In an embodiment, each tensor core operates on a 4×4 matrix and performs a matrix multiply and accumulate operation D=A><B+C, where A, B, C, and D are 4×4 matrices); 
one or more ray tracing cores, apart from the plurality of graphic cores and the plurality of tensor cores (Laine discloses there are one or more processing 1950, SFU, and LSU.  Therefore it would have been obvious to a POSITA before the effective filing date of the claimed invention to dedicate one of processing cores 1950 to perform shader programs, another one of processing cores 1950 by remove some floating point, integer double precision floating cores to perform matrix operations, and one SFU to perform ray tracing operations.  The motivation is to simply and optimize logic and computation according to different operation requirements.  As shown in Fig.22, the processing cores 1950, SFU cores 1952 and LSU core 1954 all are apart from each other and share a cache L1 1970), to perform ray tracing ([0082]: Ray tracing performed by SMs 132.  [0155]: FIG. 10A shows an exemplary ray tracing shading pipeline 900 that may be performed by SM 132 and accelerated by TTU 700.  [0155]: FIG. 10A shows an exemplary ray tracing shading pipeline 900 that may be performed by SM 132 and accelerated by TTU 70.  Note Laine teaches ray tracing is a shader program and performed by SM.  Laine further teaches SM is the core which a shader program task is allocated as cited in [0337]-[0338] above and further teaches Each SM 1840 also comprises M SFUs 1952 that perform special functions (e.g., attribute evaluation, reciprocal square root, and the like). In an embodiment, the SFUs 1952 may include a tree traversal unit configured to traverse a hierarchical tree data structure.  Note Laine teaches traversing an acceleration data structure is required for real time ray tracing, see [0008]: a hardware-based traversal coprocessor that efficiently traverses an acceleration data structure e.g., for real time ray tracing) operations; 
a cache shared among the plurality of graphics cores, the plurality of tensor cores, and one or more ray tracing cores (Fig.22: Shared Memory/L1 Cache 1970.  [0350]: The TTU 700 may communicate with the SMs 1840 via a TTU input/output block in memory input-output and with a L1 cache via a dedicated read interface); and 
a set of register files (Fig.22: Register File 1920) to store operand values ([0342]: The register file 1920 provides temporary storage for operands connected to the data paths of the functional units).
Laine teaches As shown in FIG. 17, the PPU 1700 includes … one or more general processing clusters (GPCs) 1750 ([00253] Fig.17).  A GPC includes one or more Data Processing Clusters (DPCs) ([0271] and Fig.19), Each DPC 1820 included in the GPC 1750 includes an M-Pipe Controller (MPC) 1830, a primitive engine 1835, one or more SMs 1840, one or more Texture Units 1842, and one or more TTUs 700 ([0273] and Fig.19) and The TTU 700 receives queries from one or more SMs 132 to perform tree traversal operations ([0119]).  Laine discloses In an embodiment, the PPU 1700 is a multi-threaded processor that is implemented on one or more integrated circuit devices ([0249]) and Each SM 1840 is multi-threaded and configured to execute a plurality of threads (e.g., 32 threads comprising a warp) from a particular group of threads concurrently ([0294])  Laine further discloses TTU 700 receives ray information and a BVH (or a portion of a BVH) for intersection testing from SM 132. The instruction that triggers TTU 700 to perform the accelerated intersection detection (“TTU query”) may require many operands to specify the ray information and the BVH information to TTU 700. … The corresponding result output by TTU 700 includes at least an identifier for each intersected primitive/item and a t-value (e.g., current length of the ray) … ([0154]) and Each SM 1840 also comprises M SFUs 1952 that perform special functions (e.g., attribute evaluation, reciprocal square root, and the like). In an embodiment, the SFUs 1952 may include a tree traversal unit configured to traverse a hierarchical tree data structure ([0304]).  Therefore Laine discloses wherein execution circuitry of at least one of the graphics cores, tensor cores, and ray tracing cores (Fig.21) is to execute a first instruction including a first operand specifying values associated with a plurality of threads ([0300]: Each task may comprise one or more groups of related threads, referred to herein as a warp. In an embodiment, a warp comprises 32 related threads that may be executed in parallel) to perform the operation of: returning a .
But Laine fails to explicitly disclose the returned value is a minimum value and the set of threads selected from the plurality of threads based on a mask.
However Fuetterling, in the same field of endeavor, discloses a skilled person before the effective filing date of the claimed invention had already known to use a mask to identify valid nodes in a child cluster or the number of primitive clusters within a leaf (p.4 right column third paragraph lines 8-9) and to compress the active elements into a continuous array (p.4 left column line 12).  Fuetterling further discloses computing the minimum and maximum distances of the slab test (p.4 right column second para lines 3-4) in order to retrieve the active mask (p.5 left column third para lines 1-3) and to apply the mask for compressing the valid nodes into a continuous array (p.3 Fig.2 explanation last two lines). Therefore it would have been obvious to one ordinary person skilled in the art before the effective filing date of the claimed invention to incorporate the teaching of Fuetterling into that of Laine and to add the limitation of wherein execution circuitry of at least one of the graphics cores, tensor cores, and ray tracing cores is to execute a first instruction including a first operand specifying values associated with a plurality of threads to perform the operation of: returning a minimum value from values associated with a set of threads, the set of threads selected from the plurality of threads based on a mask in order to obtain performance gains as taught by Fuetterling (Abstract last two lines).

Regarding Claim 3, Fuetterling discloses wherein the mask comprises one bit associated with each thread, wherein a first bit value indicates that a corresponding thread is included in the set of threads, and a second bit value indicates that the thread is not included in the set of threads (p.5 left column 4th para lines 6-8: The stack pointer is incremented according to the number of set bits in the active mask.  A skilled person would have recognized that a “set” is one value and “unset” is a second value).  The same reason to combine as taught in Claim 2 is incorporated herein.

Regarding Claim 4, Laine discloses wherein each of the values associated with the plurality of threads comprises an integer ([0300]: In an embodiment, a warp comprises 32 related threads that may be executed in parallel.  “32” is an integer).

Regarding Claim 5, Laine discloses wherein the set of threads are synchronized ([0340]: Cooperative Groups enables programmers to define groups of threads explicitly at sub-block (i.e., as small as a single thread) and multi-block granularities, and to perform collective operations such as synchronization on the threads in a cooperative group.  [0285]: The TTU structure described above can be implemented in, or in association with, an example non-limiting parallel processing system architecture such as that described below in relation to FIGS. 18-25).

Regarding Claim 6, Fuetterling further discloses wherein the execution circuitry of at least one of the graphics cores, tensor cores, and ray tracing cores to execute a second instruction including a second operand specifying the values associated with the plurality of threads to perform the operation of: returning a maximum value from the values associated with the set of threads, the set of threads selected from the plurality of threads based on the mask (p.4 right column second para lines 3-4: computing the minimum and maximum distance of the slab test).  The same reason to combine as taught in Claim 2 is incorporated herein.

Regarding Claim 7, Laine discloses wherein performing the ray tracing operations comprises: generating rays for traversal through a graphics scene ([0018]:  ray tracing simulates the physics of light by modeling light transport through the scene to compute all global effects (including for example reflections from shiny surfaces) using ray optics); constructing a hierarchical acceleration data structure comprising a plurality of hierarchically arranged nodes (Fig.8A/B and [0128]: IGS. 8A and 8B show a recursively-subdivided bounding volume of a 3D scene (FIG. 8A) and a corresponding tree data structure (FIG. 8B) that may be accessed by the traversal coprocessor 138 and used for hardware-accelerated operations performed by traversal coprocessor. The division of the bounding volumes may be represented in a hierarchical tree data structure with the large bounding volume shown in FIG. 2B represented by a parent node of the tree and the smaller bounding volumes represented by children nodes of the tree that are contained by the parent node); and traversing one or more of the rays through the hierarchical acceleration data structure and intersecting the one or more rays with primitives contained within the hierarchically arranged nodes (Fig.9 and [0134]: Tree traversal operations may include, for example, determining whether a ray intersects bounding volumes and/or primitives of a tree data structure (e.g., a BVH tree), which tests may involve transforming the ray into object space).

Regarding Claims 8 and 12, Claims 8 and 12 are in similar scope to Claims 1 and 6 except citing Claim 8 cites returning maximum value and Claim 12 recites returning minimum value while Claim 1 recites returning minimum value and Claim 6 cites returning maximum value.  Therefore the rejections to Claims 1 and 6 are also applied to Claims 8 and 12.

Regarding Claims 9-11 and 13, Claims 9-11 and 13 are in similar scope to Claims 3-5 and 7.  Therefore the rejections to Claims 3-5 and 7 are also applied to Claims 9-11 and 13.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YINGCHUN HE whose telephone number is (571)270-7218. The examiner can normally be reached M-F 8:00-5:00 MT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao M Wu can be reached on 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YINGCHUN HE/Primary Examiner, Art Unit 2613