DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office correspondence is in response to the application number 16/591353 filed on October 2, 2019.  Claims 1 – 20 are pending.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
Such claim limitations that are being interpreted under  35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, are recited in claim 20:
means for allocating, based at least in part on a texture processor to shading processor arithmetic logic unit (ALU) resource ratio . . . 
means for processing the portion of input activation data . . . 
The specification outlines specific structure for performing the functions recited in the limitations (see Fig. 1, pages 6 - 14: ¶¶ [0029 -  0049]) and is used for the claim analysis.
35 USC § 101 Analysis
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. 

Claims 1 – 20 are directed to statutory subject matter.  The claims are directed to non-abstract improvements in computer related technology.  A claim is non-statutory when it is directed to a judicial exception (e.g. either one of mathematical concepts, mental processes, or certain methods of organizing human activity) without significantly more.  The claimed invention is not directed to a judicial exception.  Instead, the claimed invention is directed to a technological improvement to resource allocation for machine learning workloads utilizing a device that includes a GPU and provides workload balancing and allocates, based at least in part on a texture processor to shading processor arithmetic logic unit (ALU) resource ratio, a first set of one or more weight batches associated with a portion of input activation data to the texture processor and a second set of one or more weight batches associated with the portion of input activation data to the shading processor so that the portion of input activation data is processed based at least in part on the first set of one or more weight batches and the second set of one or more weight batches using the texture processor and the shading processor in parallel.  The ordered limitation of the claimed invention improves the efficiencies for processing a machine learning workload by optimally balancing resources used for processing the workload.  As such, the claimed invention is statutory.   
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1 – 2. 13 – 14, and 20  are rejected under 35 U.S.C. 102(a)(2) as being anticipated  by Bleiweiss (U.S. 2019/0205737 A1; herein referred to as Bleiweiss).
In regard to claim 1, Bleiweiss  teaches a method for workload balancing for machine learning, comprising (see abstract “ . . . An apparatus to facilitate acceleration of machine learning operations is disclosed. The apparatus comprises at least one processor to perform operations to implement a neural network and accelerator logic to perform communicatively coupled to the processor to perform compute operations for the neural network . . .”): 
allocating, based at least in part on a texture processor to shading processor arithmetic logic unit (ALU) resource ratio (e.g. a sample state) (see ¶¶ [0085-0086] “ . . . the additional fixed function logic 516 can also include machine-learning acceleration logic, such as fixed function matrix multiplication logic, for implementations including optimizations for machine learning training or inferencing.  Within each graphics sub-core 501A-501F includes a set of execution resources that may be used to perform graphics, media, and compute operations in response to requests by graphics pipeline, media pipeline, or shader programs. The graphics sub-cores 501A-501F include multiple EU arrays 502A-502F, 504A-504F, thread dispatch and inter-thread communication (TD/IC) logic 503A-503F, a 3D (e.g., texture) sampler 505A-505F, a media sampler 506A-506F, a shader processor 507A-507F, and shared local memory (SLM) 508A-508F. The EU arrays 502A-502F, 504A-504F each include multiple execution units, which are general-purpose graphics processing units capable of performing floating-point and integer/fixed-point logic operations in service of a graphics, media, or compute operation, including graphics, media, or compute shader programs. The TD/IC logic 503A-503F performs local thread dispatch and thread control operations for the execution units within a sub-core and facilitate communication between threads executing on the execution units of the sub-core. The 3D sampler 505A-505F can read texture or other 3D graphics related data into memory. The 3D sampler can read texture data differently based on a configured sample state and the texture format associated with a given texture. The media sampler 506A-506F can perform similar read operations based on the type and format associated with media data. In one embodiment, each graphics sub-core 501A-501F can alternately include a unified 3D and media sampler. Threads executing on the execution units within each of the sub-cores 501A-501F can make use of shared local memory 508A-508F within each sub-core, to enable threads executing within a thread group to execute using a common pool of on-chip memory. . . “), a first set of one or more weight batches (sampled texture data) associated with a portion of input activation data (e.g. thread initiation requests) to the texture processor (see ¶¶ [0094-0095] “ . . . one or more data caches (e.g., 612) are included to cache thread data during thread execution. In some embodiments, a sampler 610 is included to provide texture sampling for 3D operations and media sampling for media operations. In some embodiments, sampler 610 includes specialized texture or media sampling functionality to process texture or media data during the sampling process before providing the sampled data to an execution unit.  During execution, the graphics and media pipelines send thread initiation requests to thread execution logic 600 via thread spawning and dispatch logic. Once a group of geometric objects has been processed and rasterized into pixel data, pixel processor logic (e.g., pixel shader logic, fragment shader logic, etc.) within the shader processor 602 is invoked to further compute output information and cause results to be written to output surfaces (e.g., color buffers, depth buffers, stencil buffers, etc.). In some embodiments, a pixel shader or fragment shader calculates the values of the various vertex attributes that are to be interpolated across the rasterized object. In some embodiments, pixel processor logic within the shader processor 602 then executes an application programming interface (API)-supplied pixel or fragment shader program. To execute the shader program, the shader processor 602 dispatches threads to an execution unit (e.g., 608A) via thread dispatcher 604. In some embodiments, shader processor 602 uses texture sampling logic in the sampler 610 to access texture data in texture maps stored in memory. Arithmetic operations on the texture data and the input geometry data compute pixel color data for each geometric fragment, or discards one or more pixels from further processing . . .”) and a second set of one or more weight batches (e.g. a set of detailed geometric objects based on a coarse geometric model) associated with the portion of input activation data  (e.g. dispatched threads) to the shading processor  (see ¶¶ [0114-0115] “ . . . execution units 852A-852B are an array of vector processors having an instruction set for performing graphics and media operations. In some embodiments, execution units 852A-852B have an attached L1 cache 851 that is specific for each array or shared between the arrays. The cache can be configured as a data cache, an instruction cache, or a single cache that is partitioned to contain data and instructions in different partitions. In some embodiments, geometry pipeline 820 includes tessellation components to perform hardware-accelerated tessellation of 3D objects. In some embodiments, a programmable hull shader 811 configures the tessellation operations. A programmable domain shader 817 provides back-end evaluation of tessellation output. A tessellator 813 operates at the direction of hull shader 811 and contains special purpose logic to generate a set of detailed geometric objects based on a coarse geometric model that is provided as input to geometry pipeline 820. In some embodiments, if tessellation is not used, tessellation components (e.g., hull shader 811, tessellator 813, and domain shader 817) can be bypassed.  In some embodiments, complete geometric objects can be processed by a geometry shader 819 via one or more threads dispatched to execution units 852A-852B, or can proceed directly to the clipper 829. In some embodiments, the geometry shader operates on entire geometric objects, rather than vertices or patches of vertices as in previous stages of the graphics pipeline. If the tessellation is disabled the geometry shader 819 receives input from the vertex shader 807. In some embodiments, geometry shader 819 is programmable by a geometry shader program to perform geometry tessellation if the tessellation units are disable . . . “); and 
processing the portion of input activation data based at least in part on the first set of one or more weight batches and the second set of one or more weight batches using the texture processor and the shading processor in parallel (see  ¶ [0003] “ . . . To further increase performance, graphics processors typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In an SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency . . . “ see  ¶ [0260] “ . . . the processing cluster array 3312 is configured to perform parallel graphics processing operations. In embodiments in which the parallel processor 3300 is configured to perform graphics processing operations, the processing cluster array 3312 can include additional logic to support the execution of such graphics processing operations, including, but not limited to texture sampling logic to perform texture operations, as well as tessellation logic and other vertex processing logic. Additionally, the processing cluster array 3312 can be configured to execute graphics processing related shader programs such as, but not limited to vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. The parallel processing unit 3302 can transfer data from system memory via the I/O unit 3304 for processing. During processing the transferred data can be stored to on-chip memory (e.g., parallel processor memory 3322) during processing, then written back to system memory . . . “).
In regard to claim 2, Bleiweiss  teaches identifying, based at least in part on a size of a level one cache of the texture processor (see  ¶ [0114] “ . . . execution units 852A-852B have an attached L1 cache 851 that is specific for each array or shared between the arrays . . .”), the portion of input activation data (see  ¶ [0114] “ . . . The cache can be configured as a data cache, an instruction cache, or a single cache that is partitioned to contain data and instructions in different partitions. In some embodiments, geometry pipeline 820 includes tessellation components to perform hardware-accelerated tessellation of 3D objects. In some embodiments, a programmable hull shader 811 configures the tessellation operations. A programmable domain shader 817 provides back-end evaluation of tessellation output. A tessellator 813 operates at the direction of hull shader 811 and contains special purpose logic to generate a set of detailed geometric objects based on a coarse geometric model that is provided as input to geometry pipeline 820 . . .”) for an iterative machine-learning process (see  ¶ [0161] “ . . . FIG. 15 is a generalized diagram of a machine learning software stack 1500. A machine learning application 1502 can be configured to train a neural network using a training dataset or to use a trained deep neural network to implement machine intelligence. The machine learning application 1502 can include training and inference functionality for a neural network and/or specialized software that can be used to train a neural network before deployment. The machine learning application 1502 can implement any type of machine intelligence including but not limited to image recognition, mapping and localization, autonomous navigation, speech synthesis, medical imaging, or language translation . . .”); and 
loading the portion of input activation data into the level one cache of the texture processor based at least in part on the identifying (see  ¶ [0273] “ . . . The instructions transmitted to the processing cluster 3314 constitutes a thread. A set of threads executing across the set of parallel processing engines is a thread group. A thread group executes the same program on different input data. Each thread within a thread group can be assigned to a different processing engine within a graphics multiprocessor 3334. A thread group may include fewer threads than the number of processing engines within the graphics multiprocessor 3334. When a thread group includes fewer threads than the number of processing engines, one or more of the processing engines may be idle during cycles in which that thread group is being processed. A thread group may also include more threads than the number of processing engines within the graphics multiprocessor 3334. When the thread group includes more threads than the number of processing engines within the graphics multiprocessor 3334, processing can be performed over consecutive clock cycles. In one embodiment multiple thread groups can be executed concurrently on a graphics multiprocessor 3334. In one embodiment the graphics multiprocessor 3334 includes an internal cache memory to perform load and store operations. In one embodiment, the graphics multiprocessor 3334 can forego an internal cache and use a cache memory (e.g., L1 cache 3348) within the processing cluster 3314. . . Embodiments in which the processing cluster 3314 includes multiple instances of the graphics multiprocessor 3334 can share common instructions and data, which may be stored in the L1 cache 3348. 
In regard to claim 13, Bleiweiss  teaches an apparatus for workload balancing for machine learning, comprising (see abstract “ . . . An apparatus to facilitate acceleration of machine learning operations is disclosed. The apparatus comprises at least one processor to perform operations to implement a neural network and accelerator logic to perform communicatively coupled to the processor to perform compute operations for the neural network . . .”): 
a processor (see ¶ [0006] “ . . . FIG. 2 is a block diagram of an embodiment of a processor having one or more processor cores, an integrated memory controller, and an integrated graphics processor. . . .”), 
memory coupled with the processor (see ¶ [0054] “ . . . The memory device 120 can be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory device, phase-change memory device, or some other memory device having suitable performance to serve as process memory. . .”); and 
instructions stored in the memory and executable by the processor to cause the apparatus to (see ¶ [0054] “ . . . the memory device 120 can operate as system memory for the system 100, to store data 122 and instructions 121 for use when the one or more processors 102 executes an application or process. Memory controller 116 also couples with an optional external graphics processor 112, which may communicate with the one or more graphics processors 108 in processors 102 to perform graphics and media operations. . . “):
 allocate, based at least in part on a texture processor to shading processor arithmetic logic unit (ALU) resource ratio (e.g. a sample state) (see ¶¶ [0085-0086] as described for the rejection of claim 1 and is incorporated herein), a first set of one or more weight batches  (sampled texture data) associated with a portion of input activation data(e.g. thread initiation requests)  to the texture processor (see ¶¶ [0094-0095] as described for the rejection of claim 1 and is incorporated herein) and 
a second set of one or more weight batches (e.g. a set of detailed geometric objects based on a coarse geometric model) associated with the portion of input activation data  (e.g. dispatched threads) to the shading processor (see ¶¶ [0114-0115] as described for the rejection of claim 1 and is incorporated herein); and process the portion of input activation data based at least in part on the first set of one or more weight batches and the second set of one or more weight batches using the texture processor and the shading processor in parallel (see ¶ [0003], ¶ [0260] as described for the rejection of claim 1 and is incorporated herein).
In regard to claim 14, Bleiweiss  teaches identify, based at least in part on a size of a level one cache of the texture processor (see ¶ [0114] as described for the rejection of claim 2 and is incorporated herein), the portion of input activation data(see  ¶ [0114] as described for the rejection of claim 2 and is incorporated herein) for an iterative machine-learning process (see  ¶ [0161] as described for the rejection of claim 2 and is incorporated herein): and 
 load the portion of input activation data into the level one cache of the texture 5 processor based at least in part on the identifying (see  ¶ [0273] as described for the rejection of claim 2 and is incorporated herein).
In regard to claim 20, Bleiweiss  teaches an apparatus for workload balancing for machine learning, comprising (see abstract as described for the rejection of claim 1 and is incorporated herein): 
 means for allocating, based at least in part on a texture processor to shading processor arithmetic logic unit (ALU) resource ratio(e.g. a sample state) (see ¶¶ [0085-0086] as described for the rejection of claim 1 and is incorporated herein), a first set of one or more weight batches (sampled texture data) associated with a portion of input activation data (e.g. thread initiation requests)  to the texture processor (see ¶¶ [0094-0095] as described for the rejection of claim 1 and is incorporated herein)  and a second set of one or more weight batches(e.g. a set of detailed geometric objects based on a coarse geometric model) associated with the portion of input activation data(e.g. dispatched threads)  to the shading processor(see ¶¶ [0114-0115] as described for the rejection of claim 1 and is incorporated herein); and 
means for processing the portion of input activation data based at least in part on the first set of one or more weight batches and the second set of one or more weight batches using the texture processor and the shading processor in parallel(see ¶ [0003], ¶ [0260] as described for the rejection of claim 1 and is incorporated herein).



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 3 – 6 and 15 – 18 are rejected under 35 U.S.C. 103 as being unpatentable over Bleiweiss (U.S. 2019/0205737 A1; herein referred to as Bleiweiss) as applied to claims  1 – 2. 13 – 14, and 20 in view of Barik et al. (U.S. 2018/0307980 A1; herein referred to as Barik).
In regard to claim 3, Bleiweiss  teaches wherein processing the portion of input activation data (see ¶¶ [0094-0095], ¶¶ [0114-0115] as described for the rejection of claim 1 and is incorporated herein).
Bleiweiss fails to explicitly teach further comprises: performing one or more filtering operations on the portion of input activation data, using the first set of one or more weight batches and the second set of one or more weight batches.  However Barik teaches further comprises: performing one or more filtering operations on the portion of input activation data (see ¶ [0056] “ . . . the processing cluster array 212 is configured to perform general-purpose parallel compute operations. For example, the processing cluster array 212 can include logic to execute processing tasks including filtering of video and/or audio data, performing modeling operations, including physics operations, and performing data transformations . . “), using the first set of one or more weight batches (e.g. the texture data) (see ¶ [0073] “ . . .  In graphics and computing applications, a processing cluster 214 may be configured such that each graphics multiprocessor 234 is coupled to a texture unit 236 for performing texture mapping operations, e.g., determining texture sample positions, reading texture data, and filtering the texture data. Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed . . .”) and the second set of one or more weight batches (see ¶ [0058] “ . . . when the parallel processing unit 202 is used to perform graphics processing, the scheduler 210 can be configured to divide the processing workload into approximately equal sized tasks, to better enable distribution of the graphics processing operations to multiple clusters 214A-214N of the processing cluster array 212. In some embodiments, portions of the processing cluster array 212 can be configured to perform different types of processing. For example a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen space operations, to produce a rendered image for display. Intermediate data produced by one or more of the clusters 214A-214N may be stored in buffers to allow the intermediate data to be transmitted between clusters 214A-214N for further processing. . . “).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method to perform machine learning operations, using an apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to perform one or more machine learning operations, wherein the decode unit, based on parameters of the one or more machine learning operations, is to request a scheduler to schedule the one or more machine learning operations to one of an array of programmable compute units and a fixed function compute unit, as taught by Barik, into a system and method to perform operations to implement a neural network and accelerator logic to perform compute operations for the neural network using parallel processing, as taught by Bleiweiss.  Such incorporation provides a parallel processing device for machine learning data that provides filtering operations that can enable efficiencies between the texture processing and the shading processing.
In regard to claim 4, the combination of Bleiweiss and Barik teaches wherein each of the one or more filtering operations further comprises a multiply-accumulate operation (see Barik ¶ [0206] “ . . . each block of the fixed function matrix multiplication logic 1900 includes an input element row element 1911, an input column element 1915, a multiply-accumulate logic unit 1912, and an output elements 1913. Multiple rows of input column data are provided via a column buffer 1914. Output data for a row can be stored temporarily in a row output buffer 1916 during output to an output matrix. The illustrated logic can be configured to perform a matrix multiplication between two multi-dimensional matrices. Input data can be shifted into and out of the row and column input elements before the data within those elements are processed by the multiply-accumulate logic units. Output data can then be shifted out via the output data elements. An N×N logic unit is illustrated, where the specific dimension of the fixed function matrix multiplication logic 1900 varying across embodiments. In one embodiment the fixed function matrix multiplication logic 1900 is sized to perform matrix operations for specific matrix sizes that would not be efficient to process using programmable logic units, such as the matrix operations associated with 5×5 or 7×7 convolutions . . “), wherein a multiplication aspect of the multiply-accumulate operation comprises multiplying a first batch of the first set of one or more weight batches (see Barik ¶ [0207] “ . . . FIG. 20 illustrates exemplary multiply-add logic 2001 within an embodiment. While fused multiply-add is generally described with respect to floating point operations, the combined multiply and add operations are not limited to floating point operations and logic can be configured to perform integer and/or fixed point operations. The multiply-add operations can execute on multiple data elements in the same number of clock cycles as a single multiply on unpacked data. The multiply-add logic accepts multiple inputs including Source1[63:0] 2031, Source2[63:0] 2033, and Enable 2080. Operation control 2002 processes an input control signals for the multiply-add logic 2001 and provides the enable 2080 input to activate the multiply-add logic 2011. The multiply-add logic 2001 includes four 16×16 multiplier circuits (e.g., 16×16 multiplier A 2010A, 16×16 multiplier B 2010B, 16×16 multiplier C 2010C, 16×16 multiplier D 2010D). The 32-bit intermediate results generated by 16×16 multiplier A 2010A and 16×16 multiplier B 2010B are received by adder 2020A, while the 32-bit intermediate results generated by 16×16 multiplier C 2010C and 16×16 multiplier D 2010D are received by adder 2020B. The output of adder 2020B (i.e., bits 31 through 0 of the Result) and the output of adder 2020A (i.e., bits 63 through 32 of the Result) are combined into the 64-bit Result and communicated to Result Register 2030. In one embodiment, each of adder 2051 and adder 2050 are composed of four 8-bit adders with the appropriate propagation delays. However, alternative embodiments could implement adder 2020A-2020B in any number of ways (e.g., two 32-bit adders and/or redundant arithmetic compression circuitry). . . “) or the second set of one or more weight batches with the potion of input activation data (see Barik ¶¶ [0208-0209] “ . . .” . . . FIG. 21 illustrates 1×1 convolution on embodiments described herein. Given an input of size H×W with C channels, a 1×1 convolution with K features results in a new image of size H×W for each of the K features. Each of the K features consists of C filters, which for 1×1 convolution are scalar values. The convolution of the input image with a feature is the sum of convolutions of the individual channels with the corresponding filters that constitute the feature. In one embodiment 1×1 convolution can be efficiently performed via a macroinstruction to sequence the multiple operations used to perform 1×1 convolution over an input volume. A status weight vector 2104 can store the weights of a 1×1 convolutional filter.  The number of weights that may be stored can be determined based on the precision of each of the individual weight values and the size of the static weight vector. For example and in one embodiment sixteen 8-bit weight values are stored in a weight vector 2104 of 128-bits. In such embodiment the static weight vector 2104 can also store eight 16-bit weight values. Depending upon the SIMD width of the underlying computational logic, a channel data batch 2106 can be selected for input into the SIMD compute units. A dot product between the weight channel data batch and a set of feature data channels 2108 within an input volume 2110 can be performed. The set of operations are automatically sequenced to perform the 1×1 convolution operation across the input volume 2110. . . “).
The motivation to combine Barik with Bleiweiss is described for the rejection of claim 3 and is incorporated herein.  Additionally, Barik provides the functions of multiply-accumulate operations for processing the input data for the machine learning displays.   
In regard to claim 5,  Bleiweiss  teaches determining a number of available ALU resources for the texture processor (see ¶ [0071] “ . . . Each graphics core includes a set of graphics execution resources that includes general-purpose and graphics specific execution logic to perform graphics and compute operations, as well as fixed function texture processing and/or machine learning and artificial intelligence acceleration logic. . . “); 
determining a number of available ALU resources for the shading processor (see ¶ [0072] “ . . . the 3D pipeline 312 includes fixed function and programmable logic to process one or more shader programs, such as vertex shaders, geometry shaders, pixel shaders, fragment shaders, compute shaders, or other shader programs, by processing the instructions and dispatching execution threads to the graphics core array 414. The graphics core array 414 provides a unified block of execution resources for use in processing these shader programs. Multi-purpose execution logic (e.g., execution units) within the graphics core(s) 415A-414B of the graphic core array 414 includes support for various 3D API shader languages and can execute multiple simultaneous execution threads associated with multiple shaders. . . .”); 
determining a total number of available ALU resources comprising the number of available ALU resources for the texture processor  (see ¶ [0283] “ . . .  FIG. 34A shows a graphics multiprocessor 3425 according to an additional embodiment. The graphics multiprocessor 3425 includes multiple additional instances of execution resource units relative to the graphics multiprocessor 3334 of FIG. 33D. For example, the graphics multiprocessor 3425 can include multiple instances of the instruction unit 3432A-3432B, register file 3434A-3434B, and texture unit(s) 3444A-3444B. The graphics multiprocessor 3425 also includes multiple sets of graphics or compute execution units (e.g., GPGPU core 3436A-3436B, GPGPU core 3437A-3437B, GPGPU core 3438A-3438B) and multiple sets of load/store units 3440A-3440B. In one embodiment the execution resource units have a common instruction cache 3430, texture and/or data cache memory 3442, and shared memory 3446. . . “) and the number of available ALU resources for the shading processor   (see ¶ [0285] “ . . . FIG. 34B shows a graphics multiprocessor 3450 according to an additional embodiment. The graphics processor includes multiple sets of execution resources 3456A-3456D, where each set of execution resource includes multiple instruction units, register files, GPGPU cores, and load store units, as illustrated in FIG. 33D and FIG. 34A. The execution resources 3456A-3456D can work in concert with texture unit(s) 3460A-3460D for texture operations, while sharing an instruction cache 3454, and shared memory 3462. In one embodiment the execution resources 3456A-3456D can share an instruction cache 3454 and shared memory 3462, as well as multiple instances of a texture and/or data cache memory 3458A-3458B. The various components can communicate via an interconnect fabric 3452 similar to the interconnect fabric 3427 of FIG. 34A . . .”); 
Bleiweiss  fails to explicitly teach and identifying the texture processor to shading processor ALU resource ratio based at least in part on the number of available ALU resources for the texture processor and the number of available ALU resources for the shading processor.  However Barik teaches and identifying the texture processor to shading processor ALU resource ratio based at least in part on the number of available ALU resources for the texture processor and the number of available ALU resources for the shading processor (see  ¶ [0248] “ . . . graphics processor 2800 includes scalable thread execution resources featuring modular cores 2880A-2880N (sometimes referred to as core slices), each having multiple sub-cores 2850A-550N, 2860A-2860N (sometimes referred to as core sub-slices). In some embodiments, graphics processor 2800 can have any number of graphics cores 2880A through 2880N. In some embodiments, graphics processor 2800 includes a graphics core 2880A having at least a first sub-core 2850A and a second sub-core 2860A. In other embodiments, the graphics processor is a low power processor with a single sub-core (e.g., 2850A). In some embodiments, graphics processor 2800 includes multiple graphics cores 2880A-2880N, each including a set of first sub-cores 2850A-2850N and a set of second sub-cores 2860A-2860N. Each sub-core in the set of first sub-cores 2850A-2850N includes at least a first set of execution units 2852A-2852N and media/texture samplers 2854A-2854N. Each sub-core in the set of second sub-cores 2860A-2860N includes at least a second set of execution units 2862A-2862N and samplers 2864A-2864N. In some embodiments, each sub-core 2850A-2850N, 2860A-2860N shares a set of shared resources 2870A-2870N. In some embodiments, the shared resources include shared cache memory and pixel operation logic. Other shared resources may also be included in the various embodiments of the graphics processor. . . “).
The motivation to combine Barik with Bleiweiss is described for the rejection of claim 3 and is incorporated herein. Additionally Barik provides the functionality to distribute resources amongst the differing GPU processes.
In regard to claim 6, the combination of  Bleiweiss and Barik teaches identifying an accumulation register space available within the shading processor (see Barik  ¶ [0211] “ . . . If all channels of the weight values have been multiplied by the corresponding feature data channels at the location as determined at block 2307, the 1×1 convolution logic 2300 can output the value of the accumulator register as the output feature map value for the location, as shown at block 2312. The logic 2300 can then load the feature data channels 2108 of the next location (e.g., x+1, y), as shown at block 2314 and an additional dot product can be performed starting at block 2306. If the dot products have not been performed for all data values at the location at block 2307, an intermediate value can be stored in an accumulation register and the logic 2300 can select the next subset of values at block 2311 for example, by sliding the weight channel data batch 2106 along the static weight vector 2104 and loading the new weight channel data batch 2106 into input registers for processing by dot product logic using the previously processed set of feature data channels 2108. . . .”), wherein determining the total number of available ALU resources is based at least in part on the accumulation register space (Barik  ¶ [0212] “ . . . FIG. 22 illustrates a circuit 2201 for performing multiply-add operation on packed data vectors, according to an embodiment. The circuit 2201 accepts a first source, (Source1[63:0] 2231) and a second source (Source2[63:0]2232). In one embodiment, the first and second sources are stored in N-bit long SIMD registers within the GPGPU. For two input vectors 2231 and 2232, the multiply-add instruction implemented on such registers would produce result[63:0] 2290, which can be stored to a destination register or added to an accumulation register for a multiply-accumulate operation. The illustrated logic example shows an 8-bit byte to 16-bit word embodiment of a multiply-add operation. While packed data sources and destinations are represented as having 64-bits, it will be appreciated that the principals disclosed herein may be extended to other conveniently selected lengths, such as 80-bits, 128-bits or 256-bits . . “).
The motivation to combine Barik with Bleiweiss is described for the rejection of claim 3 and is incorporated herein.  Additionally, Barik provides an accumulation register for determining resources necessary for the machine learning.
In regard to claim 15, Bleiweiss teaches wherein the instructions to process the portion of input activation data(see ¶¶ [0094-0095], ¶¶ [0114-0115] as described for the rejection of claim 1 and is incorporated herein).
Bleiweiss fails to explicitly teach further are executable by the processor to cause the apparatus to:  perform one or more filtering operations on the portion of input activation data, using the first set of one or more weight batches and the second set of one or more weight batches.    However Barik teaches further are executable by the processor to cause the apparatus to:  perform one or more filtering operations on the portion of input activation data (see ¶ [0056] as described for the rejection of claim 3 and is incorporated herein), using the first set of one or more weight batches(e.g. the texture data) (see ¶ [0073] as described for the rejection of claim 3 and is incorporated herein) and the second set of one or more weight batches (see ¶ [0058] as described for the rejection of claim 3 and is incorporated herein).
The motivation to combine Barik with Bleiweiss is described for the rejection of claim 3 and is incorporated herein.
In regard to claim 16, the combination of Bleiweiss and Barik teaches wherein each of the one or more filtering operations further comprises a multiply-accumulate operation (see Barik ¶ [0206] as described for the rejection of claim 4 and is incorporated herein), wherein a multiplication aspect of the multiply-accumulate operation comprises multiplying a first batch of the first set of one or more weight batches (see Barik ¶ [0207] as described for the rejection of claim 4 and is incorporated herein) or the second set of one or more weight batches with the  potion of input activation data (see Barik ¶¶ [0208-0209] as described for the rejection of claim 4 and is incorporated herein).
The motivation to combine Barik with Bleiweiss is described for the rejection of claim 4 and is incorporated herein.
In regard to claim 17,   Bleiweiss  teaches wherein the instructions are further executable by the processor to cause the apparatus to: determine a number of available ALU resources for the texture processor (see ¶ [0071] as described for the rejection of claim 5 and is incorporated herein); 
determine a number of available ALU resources for the shading processor (see ¶ [0072] as described for the rejection of claim 5 and is incorporated herein); 
determine a total number of available ALU resources comprising the number of available ALU resources for the texture processor(see ¶ [0283] as described for the rejection of claim 5 and is incorporated herein) and the number of available ALU resources for the shading processor (see ¶ [0285] as described for the rejection of claim 5 and is incorporated herein); 
Bleiweiss  fails to explicitly teach and  identify the texture processor to shading processor ALU resource ratio based at least in part on the number of available ALU resources for the texture processor and the 10 number of available ALU resources for the shading processor.  However Barik teaches and  identify the texture processor to shading processor ALU resource ratio based at least in part on the number of available ALU resources for the texture processor and the 10 number of available ALU resources for the shading processor  (see  ¶ [0248] as described for the rejection of claim 5 and is incorporated herein).
The motivation to combine Barik with Bleiweiss is described for the rejection of claim 5 and is incorporated herein.
In regard to claim 18, the combination of  Bleiweiss and Barik teaches identify an accumulation register space available within the shading processor (see Barik  ¶ [0211] as described for the rejection of claim 6 and is incorporated herein)., wherein determining the total number of available ALU resources is based at least in part on the accumulation register space (see Barik  ¶ [0212] as described for the rejection of claim 6 and is incorporated herein).
The motivation to combine Barik with Bleiweiss is described for the rejection of claim 6 and is incorporated herein.
Claims 7 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Bleiweiss (U.S. 2019/0205737 A1; herein referred to as Bleiweiss) in view of Barik et al. (U.S. 2018/0307980 A1; herein referred to as Barik) as applied to claims 3 – 6 and 15 – 18 in further view of Kopinsky (U.S. 2020/0218978 A1; herein referred to as Kopinsky).  
In regard to claim 7, the combination of Bleiweiss and Barik fails to explicitly teach determining a level two weight batch caching constraint for a second level of an iterative machine-learning process, wherein determining the total number of available ALU resources is based at least in part on the level two weight batch caching constraint.  However Kopinsky teaches determining a level two weight batch caching constraint for a second level of an iterative machine-learning process  (see ¶ [0135] “ . . . the invention may initialize several output vector registers (OR1, . . . , ORC) which will be accumulated into. Embodiments may then loop over inputs, and broadcast a scalar input value to an input vector register IR. Embodiments may performing a vectoral operation such as a multiplication-accumulation operation (e.g., FMA) by multiplying IR against a corresponding vector of kernel values from memory (e.g., a cache memory such as element 4B of FIG. 2) to accumulate into each output vector register OR (OR1, . . . , ORC) . . “), wherein determining the total number of available ALU resources is based at least in part on the level two weight batch caching constraint (see ¶ [0137] “ . . . The order of loops in the pseudocode may be chosen carefully to maximize the resource utilization of the memory system. For example, typical kernel sizes may be 3×3×64×64 and may require 4 bytes per each value (e.g., weight). The total memory footprint of the kernels may thus require approximately 150 KB. Modern CPUs typically include a 32 KB L1 cache, (e.g., too small to fit all the kernel values). Therefore, the kernel data elements may need to reside in the L2 cache. As explained herein, the present invention may not include reuse of kernel values and therefore may require loading a new kernel value for each FMA instruction. As a result, waiting on L2 may become a bottleneck, and may decrease the efficiency of calculation. By looping over the spatial locations in the kernels in the loop, the working set for each outer loop iteration may become only 64×64, and the corresponding memory footprint may thus be reduced to approximately 16 KB, thus fitting in the L1 cache and significantly increasing compute utilization . .  .”).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method of executing a convolution layer of a neural network  for efficient compressed convolution using activation sparsity, as taught by Kopinsky, into a system and method to perform operations to implement a neural network and accelerator logic to perform compute operations for the neural network using parallel processing, and  based on parameters of the one or more machine learning operations, to request a scheduler to schedule one or more machine learning operations to one of an array of programmable compute units and a fixed function compute unit as taught by the combination of Bleiweiss and Barik.  Such incorporation provides incorporating multiple levels for the neural network calculations. 
In regard to claim 19, the combination of Bleiweiss and Barik fails to explicitly teach determine a level two weight batch caching constraint for a second level of an iterative machine-learning process, wherein determining the total number of available ALU resources is based at least in part on the level two weight batch caching constraint.  However Kopinsky teaches determine a level two weight batch caching constraint for a second level of an iterative machine-learning process (see ¶ [0135] as described for the rejection of claim 7 and is incorporated herein), wherein determining the total number of available ALU resources is based at least in part on the level two weight batch caching constraint (see ¶ [0137] as described for the rejection of claim 7 and is incorporated herein).
The motivation to combine Kopinsky with the combination of Bleiweiss and Barik is described for the rejection of claim 7 and is incorporated herein.
Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Bleiweiss (U.S. 2019/0205737 A1; herein referred to as Bleiweiss) as applied to claims  1 – 2. 13 – 14, and 20  in view of Fuller et al. (U.S. 2019/0304138 A1; herein referred to as Fuller).
In regard to claim 8, Bleiweiss fails to explicitly teach generating a portion of output activation data based at least in part on the processing the portion of input activation data; and identifying, based at least in part on having generated the portion of output activation data and based at least in part on the size of a level one cache of the texture processor, a second portion of input activation data for an iterative machine-learning process.  However Fuller teaches generating a portion of output activation data based at least in part on the processing the portion of input activation data (see ¶ [0081] “ . . . Computer device 102 may also include a user interface component 46 operable to receive inputs from a user of computer device 102 and further operable to generate outputs for presentation to the user. User interface component 46 may include one or more input devices, including but not limited to a keyboard, a number pad, a mouse, a touch-sensitive display, a navigation key, a function key, a microphone, a voice recognition component, any other mechanism capable of receiving an input from a user, or any combination thereof. Further, user interface component 46 may include one or more output devices, including but not limited to a display, a speaker, a haptic feedback mechanism, a printer, any other mechanism capable of presenting an output to a user, or any combination thereof . . . “); and 
identifying, based at least in part on having generated the portion of output activation data and based at least in part on the size of a level one cache of the texture processor (see ¶ [0018] “ . . . Block compression is used by computer games for storing textures, since block compression may be read directly by the GPU, saving memory, bandwidth, and/or cache pressure on a computer device. However, it is possible to achieve higher compression ratios at acceptable image quality using other types of compression, for example, but not limited to, machine learning image compression, joint photographic experts group (JPEG) compression, wavelet compression, and/or general purpose lossless compression (e.g., zip, lzma, and kraken). Using a compression format with a higher compression ratio on textures for games may be desirable for reducing input/output bandwidth and/or for reducing the size of games on the hard disk, optical media, or when downloaded over the internet. Unfortunately, these other compressions schemes are not directly usable by the GPU . . .”), a second portion of input activation data for an iterative machine-learning process (see ¶ [0037] “ . . . the trained machine learning model 40 may use a trained Machine Learning Adversarial Network to block compress images and predict the best modes, shapes, and/or end points to use based on the learning achieved during a training phase of the adversarial network. For example, the machine learning networks may create a set of metadata for each iteration performed during the block compression that generates the best results. The machine learning networks may compare the block compressed image against a perfect image until the blocked compressed image is a close as possible to the original source image 23. When the comparison determines that the block compressed image is a close as possible to the original source image 23 (e.g., a measurement of error 25 is below a threshold value), the metadata may be saved as one or more hints 20 . . .”).
it would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for real time texture compression including accessing graphics hardware incompatible compressed textures in a format incompatible with the GPU, and a metadata file associated with the graphics hardware incompatible compressed textures, wherein the metadata file includes at least one hint that provides information to use for compression of decompressed textures from the graphics hardware incompatible compressed textures into hardware compatible compressed textures, as taught by Fuller, into a system and method to perform operations to implement a neural network and accelerator logic to perform compute operations for the neural network using parallel processing, as taught by Bleiweiss.  Such incorporation enables graphics data to be accessed and input into machine learning iterations.
In regard to claim 9, the combination of Bleiweiss and Fuller teaches performing one or more iterations of the iterative machine-learning process until all of the input activation data has been processed (see Fuller ¶ [0073] “ . . .the trained machine learning model 40 may use a trained Machine Learning Adversarial Network to block compress images and predict the best modes, shapes, and/or end points to use based on the learning achieved during a training phase of the adversarial network. For example, the machine learning networks may create a set of metadata for each iteration performed during the block compression that generates the best results. The machine learning networks may compare the block compressed image against a perfect image until the blocked compressed image is a close as possible to the original source image 23. When the comparison determines that the block compressed image is a close as possible to the original source image 23 (e.g., a measurement of error 25 is below a threshold value), the metadata may be saved as one or more hints 20. . . . “).
The motivation to combine Fuller with Bleiweiss is described for the rejection of claim 8 and is incorporated herein.  Additionally, Fuller evaluates each iteration for efficient compression of the input data used for the machine learning.
Claims 10 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Bleiweiss (U.S. 2019/0205737 A1; herein referred to as Bleiweiss) as applied to claims  1 – 2. 13 – 14, and 20  in view of Kilgard et al. (U.S. 7,782,334 B1; herein referred to as Kilgard)
In regard to claim 10, Bleiweiss  fails to explicitly teach identifying, by the texture processor, the first set of one or more weight batches from a system memory; and  identifying, by the shading processor, the second set of one or more weight batches from the system memory.   However Kilgard teaches identifying, by the texture processor, the first set of one or more weight batches from a system memory (see Col 2: Lines 31-46 “ . . . Various embodiments of the invention include a graphics processing system for resizing a source data array to produce a destination data array. The graphics processing system includes a texture fetch unit and a programmable shader computation unit. The texture fetch unit is configured to read one or more entries from the source data array, each of the one or more entries corresponding to a region of the source data array bounded by a box defined by a pixel of the destination data array mapped onto the source data array. the programmable shader computation unit is configured to compute a weight corresponding to the region and is configured to determine a source sample based on the one or more entries, scale the source sample by the weight to produce a weighted source sample, and combine the weighted source sample with other weighted source samples to produce a destination sample corresponding to the entry of the destination data array . . .”); and identifying, by the shading processor, the second set of one or more weight batches from the system memory (Col 3: Lines 51-67; Col 4: Lines 1 - 3 “ . . . A pixel shader program may be used to configure a graphics processor to sample and filter a source data array of any dimensions to produce a high-quality destination data array of other dimensions. The source and destination data arrays may include image data. One or more destination data arrays may be mip maps of the source data array and the filter may be a box filter or other type of filter, e.g., Gaussian, median, or the like. Each pixel in the destination data array is produced in isolation, i.e., independently. Therefore, source sample positions and weights are directly evaluated for each pixel in the destination data array, thereby permitting the use of parallel processing to produce each pixel in the destination data array. In contrast, conventional methods used to produce mip maps of non-power-of-two data arrays, incrementally compute source sample positions and weights incrementally, resulting in serialized generation of each pixel in the destination data array. Consequently, using direct evaluation to produce each pixel in the destination data array may result in higher performance than using incremental evaluation to produce each pixel in the destination data array . . . “).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for filtering texture map data using a graphics processor, as taught by Kilgard, into a system and method to perform operations to implement a neural network and accelerator logic to perform compute operations for the neural network using parallel processing, as taught by Bleiweiss.  Such incorporation provides more details of the texture and shading processes for the GPU. 
In regard to claim 11, the combination of Bleiweiss and  Kilgard  teaches identifying, by the texture processor, the first set of one or more weight batches and the second set of one or more weight batches from a system memory (see Kilgard Fig. 4B, Col 7: Lines 39-67; Col 8: Lines 1 – 13 “ . . .  FIG. 4B is a block diagram of an exemplary embodiment of texture fetch unit 460 shown in FIG. 4A in accordance with one or more aspects of the present invention. In some embodiments, texture fetch unit 460 receives data, e.g., program instructions, and attributes associated with fragments (coverage information, texture identifiers, texture coordinates such as s, t, and r, and the like) from a rasterizer, as described in conjunction with FIG. 8.  Texture fetch unit 460 includes a texture address unit 452, a read request unit 456, and a texture configuration unit 454. Weight computation unit 454 computes the boxes bounding the destination data array entries, e.g., pixels, mapped onto the source data array, the destination pixel boxes, and provides texture address unit 452 with the source sample position information, such as the texture coordinate sets for each destination pixel box and the width and height scale factors. In some embodiments of the present invention, shader computation top unit 445 computes the texture coordinate sets. Texture address unit 452 uses the texture coordinate sets and width and height scale factors to determine the number of source samples to fetch for each destination data array entry, e.g., destination image pixel. Texture address unit 452 also computes a level of detail (LOD) for mip mapped textures and the texture coordinates are used to compute an address for reading source samples from source data arrays stored in the memory resource. Read request unit 456 generates source sample read requests to fetch source samples from the memory resource.  Weight computation unit 454 clamps each source sample's box to the corresponding destination pixel box to compute the source sample regions. Weight computation unit 454 then computes areas of the source sample regions and the areas destination pixel boxes. Weight computation unit 454 divides areas of the source sample regions by the areas of the destination pixel boxes to compute the weight for each of the source samples. The weights are output by weight computation unit 454 to texture filter unit 470. In other embodiments of the present invention weight computation unit 454 is omitted and the texture coordinate sets and weights are computed by shader computation top unit 445 and/or shader computation bottom unit 480  . . . “); and 
sending, by the texture processor, the second set of one or more weight batches to the shading processor (see Kilgard Fig. 4C, Col 8: Lines 14-36 “ . . . FIG. 4C is a block diagram of an exemplary embodiment of texture filter unit 470 shown in FIG. 4A in accordance with one or more aspects of the present invention. Texture filter unit 470 includes a sample scale unit 472 and a destination sample computation unit 474. Sample scale unit 472 receives source samples from the memory resource and weights from weight computation unit 454. In some embodiments of the present invention the weights may be scaled by filter tap values to for a particular filter. For example, when box filtering is used the weight is scaled by a filter tap value of one to produce a filter coefficient. Sample scale unit 472 scales each source samples by its corresponding weight or filter coefficient to produce weighted source samples that are output to destination sample computation unit 474. Destination computation unit 474 sums all of the weighted samples corresponding to each destination data array entry and outputs the destination data array entries to shader computation bottom unit 480. In some embodiments of the present invention, shader computation bottom unit 480 and/or shader computation top unit 445 may be configured to perform the scaling and summing to produce the destination data array entries. In those embodiments of the present invention, sample scale unit 472 and destination computation unit 474 may be omitted. . . . “).
The motivation to combine Kilgard with Bleiweiss is described for the rejection of claim 10 and is incorporated herein.  Additionally, Kilgard discloses additional functions to pass data from a texture processor to as shader processer.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Bleiweiss (U.S. 2019/0205737 A1; herein referred to as Bleiweiss) as applied to claims  1 – 2. 13 – 14, and 20  in view of Aumi et al. (U.S. 2020/0364549 A1; herein referred to as Aumi).
In regard to claim 12, Bleiweiss fails to explicitly teach determining a number of fibers associated with a first iteration of an iterative machine-learning process, wherein identifying the portion of input activation data for the iterative machine-learning process is based at least in part on the number of fibers.  However Aumi teaches determining a number of fibers associated with a first iteration of an iterative machine-learning process (see ¶ [0016] “ . . . a training method comprises accessing a stored dataset comprising, for each of multiple optical fiber preforms, a plurality of images of each optical fiber preform coupled with an indication of a number of fiber kilometers lost due to diameter upset of a cable built using the optical fiber preform, wherein each image represents a portion of the optical fiber preform. The training method comprises preprocessing the stored dataset to generate a training dataset. The training method comprises training, using the training dataset, a convolutional neural network (CNN) to predict diameter upset performance of an optical fiber preform based on visual information representing the optical fiber preform, the CNN comprising an input layer, a plurality of hidden layers, and an output layer, wherein each of the input layer and the plurality of hidden layers comprises a plurality of artificial neurons. The training method comprises providing an output representing the trained CNN . . .”), wherein identifying the portion of input activation data for the iterative machine-learning process is based at least in part on the number of fibers (see  ¶ [0100-0102] “ . . . FIG. 9 is a data flow diagram 900 for feature extraction and classification using a convolutional neural network (CNN) to predict diameter upset performance of an optical fiber preform, in accordance with some embodiments.  As shown, in the feature extraction from image block 905, an input image 910 is provided. The input image 910 is passed through filters to generate the output from the first filter layer 915. Additional filters are used to generate the output from the second filter layer 920. The output from the second filter layer 920 is collapsed to a vector 925. The vector 925 is provided to a machine learning algorithm 930 in the classification phase. The classification phase outputs “good” or “bad” depending on whether the number of fiber kilometers lost due to diameter upset of the cable built using optical fiber drawn from the optical fiber preform is predicted to exceed a threshold value.   FIG. 10 illustrates example images 1000 of optical fiber preforms, in accordance with some embodiments. As shown, two images A and B are shown, and each rectangular image is divided into 20 parts 1001-1020, with each part of the image encompassing 1/20 of the height of the image and the entire width. In other embodiments, each rectangular image may be divided into n parts, with each part of the image encompassing 1/n of the height of the image and the entire width, where n is a positive integer. . . “).
it would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for predicting optical fiber manufacturing performance using a neural network as taught by Aumi, into a system and method to perform operations to implement a neural network and accelerator logic to perform compute operations for the neural network using parallel processing, as taught by Bleiweiss.  Such incorporation uses the neural network parallel processing to determine optimal fiber numbers. 
Conclusion
There are prior art made of record which are not relied upon but are considered pertinent to applicant’s disclosure.  They are listed on the PTO-892 accompanying this action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAMES N FIORILLO whose telephone number is (571)272-9909.  The examiner can normally be reached on 7:30 - 5 PM Mon - Fri..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John A. Follansbee can be reached on 571-272-3964.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JAMES N FIORILLO/Examiner, Art Unit 2444