Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The lengthy specification has not been checked to the extent necessary to
determine the presence of all possible minor errors. Applicant's cooperation is
requested in correcting any errors of which applicant may become aware in the
specification.

Examiner Notes
Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. The entire reference is considered to provide disclosure relating to the claimed invention. The claims & only the claims form the metes & bounds of the invention. Office personnel are to give the claims their broadest reasonable interpretation in light of the supporting disclosure. Unclaimed limitations appearing in the specification are not read into the claim. Prior art was referenced using terminology familiar to one of ordinary skill in the art. Such an approach is broad in concept and can be either explicit or implicit in meaning. Examiner's Notes are provided with the cited references to assist the applicant to better understand how the examiner interprets the applied prior art. Such comments are entirely consistent with the intent & spirit of compact prosecution.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 4-6, 10-14, 17-19 and 23-26 are rejected under 35 U.S.C. 103 as being unpatentable over Janedula et al (US 2019/0042923 A1), hereinafter Janedula, in view of Du et al (US 2018/0232629 A1), hereinafter Du.

Claim 1. A processor-implemented method of a neural network, the method comprising: 
Janedula discloses obtaining intermediate pooling results, respectively corresponding to sub-pooling kernels obtained by decomposing an original pooling kernel, 
Janedula: [0179] “As shown in FIG. 24, a process 2400 to perform hardware-based Winograd convolution using kernels of multiple sizes, according to embodiments described herein, includes to decompose a higher-order convolution kernel having a first kernel size into multiple sub-kernels having a second kernel size, as shown at block 2402, [correspond to sub-pooling kernels obtained by decomposing an original pooling kernel] and transform at least a patch of an input feature map and the multiple sub-kernels based on a Winograd transform associated with the second kernel size, as shown at block 2404. The process 2400 additionally includes to perform multiple successive Winograd convolution operations to generate a set of partial output feature maps as shown at block 2406, and accumulate the multiple partial output feature maps into an output feature map, as shown at block 2408. [correspond to obtaining intermediate pooling results] The process 2400 additionally includes to perform an inverse Winograd transform on the output feature map to generate a transformed output feature map, as shown at block 2410.”
Janedula discloses by performing a pooling operation on input pixels included in a current window in an input feature map using the sub-pooling kernels;
Janedula: [0199] “FIG. 28A illustrates various layers within a CNN. As shown in FIG. 28A, an exemplary CNN used to model image processing can receive input 2802 describing the red, green, and blue (RGB) components of an input image. [correspond to input pixels included in a current window] The input 2802 can be processed by multiple convolutional layers (e.g., convolutional layer 2804, convolutional layer 2806). The output from the multiple convolutional layers may optionally be processed by a set of fully connected layers 2808. Neurons in a fully connected layer have full connections to all activations in the previous layer, as previously described for a feedforward network. The output from the fully connected layers 2808 can be used to generate an output result from the network.”
[0201] “FIG. 28B illustrates exemplary computation stages within a convolutional layer of a CNN. Input to a convolutional layer 2812 of a CNN can be processed in three stages of a convolutional layer 2814. The three stages can include a convolution stage 2816, a detector stage 2818, and a pooling stage 2820. The convolution layer 2814 can then output data to a successive convolutional layer. The final convolutional layer of the network can generate output feature map data or provide input to a fully connected layer, for example, to generate a classification value for the input to the CNN.” [correspond to performing a pooling operation on input pixels included in a current window in an input feature map using the sub-pooling kernels]
[0204] “The pooling stage 2820 uses a pooling function that replaces the output of the convolutional layer 2806 with a summary statistic of the nearby outputs. The pooling function can be used to introduce translation invariance into the neural network, such that small translations to the input do not change the pooled outputs. Invariance to local translation can be useful in scenarios where the presence of a feature in the input data is more important than the precise location of the feature…”
Janedula discloses obtaining a final pooling result corresponding to the current window by post-processing the intermediate pooling results;  
Janedula: [0204] “The pooling stage 2820 uses a pooling function that replaces the output of the convolutional layer 2806 with a summary statistic of the nearby outputs. [correspond to obtaining a final pooling result corresponding to the current window by post-processing the intermediate pooling results] The pooling function can be used to introduce translation invariance into the neural network, such that small translations to the input do not change the pooled outputs. Invariance to local translation can be useful in scenarios where the presence of a feature in the input data is more important than the precise location of the feature.”
Janedula discloses determining an output pixel value of an output feature map, based on the final pooling result, 
Janedula: [0152] “FIG. 18 illustrates an exemplary logic and data layout 1800 to enable multi-stride convolution, according to an embodiment. The exemplary logic and data layout 1800 is configured for kernel size of five and a kernel stride of two. In one embodiment, a steering logic block (e.g., steering write logic 1806) is provided that hides the alternate pixels from Winograd transform. A 1×5 kernel 1802 having elements {k1, k2, k3, k4, k5} can be decomposed into two 1×3 kernels based on an exemplary kernel stride of two, where a first decomposed kernel 1808A includes elements {k1, K3, and K5}, while a second decomposed kernel 1808B includes elements {k2, k4, 0}. A patch of input data 1804 that is stored contiguously in memory can be written out to a buffer within Winograd compute logic via steering write logic 1806. The steering write logic can write the input data such that data of stride 2 is stored contiguously within the buffer. For example, a 1×12 input patch having elements {I1, I2, I3, I4, I5, I6, I7, I8, I9, I10, I11, I12} can be written out to a buffer, such that a first input data tensor 1812A includes elements {I1, I3, I5, I7, I9, I11} and a second input data tensor 1812 includes elements {I2, I4, I6, I8, I10, I12}. A Winograd compute unit 1810 can then perform an F(4,3) Winograd convolution between the respective 1×3 kernels 1808A-1808B and 1×6 input data tensors 1812A-1812B, the partial outputs being summed to create a final output 1814. The illustrated technique can be modified as necessary to enable multi-stride convolution of various strides, as control logic, such as the steering write logic 1806 can be configured accordingly for various strides greater than one.”
[0164] “For each input feature map fetched from memory, the Winograd compute architecture described herein, in one embodiment, performs calculation for multiple output feature maps before fetching the next input feature map. The number of output feature maps being computed in parallel depends on number of processing tiles.”
Janedula discloses wherein the current window is determined according to the original pooling kernel, according to a raster scan order, in the input feature map.  
Janedula: [0109] In some embodiments, render output pipeline 870 contains a rasterizer and depth test component 873 that converts vertex-based objects into an associated pixel-based representation. In some embodiments, the rasterizer logic includes a windower/masker unit to perform fixed function triangle and line rasterization. [correspond to wherein the current window is determined according to the original pooling kernel, according to a raster scan order, in the input feature map] An associated render cache 878 and depth cache 879 are also available in some embodiments. A pixel operations component 877 performs pixel-based operations on the data, though in some instances, pixel operations associated with 2D operations (e.g. bit block image transfers with blending) are performed by the 2D engine 841, or substituted at display time by the display controller 843 using overlay display planes. In some embodiments, a shared L3 cache 875 is available to all graphics components, allowing the sharing of data without the use of main system memory.”

Janedula does not appear to explicitly disclose the original pooling kernel having been slid.
However, Du discloses the current window is determined according to the original pooling kernel [0055] “In addition, the control unit 7 can also retrieve the needed instruction and convolution information from external memory by data memory access. After the instruction decoder 71 decodes the instruction, the buffer device 2 retrieves the instruction and the convolution information. The instruction may include the size of the stride of the sliding window, the address of the sliding window, and the numbers of columns and rows of the image data.”
Janedula and Du are analogous art because they are from the “same field of endeavor” convolutional neural network (CNN).
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Janedula and Du before him or her, to modify the method of Janedula to include the pooling operating feature of Du because this combination improve the performance of the CNN.
The suggestion/motivation for doing so would Du: [0005] “In view of the foregoing, an objective of the disclosure is to provide a pooling operation device and method that can reduce the required reading bandwidth for inputted data and thus enhance the pooling operation performance.”
Therefore, it would have been obvious to combine Janedula and Du to obtain the invention as specified in the instant claim(s).

Regarding Claim 13, the same ground of rejection is made as discussed above for substantially similar rationale. 
In addition, Claim 13 recites “one or more processors”.
Janedula discloses one or more processors on Janedula: [0231] “One embodiment provides for a data processing system comprising a non-transitory machine-readable medium to store instructions for execution by one or more processors of the data processing system and a general-purpose graphics processing unit including a hardware accelerator including a compute unit to perform a Winograd convolution, the compute unit configurable to perform the Winograd convolution for a first kernel size using a transform associated with a second kernel size. The compute unit of the data processing system can perform any of the Winograd convolution operations otherwise described herein.” See Fig. 1.

Claim 4 and 17 
Janedula discloses wherein the final pooling result is obtained in response to all of the intermediate pooling results being obtained for the current window.  
Janedula: [0179] “As shown in FIG. 24, a process 2400 to perform hardware-based Winograd convolution using kernels of multiple sizes, according to embodiments described herein, includes to decompose a higher-order convolution kernel having a first kernel size into multiple sub-kernels having a second kernel size, as shown at block 2402, and transform at least a patch of an input feature map and the multiple sub-kernels based on a Winograd transform associated with the second kernel size, as shown at block 2404. The process 2400 additionally includes to perform multiple successive Winograd convolution operations to generate a set of partial output feature maps as shown at block 2406, and accumulate the multiple partial output feature maps into an output feature map, as shown at block 2408. [correspond to the intermediate pooling results being obtained for the current window] The process 2400 additionally includes to perform an inverse Winograd transform on the output feature map to generate a transformed output feature map, as shown at block 2410.”
Janedula: [0204] “The pooling stage 2820 uses a pooling function that replaces the output of the convolutional layer 2806 with a summary statistic of the nearby outputs. [correspond to the final pooling result is obtained in response to all of the intermediate pooling results] The pooling function can be used to introduce translation invariance into the neural network, such that small translations to the input do not change the pooled outputs. Invariance to local translation can be useful in scenarios where the presence of a feature in the input data is more important than the precise location of the feature.”

Claim 5 and 18
Du discloses wherein intermediate pooling results corresponding to a same window are respectively stored in memory cells comprising memory addresses of a same column and different rows in a share line buffer.  
Du: [0046] “For example, in the current pooling operation, the pooling units 81˜85 can read the data A0˜A7 of the 0˜3 columns. The pooling unit 82 performs the pooling operation with the data A0˜A2, and the outputted result of the pooling operation is stored at the address A2. The pooling unit 83 performs the pooling operation with the data A2˜A4, and the outputted result of the pooling operation is stored at the address A4. The pooling unit 84 performs the pooling operation with the data A4˜A6, and the outputted result of the pooling operation is stored at the address A6. The pooling unit 85 performs the pooling operation with the data A6˜A7 and a placeholder, and the outputted result of the pooling operation is registered in the row buffer unit. In the next pooling operation as shown in FIG. 4B, the pooling result registered in the row buffer unit is provided to one of the inputs of the pooling unit 81. The pooling result registered in the row buffer unit and the data A8˜A15 of a next pooling operation are inputted to the pooling units 81˜85.”
[0055] “In addition, the control unit 7 can also retrieve the needed instruction and convolution information from external memory by data memory access. After the instruction decoder 71 decodes the instruction, the buffer device 2 retrieves the instruction and the convolution information. The instruction may include the size of the stride of the sliding window, the address of the sliding window, and the numbers of columns and rows of the image data.”

Claim 6 and 19
Janedula discloses receiving a value of a current input pixel included in the current window according to the raster scan order for the input feature map, 
Janedula: [0109] In some embodiments, render output pipeline 870 contains a rasterizer and depth test component 873 that converts vertex-based objects into an associated pixel-based representation. In some embodiments, the rasterizer logic includes a windower/masker unit to perform fixed function triangle and line rasterization. [correspond to wherein the current window is determined according to the original pooling kernel, according to a raster scan order, in the input feature map] An associated render cache 878 and depth cache 879 are also available in some embodiments. A pixel operations component 877 performs pixel-based operations on the data, though in some instances, pixel operations associated with 2D operations (e.g. bit block image transfers with blending) are performed by the 2D engine 841, or substituted at display time by the display controller 843 using overlay display planes. In some embodiments, a shared L3 cache 875 is available to all graphics components, allowing the sharing of data without the use of main system memory.”
Janedula discloses wherein the obtaining of the intermediate pooling results comprises updating at least one partial pooling result stored in at least one memory cell affected by 30012055.0484 the received value of the current input pixel, based on the received value of the current input pixel.  
Janedula: [0179] “As shown in FIG. 24, a process 2400 to perform hardware-based Winograd convolution using kernels of multiple sizes, according to embodiments described herein, includes to decompose a higher-order convolution kernel having a first kernel size into multiple sub-kernels having a second kernel size, as shown at block 2402, and transform at least a patch of an input feature map and the multiple sub-kernels based on a Winograd transform associated with the second kernel size, as shown at block 2404. The process 2400 additionally includes to perform multiple successive Winograd convolution operations to generate a set of partial output feature maps as shown at block 2406, and accumulate the multiple partial output feature maps into an output feature map, as shown at block 2408. [correspond to updating at least one partial pooling result stored in at least one memory cell affected by30012055.0484 the received value of the current input pixel, based on the received value of the current input pixel] The process 2400 additionally includes to perform an inverse Winograd transform on the output feature map to generate a transformed output feature map, as shown at block 2410.”
Examiner considers “perform multiple successive Winograd convolution operations” include step of updating partial pooling result.

Claim 10 and 23
Janedula discloses obtaining a hyper-parameter, of the neural network, comprising information about any one or any combination of any two or more of a size of the original pooling kernel, a stride size, and a pooling type, wherein a share line buffer storing the obtained intermediate pooling results is addressed based on the obtained hyper-parameter.  
Janedula: [0154] “The Winograd acceleration architecture 1900 additionally includes an input write controller 1903 and a kernel write controller 1926, which each include steering logic, such as the steering write logic 1806 of FIG. 18, to enable support for multi-stride convolution. The kernel write controller 1926 is also configured to distribute specific kernel data to specific tiles 1910A-1910M. A set of IP registers 1924 includes accelerator configuration registers to provide topology information to the Winograd acceleration architecture 1900. Provided topology information incudes kernel and input patch size, kernel stride, number of input and output feature maps, and other information used to configure convolution operations within the Winograd acceleration architecture 1900. [correspond to obtaining a hyper-parameter, of the neural network, comprising information about any one or any combination of any two or more of a size of the original pooling kernel, a stride size, and a pooling type] In one embodiment, a separate control interface is provided to configure the IP registers 1924. The IP registers 1924 can be configured on a layer-by-layer basis, such that each CNN layer can be configured differently. In one embodiment, the layer-by-layer topology of an entire CNN model can be pre-configured within the IP registers 1924 to enable streamlined processing of the CNN.”
Janedula: [0204] “The pooling stage 2820 uses a pooling function that replaces the output of the convolutional layer 2806 with a summary statistic of the nearby outputs. The pooling function can be used to introduce translation invariance into the neural network, such that small translations to the input do not change the pooled outputs. Invariance to local translation can be useful in scenarios where the presence of a feature in the input data is more important than the precise location of the feature. Various types of pooling functions can be used during the pooling stage 2820, including max pooling, average pooling, and 12-norm pooling.”


Claim 11 and 24 
Du discloses wherein the pooling operation is an operation based on a pooling type of max pooling, wherein each of the intermediate pooling results is a maximum value from among values of input pixels mapped to a corresponding sub-pooling kernel and the final pooling result is a maximum value among the intermediate pooling results, or the pooling operation is an operation based on a pooling type of average pooling, 31012055.0484 wherein each of the intermediate pooling results is a sum of the values of input pixels mapped to the corresponding sub-pooling kernel and the final pooling result is a value obtained by dividing a sum of the intermediate pooling results by a size of the original pooling kernel.  
Du: [0056] The sum buffer unit 5 is coupled to the interleaving sum unit 4. The sum buffer unit 5 includes a partial sum region 51 and a pooling unit 52. The partial sum region 51 is configured for registering data outputted from the interleaving sum unit 4. The pooling unit 52 performs a pooling operation with the data registered in the partial sum region 51. The pooling operation is a max pooling or an average pooling.
[0060] “As shown in FIG. 6, the data of the same column are together read from the convolution operation module 3 or the memory 1, and the read data can be the pixel data of an image. These data can be classified to max pooling (e.g. 2×2 or 3×3) and inputted to the corresponding max pooling unit. In this embodiment, the sum buffer unit 5 includes a plurality of pooling units 52, and each pooling unit 52 includes a register set REG, a comparator COMP, and an output switch. The comparator COMP has four inputs and one output. The register set REG has four registers, which can output the stored data to the comparator COMP. Three of the registers can receive and store the data read from the convolution operation module 3 or the memory 1, and the residual register can receive the output of the comparator COMP and store the maximum value of the outputs of the comparator COMP. The comparator COMP can compare the three inputted data and the maximum value of the previous comparison so as to output the maximum value. In other words, the maximum value outputted by the comparator COMP in the previous clock is registered in the register, so that it can be provided for next comparison with other new inputted data in the next clock. …”

Claim 12. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform the method of claim 1.  
Janedula: [0231] “One embodiment provides for a data processing system comprising a non-transitory machine-readable medium to store instructions for execution by one or more processors of the data processing system and a general-purpose graphics processing unit including a hardware accelerator including a compute unit to perform a Winograd convolution, the compute unit configurable to perform the Winograd convolution for a first kernel size using a transform associated with a second kernel size. The compute unit of the data processing system can perform any of the Winograd convolution operations otherwise described herein.” See Fig. 1.

Regarding Claim 14, the same ground of rejection is made as discussed above (claims 12 and 1) for substantially similar rationale. 

Claim 25. A processor-implemented method of a neural network, the method comprising: 
Janedula discloses obtaining intermediate pooling results, respectively corresponding to sub-pooling kernels obtained by decomposing an original pooling kernel, from input pixels included in a current window to be pooled in an input feature map with sub-pooling kernels; 
Janedula: [0179] “As shown in FIG. 24, a process 2400 to perform hardware-based Winograd convolution using kernels of multiple sizes, according to embodiments described herein, includes to decompose a higher-order convolution kernel having a first kernel size into multiple sub-kernels having a second kernel size, as shown at block 2402, [correspond to sub-pooling kernels obtained by decomposing an original pooling kernel] and transform at least a patch of an input feature map and the multiple sub-kernels based on a Winograd transform associated with the second kernel size, as shown at block 2404. The process 2400 additionally includes to perform multiple successive Winograd convolution operations to generate a set of partial output feature maps as shown at block 2406, and accumulate the multiple partial output feature maps into an output feature map, as shown at block 2408. [correspond to obtaining intermediate pooling results] The process 2400 additionally includes to perform an inverse Winograd transform on the output feature map to generate a transformed output feature map, as shown at block 2410.”
Janedula: [0199] “FIG. 28A illustrates various layers within a CNN. As shown in FIG. 28A, an exemplary CNN used to model image processing can receive input 2802 describing the red, green, and blue (RGB) components of an input image. [correspond to input pixels included in a current window] The input 2802 can be processed by multiple convolutional layers (e.g., convolutional layer 2804, convolutional layer 2806). The output from the multiple convolutional layers may optionally be processed by a set of fully connected layers 2808. Neurons in a fully connected layer have full connections to all activations in the previous layer, as previously described for a feedforward network. The output from the fully connected layers 2808 can be used to generate an output result from the network.”
[0201] “FIG. 28B illustrates exemplary computation stages within a convolutional layer of a CNN. Input to a convolutional layer 2812 of a CNN can be processed in three stages of a convolutional layer 2814. The three stages can include a convolution stage 2816, a detector stage 2818, and a pooling stage 2820. The convolution layer 2814 can then output data to a successive convolutional layer. The final convolutional layer of the network can generate output feature map data or provide input to a fully connected layer, for example, to generate a classification value for the input to the CNN.” [correspond to performing a pooling operation on input pixels included in a current window in an input feature map using the sub-pooling kernels]
[0204] “The pooling stage 2820 uses a pooling function that replaces the output of the convolutional layer 2806 with a summary statistic of the nearby outputs. The pooling function can be used to introduce translation invariance into the neural network, such that small translations to the input do not change the pooled outputs. Invariance to local translation can be useful in scenarios where the presence of a feature in the input data is more important than the precise location of the feature…”
Janedula discloses obtaining a final pooling result corresponding to the current window from the intermediate pooling results, 
Janedula: [0204] “The pooling stage 2820 uses a pooling function that replaces the output of the convolutional layer 2806 with a summary statistic of the nearby outputs. [correspond to obtaining a final pooling result corresponding to the current window by post-processing the intermediate pooling results] The pooling function can be used to introduce translation invariance into the neural network, such that small translations to the input do not change the pooled outputs. Invariance to local translation can be useful in scenarios where the presence of a feature in the input data is more important than the precise location of the feature.”
Janedula discloses in response to the intermediate pooling being complete for the current window, the current window being determined in the input feature map,
Janedula: [0109] In some embodiments, render output pipeline 870 contains a rasterizer and depth test component 873 that converts vertex-based objects into an associated pixel-based representation. In some embodiments, the rasterizer logic includes a windower/masker unit to perform fixed function triangle and line rasterization. [correspond to wherein the current window is determined according to the original pooling kernel in the input feature map] An associated render cache 878 and depth cache 879 are also available in some embodiments. A pixel operations component 877 performs pixel-based operations on the data, though in some instances, pixel operations associated with 2D operations (e.g. bit block image transfers with blending) are performed by the 2D engine 841, or substituted at display time by the display controller 843 using overlay display planes. In some embodiments, a shared L3 cache 875 is available to all graphics components, allowing the sharing of data without the use of main system memory.”
Janedula discloses determining an output pixel value of an output feature map, based on the final pooling result.  
Janedula: [0152] “FIG. 18 illustrates an exemplary logic and data layout 1800 to enable multi-stride convolution, according to an embodiment. The exemplary logic and data layout 1800 is configured for kernel size of five and a kernel stride of two. In one embodiment, a steering logic block (e.g., steering write logic 1806) is provided that hides the alternate pixels from Winograd transform. A 1×5 kernel 1802 having elements {k1, k2, k3, k4, k5} can be decomposed into two 1×3 kernels based on an exemplary kernel stride of two, where a first decomposed kernel 1808A includes elements {k1, K3, and K5}, while a second decomposed kernel 1808B includes elements {k2, k4, 0}. A patch of input data 1804 that is stored contiguously in memory can be written out to a buffer within Winograd compute logic via steering write logic 1806. The steering write logic can write the input data such that data of stride 2 is stored contiguously within the buffer. For example, a 1×12 input patch having elements {I1, I2, I3, I4, I5, I6, I7, I8, I9, I10, I11, I12} can be written out to a buffer, such that a first input data tensor 1812A includes elements {I1, I3, I5, I7, I9, I11} and a second input data tensor 1812 includes elements {I2, I4, I6, I8, I10, I12}. A Winograd compute unit 1810 can then perform an F(4,3) Winograd convolution between the respective 1×3 kernels 1808A-1808B and 1×6 input data tensors 1812A-1812B, the partial outputs being summed to create a final output 1814. The illustrated technique can be modified as necessary to enable multi-stride convolution of various strides, as control logic, such as the steering write logic 1806 can be configured accordingly for various strides greater than one.”
[0164] “For each input feature map fetched from memory, the Winograd compute architecture described herein, in one embodiment, performs calculation for multiple output feature maps before fetching the next input feature map. The number of output feature maps being computed in parallel depends on number of processing tiles.”

Janedula does not appear to explicitly disclose the original pooling kernel having been slid.
However, Du discloses the current window being determined as the original pooling kernel is slid in the input feature map;
[0055] “In addition, the control unit 7 can also retrieve the needed instruction and convolution information from external memory by data memory access. After the instruction decoder 71 decodes the instruction, the buffer device 2 retrieves the instruction and the convolution information. The instruction may include the size of the stride of the sliding window, the address of the sliding window, and the numbers of columns and rows of the image data.”
Janedula and Du are analogous art because they are from the “same field of endeavor” convolutional neural network (CNN).
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Janedula and Du before him or her, to modify the method of Janedula to include the pooling operating feature of Du because this combination improve the performance of the CNN.
The suggestion/motivation for doing so would Du: [0005] “In view of the foregoing, an objective of the disclosure is to provide a pooling operation device and method that can reduce the required reading bandwidth for inputted data and thus enhance the pooling operation performance.”
Therefore, it would have been obvious to combine Janedula and Du to obtain the invention as specified in the instant claim(s).

Claim 26. The method of claim 25, Janedula discloses a raster scan order.  
Janedula: [0109] In some embodiments, render output pipeline 870 contains a rasterizer and depth test component 873 that converts vertex-based objects into an associated pixel-based representation. In some embodiments, the rasterizer logic includes a windower/masker unit to perform fixed function triangle and line rasterization. [correspond to a raster scan order] An associated render cache 878 and depth cache 879 are also available in some embodiments. A pixel operations component 877 performs pixel-based operations on the data, though in some instances, pixel operations associated with 2D operations (e.g. bit block image transfers with blending) are performed by the 2D engine 841, or substituted at display time by the display controller 843 using overlay display planes. In some embodiments, a shared L3 cache 875 is available to all graphics components, allowing the sharing of data without the use of main system memory.”


Allowable Subject Matter
Claims 2-3, 7-9, 15-16, 20-22 and 27-28 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
Janedula et al (US 2019/0042923 A1) teaches a method of implementing machine learning operations in a computing unit to perform a Winograd convolution for a first kernel size using a transform associated with a second kernel size. See [0179-0181], 0204], [0216].
Du et al (US 2018/0232629 A1) teaches a pooling operation method for a convolutional neural network includes the following steps of: reading multiple new data in at least one column of a pooling window; performing a first pooling operation with the new data to generate at least a pooling result column; storing the pooling result column in a buffer; and performing a second pooling operation with the pooling result column and at least a preceding pooling result column stored in the buffer to generate a pooling result of the pooling window. The first pooling operation and the second pooling operation are max pooling operations.
Chen et al (NPL: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks, 2016) teaches a method of processing dataflow, called row stationary (RS), on a spatial architecture with 168 processing elements that reconfiguring the computation mapping of a given shape to optimize energy efficiency by maximally reusing data locally to reduce expensive data movement, such as DRAM accesses.
Kang et al (NPL: NNSim: Fast performance estimation based on sampled simulation of GPGPU kernels for neural networks, 2018) teaches three sampling techniques (e.g. Inter-Kernel sampling, Intra-Kernel sampling, and Streaming Multiprocessor sampling) for neural networks.
These references taken either alone or in combination with the prior art of record fail to disclose instructions, including:
Claim 2, 15 and 27: “wherein the sub-pooling kernels are 1-dimensional (1 D) kernels, respectively comprising row elements of the original pooling kernel, and a total number of sub-pooling kernels obtained by decomposing from the original pooling kernel corresponds to a height of the original pooling kernel.”
Claim 7 and 20: “reading the intermediate pooling results for the current window from the memory cells of the share line buffer; and obtaining the final pooling result corresponding to the output pixel value by performing, on the read intermediate pooling results, a post-processing operation according to a pre-set pooling type.”  
Claim 8 and 21: “wherein the share line buffer stores, in memory lines of a total number of rows corresponding to a height of the original pooling kernel, intermediate pooling results obtained for other windows in the input feature map, in a circular manner.”  
in combination with the remaining elements and features of the claimed invention.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHUEN-MEEI GAN whose telephone number is (469)295-9127. The examiner can normally be reached Monday-Friday 9:00 am to 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Rehana Perveen can be reached on 571-272-3676. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/CHUEN-MEEI GAN/Primary Examiner, Art Unit 2148