Detailed Action

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after April 30, 2019, is being examined under the first inventor to file provisions of the AIA .

Specification
The disclosure is objected to because of the following informalities: 
Paragraph 0028, the last sentence. “In this example, at N values between 64 and 128, the performance using the CHWN format starts to output perform the NCHW format performance. As can be seen form this example, a Nt threshold can be determined between N=64 and N=128” should read “In this example, at N values between 64 and 128, the performance using the CHWN format starts to outperform the NCHW format performance. As can be seen form this example, a Nt threshold can be determined between N=64 and N=128” 
Paragraph 0026, the first sentence. “FIG. 3 illustrates one example block diagram of a system 300 having a NN simulator 302 generating a meta-file 304 from input data 301” should read “FIG. 3 illustrates one example block diagram of a system 300 having a NN simulator 302 generating a meta-file 306 from input data 301”
Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:



	Claim 1-20 is/are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

	Regarding claim 1,
	2A Prong 1: The limitation of selecting a memory layout for a neural network (NN) among a plurality of different memory layouts based on thresholds derived from performance simulations of the NN is a mental process of making a judgment based on analysis or observation.
	2A Prong 2: This judicial exception is not integrated into a practical application. The claim does
not recite any additional element. The limitation of storing multi-dimensional NN kernel computation data using the selected memory layout during NN inference is a form of insignificant extra-solution activity.
	2B: The claim does not include additional elements that are sufficient to amount to significantly
more than the judicial exception. The limitation of storing multi-dimensional NN kernel computation data using the selected memory layout during NN inference was considered to be a form of insignificant extra-solution activity in Step 2A Prong 2, and thus it is re-evaluated in Step 2B to determine if it is more than what is well-understood, routine, and conventional activity in the field. The limitation merely recites “storing and retrieving information in memory”, which is a well-understood, routine, and conventional function (MPEP 2106.05(d)(II) iv). The claim is not patent eligible.
	
	Regarding claim 8, 
2A Prong 1: The limitation of selected memory layout for the NN among a plurality of different memory layouts based on thresholds derived from performance simulations of the NN is a mental process of making a judgment based on analysis or observation.
	2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional element – using a plurality of multiply accumulate units (MACs), and memories. Using MACs and memories to receive and store the input data amount to no more than mere instruction to apply the exception using generic computer components. Furthermore, the limitation of one or more memories storing input feature maps is an insignificant extra-solution activity.
	2B: The claim does not include additional elements that are sufficient to amount to significantly
more than the judicial exception. As discussed above, the additional element of MACs and memories to receive and store data amount to no more than mere instructions to apply the exception using generic computer components. Furthermore, limitation of a transform logic to store the output kernel computations and plurality of multiply accumulate units (MACs) to receive the input feature maps to perform kernel computations for a neural network (NN) merely say which particular technological field or environment the abstract idea is performed in (MPEP 2106.05(h)), because the transform logic is merely a logic. The limitation of one or more memories storing input feature maps is an insignificant extra-solution activity merely recites “storing and retrieving information in memory”, which is a well-understood, routine, and conventional function (MPEP 2106.05(d)(II) iv). The claim is not patent eligible.

	Regarding claim 14, the limitation of a non-transitory computer-readable medium including instructions, which if executed by a processing unit, causes the processing unit to perform an operation is a generic computer component. Claim 14 is a non-transitory computer-readable medium claim having similar limitation to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above.

	Regarding claim 2,
2A Prong 1: The limitation of wherein selecting the memory layout includes selecting a channel, height, width, and batches (CHWN) layout if a number of channels (C) based on input data is less than a channel number threshold (Ct) or if a number of batches (N) based on the input data is equal to or greater than a batch number threshold (Nt) is a mental process, as it merely recites selecting a layout based on threshold value (performance) which can be done with the aid of pen and paper.
	2A Prong 2: This judicial exception is not integrated into a practical application. The claim does
not recite any additional element. 
	2B: The claim does not include additional elements that are sufficient to amount to significantly
more than the judicial exception. The claim is not patent eligible.
Claim 9 is a system claim having similar limitation to method claim 2 above. Therefore, they are rejected under the same rational as of claim 2 above.
Claim 15 is a non-transitory computer-readable medium claim having similar limitation to method claim 2 above. Therefore, they are rejected under the same rational as of claim 2 above.

	Regarding claim 3,
2A Prong 1: The limitation of wherein selecting the memory layout includes selecting a batches, height, width and channel (NHWC) layout if a current NN layer giga floating point operations (GFlops) divided by a current NN layer memory size is greater than a feature map threshold (Fmt) is a mental process, as it merely recites selecting a layout based on threshold value (performance) which can be done by using pen and paper.
	2A Prong 2: This judicial exception is not integrated into a practical application. The claim does
not recite any additional element. 
2B: The claim does not include additional elements that are sufficient to amount to significantly
more than the judicial exception. The claim is not patent eligible.
Claim 10 is a system claim having similar limitation to method claim 3 above. Therefore, they are rejected under the same rational as of claim 3 above.
Claim 16 is a non-transitory computer-readable medium claim having similar limitation to method claim 3 above. Therefore, they are rejected under the same rational as of claim 3 above.

Regarding claim 4, the claim merely recites storing the values in a memory, which is a well-understood, routine, and conventional function, because the limitation merely recites “storing and retrieving information in memory” (MPEP 2106.05(d)(II)iv). The claim is not patent eligible.
Claim 11 is a system claim having similar limitation to method claim 4 above. Therefore, they are rejected under the same rational as of claim 4 above.
Claim 17 is a non-transitory computer-readable medium claim having similar limitation to method claim 4 above. Therefore, they are rejected under the same rational as of claim 4 above.
	
	Regarding claim 5, 
2A Prong 1: The limitation of wherein the selecting the memory layout includes selecting a batch, channel, height and width (NCHW) layout if the CHWN and NHWC layouts are not selected is a mental process, as it merely recites process of selecting the NCHW layout if the performance of other memory layouts are worse than the performance of NCHW layout, which can be done by using pen and paper.
	2A Prong 2: This judicial exception is not integrated into a practical application. The claim does
not recite any additional element. 
	2B: The claim does not include additional elements that are sufficient to amount to significantly

Claim 12 is a system claim having similar limitation to method claim 5 above. Therefore, they are rejected under the same rational as of claim 5 above.
Claim 18 is a non-transitory computer-readable medium claim having similar limitation to method claim 5 above. Therefore, they are rejected under the same rational as of claim 5 above.

	Regarding claim 6,
2A Prong 1: The limitation of wherein storing the multi-dimensional NN kernel computational data includes if the multi-dimensional NN kernel computation data is not in Atty. Docket No.: 209922.0356.3 (P320) 13the selected memory layout transforming the multi-dimensional NN kernel computation data for the selected memory layout is a mental process, because it merely recites transforming the layout of data by using matrix operation if the memory layout differs from the selected layout, which can be done by using pen and paper.
	2A Prong 2: This judicial exception is not integrated into a practical application. The claim does
not recite any additional element. 
	2B: The claim does not include additional elements that are sufficient to amount to significantly
more than the judicial exception. The claim is not patent eligible.
Claim 19 is a non-transitory computer-readable medium claim having similar limitation to method claim 6 above. Therefore, they are rejected under the same rational as of claim 6 above.

	Regarding claim 7, the limitation of wherein transforming the multi- dimensional NN kernel computation data is performed in hardware merely recites the field of use or technological environment (MPEP 2106.05(h)). The claim is not patent eligible.

wherein the transform logic includes a field programmable array (FPGA), programmable logic arrays (PLAs), or hard-wired circuitry merely recites the field of use or technological environment (MPEP 2106.05(h)). The claim is not patent eligible.

	Regarding claim 20, the limitation of wherein the processing unit performs an operation comprising: performing simulations of kernel computations in different memory layouts based on varying multi-dimension parameters is a mental process, as it merely recites the process of testing the computation to select which memory layout works best, which can be done by using pen and paper.
 
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim 1-2, 6-7, 8-9, 13-15, and 19-20 is/are rejected under 35 U.S.C. 102 over Li (Chao Li et al, 2016, “Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs”).

Regarding claim 1, Li teaches a neural network method comprising: selecting a memory layout for a neural network (NN) among a plurality of different memory layouts based on thresholds derived from performance simulations of the NN ([Li, page 4, right column, entire second paragraph and third paragraph; Fig 3; Fig 4a] “Fig. 3 shows the performance comparison between two different data layouts in the convolutional layers. As discussed above, different data layouts will also impact on the implementations of the convolutional layers. The performance of each data layout is evaluated using their best performant implementations … In other words, for the CHWN data layout, the N dimension is used for both memory coalescing and data reuse (in registers), and therefore the performance is very sensitive to the value of N”, [Li, page 5, left column, third paragraph, line 6-12; Fig 4] “For a given convolutional configuration, (1) if the value of C is smaller than a threshold Ct, CHWN will be preferred as the cost of memory transformation used in NCHW data layout is high; (2) if N is greater than or equal to a threshold Nt, the CHWN data layout is still the better choice as N is large enough to achieve both memory coalescing and data reuse. For the rest of the configurations, NCHW is the preferred choice”, [Li, page 6, first paragraph, first two sentences] “Compared to convolutional layers, pooling layers, another essential part of CNNs, are memory-intensive and also work on 4D data structures. Fig. 6 shows the performance of pooling layers with different data layouts. As we can see, cuda-convnet (i.e., CHWN) significantly outperforms Caffe and cuDNN (i.e., NCHW) across the board, with a speedup up to 16.3x”, all these paragraphs of Li reference discloses the process of simulating and measuring performance of model with variety of memory layouts, and [Li, Fig. 1] shows the performance comparison between CHWN layout (cuda-convnet2) and NCHW layout (cuDNNv4) ); 
storing multi-dimensional NN kernel computation data using the selected memory layout during NN inference ([Li, page 10, right column, second paragraph, line 11-19] “Our optimized framework selects the right layout across all different convolutional shapes, CHWN for CV1 and NCHW for the rest. Then, for the pooling layers, based on our study, CHWN is the best and cuda-convnet consistently performs better than cuDNN. Our optimizations on these three pooling layers further improves the performance by up to 27.8% over cuda-convenet. Finally, for the softmax layer, our memory optimization shows the significant speedup, with up to 20.1x speedup over cuDNN and 8.2x over cuda-convnet”, Li selects CHWN for CV1, then uses the optimized layouts, and improved the performance of the neural network processing method).
	
[Li, page 4, left column, Data Layout in Generic Implementation, the first paragraph, the last sentence] “Third, the depth of input feature maps (Ci) is 1 for grey-scale images or 3 for RGB images in the first convolutional layers, and then is a multiple of 16 in the rest of the convolutional layers, which can also provide regular memory accesses for a warp of GPU threads (warp size =32)”, implies that the input feature map is stored in a memory); a plurality of multiply accumulate units (MACs) to receive the input feature maps to perform kernel computations for a neural network (NN) ([Li, page 2, left column, third paragraph] “In this paper, we look into these memory issues and propose a set of methods to optimize memory efficiency for accelerating CNNs on GPUs. The main contributions of this paper are: … With the promising results on the state-of-the-art networks including LeNet [17] and AlexNet [12], our work improves the development of deep neural network libraries on GPUs, hence contributing to the advance in machine learning applications”, since all operations will occur in GPUs, it is obvious to have a unit that calculates and accumulates the result of computations as every GPU comprises processors and memories); and transform logic to store the output kernel computations in a selected memory layout for the NN among a plurality of different memory layouts based on thresholds derived from performance simulations of the NN ([Li, page 5, left column, third paragraph, line 6-12] “For a given convolutional configuration, (1) if the value of C is smaller than a threshold Ct, CHWN will be preferred as the cost of memory transformation used in NCHW data layout is high; (2) if N is greater than or equal to a threshold Nt, the CHWN data layout is still the better choice as N is large enough to achieve both memory coalescing and data reuse. For the rest of the configurations, NCHW is the preferred choice”, [Li, page 6, first paragraph, first two sentences] “Compared to convolutional layers, pooling layers, another essential part of CNNs, are memory-intensive and also work on 4D data structures. Fig. 6 shows the performance of pooling layers with different data layouts. As we can see, cuda-convnet (i.e., CHWN) significantly outperforms Caffe and cuDNN (i.e., NCHW) across the board, with a speedup up to 16.3x”, [Li, Fig. 1] shows the performance comparison between CHWN layout (cuda-convnet2) and NCHW layout (cuDNNv4)). 

	Regarding claim 14, Li teaches a non-transitory computer-readable medium including instructions, which if executed by a processing unit, causes the processing unit to perform an operation ([Li, page 2, left column, third paragraph, second and third dot] “Second, we support one network with multiple data layouts by proposing a fast multi-dimension data layout transformation on GPUs. We integrate the support for automatic data layout selection and transformation into a popular deep learning framework, Caffe. Third, we study the memory behavior of the memory-bounded pooling and softmax layers and optimize their memory access efficiency on GPUs”, GPUs are the processing unit, and [Li, page 9, right column, first paragraph, the last sentence] “For example, in PL3 with a pool window of 3 and a stride of 2, our optimized kernel effectively reduced 9.1% global memory transactions and 36.0% DRAM accesses respectively, compared to cuda-convnet, and the overall performance has improved by 33.9%”, DRAM is the non-transitory computer-readable medium). Claim 14 is a non-transitory computer-readable medium claim having similar limitation to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above.

Regarding claim 2, Li teaches wherein selecting the memory layout includes selecting a channel, height, width, and batches (CHWN) layout if a number of channels (C) based on input data is less than a channel number threshold (Ct) or if a number of batches (N) based on the input data is equal to or greater than a batch number threshold (Nt) ([Li, page 2, right column, second paragraph, line 17-21] “To differentiate data layouts in the 4D arrays, we use the following notation in the paper: N (the number of images), C (the number of feature maps), H (the image height), and W (the image width). With this notation, we can see that Equation 1 uses the NCHW layout” discloses the basic notations N, C, H, and W, and [Li, page 4, left column, Data Layouts in Generic Implementations, second sentence] “First, the batch size, N, is generally a multiple of 16, and has limited choices as described in prior works [12][25]. Therefore, using N as the lowest dimension is a good choice to meet the requirements for coalesced memory accessing as the threads are organized accordingly”, clearly shows that the N corresponds to the batch size. [Li, page 5, left column, third paragraph, line 6-11] “For a given convolutional configuration, (1) if the value of C is smaller than a threshold Ct, CHWN will be preferred as the cost of memory transformation used in NCHW data layout is high; (2) if N is greater than or equal to a threshold Nt, the CHWN data layout is still the better choice as N is large enough to achieve both memory coalescing and data reuse”).
Claim 9 is a system claim having similar limitation to method claim 2 above. Therefore, they are rejected under the same rational as of claim 2 above.
Claim 15 is a non-transitory computer-readable medium claim having similar limitation to method claim 2 above. Therefore, they are rejected under the same rational as of claim 2 above.

Regarding claim 6, Li teaches wherein storing the multi-dimensional NN kernel computational data includes if the multi-dimensional NN kernel computation data is not in Atty. Docket No.: 209922.0356.3 (P320) 13the selected memory layout transforming the multi-dimensional NN kernel computation data for the selected memory layout ([Li, page 6, left column, second-third paragraph, C. A Fast Data Layout Transformation for CNNs, line 3-10] “We also derive the preferred data layout based on the performance implication of the memory behavior of different layers. A subsequent question is how to enable the different suitable data layouts into one network? We propose an efficient data layout transformation. For brevity, we will mainly discuss the approach to transform from CHWN to NCHW. Transforming an array in the CHWN layout to the NCHW layout is essentially a transpose operation on a 4D array”).


Regarding claim 7, Li teaches wherein transforming the multi- dimensional NN kernel computation data is performed in hardware ([Li, page 2, left column, third paragraph, the second dot] “Second, we support one network with multiple data layouts by proposing a fast multi-dimension data layout transformation on GPUs” teaches the multi-dimensional data layout transformation performed in hardware (GPU) ).

Regarding claim 13, Li teaches wherein the transform logic includes a field programmable array (FPGA), programmable logic arrays (PLAs), or hard-wired circuitry ([Li, page 2, left column, third paragraph, the second dot] “Second, we support one network with multiple data layouts by proposing a fast multi-dimension data layout transformation on GPUs. We integrate the support for automatic data layout selection and transformation into a popular deep learning framework, Caffe”, teaches the transform logic will happen in GPUs, which corresponds to the hard-wired circuitry).

Regarding claim 20, Li teaches the non-transitory computer-readable medium of claim 14, wherein the processing unit performs an operation comprising: performing simulations of kernel computations in different memory layouts based on varying multi-dimension parameters ([Li, page 6, B. Data Layout In Pooling Layers, first paragraph, line 3-6; Fig. 6] “Fig. 6 shows the performance of pooling layers with different data layouts. As we can see, cuda-convnet (i.e., CHWN) significantly outperforms Caffe and cuDNN (i.e., NCHW) across the board, with a speedup up to 16.3x”, ‘based on multi-dimension parameters’ are interpreted as ‘based on parameters C, N, W, H (channel, batch, width, height)’).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 3-5, 10-12, 16-18 is/are rejected under 35 U.S.C. 103 over Li (Chao Li et al, 2016, “Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs”) in view of Herbert (Herbert et al, 03/21/2019, “CNN Inference with cuDNN”), and further in view of Zhou (Zhou et al, 2018, “Resource-Efficient Neural Architect”).

	Regarding claim 3, Li teaches selecting different memory layout based on performance ([Li, page 4, right column, entire second paragraph and third paragraph; Fig 3; Fig 4a] “Fig. 3 shows the performance comparison between two different data layouts in the convolutional layers. As discussed above, different data layouts will also impact on the implementations of the convolutional layers. The performance of each data layout is evaluated using their best performant implementations … In other words, for the CHWN data layout, the N dimension is used for both memory coalescing and data reuse (in registers), and therefore the performance is very sensitive to the value of N”, [Li, page 5, left column, third paragraph, line 6-12; Fig 4] “For a given convolutional configuration, (1) if the value of C is smaller than a threshold Ct, CHWN will be preferred as the cost of memory transformation used in NCHW data layout is high; (2) if N is greater than or equal to a threshold Nt, the CHWN data layout is still the better choice as N is large enough to achieve both memory coalescing and data reuse. For the rest of the configurations, NCHW is the preferred choice”), but Li does not specifically teaches wherein 
Herbert teaches wherein selecting the memory layout includes selecting a batches, height, width and channel (NHWC) layout if performance of the layout (Fmt) ([Herbert, page 38, TENSOR CORES ON VOLTA AND TURING, NCHW vs NHWC] The slides describes the calculation speed comparison in various input tensor sizes and output tensor sizes. NHWC layout performs better in input tensor size of 32x32x64 and output tensor size of 16x16x128, input 128x128x128 output 128x128x128 … and 7x7 filter size. Fmt also can be calculated by dividing the measured calculation time of each of the layout by output or input tensor size (memory size). For example, it is possible to calculate Fmt of the NHWC when output tensor size is 16 x 16 x 128 and the runtime is 0.04ms, it will be             
                
                    
                        0.04
                        m
                        s
                    
                    
                        16
                         
                        x
                         
                        16
                         
                        x
                         
                        128
                    
                
            
         . We may able to compare the value of the calculation between each of the different models and get Fmt threshold value, where the NHWC performs better than NCHW).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would
have been obvious, having both the teachings of Li and Herbert, to use the process of selecting a memory layout by comparing performance of each memory layouts including NHWC of Herbert to implement the neural network memory layout selection method of Li. The suggestion and/or motivation for doing so is to improve the performance of the model by choosing the best memory layout among the more diverse types of memory layout. 
Li in view of Herbert does not specifically teaches calculating performance using a current NN layer giga floating point operations (GFlops) divided by a current NN layer memory size (Fmt). 
Zhou teaches calculating a performance using current NN layer giga floating point operations (GFlops) divided by a current NN layer memory size (Fmt) ([Zhou, page 3, 3rd paragraph, (iii) Compute intensity] “Compute intensity is defined as the average number of FLOPs per data access (i.e. data transfer between the fast and slow memory). Compute intensity is a measure of how efficiently an algorithm can re-use data. For modern multi-core architectures like GPUs and TPUs, it is an indirect measure of how fast the algorithm can be run. In general, if a neural network reuses data, it requires less memory bandwidth and achieves higher compute intensity … ”, teaches calculating fmt, as feature map threshold is calculated by              
                F
                m
                t
                =
                 
                
                    
                        G
                        f
                        l
                        o
                        p
                        s
                    
                    
                        m
                        e
                        m
                        o
                        r
                        y
                         
                        s
                        i
                        z
                        e
                    
                
            
          i.e. Flops/byte. Example of compute intensity can be found in page 7 Table 1 and Table 2, as they are showing the unit of compute intensity FLOPs/byte).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would
have been obvious, having both the teachings of Li, Herbert, and Zhou, to use the performance measurement using GFlops/memory size of Zhou to implement the process of measuring performances and selecting a memory layout of Li and Herbert. The suggestion and/or motivation for doing so is to measure how efficiently an algorithm can re-use data, and use the efficiency measurement to select which layout to use.
Claim 10 is a system claim having similar limitation to method claim 3 above. Therefore, they are rejected under the same rational as of claim 3 above.
Claim 16 is a non-transitory computer-readable medium claim having similar limitation to method claim 3 above. Therefore, they are rejected under the same rational as of claim 3 above.

	Regarding claim 4, Li in view of Herbert, and further in view of Zhou teaches further comprising: storing the Ct, and Nt thresholds in a meta file ([Li, page 5, left column, 3rd paragraph] “First, the N and C dimension are revealed as being highly correlated with memory performance. Second, a heuristic to select the suitable data layout for a convolutional shape can be derived based on the performance sensitivity analysis. For a given convolutional configuration, (1) if the value of C is smaller than a threshold Ct … Considering that the heuristic parameters only relate to the property of the hardware, for each GPU architecture, we only need one-time profiling (as the one shown in Fig. 4 on varying N and C) to determine the thresholds”, discloses the Ct and Nt, and determining threshold happens in GPU. It is obvious that Ct, Nt, and Fmt will be stored in some kind of file in GPUs, and the term meta file merely recites a name of a file without any detail).
	Li in view of Herbert does not specifically teaches storing the Fmt thresholds in a file.
	Zhou teaches storing Fmt thresholds (i.e. FLOPS/memory size) in a file ([Zhou, page 3, 3rd paragraph, (iii) Compute intensity] “Compute intensity is defined as the average number of FLOPs per data access (i.e. data transfer between the fast and slow memory). Compute intensity is a measure of how efficiently an algorithm can re-use data. For modern multi-core architectures like GPUs and TPUs, it is an indirect measure of how fast the algorithm can be run. In general, if a neural network reuses data, it requires less memory bandwidth and achieves higher compute intensity … ”, teaches calculating fmt, as feature map threshold is calculated by              
                F
                m
                t
                =
                 
                
                    
                        G
                        f
                        l
                        o
                        p
                        s
                    
                    
                        m
                        e
                        m
                        o
                        r
                        y
                         
                        s
                        i
                        z
                        e
                    
                
            
          i.e. Flops/byte. Example of compute intensity can be found in page 7 Table 1 and Table 2, as they are showing the unit of compute intensity FLOPs/byte. The Zhou reference also uses GPUs or TPUs to store and process data, it is obvious that the Fmt will be stored in a file).
Claim 11 is a system claim having similar limitation to method claim 4 above. Therefore, they are rejected under the same rational as of claim 4 above.
Claim 17 is a non-transitory computer-readable medium claim having similar limitation to method claim 4 above. Therefore, they are rejected under the same rational as of claim 4 above.

	Regarding claim 5, Li in view of Herbert, and further in view of Zhou teaches wherein the selecting the memory layout includes selecting a batch, channel, height and width (NCHW) layout if the CHWN and NHWC layouts are not selected ([Li, page 5, left column, third paragraph, line 6-12] “For a given convolutional configuration, (1) if the value of C is smaller than a threshold Ct, CHWN will be preferred as the cost of memory transformation used in NCHW data layout is high; (2) if N is greater than or equal to a threshold Nt, the CHWN data layout is still the better choice as N is large enough to achieve both memory coalescing and data reuse. For the rest of the configurations, NCHW is the preferred choice”, Li discloses NCHW is the preferred choice if other layouts are not selected).
Claim 12 is a system claim having similar limitation to method claim 5 above. Therefore, they are rejected under the same rational as of claim 5 above.
Claim 18 is a non-transitory computer-readable medium claim having similar limitation to method claim 5 above. Therefore, they are rejected under the same rational as of claim 5 above.

	 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s
disclosure.
Regarding data layout optimization.
US 20160342888 A1
Heehoom Kim et al, 2017, “Performance Analysis of CNN Frameworks for GPUs”

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can
normally be reached on 7:30 AM - 5:30 PM. If attempts to reach the examiner by telephone are
unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax
phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application
Information Retrieval (PAIR) system. Status information for published applications may be obtained
from either Private PAIR or Public PAIR. Status information for unpublished applications is available

Business Center (EBC) at 866-217-9197 (toll-free).

/JUN KWON/
Examiner, Art Unit 2127
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127