DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 9/8/2022 has been entered.
Claims 1-2, 8-9, and 15-16 have been amended. Claims 1-20 are pending and have been examined.

Response to Arguments
Applicant’s arguments, see pp. 10 and 12-15, filed 9/8/2022, with respect to the rejections of claims 1-20, have been fully considered and are persuasive.  The rejections under 35 USC § 103 of claims 1-20 have been withdrawn. 
Applicant’s arguments, see pp. 10 and 12-15, filed 9/8/2022, with respect to the rejection(s) of claim(s) 1-2, 8-9, and 15-16 under 35 USC § 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Ravi (“ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections”), Wang et al. “Fast and guaranteed tensor decomposition via sketching,” and Tjandra et al. (“Compressing recurrent neural network with tensor train”). 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 3, 6-8, 10, 13-15 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shaji et al. (US 2018/0039879 A1) in view of Tai et al. (“Convolutional Neural Networks with Low-Rank Regularization”), Guo et al. (US 2020/0167654 A1), and Ravi (“ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections”).

Regarding claim 1
Shaji teaches 
- deploying a neural network (NN) model on an electronic device (Shaji: [Abstract] “The method includes providing a base neural network for generating learned features.”, [Fig.5] discloses the neural network model on a system which is an electronic device [Fig. 3]), 
- the NN model being generated by training a first NN architecture on a first dataset, wherein a first function defines a first layer of the first NN architecture … (Shaji: [0069] “The method further includes updating the base neural network to generate a personalized neural network based on the received second set of training images.” [0074] “According to some embodiments, the final layers of the personalized neural network comprise linear and/or non-linear multi-dimensionality reduction functions,”; “a personalized neural network” in 0069 reads on the claimed “a first NN architecture” and “set of training images” reads on “a first dataset”;  one of “the final layers of the personalized neural network” in 0074 reads on “a first layer of the first NN architecture” and “reduction functions” reads on “a first function”), 
- enabling retraining of the NN model on the electronic device using a second data set and … and retraining of the NN model … provides a personalized deep learning model for a user of the electronic device (Shaji: [0076] "Updating the personalized neural network comprises re-training the final layers of the personalized neural network with the third set of images"; “the third set of images” reads on “a second data set”).
	Shaji does not distinctly disclose:
- the first function being constructed based on approximating a second function applied by a second layer of a second NN architecture 
- wherein the first layer of the first NN architecture replaces the second layer of the second NN architecture, the first function and the second function have a same input and output dimensionality, and retraining of the NN model that is reduced in size …
	However, Tai teaches: 
- the first function being constructed based on approximating a second function applied by a second layer of a second NN architecture ([Section 3], “The goal is to find an approximation Ŵ of W that facilitates more efficient computation while maintaining the classification accuracy of the CNN”; 
    PNG
    media_image1.png
    49
    316
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    49
    120
    media_image2.png
    Greyscale
; Ŵ reads on “first function” and W reads on “a second function”
; [Figure 2.] discloses how Ŵ and W represents the function of first layer of first function and second layer of second function.

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
- wherein the first layer of the first NN architecture replaces the second layer of the second NN architecture, and the first function and the second function have a same input and output dimensionality, and retraining of the NN model that is reduced in size. ([Fig. 2] discloses the replacement of layers and each layer has same input dimensionality C and output dimensionality N)

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale

[Section 3.3], “Using the above scheme to train a new CNN from scratch is conceptually straightforward. Simply parametrize the convolutional to be of the form in (1), and the rest is not very different from training a non-constrained CNN.” )
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the personalized aesthetic scoring neural network system of Shaji with the function approximation of Tai in order to remove the redundancy in the layer thereby obtaining an exact solution efficiently (Tai: [Section 1] “As the tensor decomposition is the most important step in approximating CNNs, being able to obtain an exact solution efficiently thus provides great advantages.”)
The combination of Shaji and Tai does not appear to distinctly disclose
- the approximating based on a sketching operation performed on network function parameters of the second function through linear projections of the second function with random vectors, a number of network function parameters for the first function is less than a number of network function parameters for the second function for parameter reduction, and the network function parameters for the second function are reduced without training the second NN architecture
However, Guo teaches 
-the approximating based on a sketching operation performed on network function parameters of the second function ([0213] “As described above, a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means 
    PNG
    media_image4.png
    48
    145
    media_image4.png
    Greyscale
”, W is the approximation of second function and aj and bj are the function parameters; [0246] “Considering that the fully-connected layers of AlexNet contain more than 95% of its parameters, sketching them to an extreme can be attempted, namely 1 bit.”, [Fig. 15] also discloses sketching), a number of network function parameters for the first function is less than a number of network function parameters for the second function ([0250] “TABLE 6: Network sketching technique generates binary-weight ResNets with the ability to make faithful inference and roughly 7.4× fewer parameters than its reference (in bits).”; “reference” reads on “the second function”) and the network function parameters for the second function are reduced without training the second NN architecture ([0250] “TABLE 6: Sketch (dir.)”; “Sketch (dir.)” discloses parameter reduction via direct sketching which does not train the model.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to modify the personalized aesthetic scoring system as taught by Shaji and Tai to include function sketching as taught by Guo in order to implement more flexible network thereby achieving classification efficiency and faithful inference (Guo: [0251] “It can be more flexible than current available methods and it enables researchers and engineers to regulate the precision of generated sketches and get better trade-off between the model efficiency and accuracy. Both theoretical and empirical analyses have been given to validate its efficacy. Moreover, an associative implementation of binary tensor convolutions can be implemented to further speedup the sketches. As a result, binary-weight AlexNets and ResNets can be generated with the ability to make both efficient and faithful inference on the ImageNet classification task.”)
The combination of Shaji, Tai, and Guo does not appear to distinctly teach
- approximating … through linear projections of the second function with random vectors.
However, Ravi teaches:
- approximating … through linear projections of the second function with random vectors (Ravi, [top of p. 6], “The projection matrix P is fixed prior to training and inference. Note that we never need to explicitly store the random projection vector Pk since we can compute them on the fly using hash functions rather than invoking a random number generator. In addition, this also permits us to perform projection operations that are linear in the observed feature size rather than the overall feature size which can be prohibitively large for high-dimensional data, thereby saving both memory and computation cost.”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the approximation of Guo with Ravi’s projection in order to save memory and computation cost as suggested by Ravi. 


Regarding claim 3
Shaji as modified by Tai and Guo teaches all of the limitations of claim 1 as cited above and Tai further teaches
- the first function and the second function have a same input and output dimensionality ([Fig. 2] discloses the replacement of layers and each layer has same input dimensionality C and output dimensionality N)

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
Same motivation as claim 1.
Guo further teaches 
- wherein the sketching operation is performed along different dimensions of a tensor space for generating multiple different first functions that are combined to form the first layer (Guo: [0213] “As described above, a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means
W ≈ 〈 B , a 〉 =                         
                            
                                
                                    ∑
                                    
                                        j
                                        =
                                        0
                                    
                                    
                                        m
                                        -
                                        1
                                    
                                
                                
                                    
                                        
                                            a
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            
                                
                                    b
                                
                                
                                    j
                                
                            
                        
                     
in which B∈{+1−1}c×w×h×m and a∈Rm are the concatenations of m binary tensors {B0, . . . , Bm-1} and the same number of scale factors {a0, . . . , am-1}, respectively”, ”,  tensors {B0, . . . , Bm-1} reads on “different dimensions of a tensor space”,  [Fig. 17] also shows how multiple subsets of first layers are formed.) and that replicate functionality of the second layer ([0214] “Generally, the reconstruction error (or approximation error, round-off error) should be minimized to retain the model accuracy after expansion.”; “retain the model accuracy” reads on “functionality of the second layer”)
Same motivation as claim 1.


Regarding claim 6
Shaji as modified by Tai and Guo teaches all of the limitations of claim 1 as cited above and Shaji further teaches
- wherein retraining of the first NN architecture is not tied to a particular dataset (Shaji: [0076] “Updating the personalized neural network comprises re-training the final layers of the personalized neural network with the third set of images and keeping the initial layers of the personalized neural network”; [0073] shows re-training is not done with the specific dataset. )

Regarding claim 7
Shaji as modified by Tai and Guo teaches all of the limitations of claim 1 as cited above and Shaji further teaches
- wherein the electronic device comprises a mobile electronic device (Shaji: [0036] “In some embodiments, the training of the personalized layer might be done in mobile device, smartphone, or other low-powered portable device.”)
	Guo further teaches
- the parameter reduction results in reduction in computing resources required for processing the NN model by the mobile electronic device including computation time, storage space and transmission bandwidth. ([0250] “TABLE 6: Network sketching technique generates binary-weight ResNets with the ability to make faithful inference and roughly 7.4× fewer parameters than its reference (in bits).”; [0086] “Thus, the techniques described herein may be implemented on any properly configured processing unit, including, without limitation, one or more mobile application processors ...”; [0126] “At the same time, the ability to access GPU attached memory 420-423 without cache coherence overheads can be critical to the execution time of an offloaded computation. In cases with substantial streaming write memory traffic, for example, cache coherence overhead can significantly reduce the effective write bandwidth seen by a GPU 410-413”)
	Same motivation as claim 1.

Regarding claim 8
Shaji teaches 
- a memory storing instructions (Shaji: [0065]; see the structure of a device) 
- at least one processor executing the instructions including a process configured to: (Shaji: [0065]; see the structure of a device) 
- deploy a neural network (NN) model on an electronic device (Shaji: [Abstract] “The method includes providing a base neural network for generating learned features.”, [Fig.5] discloses the neural network model on a system which is an electronic device [Fig. 3]), 
- the NN model being generated by training a first NN architecture on a first dataset wherein a first function defines a first layer of the first NN architecture … (Shaji: [0069] “The method further includes updating the base neural network to generate a personalized neural network based on the received second set of training images.” [0074] “According to some embodiments, the final layers of the personalized neural network comprise linear and/or non-linear multi-dimensionality reduction functions,”; “a personalized neural network” in 0069 reads on the claimed “a first NN architecture” and “set of training images” reads on “a first dataset”;  one of “the final layers of the personalized neural network” in 0074 reads on “a first layer of the first NN architecture” and “reduction functions” reads on “a first function”), 
- enable retraining of the NN model on the electronic device using a second data set and … and retraining of the NN model … provides a personalized deep learning model for a user of the electronic device (Shaji: [0076] "Updating the personalized neural network comprises re-training the final layers of the personalized neural network with the third set of images"; “the third set of images” reads on “a second data set”).
	Shaji does not distinctly disclose:
- the first function being constructed based on approximating a second function applied by a second layer of a second NN architecture 
- wherein the first layer of the first NN architecture replaces the second layer of the second NN architecture, the first function and the second function have a same input and output dimensionality, and retraining of the NN model that is reduced in size.
	However, Tai teaches: 
- the first function being constructed based on approximating a second function applied by a second layer of a second NN architecture ([Section 3] “[Section 3], “The goal is to find an approximation Ŵ of W that facilitates more efficient computation while maintaining the classification accuracy of the CNN”; 
    PNG
    media_image1.png
    49
    316
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    49
    120
    media_image2.png
    Greyscale
; Ŵ reads on “first function” and W reads on “a second function”
; [Figure 2.] discloses how Ŵ and W represents the function of first layer of first function and second layer of second function.

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
- wherein the first layer of the first NN architecture replaces the second layer of the second NN architecture, the first function and the second function have a same input and output dimensionality, and retraining of the NN model that is reduced in size. ([Fig. 2] discloses the replacement of layers and each layer has same input dimensionality C and output dimensionality N)

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
[Section 3.3], “Using the above scheme to train a new CNN from scratch is conceptually straightforward. Simply parametrize the convolutional to be of the form in (1), and the rest is not very different from training a non-constrained CNN.” )
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the personalized aesthetic scoring neural network system of Shaji with the function approximation of Tai in order to remove the redundancy in the layer thereby obtaining an exact solution efficiently (Tai: [Section 1] “As the tensor decomposition is the most important step in approximating CNNs, being able to obtain an exact solution efficiently thus provides great advantages.”)
The combination of Shaji and Tai does not appear to distinctly disclose
- the approximating based on a sketching operation performed on network function parameters of the second function through linear projections of the second function with random vectors, a number of network function parameters for the first function is less than a number of network function parameters for the second function for parameter reduction, and the network function parameters for the second function are reduced without training the second NN architecture
However, Guo teaches 
-the approximating based on a sketching operation performed on network function parameters of the second function ([0213] “As described above, a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means 
    PNG
    media_image4.png
    48
    145
    media_image4.png
    Greyscale
”, W is the approximation of second function and aj and bj are the function parameters; [0246] “Considering that the fully-connected layers of AlexNet contain more than 95% of its parameters, sketching them to an extreme can be attempted, namely 1 bit.”, [Fig. 15] also discloses sketching), a number of network function parameters for the first function is less than a number of network function parameters for the second function ([0250] “TABLE 6: Network sketching technique generates binary-weight ResNets with the ability to make faithful inference and roughly 7.4× fewer parameters than its reference (in bits).”; “reference” reads on “the second function”) and the network function parameters for the second function are reduced without training the second NN architecture ([0250] “TABLE 6: Sketch (dir.)”; “Sketch (dir.)” discloses parameter reduction via direct sketching which does not train the model.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to modify the personalized aesthetic scoring system as taught by Shaji and Tai to include function sketching as taught by Guo in order to implement more flexible network thereby achieving classification efficiency and faithful inference (Guo: [0251] “It can be more flexible than current available methods and it enables researchers and engineers to regulate the precision of generated sketches and get better trade-off between the model efficiency and accuracy. Both theoretical and empirical analyses have been given to validate its efficacy. Moreover, an associative implementation of binary tensor convolutions can be implemented to further speedup the sketches. As a result, binary-weight AlexNets and ResNets can be generated with the ability to make both efficient and faithful inference on the ImageNet classification task.”)
The combination of Shaji, Tai, and Guo does not appear to distinctly teach
approximating … through linear projections of the second function with random vectors.
However, Ravi teaches:
- approximating … through linear projections of the second function with random vectors (Ravi, [top of p. 6], “The projection matrix P is fixed prior to training and inference. Note that we never need to explicitly store the random projection vector Pk since we can compute them on the fly using hash functions rather than invoking a random number generator. In addition, this also permits us to perform projection operations that are linear in the observed feature size rather than the overall feature size which can be prohibitively large for high-dimensional data, thereby saving both memory and computation cost.”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the approximation of Guo with Ravi’s projection in order to save memory and computation cost as suggested by Ravi. 


Regarding claim 10
Shaji as modified by Tai and Guo teaches all of the limitations of claim 8 as cited above and Tai further teaches
- the first function and the second function have a same input and output dimensionality ([Fig. 2] discloses the replacement of layers and each layer has same input dimensionality C and output dimensionality N)

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
Same motivation as claim 8.
Guo further teaches 
- wherein the sketching operation is performed along different dimensions of a tensor space for generating multiple different first functions that are combined to form the first layer (Guo: [0213] “As described above, a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means
W ≈ 〈 B , a 〉 =                         
                            
                                
                                    ∑
                                    
                                        j
                                        =
                                        0
                                    
                                    
                                        m
                                        -
                                        1
                                    
                                
                                
                                    
                                        
                                            a
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            
                                
                                    b
                                
                                
                                    j
                                
                            
                        
                     
in which B∈{+1−1}c×w×h×m and a∈Rm are the concatenations of m binary tensors {B0, . . . , Bm-1} and the same number of scale factors {a0, . . . , am-1}, respectively”, ”,  tensors {B0, . . . , Bm-1} reads on “different dimensions of a tensor space”,  [Fig. 17] also shows how multiple subsets of first layers are formed.) and that replicate functionality of the second layer ([0214] “Generally, the reconstruction error (or approximation error, round-off error) should be minimized to retain the model accuracy after expansion.”; “retain the model accuracy” reads on “functionality of the second layer”)
Same motivation as claim 8.

Regarding claim 13
Shaji as modified by Tai and Guo teaches all of the limitations of claim 9 as cited above and Shaji further teaches: 
- wherein retraining of the first NN architecture is not tied to a particular dataset ([0076] “Updating the personalized neural network comprises re-training the final layers of the personalized neural network with the third set of images and keeping the initial layers of the personalized neural network” ”; [0073] shows re-training is not done with the specific dataset.)

Regarding claim 14
Shaji as modified by Tai and Guo teaches all of the limitations of claim 9 as cited above and Shaji further teaches
- wherein the electronic device comprises a mobile electronic device (Shaji: [0036] “In some embodiments, the training of the personalized layer might be done in mobile device, smartphone, or other low-powered portable device.”)
	Guo further teaches
- the parameter reduction results in reduction in computing resources required for processing the NN model by the mobile electronic device including computation time, storage space and transmission bandwidth. ([0250] “TABLE 6: Network sketching technique generates binary-weight ResNets with the ability to make faithful inference and roughly 7.4× fewer parameters than its reference (in bits).”; [0086] “Thus, the techniques described herein may be implemented on any properly configured processing unit, including, without limitation, one or more mobile application processors ...”; [0126] “At the same time, the ability to access GPU attached memory 420-423 without cache coherence overheads can be critical to the execution time of an offloaded computation. In cases with substantial streaming write memory traffic, for example, cache coherence overhead can significantly reduce the effective write bandwidth seen by a GPU 410-413”)
	Same motivation as claim 8.

Regarding claim 15
Shaji teaches 
- A non-transitory processor-readable medium (Shaji: [0059] “memory”) that includes a program that when executed by a processor performing a method comprising:
- deploying a neural network (NN) model on an electronic device (Shaji: [Abstract] “The method includes providing a base neural network for generating learned features.”, [Fig.5] discloses the neural network model on a system which is an electronic device [Fig. 3]), 
- the NN model being generated by training a first NN architecture on a first dataset wherein a first function defines a first layer of the first NN architecture … (Shaji: [0069] “The method further includes updating the base neural network to generate a personalized neural network based on the received second set of training images.” [0074] “According to some embodiments, the final layers of the personalized neural network comprise linear and/or non-linear multi-dimensionality reduction functions,”; “a personalized neural network” in 0069 reads on the claimed “a first NN architecture” and “set of training images” reads on “a first dataset”;  one of “the final layers of the personalized neural network” in 0074 reads on “a first layer of the first NN architecture” and “reduction functions” reads on “a first function”), 
- enabling retraining of the NN model on the electronic device using a second data set and … and retraining of the NN model … provides a personalized deep learning model for a user of the electronic device (Shaji: [0076] "Updating the personalized neural network comprises re-training the final layers of the personalized neural network with the third set of images"; “the third set of images” reads on “a second data set”).
	Shaji does not distinctly disclose:
- the first function being constructed based on approximating a second function applied by a second layer of a second NN architecture 
- wherein the first layer of the first NN architecture replaces the second layer of the second NN architecture, and the first function and the second function have a same input and output dimensionality, and retraining of the NN model that is reduced in size …
	However, Tai teaches: 
- the first function being constructed based on approximating a second function applied by a second layer of a second NN architecture ([Section 3], “The goal is to find an approximation Ŵ of W that facilitates more efficient computation while maintaining the classification accuracy of the CNN”; 
    PNG
    media_image1.png
    49
    316
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    49
    120
    media_image2.png
    Greyscale
; Ŵ reads on “first function” and W reads on “a second function”
; [Figure 2.] discloses how Ŵ and W represents the function of first layer of first function and second layer of second function.

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
- wherein the first layer of the first NN architecture replaces the second layer of the second NN architecture, and the first function and the second function have a same input and output dimensionality, and retraining of the NN model that is reduced in size.  ([Fig. 2] discloses the replacement of layers and each layer has same input dimensionality C and output dimensionality N)

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
[Section 3.3], “Using the above scheme to train a new CNN from scratch is conceptually straightforward. Simply parametrize the convolutional to be of the form in (1), and the rest is not very different from training a non-constrained CNN.” )
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the personalized aesthetic scoring neural network system of Shaji with the function approximation of Tai in order to remove the redundancy in the layer thereby obtaining an exact solution efficiently (Tai: [Section 1] “As the tensor decomposition is the most important step in approximating CNNs, being able to obtain an exact solution efficiently thus provides great advantages.”)
The combination of Shaji and Tai does not appear to distinctly disclose
- the approximating based on a sketching operation performed on network function parameters of the second function through linear projections of the second function with random vectors, a number of network function parameters for the first function is less than a number of network function parameters for the second function for parameter reduction, and the network function parameters for the second function are reduced without training the second NN architecture
However, Guo teaches 
-the approximating based on a sketching operation performed on network function parameters of the second function ([0213] “As described above, a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means 
    PNG
    media_image4.png
    48
    145
    media_image4.png
    Greyscale
”, W is the approximation of second function and aj and bj are the function parameters; [0246] “Considering that the fully-connected layers of AlexNet contain more than 95% of its parameters, sketching them to an extreme can be attempted, namely 1 bit.”, [Fig. 15] also discloses sketching), a number of network function parameters for the first function is less than a number of network function parameters for the second function ([0250] “TABLE 6: Network sketching technique generates binary-weight ResNets with the ability to make faithful inference and roughly 7.4× fewer parameters than its reference (in bits).”; “reference” reads on “the second function”) and the network function parameters for the second function are reduced without training the second NN architecture ([0250] “TABLE 6: Sketch (dir.)”; “Sketch (dir.)” discloses parameter reduction via direct sketching which does not train the model.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to modify the personalized aesthetic scoring system as taught by Shaji and Tai to include function sketching as taught by Guo in order to implement more flexible network thereby achieving classification efficiency and faithful inference (Guo: [0251] “It can be more flexible than current available methods and it enables researchers and engineers to regulate the precision of generated sketches and get better trade-off between the model efficiency and accuracy. Both theoretical and empirical analyses have been given to validate its efficacy. Moreover, an associative implementation of binary tensor convolutions can be implemented to further speedup the sketches. As a result, binary-weight AlexNets and ResNets can be generated with the ability to make both efficient and faithful inference on the ImageNet classification task.”)
The combination of Shaji, Tai, and Guo does not appear to distinctly teach
- approximating … through linear projections of the second function with random vectors.
However, Ravi teaches:
- approximating … through linear projections of the second function with random vectors (Ravi, [top of p. 6], “The projection matrix P is fixed prior to training and inference. Note that we never need to explicitly store the random projection vector Pk since we can compute them on the fly using hash functions rather than invoking a random number generator. In addition, this also permits us to perform projection operations that are linear in the observed feature size rather than the overall feature size which can be prohibitively large for high-dimensional data, thereby saving both memory and computation cost.”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the approximation of Guo with Ravi’s projection in order to save memory and computation cost as suggested by Ravi. 

Regarding claim 20
Shaji as modified by Tai teaches all of the limitations of claim 15 as cited above and Shaji further teaches
-wherein the electronic device comprises a mobile electronic device (Shaji: [0036] “In some embodiments, the training of the personalized layer might be done in mobile device, smartphone, or other low-powered portable device.”)
Guo further teaches
- the parameter reduction results in reduction in computing resources required for processing the NN model by the mobile electronic device including computation time, storage space and transmission bandwidth. ([0250] “TABLE 6: Network sketching technique generates binary-weight ResNets with the ability to make faithful inference and roughly 7.4× fewer parameters than its reference (in bits).”; [0086] “Thus, the techniques described herein may be implemented on any properly configured processing unit, including, without limitation, one or more mobile application processors ...”; [0126] “At the same time, the ability to access GPU attached memory 420-423 without cache coherence overheads can be critical to the execution time of an offloaded computation. In cases with substantial streaming write memory traffic, for example, cache coherence overhead can significantly reduce the effective write bandwidth seen by a GPU 410-413”)
	Same motivation as claim 15.


Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Shaji in view of Tai, Guo, and Ravi as cited above, and further in view of Wang et al. (“Fast and guaranteed tensor decomposition via sketching”).

Regarding claim 2
Shaji as modified by Tai and Guo teaches all of the limitations of claim 1 as cited above and Guo further teaches 
- wherein prior knowledge of the first dataset is not required due to the training occurring after the parameter reduction ([0247] “Just to avoid the propagation of reconstruction errors, we need to somehow fine-tune the generated sketches. … one is known as projection gradient descent and the other is stochastic gradient descent with full precision weight update as described in Reference [1]. The latter can be chose by virtue of its better convergence. The training batch size can be set as 256 and the momentum is 0.9.”; discloses the training is done after sketching which reduced the parameters)
	Same motivation as claim 1.
Shaji, Tai, Guo, and Ravi do not expressly teach
and the first layer is a sketching converted layer that is parameterized by a sequence of tensor-matrix pairs
However, this is taught by Wang.  See Wang, bottom of p. 3, e.g.:

    PNG
    media_image5.png
    84
    512
    media_image5.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Shaji’s neural network layers with Wang’s sketching converted data in order to provide efficient tensor decomposition as suggested by Wang (See Wang, section 1).

Claims 4-5, 11-12 and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Shaji in view of Tai in view of Guo as shown above, further in view of Kisilev et al. (US 2018/0060719 A1).

Regarding claim 4
Shaji as modified by Tai and Guo teaches all of the limitations of claim 1 as cited above and Shaji further teaches
- wherein the second dataset is a personal data set on the electronic device (Shaji: [0027], “Further, in some embodiments, the personalization layer (or layers) (i.e., those layers that are updated based on the personalized data set) may be comprised of a linear and/or non-linear multi dimensionality reduction functions.”)
-the first NN architecture is trained on the first dataset … ([0027] “the personalization layer (or layers) (i.e., those layers that are updated based on the personalized data set) may be comprised of a linear and/or non-linear multi dimensionality reduction functions. WSABIE is an instance of linear multi-dimensionality reduction functions. In some embodiments, these functions can be trained quickly using stochastic gradient methods.”; “The personalization layer” reads on “the first NN architecture” and “the personalized data set” reads on “the first dataset”)
	Guo further teaches
- the sketching operation is not performed on input data of the second function (“[0022]
FIG. 15 is an overview diagram of sketching a network model by exploiting binary structures according to exemplary embodiments.”; “sketching” approximates the structure of the network and it does not deal with the input data.)
	Shaji as modified by Tai and Guo does not distinctly disclose
- after the second NN architecture is reduced to the first NN architecture
	However, Kisilev teaches 
- after the second NN architecture is reduced to the first NN architecture ([0027] “The reduced network may be trained on the data in that stage.”; “The reduced network” reads on the “first NN architecture” reduced from the second NN architecture.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to modify the personalized aesthetic scoring system as taught by Shaji, Tai and Guo to include network reduction as taught by Kisilev in order to train neural network architecture after reduction thereby improving the speed of training (Kisilev: [0027] “By avoiding training all nodes on all training data, dropout may decrease overfitting in neural networks and may also significantly improve the speed of training.”).

Regarding claim 5
Shaji as modified by Tai, Guo and Kisilev teaches all of the limitations of claim 4 as cited above and Kisilev further teaches
- wherein the first NN architecture is computationally reduced from the second NN architecture (Kisilev: [0027] “A dropout layer of processing may be performed to prevent overfitting. In dropout processing, individual nodes may be either “dropped out” of the neural network with probability 1-p or kept with probability p, so that a reduced network is left Likewise, incoming and outgoing edges to a dropped-out node may also be removed.”; [0027] discloses how the neural network is computationally reduced.)
	Same motivation as claim 4.

Regarding claim 11
Shaji as modified by Tai and Guo teaches all of the limitations of claim 8 as cited above and Shaji further teaches: 
- wherein the second dataset is a personal data set on the electronic device (Shaji: [0027], “Further, in some embodiments, the personalization layer (or layers) (i.e., those layers that are updated based on the personalized data set) may be comprised of a linear and/or non-linear multi dimensionality reduction functions.”)
-the first NN architecture is trained on the first dataset … ([0027] “the personalization layer (or layers) (i.e., those layers that are updated based on the personalized data set) may be comprised of a linear and/or non-linear multi dimensionality reduction functions. WSABIE is an instance of linear multi-dimensionality reduction functions. In some embodiments, these functions can be trained quickly using stochastic gradient methods.”; “The personalization layer” reads on “the first NN architecture” and “the personalized data set” reads on “the first dataset”)
	Guo further teaches
- the sketching operation is not performed on input data of the second function (“[0022]
FIG. 15 is an overview diagram of sketching a network model by exploiting binary structures according to exemplary embodiments.”; “sketching” approximates the structure of the network and it does not deal with the input data.)
	Shaji as modified by Tai and Guo does not distinctly disclose
- after the second NN architecture is reduced to the first NN architecture
	However, Kisilev teaches 
- after the second NN architecture is reduced to the first NN architecture ([0027] “The reduced network may be trained on the data in that stage.”; “The reduced network” reads on the “first NN architecture” reduced from the second NN architecture.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to modify the personalized aesthetic scoring system as taught by Shaji, Tai and Guo to include network reduction as taught by Kisilev in order to train neural network architecture after reduction thereby improving the speed of training (Kisilev: [0027]).

Regarding claim 12
Shaji as modified by Tai, Guo and Kisilev teaches all of the limitations of claim 11 as cited above and Kisilev further teaches
- wherein the first NN architecture is computationally reduced from the second NN architecture (Kisilev: [0027] “A dropout layer of processing may be performed to prevent overfitting. In dropout processing, individual nodes may be either “dropped out” of the neural network with probability 1-p or kept with probability p, so that a reduced network is left Likewise, incoming and outgoing edges to a dropped-out node may also be removed.”; [0027] discloses how the neural network is computationally reduced.)
	Same motivation as claim 11.

Regarding claim 17
Shaji as modified by Tai and Guo teaches all of the limitations of claim 15 as cited above and Shaji further teaches: 
- wherein the second dataset is a personal data set on the electronic device (Shaji: [0027], “Further, in some embodiments, the personalization layer (or layers) (i.e., those layers that are updated based on the personalized data set) may be comprised of a linear and/or non-linear multi dimensionality reduction functions.”)
-the first NN architecture is trained on the first dataset … ([0027] “the personalization layer (or layers) (i.e., those layers that are updated based on the personalized data set) may be comprised of a linear and/or non-linear multi dimensionality reduction functions. WSABIE is an instance of linear multi-dimensionality reduction functions. In some embodiments, these functions can be trained quickly using stochastic gradient methods.”; “The personalization layer” reads on “the first NN architecture” and “the personalized data set” reads on “the first dataset”)
- the first function and the second function have a same input and output dimensionality ([Fig. 2] discloses the replacement of layers and each layer has same input dimensionality C and output dimensionality N)

    PNG
    media_image3.png
    242
    549
    media_image3.png
    Greyscale
)
Same motivation as claim 15.
Guo further teaches: 
- wherein the sketching operation is performed along different dimensions of a tensor space for generating multiple different first functions that are combined to form the first layer (Guo: [0213] “As described above, a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means
W ≈ 〈 B , a 〉 =                         
                            
                                
                                    ∑
                                    
                                        j
                                        =
                                        0
                                    
                                    
                                        m
                                        -
                                        1
                                    
                                
                                
                                    
                                        
                                            a
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            
                                
                                    b
                                
                                
                                    j
                                
                            
                        
                     
in which B∈{+1−1}c×w×h×m and a∈Rm are the concatenations of m binary tensors {B0, . . . , Bm-1} and the same number of scale factors {a0, . . . , am-1}, respectively”,  tensors {B0, . . . , Bm-1} reads on “different dimensions of a tensor space”,  [Fig. 17] also shows how multiple subsets of first layers are formed.) and that replicate functionality of the second layer ([0214] “Generally, the reconstruction error (or approximation error, round-off error) should be minimized to retain the model accuracy after expansion.”; “retain the model accuracy” reads on “functionality of the second layer”)
	Shaji as modified by Tai and Guo does not distinctly disclose
- after the second NN architecture is reduced to the first NN architecture
	However, Kisilev teaches 
- after the second NN architecture is reduced to the first NN architecture ([0027] “The reduced network may be trained on the data in that stage.”; “The reduced network” reads on the “first NN architecture” reduced from the second NN architecture.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to modify the personalized aesthetic scoring system as taught by Shaji, Tai and Guo to include network reduction as taught by Kisilev in order to train neural network architecture after reduction thereby improving the speed of training (Kisilev: [0027] “By avoiding training all nodes on all training data, dropout may decrease overfitting in neural networks and may also significantly improve the speed of training.”).

Regarding claim 18
Shaji as modified by Tai,  Guo and Kisilev teaches all of the limitations of claim 17 as cited above and Guo further teaches
- the sketching operation is not performed on input data of the second function ([0022] “FIG. 15 is an overview diagram of sketching a network model by exploiting binary structures according to exemplary embodiments.”; “sketching” approximates the structure of the network and it does not deal with the input data.)
Kisilev further teaches
- wherein the first NN architecture is computationally reduced from the second NN architecture (Kisilev: [0027] “A dropout layer of processing may be performed to prevent overfitting. In dropout processing, individual nodes may be either “dropped out” of the neural network with probability 1-p or kept with probability p, so that a reduced network is left Likewise, incoming and outgoing edges to a dropped-out node may also be removed.”; [0027] discloses how the neural network is computationally reduced.)
	Same motivation as claim 17.

Regarding claim 19
Shaji as modified by Tai, Guo and Kisilev teaches all of the limitations of claim 18 as cited above and Shaji further teaches:
- wherein retraining of the first NN architecture is not tied to a particular dataset ([0076] “Updating the personalized neural network comprises re-training the final layers of the personalized neural network with the third set of images and keeping the initial layers of the personalized neural network” ”; [0076] shows re-training is not done with the specific dataset.)

Claims 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Shaji in view of Tai, Guo, and Ravi as cited above, and further in view of Tjandra et al. (“Compressing recurrent neural network with tensor train”).

Regarding claim 9
Shaji as modified by Tai ang Guo teaches all of the limitations of claim 8 as cited above and Guo further teaches 
- wherein prior knowledge of the first dataset is not required due to the training occurring after the parameter reduction ([0247] “Just to avoid the propagation of reconstruction errors, we need to somehow fine-tune the generated sketches. … one is known as projection gradient descent and the other is stochastic gradient descent with full precision weight update as described in Reference [1]. The latter can be chose by virtue of its better convergence. The training batch size can be set as 256 and the momentum is 0.9.”; discloses the training is done after sketching which reduced the parameters)
	Same motivation as claim 8.
Shaji, Tai, Guo, and Ravi do not expressly teach
- and the first layer is a sketching fully-connected layer that is parameterized by a bias vector and a sequence of matrix pairs. However, this is taught by Tjandra ([Section IIA, p. 4451]:

    PNG
    media_image6.png
    43
    354
    media_image6.png
    Greyscale
  )
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Shaji’s neural network layers with Tjandra’s vector sequence in order to model sequential dependencies as suggested by Tjandra ([Section IIA, p. 4451]).

Regarding claim 16
Shaji as modified by Tai and Guo teaches all of the limitations of claim 15 as cited above and Guo further teaches 
- wherein prior knowledge of the first dataset is not required due to the training occurring after the parameter reduction ([0247] “Just to avoid the propagation of reconstruction errors, we need to somehow fine-tune the generated sketches. … one is known as projection gradient descent and the other is stochastic gradient descent with full precision weight update as described in Reference [1]. The latter can be chose by virtue of its better convergence. The training batch size can be set as 256 and the momentum is 0.9.”; discloses the training is done after sketching which reduced the parameters)
	Same motivation as claim 15.

Shaji, Tai, Guo, and Ravi do not expressly teach
- and the first layer is a sketching fully-connected layer that is parameterized by a bias vector and a sequence of matrix pairs. However, this is taught by Tjandra ([Section IIA, p. 4451]:

    PNG
    media_image6.png
    43
    354
    media_image6.png
    Greyscale
  )
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Shaji’s neural network layers with Tjandra’s vector sequence in order to model sequential dependencies as suggested by Tjandra ([Section IIA, p. 4451]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
U.S. Patent Application Publication 2017/0154262 by Sussillo et al. See ¶ 0032, e.g. “the inputs to the trained neural network 102 and the retrained, resized neural network 104 are features of a personalized recommendation for a user” Also see Fig. 2 and ¶ 0038, e.g. “The system retrains the resized neural network (step 206).” 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to James D Rutten whose telephone number is (571)272-3703. The examiner can normally be reached M-F 9:00-5:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/James D. Rutten/Primary Examiner, Art Unit 2121