DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 1-20 are pending under this Office action.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 4, 6, 12, and 14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Note that every term in the equation should be well defined: such terms as X,             
                
                    
                        X
                    
                    
                        i
                    
                
            
        ,             
                
                    
                        x
                    
                    
                        i
                    
                
            
        , i, min, and L, etc. are not well defined in the current format of the claims.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Edrenkin (US 20200380355 A1) in view of Gambetta, etc. (US 20200342347 A1), further in view of Stojevic, etc. (US 20210398621 A1).
Regarding claim 1, Edrenkin teaches that a method of processing data in a machine learning model (See Edrenkin: Figs. 1A-B, and [0028], “FIG. 1A is an illustration of a general neural network comprising an input layer, several hidden layers, and an output layer. The neural network may comprise an input layer having at least one neuron, one or several hidden layers, each hidden layer usually having multiple neurons, and an output layer having again at least one neuron. In this matter, a neuron is a node of the neural network, wherein a collection of neurons, i.e. nodes, forms a layer, for example, an input layer, a hidden layer, or an output layer”), comprising: 
receiving input data at a machine learning model, the machine learning model (See Edrenkin: Figs. 1A-B, and [0037], “A general neural classification model architecture may use a neural network as shown in FIGS. 1A and 1B or may use a recurrent neural network, or a transformer neural network being used especially for language models with regard to translations. The computational complexity of the neural classification model is mostly dominated by the fully connected last hidden layer having N neurons typically in the order of hundreds, and the output layer having V neurons typically in the order of millions. When training the parameters in these two layers, for example the weights w associated with these two layers, O(N*V) operations may be performed during training. As mentioned above, as N may be a value in the order of hundreds and V may be a value in the order of millions, the number of operations used for training may be quite high leading to increased computational burden and extremely long training times of the classification model. If, however, the number of neurons in the last hidden layer and/or the number of neurons of the output layer are reduced in order to decrease computational burden and training times, the throughput and capacity of the classification model is also reduced, and the quality, accuracy, efficiency, and/or preciseness of classification is impaired”) comprising: 
a plurality of processing layers (See Edrenkin: Figs. 1A-B, and [0028], “FIG. 1A is an illustration of a general neural network comprising an input layer, several hidden layers, and an output layer”); 
a plurality of gate logics (See Edrenkin: Fig. 2, and [0088], “According to another embodiment, the activation function f used for the neural network 230 to compute the probabilities assigned to the neurons of the output layer may be a hierarchical softmax function or any other softmax-like activation function that outputs a valid probability distribution over words. For instance, the coarse training unit 210 uses hierarchical softmax or any other softmax-like activation function that outputs a valid probability distribution over words as activation function fin order to further reduce the computational complexity of the classification”); 
a plurality of gates (See Edrenkin: Fig. 2, and [0047], “In one embodiment, the subset of neurons V′ of the output layer is determined by a determination unit being included in the classification apparatus 200, wherein the fine training unit 220 trains the neural network on the set of neurons k of the last hidden layer and the determined subset of neurons l′ of the output layer. The determination unit may be part of the fine training unit 220 or may be a single unit in the classification apparatus 200. The determination unit may comprise a partitioning unit and a selection unit, wherein the partitioning unit partitions the neurons of the output layer in subsets, and wherein the selection unit selects a subset which is used by the fine training unit 220 for training the neural network 230”); and 
a fully connected layer connected to an output of one of the plurality of processing layers (See Edrenkin: Figs. 4A-B, and [0043], “FIG. 4A is an illustration of coarse training a neural network according to an embodiment. Coarse training is performed by the coarse training unit 210 and may be also called coarse softmax. FIG. 4A shows the last hidden layer and the output layer of the neural network 230. During coarse training of the neural network 230, the coarse training unit trains the neural network on a subset of neurons k′ of the last hidden layer and a set of neurons l of the output layer. The set of neurons l of the output layer may be the total number of neurons V of the output layer or a set of neurons smaller than the total number of neurons of the output layer. The subset of neurons k′ of the last hidden layer may comprise a number of neurons of the last hidden layer smaller than a total number of neurons N of the last hidden layer. Again, the parameter V indicates the number of neurons of the output layer and the parameter N indicates the number of neurons of the fully connected last hidden layer”); 
determining based on a plurality of gate parameters associated with the plurality of gate logics, a subset of the plurality of processing layers with which to process the input data (See Edrenkin: Fig. 7, and [0072], “Thus, in step S760, the fine probabilities are computed for a subset of neurons of the output layer, i.e. for a subset of elements of the classification training dataset. In order to do so, the parameter k for the number of neurons in the last hidden layer used for fine training is, for example, defined in step S750. The parameter k is, for example, defined manually by a user of the classification apparatus 200 or during manufacturing, dependent on hardware configuration and capacity, the domain and size of the classification training dataset etc. Typical values for k are, for example, in the 1000′s”); 
processing the input data with the subset of the plurality of processing layers and the fully connected layer to generate an inference (See Edrenkin: Figs. 4A-B, and [0044], “In FIG. 4A, it is shown that the coarse training unit 210 trains the neural network 230 on k′=4 neurons of the last hidden layer and on l=V neurons of the output layer, i.e. the total number of neurons V of the output layer. Thus, the number of operations during coarse training is reduced to O(k′*V) operations which means that the coarse training unit 210 performs N/k′ times less operations compared to the prior art”); 
determining a prediction loss based on the inference and a training label associated with the input data (See Edrenkin: Figs. 1A-B, and [0036], “Training a neural network may mean calibrating the weights w associated with the inputs of the neurons. In the beginning, initial weights may be randomly selected based on, for example, Gaussian distribution. The process for training a neural network may then be repeated multiple times on the initial weights until the weights are calibrated using the backpropagation algorithm to accurately predict an output”); 
determining an energy loss based on the subset of the plurality of processing layers used to process the input data (See Edrenkin: Fig. 2, and [0078], “According to an embodiment, the weights of the neural network 230 are calculated with a joint loss function, wherein the joint loss function is a sum of a loss function of the coarse training unit 210 and a loss function of the fine training unit 220. As a loss function, which is also sometimes called error function, cross-entropy or any other loss function suitable for a classification task may be used. Minimizing the joint loss function is tantamount to minimizing each summand, i.e. minimizing both the loss function of the coarse training unit 210 and the loss function of the fine training unit 220, and hence coarse training and fine training is possible at the same time”); and 
optimizing the machine learning model (See Edrenkin: Fig. 2, and [0068], “To reduce the size of the classification training dataset, for example the number of neurons V in the output layer, as illustrated in FIGS. 4A and 4B, or as illustrated in FIGS. 5A and 5B, every neuron of the output layer may be allocated with a probability distribution by the coarse training unit 210. In order to do so, a subset of neurons k′ of the last hidden layer is defined in step S720 and the neural network 230 is trained based on the subset of neurons k′ of the last hidden layer and the set of neurons l of the output layer, wherein the set of neurons l may comprise the total number of neurons of the output layer. The value k′ for defining the subset of neurons of the last hidden layer may be set manually by a user of the classification apparatus 200 or may be automatically determined in a separate training model. For example, as separate training model for determining the parameter k′ for setting the number of neurons of the last hidden layer used by the coarse training unit 210, a reinforcement machine learning model is used. A large k′ may result in a finer and more accurate calculation of the coarse probabilities and a more accurate probability distribution of the classification training dataset with a larger computational burden. The use of the term “coarse” in the present solution designates the fidelity or resolution of the probability distribution for the classification training dataset. The parameter k′ may be configured to optimize the training of the model, and may vary based on the type of input dataset, hardware requirements, etc.”) based on: 
the prediction (See Edrenkin: Figs. 1A-B, and [0036], “Training a neural network may mean calibrating the weights w associated with the inputs of the neurons. In the beginning, initial weights may be randomly selected based on, for example, Gaussian distribution. The process for training a neural network may then be repeated multiple times on the initial weights until the weights are calibrated using the backpropagation algorithm to accurately predict an output”) loss; 
the energy (See Edrenkin: Fig. 2, and [0078], “According to an embodiment, the weights of the neural network 230 are calculated with a joint loss function, wherein the joint loss function is a sum of a loss function of the coarse training unit 210 and a loss function of the fine training unit 220. As a loss function, which is also sometimes called error function, cross-entropy or any other loss function suitable for a classification task may be used. Minimizing the joint loss function is tantamount to minimizing each summand, i.e. minimizing both the loss function of the coarse training unit 210 and the loss function of the fine training unit 220, and hence coarse training and fine training is possible at the same time”) loss; and 
a prior probability associated with the training label (See Edrenkin: Fig. 2, and [0048], “During training by the coarse training unit 210, each neuron v.sub.i∈l of the set of neurons l of the output layer may be assigned to a coarse probability P.sub.c(v.sub.i), wherein the coarse probability may be an output result of the coarse training unit 210. Similar to the calculations performed in the neural network of FIGs. 1A and 1B, the coarse probability P.sub.c(v.sub.i) may be calculated in each neuron v.sub.i of the output layer by using an activation function f. The activation function f may be a softmax function outputting probabilities summing up to one, and may use as inputs the weighted sum of the m.sub.c inputs of each neuron v.sub.i of the set of neurons l of the output layer during coarse training as shown below”).
However, Edrenkin fails to explicitly disclose that a plurality of gate logics; determining based on a plurality of gate parameters associated with the plurality of gate logics; the prediction loss; and the energy loss. 
However, Gambetta teaches that a plurality of gate logics (See Gambetta: Fig. 1, and [0097], “Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Edrenkin to have a plurality of gate logics as taught by Gambetta in order to  significantly reduce resource demands, thus, increasing efficiency and decreasing complexity (See Gambetta: [0011], “In conventional circuits, Boolean logic gates arranged in succession manipulate a series of bits. The technology for optimizing the gate-logic for binary computations is well-known. Circuit optimization software for conventional circuits aims to increase efficiency and decrease complexity of conventional circuits. Circuit optimization software for conventional circuits functions in part by decomposing the overall desired behavior of the conventional circuit into simpler functions. The conventional circuit optimization software more easily manipulates and processes the simpler functions. The circuit optimization software generates an efficient layout of design elements on the conventional circuit. As a result, circuit optimization software for conventional circuits significantly reduces resource demands, thereby increasing efficiency and decreasing complexity”). Edrenkin teaches a method and system that may train the classification apparatus with a coarse training and a fine training with different sub-set of layers to increase the training speed and reducing the computational complexity with activation functions to assign weights to different neurons; while Gambetta teaches a system and method that may train the machine learning model with training dataset for validation of quantum circuit by a first processor with a set of rules and arrays of programmable logic arrays. Therefore, it is obvious to one of ordinary skill in the art to modify Edrenkin by Gambetta to enhance the machine learning model training with logic arrays. The motivation to modify Edrenkin by Gambetta is “Use of known technique to improve similar devices (methods, or products) in the same way”.
However, Edrenkin, modified by Gambetta, fails to explicitly disclose that determining based on a plurality of gate parameters associated with the plurality of gate logics; the prediction loss; and the energy loss. 
However, Stojevic teaches that determining based on a plurality of gate parameters associated with the plurality of gate logics (See Stojevic: Figs. 1-3, and [0785], “The adjustable parameters of gates, w.sub.1 to w.sub.10 in the diagram, are simply classical parameters. These may be optimised in a standard stochastic gradient routine to minimise a cost function of interest. The quantum part of the calculation determines the predictors, and the gradients for the gate parameters. The algorithm is therefore split into classical and quantum subroutines. The quantum subroutine deals with the calculation of the cost function and the gradients, whilst the updates of the parameters, and the rest of the stochastic gradient subroutine, are logic dealt with classically. The basic arrangement of splitting a gradient descent algorithm into quantum and classical parts addresses a machine learning problem, rather than purely wave function optimisation for a single molecule. The algorithm(s) and arrangement disclosed herein can be tested on computers that are not universal and do not have full quantum error correction implemented, for example the “IBM-Q” or the upcoming Google chipset (or the D-WAVE computer). Even though non-universal, future iterations of these machines may be capable of exploiting quantum entanglement that is not efficiently accessible classically. It has been inconclusively argued that this may be the case even for systems available at present. All accessible entanglement would, in principle, be utilised by the present approach, and there is be no requirement to understand details of decoherence effects or how to correct for these. The effects of decoherence may only be noticed in the lack of improvement for quantum circuits beyond a certain depth”);
 the prediction loss (See Stojevic: Figs. 1-3, and [0773], “Data may be taken in the classical algorithm, expressed as a set of bits (0s and 1s), and mapped to a qubit input on the quantum computing side (classical 0 is mapped to a qubit pointing ‘down’, and 1 to a qubit pointing ‘up’). On the classical side, a standard training approach may be used, wherein the predictors are minimised with respect to the training data using standard cost functions. The predictor on the quantum side is obtained by actually measuring the operator (described by the operator network) on the quantum state many times in order to obtain the expectation value of the operator statistically. The quantum state on which the measurement of the operator expectation value is performed is the result of applying the transformative quantum circuit (equivalent of the transformative network on the classical side) to the inputs. More generally, in the case when left and right inputs are different, the quantum measurement corresponds to obtaining the amplitude of the quantum state corresponding to the transformed right input data into the quantum state corresponding to the transformed left input data, with an insertion of the operator. Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”); 
the energy loss (See Stojevic: Fig. 55, and [0543], “From a HF solution, a subset of occupied and virtual orbitals is selected to act as active space. The remaining occupied and virtual orbitals are kept frozen at HF level and the electronic structure in the active space is solved for exactly. The notation CAS(N, L) refers to an active space containing N electrons distributed between all configurations that can be constructed from L molecular orbitals. A CAS-SCF simulation is a two-step process where the energy can be iteratively minimized by doing a full-CI calculation only in the active space (CAS-CI). That information is then used to rotate the occupied and active orbital spaces to minimize the energy even further. Because the many-body Hilbert space grows exponentially with the number of single-particle states, only small active spaces up to 18 electrons in 18 orbitals can be treated with CAS-CI (cf. exact diagonalization). Dynamic correlation is usually small and can be recovered with good accuracy by means of perturbative methods on top of the CAS solution which should contain the proper static correlation”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Edrenkin to have determining based on a plurality of gate parameters associated with the plurality of gate logics; the prediction loss; and the energy loss as taught by Stojevic in order to allow one to construct the model entirely from graph-convolutional networks (See Stojevic: [0813], “where pi is model's distribution over the ith node given the input graph custom-character, {circumflex over (π)} is a node permuting operator, and π(i) is the index of the node to which the ith node is mapped under {circumflex over (π)}. This comes with other advantages: it allows one to construct the model entirely from graph-convolutional networks without using any fully connected layers, which means we can deal with molecules of any size without changing the number of parameters”). Edrenkin teaches a method and system that may train the classification apparatus with a coarse training and a fine training with different sub-set of layers to increase the training speed and reducing the computational complexity with activation functions to assign weights to different neurons; while Stojevic teaches a system and method that may have quantum circuits configured as an infinite tensor network representation of quantum states of the infinite physical or chemical system and train the network by minimizing the prediction loss and energy loss functions. Therefore, it is obvious to one of ordinary skill in the art to modify Edrenkin by Stojevic to train the machine learning model by minimizing the prediction loss and energy loss functions. The motivation to modify Edrenkin by Stojevic is “Use of known technique to improve similar devices (methods, or products) in the same way”.
Regarding claim 2, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 1 as outlined above. Further, Edrenkin teaches that the method of Claim 1, wherein optimizing the machine learning model is based on a loss function comprising a prior probability element (See Edrenkin: Fig. 2, and [0048], “During training by the coarse training unit 210, each neuron v.sub.i∈l of the set of neurons l of the output layer may be assigned to a coarse probability P.sub.c(v.sub.i), wherein the coarse probability may be an output result of the coarse training unit 210. Similar to the calculations performed in the neural network of FIGs. 1A and 1B, the coarse probability P.sub.c(v.sub.i) may be calculated in each neuron v.sub.i of the output layer by using an activation function f. The activation function f may be a softmax function outputting probabilities summing up to one, and may use as inputs the weighted sum of the m.sub.c inputs of each neuron v.sub.i of the set of neurons l of the output layer during coarse training as shown below”).
Regarding claim 3, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 2 as outlined above. Further, Stojevic teaches that the method of Claim 2, wherein optimizing the machine learning model comprises: 
determining updated layer weights for one or more of the processing layers (See Stojevic: [0773], “In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”); and 
determining updated gate logic parameters for one or more of the gate logics (See Stojevic: [0773], “Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements”). 
Regarding claim 4, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 3 as outlined above. Further, Stojevic teaches that the method of Claim 3, wherein: the loss function is: 
the loss function is: Loss =                 
                    
                        
                            
                                
                                    min
                                
                                
                                    W
                                    ,
                                     G
                                    ,
                                     X
                                    ,
                                     P
                                
                            
                        
                        ⁡
                        
                            
                                
                                    ∑
                                    
                                        X
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    L
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    W
                                                    ,
                                                    G
                                                
                                            
                                            +
                                            
                                                
                                                    α
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    E
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            (
                                            W
                                            ,
                                            G
                                            )
                                        
                                    
                                    P
                                    (
                                    
                                        
                                            X
                                        
                                        
                                            i
                                        
                                    
                                    )
                                
                            
                        
                    
                
            , 

                
                    
                        
                            x
                        
                        
                            i
                        
                    
                
             comprises the input data in a class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            ,

                
                    P
                    
                        
                            
                                
                                    X
                                
                                
                                    i
                                
                            
                        
                    
                     
                
            comprises the prior probability associated with the class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            ,

W comprises the updated layer weights,

G comprises the updated gate logic parameters, and

                
                    
                        
                            α
                        
                        
                            
                                
                                    X
                                
                                
                                    i
                                
                            
                        
                    
                
             comprises a predetermined scalar value (See Stojevic: Fig. 7, and [0136], “In transitioning to the quantum algorithm it will be useful to think of the above as a procedure for minimizing the cost function”).
Regarding claim 5, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 2 as outlined above. Further, Stojevic teaches that the method of Claim 2, wherein optimizing the machine learning model comprises determining updated gate logic parameters for one or more of the gate logics (See Stojevic: Figs. 1A-B, and [0773], “Data may be taken in the classical algorithm, expressed as a set of bits (0s and 1s), and mapped to a qubit input on the quantum computing side (classical 0 is mapped to a qubit pointing ‘down’, and 1 to a qubit pointing ‘up’). On the classical side, a standard training approach may be used, wherein the predictors are minimised with respect to the training data using standard cost functions. The predictor on the quantum side is obtained by actually measuring the operator (described by the operator network) on the quantum state many times in order to obtain the expectation value of the operator statistically. The quantum state on which the measurement of the operator expectation value is performed is the result of applying the transformative quantum circuit (equivalent of the transformative network on the classical side) to the inputs. More generally, in the case when left and right inputs are different, the quantum measurement corresponds to obtaining the amplitude of the quantum state corresponding to the transformed right input data into the quantum state corresponding to the transformed left input data, with an insertion of the operator. Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”).
Regarding claim 6, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 5 as outlined above. Further, Stojevic teaches that the method of Claim 5, wherein: wherein: the loss function is: Loss =                
                    
                        
                            
                                
                                    min
                                
                                
                                    G
                                    ,
                                     X
                                    ,
                                     P
                                
                            
                        
                        ⁡
                        
                            
                                
                                    ∑
                                    
                                        X
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    L
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    G
                                                
                                            
                                            +
                                            
                                                
                                                    β
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    G
                                                
                                            
                                            P
                                            (
                                            
                                                
                                                    X
                                                
                                                
                                                    i
                                                
                                            
                                            )
                                        
                                    
                                
                            
                        
                    
                
            ,                 
                    
                        
                            x
                        
                        
                            i
                        
                    
                
             comprises the input data in a class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            ,                 
                    P
                    (
                    
                        
                            X
                        
                        
                            i
                        
                    
                    )
                
             comprises the prior probability associated with the class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            , G comprises the updated gate logic parameters, and                 
                    
                        
                            β
                        
                        
                            
                                
                                    X
                                
                                
                                    i
                                
                            
                        
                    
                    
                        
                            G
                        
                    
                
             comprises a predetermined scalar value (See Stojevic: [0283], “The term ‘cost function’ preferably connotes a mathematical function representing a measure of performance of an artificial neural network, or a tensor network, in relation to a desired output. The weights in the network are optimised to minimise some desired cost function”).
Regarding claim 7, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 3 as outlined above. Further, Edrenkin teaches that the method of Claim 3, further comprising: determining an updated prior probability associated with the training label based on the inference (See Edrenkin: Fig. 2, and [0050], “In one example, the fine training unit 220 is then able to assign, during fine training of the neural network 230, a fine probability P.sub.f(v.sub.i) to each neuron v.sub.i ∈ l.sub.i of the output layer in the subset of neurons l′, the fine probability P.sub.c(v.sub.i) having a higher probability distribution accuracy than the coarse probability P.sub.c(v.sub.i) and being an output result of the fine training unit 220. Again, similar to the calculations performed in the neural network of FIGS. 1A and 1B, the fine probability P.sub.f(v.sub.i) may be calculated in each neuron v.sub.i of the subset of neurons l′ of the output layer by using the activation function f. For example, the activation function f uses as inputs the weighted sum of the m.sub.f inputs of each neuron v.sub.i of the subset of neurons l′ of the output layer as shown below”).
Regarding claim 8, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 7 as outlined above. Further, Stojevic teaches that the method of Claim 7, wherein determining the updated layer weights and determining the updated gate logic parameters are based on the updated prior probability (See Stojevic: [0773], “Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”).
Regarding claim 9, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 1 as outlined above. Further, Edrenkin, Gambetta, and Stojevic teach that a processing system (See Edrenkin: Figs. 1A-B, and [0028], “FIG. 1A is an illustration of a general neural network comprising an input layer, several hidden layers, and an output layer. The neural network may comprise an input layer having at least one neuron, one or several hidden layers, each hidden layer usually having multiple neurons, and an output layer having again at least one neuron. In this matter, a neuron is a node of the neural network, wherein a collection of neurons, i.e. nodes, forms a layer, for example, an input layer, a hidden layer, or an output layer”), comprising: 
a memory comprising computer-executable instructions (See Edrenkin: Fig. 10, and [0093], “FIG. 10 shows a computing device 1000 for implementing the classification apparatus 200 or 800. The computing device 1000 comprises an input 1010, a processor 1020, a memory 1030, and an output 1040”); 
one or more processors configured to execute the computer-executable instructions and cause the processing system (See Edrenkin: Fig. 10, and [0097], “In addition, the memory 1030 may store a computer program to execute the classification methods described above, the classification methods being performed by the classification apparatus. The computer program comprises instructions which, when the program is executed by a computer, or by the computing device 1000, cause the computer to carry out the following steps: a coarse step of training the neural network on a subset of neurons of a last hidden layer and a set of neurons of the output layer; and a fine step of training the neural network on a set of neurons of the last hidden layer and a subset of neurons of the output layer, the subset of neurons of the last hidden layer comprising a smaller number of neurons than the set of neurons of the last hidden layer, and the subset of neurons of the output layer comprising a smaller number of neurons than the set of neurons of the output layer. According to another embodiment, the computer program may comprise instructions which cause the computer to carry out the steps described with respect to the FIGS. 1A to 9, particularly the steps described with respect to FIGS. 3, 6, 7, and 9”) to: 
receive input data at a machine learning model, the machine learning model (See Edrenkin: Figs. 1A-B, and [0037], “A general neural classification model architecture may use a neural network as shown in FIGS. 1A and 1B or may use a recurrent neural network, or a transformer neural network being used especially for language models with regard to translations. The computational complexity of the neural classification model is mostly dominated by the fully connected last hidden layer having N neurons typically in the order of hundreds, and the output layer having V neurons typically in the order of millions. When training the parameters in these two layers, for example the weights w associated with these two layers, O(N*V) operations may be performed during training. As mentioned above, as N may be a value in the order of hundreds and V may be a value in the order of millions, the number of operations used for training may be quite high leading to increased computational burden and extremely long training times of the classification model. If, however, the number of neurons in the last hidden layer and/or the number of neurons of the output layer are reduced in order to decrease computational burden and training times, the throughput and capacity of the classification model is also reduced, and the quality, accuracy, efficiency, and/or preciseness of classification is impaired”) comprising: 
a plurality of processing layers (See Edrenkin: Figs. 1A-B, and [0028], “FIG. 1A is an illustration of a general neural network comprising an input layer, several hidden layers, and an output layer”); 
a plurality of gate logics (See Gambetta: Fig. 1, and [0097], “Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention”); 
a plurality of gates (See Edrenkin: Fig. 2, and [0047], “In one embodiment, the subset of neurons V′ of the output layer is determined by a determination unit being included in the classification apparatus 200, wherein the fine training unit 220 trains the neural network on the set of neurons k of the last hidden layer and the determined subset of neurons l′ of the output layer. The determination unit may be part of the fine training unit 220 or may be a single unit in the classification apparatus 200. The determination unit may comprise a partitioning unit and a selection unit, wherein the partitioning unit partitions the neurons of the output layer in subsets, and wherein the selection unit selects a subset which is used by the fine training unit 220 for training the neural network 230”); and 
a fully connected layer connected to an output of one of the plurality of processing layers (See Edrenkin: Figs. 4A-B, and [0043], “FIG. 4A is an illustration of coarse training a neural network according to an embodiment. Coarse training is performed by the coarse training unit 210 and may be also called coarse softmax. FIG. 4A shows the last hidden layer and the output layer of the neural network 230. During coarse training of the neural network 230, the coarse training unit trains the neural network on a subset of neurons k′ of the last hidden layer and a set of neurons l of the output layer. The set of neurons l of the output layer may be the total number of neurons V of the output layer or a set of neurons smaller than the total number of neurons of the output layer. The subset of neurons k′ of the last hidden layer may comprise a number of neurons of the last hidden layer smaller than a total number of neurons N of the last hidden layer. Again, the parameter V indicates the number of neurons of the output layer and the parameter N indicates the number of neurons of the fully connected last hidden layer”); 
determine based on a plurality of gate parameters associated with the plurality of gate logics (See Edrenkin: Figs. 4A-B, and [0043], “FIG. 4A is an illustration of coarse training a neural network according to an embodiment. Coarse training is performed by the coarse training unit 210 and may be also called coarse softmax. FIG. 4A shows the last hidden layer and the output layer of the neural network 230. During coarse training of the neural network 230, the coarse training unit trains the neural network on a subset of neurons k′ of the last hidden layer and a set of neurons l of the output layer. The set of neurons l of the output layer may be the total number of neurons V of the output layer or a set of neurons smaller than the total number of neurons of the output layer. The subset of neurons k′ of the last hidden layer may comprise a number of neurons of the last hidden layer smaller than a total number of neurons N of the last hidden layer. Again, the parameter V indicates the number of neurons of the output layer and the parameter N indicates the number of neurons of the fully connected last hidden layer”), a subset of the plurality of processing layers with which to process the input data (See Edrenkin: Fig. 7, and [0072], “Thus, in step S760, the fine probabilities are computed for a subset of neurons of the output layer, i.e. for a subset of elements of the classification training dataset. In order to do so, the parameter k for the number of neurons in the last hidden layer used for fine training is, for example, defined in step S750. The parameter k is, for example, defined manually by a user of the classification apparatus 200 or during manufacturing, dependent on hardware configuration and capacity, the domain and size of the classification training dataset etc. Typical values for k are, for example, in the 1000′s”); 
process the input data with the subset of the plurality of processing layers and the fully connected layer to generate an inference (See Edrenkin: Fig. 7, and [0072], “Thus, in step S760, the fine probabilities are computed for a subset of neurons of the output layer, i.e. for a subset of elements of the classification training dataset. In order to do so, the parameter k for the number of neurons in the last hidden layer used for fine training is, for example, defined in step S750. The parameter k is, for example, defined manually by a user of the classification apparatus 200 or during manufacturing, dependent on hardware configuration and capacity, the domain and size of the classification training dataset etc. Typical values for k are, for example, in the 1000′s”); 
determine a prediction loss based on the inference and a training label associated with the input data (See Edrenkin: Figs. 1A-B, and [0036], “Training a neural network may mean calibrating the weights w associated with the inputs of the neurons. In the beginning, initial weights may be randomly selected based on, for example, Gaussian distribution. The process for training a neural network may then be repeated multiple times on the initial weights until the weights are calibrated using the backpropagation algorithm to accurately predict an output”); 
determine an energy loss based on the subset of the plurality of processing layers used to process the input data (See Edrenkin: Fig. 2, and [0078], “According to an embodiment, the weights of the neural network 230 are calculated with a joint loss function, wherein the joint loss function is a sum of a loss function of the coarse training unit 210 and a loss function of the fine training unit 220. As a loss function, which is also sometimes called error function, cross-entropy or any other loss function suitable for a classification task may be used. Minimizing the joint loss function is tantamount to minimizing each summand, i.e. minimizing both the loss function of the coarse training unit 210 and the loss function of the fine training unit 220, and hence coarse training and fine training is possible at the same time”); and 
optimize the machine learning model (See Edrenkin: Fig. 2, and [0068], “To reduce the size of the classification training dataset, for example the number of neurons V in the output layer, as illustrated in FIGS. 4A and 4B, or as illustrated in FIGS. 5A and 5B, every neuron of the output layer may be allocated with a probability distribution by the coarse training unit 210. In order to do so, a subset of neurons k′ of the last hidden layer is defined in step S720 and the neural network 230 is trained based on the subset of neurons k′ of the last hidden layer and the set of neurons l of the output layer, wherein the set of neurons l may comprise the total number of neurons of the output layer. The value k′ for defining the subset of neurons of the last hidden layer may be set manually by a user of the classification apparatus 200 or may be automatically determined in a separate training model. For example, as separate training model for determining the parameter k′ for setting the number of neurons of the last hidden layer used by the coarse training unit 210, a reinforcement machine learning model is used. A large k′ may result in a finer and more accurate calculation of the coarse probabilities and a more accurate probability distribution of the classification training dataset with a larger computational burden. The use of the term “coarse” in the present solution designates the fidelity or resolution of the probability distribution for the classification training dataset. The parameter k′ may be configured to optimize the training of the model, and may vary based on the type of input dataset, hardware requirements, etc.”) based on: 
the prediction loss (See Stojevic: Figs. 1-3, and [0773], “Data may be taken in the classical algorithm, expressed as a set of bits (0s and 1s), and mapped to a qubit input on the quantum computing side (classical 0 is mapped to a qubit pointing ‘down’, and 1 to a qubit pointing ‘up’). On the classical side, a standard training approach may be used, wherein the predictors are minimised with respect to the training data using standard cost functions. The predictor on the quantum side is obtained by actually measuring the operator (described by the operator network) on the quantum state many times in order to obtain the expectation value of the operator statistically. The quantum state on which the measurement of the operator expectation value is performed is the result of applying the transformative quantum circuit (equivalent of the transformative network on the classical side) to the inputs. More generally, in the case when left and right inputs are different, the quantum measurement corresponds to obtaining the amplitude of the quantum state corresponding to the transformed right input data into the quantum state corresponding to the transformed left input data, with an insertion of the operator. Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”); 
the energy loss (See Stojevic: Fig. 55, and [0543], “From a HF solution, a subset of occupied and virtual orbitals is selected to act as active space. The remaining occupied and virtual orbitals are kept frozen at HF level and the electronic structure in the active space is solved for exactly. The notation CAS(N, L) refers to an active space containing N electrons distributed between all configurations that can be constructed from L molecular orbitals. A CAS-SCF simulation is a two-step process where the energy can be iteratively minimized by doing a full-CI calculation only in the active space (CAS-CI). That information is then used to rotate the occupied and active orbital spaces to minimize the energy even further. Because the many-body Hilbert space grows exponentially with the number of single-particle states, only small active spaces up to 18 electrons in 18 orbitals can be treated with CAS-CI (cf. exact diagonalization). Dynamic correlation is usually small and can be recovered with good accuracy by means of perturbative methods on top of the CAS solution which should contain the proper static correlation”); and 
a prior probability associated with the training label (See Edrenkin: Fig. 2, and [0048], “During training by the coarse training unit 210, each neuron v.sub.i∈l of the set of neurons l of the output layer may be assigned to a coarse probability P.sub.c(v.sub.i), wherein the coarse probability may be an output result of the coarse training unit 210. Similar to the calculations performed in the neural network of FIGs. 1A and 1B, the coarse probability P.sub.c(v.sub.i) may be calculated in each neuron v.sub.i of the output layer by using an activation function f. The activation function f may be a softmax function outputting probabilities summing up to one, and may use as inputs the weighted sum of the m.sub.c inputs of each neuron v.sub.i of the set of neurons l of the output layer during coarse training as shown below”).
Regarding claim 10, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 9 as outlined above. Further, Edrenkin teaches that the processing system of Claim 9, wherein the one or more processors are further configured to optimize the machine learning model based on a loss function comprising a prior probability element (See Edrenkin: Fig. 2, and [0048], “During training by the coarse training unit 210, each neuron v.sub.i∈l of the set of neurons l of the output layer may be assigned to a coarse probability P.sub.c(v.sub.i), wherein the coarse probability may be an output result of the coarse training unit 210. Similar to the calculations performed in the neural network of FIGs. 1A and 1B, the coarse probability P.sub.c(v.sub.i) may be calculated in each neuron v.sub.i of the output layer by using an activation function f. The activation function f may be a softmax function outputting probabilities summing up to one, and may use as inputs the weighted sum of the m.sub.c inputs of each neuron v.sub.i of the set of neurons l of the output layer during coarse training as shown below”).
Regarding claim 11, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 10 as outlined above. Further, Stojevic teaches that the processing system of Claim 10, wherein in order to optimize the machine learning model, the one or more processors are further configured to:
determine updated layer weights for one or more of the processing layers (See Stojevic: [0773], “In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”); and 
determine updated gate logic parameters for one or more of the gate logics (See Stojevic: [0773], “Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements”). 
Regarding claim 12, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 11 as outlined above. Further, Stojevic teaches that the processing system of Claim 11, wherein:
the loss function is: Loss =                 
                    
                        
                            
                                
                                    min
                                
                                
                                    W
                                    ,
                                     G
                                    ,
                                     X
                                    ,
                                     P
                                
                            
                        
                        ⁡
                        
                            
                                
                                    ∑
                                    
                                        X
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    L
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    W
                                                    ,
                                                    G
                                                
                                            
                                            +
                                            
                                                
                                                    α
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    E
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            (
                                            W
                                            ,
                                            G
                                            )
                                        
                                    
                                    P
                                    (
                                    
                                        
                                            X
                                        
                                        
                                            i
                                        
                                    
                                    )
                                
                            
                        
                    
                
            , 

                
                    
                        
                            x
                        
                        
                            i
                        
                    
                
             comprises the input data in a class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            ,

                
                    P
                    
                        
                            
                                
                                    X
                                
                                
                                    i
                                
                            
                        
                    
                     
                
            comprises the prior probability associated with the class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            ,

W comprises the updated layer weights,

G comprises the updated gate logic parameters, and

                
                    
                        
                            α
                        
                        
                            
                                
                                    X
                                
                                
                                    i
                                
                            
                        
                    
                
             comprises a predetermined scalar value (See Stojevic: Fig. 7, and [0136], “In transitioning to the quantum algorithm it will be useful to think of the above as a procedure for minimizing the cost function”).

Regarding claim 13, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 10 as outlined above. Further, Stojevic teaches that the processing system of Claim 10, wherein in order to optimize the machine learning model, the one or more processors are further configured to determine updated gate logic parameters for one or more of the gate logics (See Stojevic: Figs. 1A-B, and [0773], “Data may be taken in the classical algorithm, expressed as a set of bits (0s and 1s), and mapped to a qubit input on the quantum computing side (classical 0 is mapped to a qubit pointing ‘down’, and 1 to a qubit pointing ‘up’). On the classical side, a standard training approach may be used, wherein the predictors are minimised with respect to the training data using standard cost functions. The predictor on the quantum side is obtained by actually measuring the operator (described by the operator network) on the quantum state many times in order to obtain the expectation value of the operator statistically. The quantum state on which the measurement of the operator expectation value is performed is the result of applying the transformative quantum circuit (equivalent of the transformative network on the classical side) to the inputs. More generally, in the case when left and right inputs are different, the quantum measurement corresponds to obtaining the amplitude of the quantum state corresponding to the transformed right input data into the quantum state corresponding to the transformed left input data, with an insertion of the operator. Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”).
Regarding claim 14, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 13 as outlined above. Further, Stojevic teaches that the processing system of Claim 13,  wherein: the loss function is: Loss =                
                    
                        
                            
                                
                                    min
                                
                                
                                    G
                                    ,
                                     X
                                    ,
                                     P
                                
                            
                        
                        ⁡
                        
                            
                                
                                    ∑
                                    
                                        X
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    L
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    G
                                                
                                            
                                            +
                                            
                                                
                                                    β
                                                
                                                
                                                    
                                                        
                                                            X
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    G
                                                
                                            
                                            P
                                            (
                                            
                                                
                                                    X
                                                
                                                
                                                    i
                                                
                                            
                                            )
                                        
                                    
                                
                            
                        
                    
                
            ,                 
                    
                        
                            x
                        
                        
                            i
                        
                    
                
             comprises the input data in a class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            ,                 
                    P
                    (
                    
                        
                            X
                        
                        
                            i
                        
                    
                    )
                
             comprises the prior probability associated with the class                 
                    
                        
                            X
                        
                        
                            i
                        
                    
                
            , G comprises the updated gate logic parameters, and                 
                    
                        
                            β
                        
                        
                            
                                
                                    X
                                
                                
                                    i
                                
                            
                        
                    
                    
                        
                            G
                        
                    
                
             comprises a predetermined scalar value (See Stojevic: [0283], “The term ‘cost function’ preferably connotes a mathematical function representing a measure of performance of an artificial neural network, or a tensor network, in relation to a desired output. The weights in the network are optimised to minimise some desired cost function”).
Regarding claim 15, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 11 as outlined above. Further, Edrenkin teaches that the processing system of Claim 11, wherein the one or more processors are further configured to determine an updated prior probability associated with the training label based on the inference (See Edrenkin: Fig. 2, and [0050], “In one example, the fine training unit 220 is then able to assign, during fine training of the neural network 230, a fine probability P.sub.f(v.sub.i) to each neuron v.sub.i ∈ l.sub.i of the output layer in the subset of neurons l′, the fine probability P.sub.c(v.sub.i) having a higher probability distribution accuracy than the coarse probability P.sub.c(v.sub.i) and being an output result of the fine training unit 220. Again, similar to the calculations performed in the neural network of FIGS. 1A and 1B, the fine probability P.sub.f(v.sub.i) may be calculated in each neuron v.sub.i of the subset of neurons l′ of the output layer by using the activation function f. For example, the activation function f uses as inputs the weighted sum of the m.sub.f inputs of each neuron v.sub.i of the subset of neurons l′ of the output layer as shown below”).
Regarding claim 16, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 15 as outlined above. Further, Stojevic teaches that the processing system of Claim 15, wherein the one or more processors are further configured to determine the updated layer weights and determine the updated gate logic parameters based on the updated prior probability (See Stojevic: [0773], “Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”).
Regarding claim 17, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 1 as outlined above. Further, Edrenkin, Gambetta, and Stojevic teach that a non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method of processing data in a machine learning model (See Edrenkin: Figs. 1A-B, and [0028], “FIG. 1A is an illustration of a general neural network comprising an input layer, several hidden layers, and an output layer. The neural network may comprise an input layer having at least one neuron, one or several hidden layers, each hidden layer usually having multiple neurons, and an output layer having again at least one neuron. In this matter, a neuron is a node of the neural network, wherein a collection of neurons, i.e. nodes, forms a layer, for example, an input layer, a hidden layer, or an output layer”), the method comprising:
receiving input data at a machine learning model, the machine learning model (See Edrenkin: Figs. 1A-B, and [0037], “A general neural classification model architecture may use a neural network as shown in FIGS. 1A and 1B or may use a recurrent neural network, or a transformer neural network being used especially for language models with regard to translations. The computational complexity of the neural classification model is mostly dominated by the fully connected last hidden layer having N neurons typically in the order of hundreds, and the output layer having V neurons typically in the order of millions. When training the parameters in these two layers, for example the weights w associated with these two layers, O(N*V) operations may be performed during training. As mentioned above, as N may be a value in the order of hundreds and V may be a value in the order of millions, the number of operations used for training may be quite high leading to increased computational burden and extremely long training times of the classification model. If, however, the number of neurons in the last hidden layer and/or the number of neurons of the output layer are reduced in order to decrease computational burden and training times, the throughput and capacity of the classification model is also reduced, and the quality, accuracy, efficiency, and/or preciseness of classification is impaired”) comprising: 
plurality of processing layers (See Edrenkin: Figs. 1A-B, and [0028], “FIG. 1A is an illustration of a general neural network comprising an input layer, several hidden layers, and an output layer”); 
a plurality of gate logics (See Gambetta: Fig. 1, and [0097], “Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention”); 
a plurality of gates (See Edrenkin: Fig. 2, and [0047], “In one embodiment, the subset of neurons V′ of the output layer is determined by a determination unit being included in the classification apparatus 200, wherein the fine training unit 220 trains the neural network on the set of neurons k of the last hidden layer and the determined subset of neurons l′ of the output layer. The determination unit may be part of the fine training unit 220 or may be a single unit in the classification apparatus 200. The determination unit may comprise a partitioning unit and a selection unit, wherein the partitioning unit partitions the neurons of the output layer in subsets, and wherein the selection unit selects a subset which is used by the fine training unit 220 for training the neural network 230”); and 
a fully connected layer connected to an output of one of the plurality of processing layers (See Edrenkin: Figs. 4A-B, and [0043], “FIG. 4A is an illustration of coarse training a neural network according to an embodiment. Coarse training is performed by the coarse training unit 210 and may be also called coarse softmax. FIG. 4A shows the last hidden layer and the output layer of the neural network 230. During coarse training of the neural network 230, the coarse training unit trains the neural network on a subset of neurons k′ of the last hidden layer and a set of neurons l of the output layer. The set of neurons l of the output layer may be the total number of neurons V of the output layer or a set of neurons smaller than the total number of neurons of the output layer. The subset of neurons k′ of the last hidden layer may comprise a number of neurons of the last hidden layer smaller than a total number of neurons N of the last hidden layer. Again, the parameter V indicates the number of neurons of the output layer and the parameter N indicates the number of neurons of the fully connected last hidden layer”); 
determining based on a plurality of gate parameters associated with the plurality of gate logics (See Edrenkin: Figs. 4A-B, and [0043], “FIG. 4A is an illustration of coarse training a neural network according to an embodiment. Coarse training is performed by the coarse training unit 210 and may be also called coarse softmax. FIG. 4A shows the last hidden layer and the output layer of the neural network 230. During coarse training of the neural network 230, the coarse training unit trains the neural network on a subset of neurons k′ of the last hidden layer and a set of neurons l of the output layer. The set of neurons l of the output layer may be the total number of neurons V of the output layer or a set of neurons smaller than the total number of neurons of the output layer. The subset of neurons k′ of the last hidden layer may comprise a number of neurons of the last hidden layer smaller than a total number of neurons N of the last hidden layer. Again, the parameter V indicates the number of neurons of the output layer and the parameter N indicates the number of neurons of the fully connected last hidden layer”), a subset of the plurality of processing layers with which to process the input data (See Edrenkin: Fig. 7, and [0072], “Thus, in step S760, the fine probabilities are computed for a subset of neurons of the output layer, i.e. for a subset of elements of the classification training dataset. In order to do so, the parameter k for the number of neurons in the last hidden layer used for fine training is, for example, defined in step S750. The parameter k is, for example, defined manually by a user of the classification apparatus 200 or during manufacturing, dependent on hardware configuration and capacity, the domain and size of the classification training dataset etc. Typical values for k are, for example, in the 1000′s”); 
processing the input data with the subset of the plurality of processing layers and the fully connected layer to generate an inference (See Edrenkin: Fig. 7, and [0072], “Thus, in step S760, the fine probabilities are computed for a subset of neurons of the output layer, i.e. for a subset of elements of the classification training dataset. In order to do so, the parameter k for the number of neurons in the last hidden layer used for fine training is, for example, defined in step S750. The parameter k is, for example, defined manually by a user of the classification apparatus 200 or during manufacturing, dependent on hardware configuration and capacity, the domain and size of the classification training dataset etc. Typical values for k are, for example, in the 1000′s”); 
determining a prediction loss based on the inference and a training label associated with the input data (See Edrenkin: Figs. 1A-B, and [0036], “Training a neural network may mean calibrating the weights w associated with the inputs of the neurons. In the beginning, initial weights may be randomly selected based on, for example, Gaussian distribution. The process for training a neural network may then be repeated multiple times on the initial weights until the weights are calibrated using the backpropagation algorithm to accurately predict an output”); 
determining an energy loss based on the subset of the plurality of processing layers used to process the input data (See Edrenkin: Fig. 2, and [0078], “According to an embodiment, the weights of the neural network 230 are calculated with a joint loss function, wherein the joint loss function is a sum of a loss function of the coarse training unit 210 and a loss function of the fine training unit 220. As a loss function, which is also sometimes called error function, cross-entropy or any other loss function suitable for a classification task may be used. Minimizing the joint loss function is tantamount to minimizing each summand, i.e. minimizing both the loss function of the coarse training unit 210 and the loss function of the fine training unit 220, and hence coarse training and fine training is possible at the same time”); and 
optimizing the machine learning model (See Edrenkin: Fig. 2, and [0068], “To reduce the size of the classification training dataset, for example the number of neurons V in the output layer, as illustrated in FIGS. 4A and 4B, or as illustrated in FIGS. 5A and 5B, every neuron of the output layer may be allocated with a probability distribution by the coarse training unit 210. In order to do so, a subset of neurons k′ of the last hidden layer is defined in step S720 and the neural network 230 is trained based on the subset of neurons k′ of the last hidden layer and the set of neurons l of the output layer, wherein the set of neurons l may comprise the total number of neurons of the output layer. The value k′ for defining the subset of neurons of the last hidden layer may be set manually by a user of the classification apparatus 200 or may be automatically determined in a separate training model. For example, as separate training model for determining the parameter k′ for setting the number of neurons of the last hidden layer used by the coarse training unit 210, a reinforcement machine learning model is used. A large k′ may result in a finer and more accurate calculation of the coarse probabilities and a more accurate probability distribution of the classification training dataset with a larger computational burden. The use of the term “coarse” in the present solution designates the fidelity or resolution of the probability distribution for the classification training dataset. The parameter k′ may be configured to optimize the training of the model, and may vary based on the type of input dataset, hardware requirements, etc.”) based on: 
the prediction loss (See Stojevic: Figs. 1-3, and [0773], “Data may be taken in the classical algorithm, expressed as a set of bits (0s and 1s), and mapped to a qubit input on the quantum computing side (classical 0 is mapped to a qubit pointing ‘down’, and 1 to a qubit pointing ‘up’). On the classical side, a standard training approach may be used, wherein the predictors are minimised with respect to the training data using standard cost functions. The predictor on the quantum side is obtained by actually measuring the operator (described by the operator network) on the quantum state many times in order to obtain the expectation value of the operator statistically. The quantum state on which the measurement of the operator expectation value is performed is the result of applying the transformative quantum circuit (equivalent of the transformative network on the classical side) to the inputs. More generally, in the case when left and right inputs are different, the quantum measurement corresponds to obtaining the amplitude of the quantum state corresponding to the transformed right input data into the quantum state corresponding to the transformed left input data, with an insertion of the operator. Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”); 
the energy loss (See Stojevic: Fig. 55, and [0543], “From a HF solution, a subset of occupied and virtual orbitals is selected to act as active space. The remaining occupied and virtual orbitals are kept frozen at HF level and the electronic structure in the active space is solved for exactly. The notation CAS(N, L) refers to an active space containing N electrons distributed between all configurations that can be constructed from L molecular orbitals. A CAS-SCF simulation is a two-step process where the energy can be iteratively minimized by doing a full-CI calculation only in the active space (CAS-CI). That information is then used to rotate the occupied and active orbital spaces to minimize the energy even further. Because the many-body Hilbert space grows exponentially with the number of single-particle states, only small active spaces up to 18 electrons in 18 orbitals can be treated with CAS-CI (cf. exact diagonalization). Dynamic correlation is usually small and can be recovered with good accuracy by means of perturbative methods on top of the CAS solution which should contain the proper static correlation”); and 
a prior probability associated with the training label (See Edrenkin: Fig. 2, and [0048], “During training by the coarse training unit 210, each neuron v.sub.i∈l of the set of neurons l of the output layer may be assigned to a coarse probability P.sub.c(v.sub.i), wherein the coarse probability may be an output result of the coarse training unit 210. Similar to the calculations performed in the neural network of FIGs. 1A and 1B, the coarse probability P.sub.c(v.sub.i) may be calculated in each neuron v.sub.i of the output layer by using an activation function f. The activation function f may be a softmax function outputting probabilities summing up to one, and may use as inputs the weighted sum of the m.sub.c inputs of each neuron v.sub.i of the set of neurons l of the output layer during coarse training as shown below”).
Regarding claim 18, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 17 as outlined above. Further, Edrenkin teaches that the non-transitory computer-readable medium of Claim 17, wherein optimizing the machine learning model is based on a loss function comprising a prior probability element (See Edrenkin: Fig. 2, and [0048], “During training by the coarse training unit 210, each neuron v.sub.i∈l of the set of neurons l of the output layer may be assigned to a coarse probability P.sub.c(v.sub.i), wherein the coarse probability may be an output result of the coarse training unit 210. Similar to the calculations performed in the neural network of FIGs. 1A and 1B, the coarse probability P.sub.c(v.sub.i) may be calculated in each neuron v.sub.i of the output layer by using an activation function f. The activation function f may be a softmax function outputting probabilities summing up to one, and may use as inputs the weighted sum of the m.sub.c inputs of each neuron v.sub.i of the set of neurons l of the output layer during coarse training as shown below”).
Regarding claim 19, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 18 as outlined above. Further, Stojevic teaches that the non-transitory computer-readable medium of Claim 18, wherein optimizing the machine learning model comprises: 
determining updated layer weights for one or more of the processing layers (See Stojevic: [0773], “In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”); and 
determining updated gate logic parameters for one or more of the gate logics (See Stojevic: [0773], “Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements”).
Regarding claim 20, Edrenkin, Gambetta, and Stojevic teach all the features with respect to claim 19 as outlined above. Further, Edrenkin and Stojevic teach that the non-transitory computer-readable medium of Claim 19, wherein the method further comprises: 
determining an updated prior probability associated with the training label based on the inference (See Edrenkin: Fig. 2, and [0050], “In one example, the fine training unit 220 is then able to assign, during fine training of the neural network 230, a fine probability P.sub.f(v.sub.i) to each neuron v.sub.i ∈ l.sub.i of the output layer in the subset of neurons l′, the fine probability P.sub.c(v.sub.i) having a higher probability distribution accuracy than the coarse probability P.sub.c(v.sub.i) and being an output result of the fine training unit 220. Again, similar to the calculations performed in the neural network of FIGS. 1A and 1B, the fine probability P.sub.f(v.sub.i) may be calculated in each neuron v.sub.i of the subset of neurons l′ of the output layer by using the activation function f. For example, the activation function f uses as inputs the weighted sum of the m.sub.f inputs of each neuron v.sub.i of the subset of neurons l′ of the output layer as shown below”), and 
wherein determining the updated layer weights and determining the updated gate logic parameters are based on the updated prior probability (See Stojevic: [0773], “Therefore, in the quantum algorithm, the parameters as previously disclosed are updated using results of these measurements. There may be no quantum element regarding these parameters i.e. they are just real or complex numbers, but they may be operable to be updated using a result of an average of quantum measurements. In the classical algorithm, the parameters (or equivalently the weights) may be updated using back-propagation. In the quantum computer implementation (or quantum version), the cost function and the updates of the weights classically obtained using back propagation are obtained via quantum measurements”).


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382. The examiner can normally be reached Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GORDON G LIU/Primary Examiner, Art Unit 2612