DETAILED ACTION
Response to Arguments
Applicant’s arguments with respect to claims 1, 10 and 19 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 10 and 19 are objected to because of the following informalities: the amended limitation of: selecting a hardware device, of the set of hardware devices, on which to execute the is incomplete.  Appropriate correction is required. For purposes of the Office Action the amended limitation of claim 10 and 19 is being interpreted as follows: selecting a hardware device, of the set of hardware devices, on which to execute the expert.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 4-5, 8-11, 13-14, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Marco et al. "Improving spark application throughput via memory aware task co-location: A mixture of experts approach." Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. (2017)(“Marco”) in view of Rattanatamrong, Prapaporn. Real-time scheduling of ensemble systems with limited resources. University of Florida, (2011)(“ Prapaporn”) and in view of Stefan, et al. "Omnivore: An optimizer for multi-device deep learning on cpus and gpus." arXiv preprint arXiv:1606.04487 (2016)(“Stefan”).
Regarding claim 1, Marco teaches a method for distributing an expert of a mixture-of-experts system for execution in a set of hardware devices, the method comprising: retrieving priority data based on the one or more execution parameters and the expert (Marco, pg. 100, sec. 4.2 Resource Monitor, figure. 5, “Each computing node runs a daemon that periodically reports to the resource monitor its memory usage and CPU load. Our current implementation reports the average memory usage and system load within a 5-minute window.” Note: As figure 5 details the memory function used to determine how many data items to give under a memory budget represents the limitation of: priority data); identifying a hardware device of the set of hardware devices based on the priority data; and dispatching the expert to the identified hardware device for execution(Marco,  pg. 100, 4.3 Job Dispatcher, figure. 5, “Once we have the memory function of the highest-priority application, the job dispatcher will spawn a new executor for the application to run on severs that have spare memory and if the aggregate CPU load of all co-running tasks will not go over 100%.” Note: It is being interpreted that a new executor for the application represents the limitation of: dispatching the expert). 
Marco does not teach: the expert being a component of a mixture-of-experts machine learning model; the priority data ranking hardware devices of the set of hardware devices for executing the expert.
However, Prapaporn teaches: the expert being a component of a mixture-of-experts machine learning model(Prapaporn, pgs. 61-68, see also fig. 3-1 and table 3-1, As fig. 3-1 details, “[t]he architecture and workflow of the EES manager. The Task Utilization Adaptor (TUA) assigns resource-utilization demands of experts according to the experts’ responsibilities, determined by a gating component of an ensemble system, in order to minimize the impact of limited resources on the quality of system outputs.”); the priority data ranking hardware devices of the set of hardware devices for executing the expert(Prapaporn, pgs. 61-68, see also fig. 3-1 and table 3-1, As fig. 3-1 details, “[t]he Real-time Task Scheduler (RTS) allocates resources to experts based on their assigned resource-utilization demands and selects only a subset of experts to execute in each cycle.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Marco with the above teachings of Prapaporn the motivation to so would be to schedule experts in a mixture of experts in a way that does not overburden the hardware systems (Prapaporn, pg. 16, “Inspired by the strategy of divide and conquer, ensemble systems utilize multiple simple computational models (called ‘experts’) that can, individually or in some combination, generate solutions for a larger range of input cases than their single original model. Real system requirements of ensemble systems (e.g., size, weight, power and cost constraints) often lead to limited availability of computational resources required to support concurrent execution of all experts. This dissertation proposes a generalized architecture, called Elastic Ensemble Scheduling (EES) manager, to address the problem of scheduling experts in ensemble systems in the way that the overall system performance is minimally affected by limited resources.”). 
Marco does not teach:  identifying one or more execution parameters to act as a basis for selecting a hardware device, of the set of hardware devices, on which to execute the expert.
However, Stefan does teach: identifying one or more execution parameters to act as a basis for selecting a hardware device, of the set of hardware devices, on which to execute the expert(Stefan, pg. 6, right-column, see also fig. 5,  “As we have decomposed each layer [i.e. the expert] into a model server and compute servers and further mapped compute servers to compute groups over several devices [i.e. selecting a hardware device, of the set of hardware devices, on which to execute], it is the responsibility of our execution engine to make sure that all data is on each device when needed…our predicted model for iteration time, HE(g), is:                         
                            H
                            E
                            
                                
                                    g
                                
                            
                            =
                            m
                            a
                            x
                            ⁡
                            {
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                            ,
                             
                            (
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                            +
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                            )
                            /
                            g
                            }
                        
                    …[t]he parameters…can be measured with high accuracy and low variance.                        
                             
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                        
                     can be measured by running an iteration on a single device, but                         
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                        
                    , though still directly measurable, requires measurements for each k. Instead,                        
                             
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                        
                     can be calculated from (i)the throughput of each node; (ii) the network speed; and (iii) a measurement of                         
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    1
                                
                            
                        
                     (which only needs to be measured for a single k, k = 1, and on a single device) [i.e. identifying one or more execution parameters to act as a basis for]. Figure 5(b) shows that our hardware efficiency is accurate.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Marco in view of Stefan the motivation to do so minimize the time to train a deep learning system(Stefan, pg. 2, “We focus on perhaps the most popular deep learning models, convolutional neural networks (CNNs), which are state-of-the-art for a wide range of applications (e.g., image processing, video analysis, drug discovery).Our study answers the following question: ‘Given a cluster (e.g., X machines, Y GPUs, Z CPUs, etc.), how do I train my CNN as quickly as possible?’. We assume that the following
are given: (i) a deep learning model (network architecture), (ii) a dataset for training this model, and (iii) a set of computational resources (a number of devices on machines, their throughput, and the network speed). We then study how to minimize the total training time. We build a complete
prototype capable of training the most popular deep learning models. This allows us to hone in on two major choices: (i) how to use hardware on each node and (ii) the degree to which asynchrony can be tolerated.”).
Regarding claim 2, Marco in view of Prapaporn and in view of Stefan teaches the method of claim 1, wherein the priority data includes ranking information for ranking hardware devices of the set of hardware devices for the combination of the expert and the one or more execution parameters(Marco,  pg. 100, 4.3 Job Dispatcher, figure. 5, “Once we have the memory function of the highest-priority application, the job dispatcher will spawn a new executor for the application to run on se[r]vers that have spare memory and if the aggregate CPU load of all co-running tasks will not go over 100%.” Note: It is being interpreted that the highest-priority application to run on servers that have spare memory represents the limitation of: ranking information for ranking hardware devices of the set of hardware devices).
Regarding claim 4, Marco in view of Prapaporn and in view of Stefan teaches the method of claim 1, wherein the execution parameters comprise one or more of execution throughput, execution latency, power consumption, and execution speed(Marco, pg. 98, sec. 3.2 Runtime Features, Table 2, figure 4(b), “Collected feature values are encoded to a vector of real values… [c]ache features, L1_TCM, L1_DCM and L1_STM, are found to be important for describing memory behaviors. This is not supervising as cache hit/miss rates are shown to be useful in characterizing the application behavior in prior works…[o]ther features of virtual memory usage (vcache), I/O (bo) and thread contention (cs) are also considered to be useful.” & See also Table 2 for additional execution parameters.  Note: It is being interpreted that I/O represents the limitation of: execution speed and thread contention represents the limitation of: execution latency).1
Regarding claim 5, Marco in view of Prapaporn and in view of Stefan teaches the method of claim 1, further comprising automatically obtaining the priority data by testing the expert (Marco, pgs. 96-97, 2.3 Overview of Our Approach, figure.  1, “For each “new" application that is ready to run, we predict which of the off-line learned experts, termed ‘memory function’ in this paper, best describes its memory behavior, i.e. how the memory footprint changes as the input size varies. The selection of the memory function is based on runtime information of the program, such as the number of L1 data and instruction cache misses. This information is collected by running the application on a small portion (around 100MB) of the input data items.” Note: It is being interpreted that the memory function represents the limitation of: the priority data and runtime information collected by running the application on a small portion of the input data represents the limitation of: testing the expert).
Regarding claim 8, Marco in view of Prapaporn and in view of Stefan teaches the method of claim 1, wherein identifying a hardware device of the set of hardware devices to dispatch the expert further comprises retrieving priority data based on one or more model characteristics or parameters (Marco, pgs. 96-97, 2.3 Overview of Our Approach, figure.  1, “For each “new" application that is ready to run, we predict which of the off-line learned experts, termed ‘memory function’ in this paper, best describes its memory behavior, i.e. how the memory footprint changes as the input size varies. The selection of the memory function is based on runtime information of the program, such as the number of L1 data and instruction cache misses. This information is collected by running the application on a small portion (around 100MB) of the input data items. We then calibrate the selected function to tailor its parameters to the target program and input.” Note: It is being interpreted that the memory function represents the limitation of: the priority data and it is being interpreted that runtime information of the program, such as the number of L1 data and instruction cache misses represents the limitation of model characteristics or parameters).2
Regarding claim 9, Marco in view of Prapaporn and in view of Stefan teaches the method of claim 1, wherein the hardware devices comprise one or more of a central processing unit, a graphics processing unit, a field programmable gate array, a dataflow execution unit, an application specific integrated circuit, and a microprocessor(Marco pg. 100, sec. 5 Experimental Setup, “We use a multi-core cluster with 40 nodes, each has an 8-core Xeon E5-2650 CPU @ 2.6GHz (16 threads with hyperthreading)… [n]odes have SSD storage and are connected through 10Gbps Ethernet…”).3
Regarding claim 10, Marco teaches a system for distributing an expert of a mixture-of-experts system for execution in a set of hardware devices, the system comprising: a data store configured to store priority data(Marco, pg. 100, sec. 4.2 Resource Monitor, “Each computing node runs a daemon that periodically reports to the resource monitor its memory usage and CPU load. Our current implementation reports the average memory usage and system load within a 5-minute window. The information is retrieved from the Linux [‘]/proc[’] system.”Note: It is being interpreted that the Linux “/proc” file system represents the limitation of: the data store configured to store priority data); and an orchestrator configured to: retrieve priority data from the data store based on the one or more execution parameters and the expert(Marco, pg. 100, sec. 4.2 Resource Monitor, figure. 5, “Each computing node runs a daemon that periodically reports to the resource monitor its memory usage and CPU load. Our current implementation reports the average memory usage and system load within a 5-minute window.” Note: As figure 5 details the memory function used to determine how many data items to give under a memory budget represents the limitation of: priority data); identify a hardware device of the set of hardware devices based on the priority data; and dispatch the expert to the identified hardware device for execution(Marco,  pg. 100, 4.3 Job Dispatcher, figure. 5, “Once we have the memory function of the highest-priority application, the job dispatcher will spawn a new executor for the application to run on severs that have spare memory and if the aggregate CPU load of all co-running tasks will not go over 100%.” Note: It is being interpreted that a new executor for the application represents the limitation of: dispatching the expert).
Marco does not teach: the expert being a component of a mixture-of-experts machine learning model; the priority data ranking hardware devices of the set of hardware devices for executing the expert.
However, Prapaporn teaches: the expert being a component of a mixture-of-experts machine learning model(Prapaporn, pgs. 61-68, see also fig. 3-1 and table 3-1, As fig. 3-1 details, “[t]he architecture and workflow of the EES manager. The Task Utilization Adaptor (TUA) assigns resource-utilization demands of experts according to the experts’ responsibilities, determined by a gating component of an ensemble system, in order to minimize the impact of limited resources on the quality of system outputs.”); the priority data ranking hardware devices of the set of hardware devices for executing the expert(Prapaporn, pgs. 61-68, see also fig. 3-1 and table 3-1, As fig. 3-1 details, “[t]he Real-time Task Scheduler (RTS) allocates resources to experts based on their assigned resource-utilization demands and selects only a subset of experts to execute in each cycle.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Marco with the above teachings of Prapaporn the motivation to so would be to schedule experts in a mixture of experts in a way that does not overburden the hardware systems (Prapaporn, pg. 16, “Inspired by the strategy of divide and conquer, ensemble systems utilize multiple simple computational models (called ‘experts’) that can, individually or in some combination, generate solutions for a larger range of input cases than their single original model. Real system requirements of ensemble systems (e.g., size, weight, power and cost constraints) often lead to limited availability of computational resources required to support concurrent execution of all experts. This dissertation proposes a generalized architecture, called Elastic Ensemble Scheduling (EES) manager, to address the problem of scheduling experts in ensemble systems in the way that the overall system performance is minimally affected by limited resources.”). 
Marco does not teach: identify one or more execution parameters to act as a basis for selecting a hardware device, of the set of hardware devices, on which to execute the expert. 
However, Stefan teaches: identify one or more execution parameters to act as a basis for selecting a hardware device, of the set of hardware devices, on which to execute the expert(Stefan, pg. 6, right-column, see also fig. 5,  “As we have decomposed each layer [i.e. the expert] into a model server and compute servers and further mapped compute servers to compute groups over several devices [i.e. selecting a hardware device, of the set of hardware devices, on which to execute], it is the responsibility of our execution engine to make sure that all data is on each device when needed…our predicted model for iteration time, HE(g), is:                         
                            H
                            E
                            
                                
                                    g
                                
                            
                            =
                            m
                            a
                            x
                            ⁡
                            {
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                            ,
                             
                            (
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                            +
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                            )
                            /
                            g
                            }
                        
                    …[t]he parameters…can be measured with high accuracy and low variance.                        
                             
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                        
                     can be measured by running an iteration on a single device, but                         
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                        
                    , though still directly measurable, requires measurements for each k. Instead,                        
                             
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                        
                     can be calculated from (i)the throughput of each node; (ii) the network speed; and (iii) a measurement of                         
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    1
                                
                            
                        
                     (which only needs to be measured for a single k, k = 1, and on a single device) [i.e. identifying one or more execution parameters to act as a basis for]. Figure 5(b) shows that our hardware efficiency is accurate.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Marco in view of Stefan the motivation to do so minimize the time to train a deep learning system(Stefan, pg. 2, “We focus on perhaps the most popular deep learning models, convolutional neural networks (CNNs), which are state-of-the-art for a wide range of applications (e.g., image processing, video analysis, drug discovery).Our study answers the following question: ‘Given a cluster (e.g., X machines, Y GPUs, Z CPUs, etc.), how do I train my CNN as quickly as possible?’. We assume that the following
are given: (i) a deep learning model (network architecture), (ii) a dataset for training this model, and (iii) a set of computational resources (a number of devices on machines, their throughput, and the network speed). We then study how to minimize the total training time. We build a complete
prototype capable of training the most popular deep learning models. This allows us to hone in on two major choices: (i) how to use hardware on each node and (ii) the degree to which asynchrony can be tolerated.”).
Regarding claim 11, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 10, wherein the priority data includes ranking information for ranking hardware devices of the set of hardware devices for the combination of the expert and the one or more execution parameters(Marco,  pg. 100, 4.3 Job Dispatcher, figure. 5, “Once we have the memory function of the highest-priority application, the job dispatcher will spawn a new executor for the application to run on severs that have spare memory and if the aggregate CPU load of all co-running tasks will not go over 100%.”Note: It is being interpreted that the highest-priority application to run on servers that have spare memory represents the limitation of: ranking information for ranking hardware devices of the set of hardware devices).
Regarding claim 13, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 10, wherein the execution parameters comprise one or more of execution throughput, execution latency, power consumption, and execution speed(Marco, pg. 98, sec. 3.2 Runtime Features, Table 2, figure 4(b), “Collected feature values are encoded to a vector of real values… [c]ache features, L1_TCM, L1_DCM and L1_STM, are found to be important for describing memory behaviors. This is not supervising as cache hit/miss rates are shown to be useful in characterizing the application behavior in prior works…[o]ther features of virtual memory usage (vcache), I/O (bo) and thread contention (cs) are also considered to be useful.” & See also Table 2 for additional execution parameters. Note: It is being interpreted that I/O represents the limitation of: execution speed and thread contention represents the limitation of: execution latency ).4
Regarding claim 14, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 10, wherein the orchestrator is further configured to automatically obtain the priority data by testing the expert(Marco, pgs. 96-97, 2.3 Overview of Our Approach, figure.  1, “For each “new" application that is ready to run, we predict which of the off-line learned experts, termed ‘memory function’ in this paper, best describes its memory behavior, i.e. how the memory footprint changes as the input size varies. The selection of the memory function is based on runtime information of the program, such as the number of L1 data and instruction cache misses. This information is collected by running the application on a small portion (around 100MB) of the input data items.” Note: It is being interpreted that the memory function represents the limitation of: the priority data and runtime information collected by running the application on a small portion of the input data represents the limitation of: testing the expert).
Regarding claim 17, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 10, wherein the orchestrator is configured to identify a hardware device of the set of hardware devices to dispatch the expert by: retrieving priority data based on one or more model characteristics or parameters(Marco, pgs. 96-97, 2.3 Overview of Our Approach, figure.  1, “For each “new" application that is ready to run, we predict which of the off-line learned experts, termed ‘memory function’ in this paper, best describes its memory behavior, i.e. how the memory footprint changes as the input size varies. The selection of the memory function is based on runtime information of the program, such as the number of L1 data and instruction cache misses. This information is collected by running the application on a small portion (around 100MB) of the input data items. We then calibrate the selected function to tailor its parameters to the target program and input.” Note: It is being interpreted that the memory function represents the limitation of: the priority data and it is being interpreted that runtime information of the program, such as the number of L1 data and instruction cache misses represents the limitation of model characteristics or parameters).5
Regarding claim 18, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 10, wherein the hardware devices comprise one or more of a central processing unit, a graphics processing unit, a field programmable gate array, a dataflow execution unit, an application specific integrated circuit, and a microprocessor(Marco pg. 100, sec. 5 Experimental Setup, “We use a multi-core cluster with 40 nodes, each has an 8-core Xeon E5-2650 CPU @ 2.6GHz (16 threads with hyperthreading)… [n]odes have SSD storage and are connected through 10Gbps Ethernet…”).6
Regarding claim 19, Marco teaches a system, comprising: a set of hardware devices(Marco pg. 100, sec. 5 Experimental Setup, “We use a multi-core cluster with 40 nodes, each has an 8-core Xeon E5-2650 CPU @ 2.6GHz (16 threads with hyperthreading)… [n]odes have SSD storage and are connected through 10Gbps Ethernet…”); and a system for distributing an expert of a mixture-of-experts system for execution in the set of hardware devices, the system comprising: a data store configured to store priority data (Marco, pg. 100, sec. 4.2 Resource Monitor, “Each computing node runs a daemon that periodically reports to the resource monitor its memory usage and CPU load. Our current implementation reports the average memory usage and system load within a 5-minute window. The information is retrieved from the Linux [‘]/proc[’] system.” Note: It is being interpreted that the Linux “/proc” file system represents the limitation of: the data store configured to store priority data); and an orchestrator configured to: retrieve priority data from the data store based on the one or more execution parameters and the expert(Marco, pg. 100, sec. 4.2 Resource Monitor, figure. 5, “Each computing node runs a daemon that periodically reports to the resource monitor its memory usage and CPU load. Our current implementation reports the average memory usage and system load within a 5-minute window.” Note: As figure 5 details the memory function used to determine how many data items to give under a memory budget represents the limitation of: priority data ); identify a hardware device of the set of hardware devices based on the priority data; and dispatch the expert to the identified hardware device for execution(Marco,  pg. 100, 4.3 Job Dispatcher, figure. 5, “Once we have the memory function of the highest-priority application, the job dispatcher will spawn a new executor for the application to run on severs that have spare memory and if the aggregate CPU load of all co-running tasks will not go over 100%.” Note: It is being interpreted that a new executor for the application represents the limitation of: dispatching the expert ).
Marco does not teach: the expert being a component of a mixture-of-experts machine learning model; the priority data ranking hardware devices of the set of hardware devices for executing the expert.
However, Prapaporn teaches: the expert being a component of a mixture-of-experts machine learning model(Prapaporn, pgs. 61-68, see also fig. 3-1 and table 3-1, As fig. 3-1 details, “[t]he architecture and workflow of the EES manager. The Task Utilization Adaptor (TUA) assigns resource-utilization demands of experts according to the experts’ responsibilities, determined by a gating component of an ensemble system, in order to minimize the impact of limited resources on the quality of system outputs.”); the priority data ranking hardware devices of the set of hardware devices for executing the expert(Prapaporn, pgs. 61-68, see also fig. 3-1 and table 3-1, As fig. 3-1 details, “[t]he Real-time Task Scheduler (RTS) allocates resources to experts based on their assigned resource-utilization demands and selects only a subset of experts to execute in each cycle.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Marco with the above teachings of Prapaporn the motivation to so would be to schedule experts in a mixture of experts in a way that does not overburden the hardware systems (Prapaporn, pg. 16, “Inspired by the strategy of divide and conquer, ensemble systems utilize multiple simple computational models (called ‘experts’) that can, individually or in some combination, generate solutions for a larger range of input cases than their single original model. Real system requirements of ensemble systems (e.g., size, weight, power and cost constraints) often lead to limited availability of computational resources required to support concurrent execution of all experts. This dissertation proposes a generalized architecture, called Elastic Ensemble Scheduling (EES) manager, to address the problem of scheduling experts in ensemble systems in the way that the overall system performance is minimally affected by limited resources.”). 
Marco does not teach: identify one or more execution parameters to act as a basis for selecting a hardware device, of the set of hardware devices, on which to execute the expert. 
However, Stefan teaches: identify one or more execution parameters to act as a basis for selecting a hardware device, of the set of hardware devices, on which to execute the expert(Stefan, pg. 6, right-column, see also fig. 5,  “As we have decomposed each layer [i.e. the expert] into a model server and compute servers and further mapped compute servers to compute groups over several devices [i.e. selecting a hardware device, of the set of hardware devices, on which to execute], it is the responsibility of our execution engine to make sure that all data is on each device when needed…our predicted model for iteration time, HE(g), is:                         
                            H
                            E
                            
                                
                                    g
                                
                            
                            =
                            m
                            a
                            x
                            ⁡
                            {
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                            ,
                             
                            (
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                            +
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                            )
                            /
                            g
                            }
                        
                    …[t]he parameters…can be measured with high accuracy and low variance.                        
                             
                            
                                
                                    t
                                
                                
                                    f
                                    c
                                
                            
                        
                     can be measured by running an iteration on a single device, but                         
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                        
                    , though still directly measurable, requires measurements for each k. Instead,                        
                             
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    k
                                
                            
                        
                     can be calculated from (i)the throughput of each node; (ii) the network speed; and (iii) a measurement of                         
                            
                                
                                    t
                                
                                
                                    c
                                    o
                                    n
                                    v
                                
                            
                            
                                
                                    1
                                
                            
                        
                     (which only needs to be measured for a single k, k = 1, and on a single device) [i.e. identifying one or more execution parameters to act as a basis for]. Figure 5(b) shows that our hardware efficiency is accurate.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Marco in view of Stefan the motivation to do so minimize the time to train a deep learning system(Stefan, pg. 2, “We focus on perhaps the most popular deep learning models, convolutional neural networks (CNNs), which are state-of-the-art for a wide range of applications (e.g., image processing, video analysis, drug discovery).Our study answers the following question: ‘Given a cluster (e.g., X machines, Y GPUs, Z CPUs, etc.), how do I train my CNN as quickly as possible?’. We assume that the following are given: (i) a deep learning model (network architecture), (ii) a dataset for training this model, and (iii) a set of computational resources (a number of devices on machines, their throughput, and the network speed). We then study how to minimize the total training time. We build a complete prototype capable of training the most popular deep learning models. This allows us to hone in on two major choices: (i) how to use hardware on each node and (ii) the degree to which asynchrony can be tolerated.”).
Regarding claim 20, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 19, wherein the orchestrator is further configured to automatically obtain the priority data by testing the expert(Marco, pgs. 96-97, 2.3 Overview of Our Approach, figure.  1, “For each “new" application that is ready to run, we predict which of the off-line learned experts, termed ‘memory function’ in this paper, best describes its memory behavior, i.e. how the memory footprint changes as the input size varies. The selection of the memory function is based on runtime information of the program, such as the number of L1 data and instruction cache misses. This information is collected by running the application on a small portion (around 100MB) of the input data items.” Note: It is being interpreted that the memory function represents the limitation of: the priority data and runtime information collected by running the application on a small portion of the input data represents the limitation of: testing the expert).
Claims 3, 6, 12, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Marco et al. "Improving spark application throughput via memory aware task co-location: A mixture of experts approach." Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. (2017)(“Marco”) and in view of  Rattanatamrong, Prapaporn. Real-time scheduling of ensemble systems with limited resources. University of Florida, (2011)(“ Prapaporn”) and in view of Stefan, et al. "Omnivore: An optimizer for multi-device deep learning on cpus and gpus." arXiv preprint arXiv:1606.04487 (2016)(“Stefan”) and further in view of  US 2018/0234491 Al(“GOMES DE OLIVEIRA”).
Regarding claim 3, Marco in view of Prapaporn and in view of Stefan teaches the method of claim 2, but does not teach: wherein identifying the hardware device based on the priority data includes identifying the highest ranked available hardware device.
However, GOMES DE OLIVEIRA teaches: wherein identifying the hardware device based on the priority data includes identifying the highest ranked available hardware device(GOMES DE OLIVEIRA,  para. 0033, fig.2, “[R]anking engine 204 represents generally a combination of hardware and programming to rank each of the servers of the set of servers with an efficiency ranking. Ranking engine 204 determines the efficiency rankings based upon the efficiency rates that were determined by efficiency rate engine 202. For instance, utilizing an efficiency rate model wherein a first server with a determined efficiency rate of "444" is indicative of a higher power consumption efficiency than a second server that has a determined efficiency rate of"220", ranking engine 204 may assign an efficiency ranking to the first server such as "Rank 1", "First Rank", or "Rank A", or the like and may assign an efficiency ranking to the second server such as "Rank 2", "Second Rank", "Rank B", or the like according to a given efficiency ranking construct.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Marco’s method in view of Prapaporn and in view of Stefan  and further in view of GOMES DE OLIVEIRA to teach: wherein identifying the hardware device based on the priority data includes identifying the highest ranked available hardware device. The motivation to do so would be the have the most efficient underlying hardware executing data intensive application as a means of lowering service fees(GOMES DE OLIVEIRA para. 0009, “For instance in addition to having multiple types of servers, similar severs can have significantly different numbers of cores and core configurations. Cloud service providers and their users will thus appreciate a system and method to automatically and effectively deploy programs among a set of heterogeneous servers in a manner that maximizes energy efficiencies of the servers set.”)
Regarding claim 6, Marco in view of Prapaporn and in view of Stefan teaches the method of claim 5 and teaches and recording measurements from the executions of the experts as the priority data (Marco, pg. 100, sec. 4.2 Resource Monitor, figure. 5, “Each computing node runs a daemon that periodically reports to the resource monitor its memory usage and CPU load. Our current implementation reports the average memory usage and system load within a 5-minute window.”).
Marco in view of Prapaporn and in view of Stefan does not teach: wherein testing the expert comprises: executing the expert on different hardware devices of the set of hardware devices.
However, GOMES DE OLIVEIRA teaches:  wherein testing the expert comprises: executing the expert on different hardware devices of the set of hardware devices(GOMES DE OLIVEIRA,  para. 0034, fig.2, “In an example, wherein a first server has a ranking of "Rank 1", "First Rank", or "Rank A", or the like and a second server has a ranking such as "Rank 2", "Second Rank", "Rank B", or the like, deployment engine 206 may deploy programs to the highest ranking sever until the highest ranking server is at maximum capacity, and then may deploy programs from the set of programs to the server with the next highest efficiency ranking, and so on.” Note:  It is being interpreted that the deployment engine deploying programs to the servers based on their rankings represents the limitation of: executing the expert on different hardware devices of the set of hardware devices).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Marco’s method in view of Prapaporn and in view Stefan and further in view of GOMES DE OLIVEIRA to teach: wherein testing the expert comprises: executing the expert on different hardware devices of the set of hardware devices. The motivation to do so would be the have the most efficient underlying hardware executing data intensive application as a means of lowering service fees(GOMES DE OLIVEIRA para. 0009, “For instance in addition to having multiple types of servers, similar severs can have significantly different numbers of cores and core configurations. Cloud service providers and their users will thus appreciate a system and method to automatically and effectively deploy programs among a set of heterogeneous servers in a manner that maximizes energy efficiencies of the servers set.”).
Regarding claim 12, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 11 but does not teach: wherein the orchestrator is configured to identify the hardware device based on the priority data by identifying the highest ranked available hardware device.
However, GOMES DE OLIVEIRA teaches: wherein the orchestrator is configured to identify the hardware device based on the priority data by identifying the highest ranked available hardware device(GOMES DE OLIVEIRA,  para. 0033, fig.2, “[R]anking engine 204 represents generally a combination of hardware and programming to rank each of the servers of the set of servers with an efficiency ranking. Ranking engine 204 determines the efficiency rankings based upon the efficiency rates that were determined by efficiency rate engine 202. For instance, utilizing an efficiency rate model wherein a first server with a determined efficiency rate of "444" is indicative of a higher power consumption efficiency than a second server that has a determined efficiency rate of"220", ranking engine 204 may assign an efficiency ranking to the first server such as "Rank 1", "First Rank", or "Rank A", or the like and may assign an efficiency ranking to the second server such as "Rank 2", "Second Rank", "Rank B", or the like according to a given efficiency ranking construct.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Marco’s method in view of Prapaporn and in view Stefan  and further in view of GOMES DE OLIVEIRA to teach: wherein the orchestrator is configured to identify the hardware device based on the priority data by identifying the highest ranked available hardware device. The motivation to do so would be the have the most efficient underlying hardware executing data intensive application as a means of lowering service fees(GOMES DE OLIVEIRA para. 0009, “For instance in addition to having multiple types of servers, similar severs can have significantly different numbers of cores and core configurations. Cloud service providers and their users will thus appreciate a system and method to automatically and effectively deploy programs among a set of heterogeneous servers in a manner that maximizes energy efficiencies of the servers set.”). 
Regarding claim 15, Marco in view of Prapaporn and in view of Stefan teaches the system of claim 14 and teaches and recording measurements from the executions of the experts as the priority data(Marco, pg. 100, sec. 4.2 Resource Monitor, figure. 5, “Each computing node runs a daemon that periodically reports to the resource monitor its memory usage and CPU load. Our current implementation reports the average memory usage and system load within a 5-minute window.”).
Marco in view of Prapaporn and in view of Stefan does not teach: wherein the orchestrator is configured to test the expert by: executing the expert on different hardware devices of the set of hardware devices.
However, GOMES DE OLIVEIRA teaches wherein the orchestrator is configured to test the expert by: executing the expert on different hardware devices of the set of hardware devices (GOMES DE OLIVEIRA,  para. 0034, fig.2, “In an example, wherein a first server has a ranking of"Rank 1", "First Rank", or"Rank A", or the like and a second server has a ranking such as "Rank 2", "Second Rank", "Rank B", or the like, deployment engine 206 may deploy programs to the highest ranking sever until the highest ranking server is at maximum capacity, and then may deploy programs from the set of programs to the server with the next highest efficiency ranking, and so on.” Note:  It is being interpreted that the deployment engine deploying programs to the servers based on their rankings represents the limitation of: executing the expert on different hardware devices of the set of hardware devices).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Marco’s method in view of Prapaporn and in view Stefan  and further in view of GOMES DE OLIVEIRA to teach: wherein the orchestrator is configured to test the expert by: executing the expert on different hardware devices of the set of hardware devices. The motivation to do so would be the have the most efficient underlying hardware executing data intensive application as a means of lowering service fees(GOMES DE OLIVEIRA para. 0009, “For instance in addition to having multiple types of servers, similar severs can have significantly different numbers of cores and core configurations. Cloud service providers and their users will thus appreciate a system and method to automatically and effectively deploy programs among a set of heterogeneous servers in a manner that maximizes energy efficiencies of the servers set.”).
Claims 7 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Marco et al. "Improving spark application throughput via memory aware task co-location: A mixture of experts approach." Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. (2017)(“Marco”) in view of  Rattanatamrong, Prapaporn. Real-time scheduling of ensemble systems with limited resources. University of Florida, (2011)(“Prapaporn”) and in view of Stefan, et al. "Omnivore: An optimizer for multi-device deep learning on cpus and gpus." arXiv preprint arXiv:1606.04487 (2016)(“Stefan”) and in view of US 2018/0234491 Al(“GOMES DE OLIVEIRA”) and further in view of Rattanatamrong et al. "Real-time scheduling of mixture-of-experts systems with limited resources." Proceedings of the 13th ACM international conference on Hybrid systems: computation and control. (2010) (“Rattanatamrong”).  
Regarding claim 7, Marco in view of Prapaporn and in view of Stefan and in view of GOMES DE OLIVEIRA teaches the method of claim 6 but does not teach: comparing the recorded measurements to generate ranking data for the experts. 
However, Rattanatamrong teaches: comparing the recorded measurements to generate ranking data for the experts (Rattanatamrong, pgs. 75-76, sec. 5 MoE Resource Scheduling Architecture and Heuristic, figure.1, figure.3, figure.4, “In each cycle, the expert responsibilities R(t)={                        
                            
                                
                                    R
                                
                                
                                    1
                                
                            
                            (
                            t
                            )
                        
                    ,…                        
                             
                            
                                
                                    R
                                
                                
                                    N
                                
                            
                            (
                            t
                            )
                        
                    }, together with information about the MoE policy and available resources are used to define a new optimization problem P(w(t)) for the cycle at time t…[u]sing slack time or scheduled time before the new cycle begins, the…[ responsibility predictor] utilizes logs of historic input data and performance data, collected by the…[the logging and monitoring module], to estimate the next responsibilities pred_R(t+1). These estimated responsibilities can then be mapped to a set of weights (pred_w(t+1))…” Note: It is being interpreted that the estimated next responsibilities pred_R(t+1) for experts represents the limitation of: ranking data for the experts).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Marco’s method in view of Prapaporn and in view of Stefan and in view of  GOMES DE OLIVEIRA and further in view of Rattanatamrong to teach: comparing the recorded measurements to generate ranking data for the experts.  The motivation to do so would be the have the most important program/expert also have the most optimal resource time while also allowing less important programs/experts to have non-optimal schedules, allowing for faster problem solving overall (Rattanatamrong, pg. 72, sec. 1 Introduction,  “We propose a new faster heuristic (taking O(N) time) that uses optimal-solution sensitivity analysis and prediction of responsibilities to enable a simple test to determine the optimal schedule at the beginning of most cycles. Our approach proposes the use of a responsibility predictor to predict the responsibilities of experts prior to each cycle. This enables the use of the TC algorithm during the cycle preceding the cycle for which responsibilities are predicted.”).
Regarding claim 16, Marco in view of Prapaporn and in view of Stefan and in view of  GOMES DE OLIVEIRA teaches the system of claim 15, but does not teach: wherein the orchestrator is further configured to: compare the recorded measurements to generate ranking data for the experts.
However, Rattanatamrong teaches: wherein the orchestrator is further configured to: compare the recorded measurements to generate ranking data for the experts(Rattanatamrong, pgs. 75-76, sec. 5 MoE Resource Scheduling Architecture and Heuristic, figure.1, figure.3, figure.4, “In each cycle, the expert responsibilities R(t)={                        
                            
                                
                                    R
                                
                                
                                    1
                                
                            
                            (
                            t
                            )
                        
                    ,…                        
                             
                            
                                
                                    R
                                
                                
                                    N
                                
                            
                            (
                            t
                            )
                        
                    }, together with information about the MoE policy and available resources are used to define a new optimization problem P(w(t)) for the cycle at time t…[u]sing slack time or scheduled time before the new cycle begins, the…[ responsibility predictor] utilizes logs of historic input data and performance data, collected by the…[the logging and monitoring module], to estimate the next responsibilities pred_R(t+1). These estimated responsibilities can then be mapped to a set of weights (pred_w(t+1))…” Note: It is being interpreted that the estimated next responsibilities pred_R(t+1) for experts represents the limitation of: ranking data for the experts)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Marco’s method in view of Prapaporn and in view of Stefan and in view of  GOMES DE OLIVEIRA and further in view of Rattanatamrong to teach: wherein the orchestrator is further configured to: compare the recorded measurements to generate ranking data for the experts.  The motivation to do so would be the have the most important program/expert also have the most optimal resource time while also allowing less important programs/experts to have non-optimal schedules, allowing for faster problem solving overall (Rattanatamrong, pg. 72, sec. 1 Introduction,  “We propose a new faster heuristic (taking O(N) time) that uses optimal-solution sensitivity analysis and prediction of responsibilities to enable a simple test to determine the optimal schedule at the beginning of most cycles. Our approach proposes the use of a responsibility predictor to predict the responsibilities of experts prior to each cycle. This enables the use of the TC algorithm during the cycle preceding the cycle for which responsibilities are predicted.”).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Adam Clark Standke whose telephone number is (571)270-1806. The examiner can normally be reached 10AM-7PM M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



Adam Clark Standke
Assistant Examiner
Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        2 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        3 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        4 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        5 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.
        6 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim
        requiring one or more elements but not all.