DETAILED ACTION
Claims 1, 2, 4, 7-9, 11, 12, and 15 are amended. Claims 3, 6, and 14 are cancelled. Claims 1, 2, 4, 5, 7-13, and 15-20 are pending in the application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claim Objections
Claims 9-10 are objected to because of the following informalities:  
Claim 9, lines 2-3: “the subsets of supernodes” should have been –subsets of the supernodes—.
Claim 10 inherits the features of claim 9 and is objected to accordingly.
Appropriate corrections are required. Applicant is advised to review the entire claims for further needed corrections.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 2, 4, 5, 7-9, 11-13, and 15-20  are rejected under 35 U.S.C. 103 as being unpatentable over Archer et al. (US 2012/0066310 A1; from IDS filed on 02/27/2020; hereinafter Archer).

With respect to claim 1, Archer teaches: A computer-implemented method comprising: 
performing a reduce-scatter operation among a plurality of nodes of a parallel processing system (see e.g. paragraph 21: “perform a desired collective operation for the compute nodes of the parallel computer… Examples of collective operations include… a reduce operation”; and paragraph 28: “Examples of vector variants of MPI operations include scattery (the vector variant of a scatter operation)”) using a plurality of parallel processing stages (see e.g. paragraph 21: “A collective operation generally refers to a message-passing instruction that is executed simultaneously (or approximately so) by all the compute nodes of an operational group of compute nodes”), wherein the plurality of nodes comprises clusters of nodes (see e.g. paragraph 27: “a cluster may be built”); and 
communicating messages among the nodes of each cluster of the clusters in an initial stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… As another example, a cluster may be built”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4), wherein the initial stage is associated with a higher node injection bandwidth (see e.g. paragraph 8: “The primary communication strategy for the Blue Gene/L system is message passing over a torus network”; and paragraph 27: “networks connecting the compute nodes of the parallel computer have different characteristics… second network may support transferring data at a higher bandwidth… a torus network (higher bandwidth)”) than a [subsequent] stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristics. For example, a first network may support transferring data with a lower latency”; and paragraph 25: “the low latency protocol provides a low bandwidth”); and
communicating messages among the clusters in the subsequent stage (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… As another example, a cluster may be built”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4).
Even though Archer discloses primarily using a higher bandwidth network, such as a torus network, for the nodes (see e.g. paragraphs 4, 27; Fig. 2) and using a low-latency low-bandwidth network, such as a collective network, for the nodes (see e.g. paragraph 27), Archer does not explicitly disclose using the low-bandwidth network “subsequent” to using the higher bandwidth network.
However, Archer does disclose connecting the nodes using both the low-bandwidth and high-bandwidth networks (see e.g. paragraph 27: “compute nodes of the parallel computer may be connected by both a collective network (lower latency) and a point-to-point network, such as a torus network (higher bandwidth)”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement node-to-node communications to perform a collective operation (e.g. allreduce in Fig. 4) by utilizing a higher-bandwidth network followed by utilizing a lower-bandwidth network based on the communication requirements of the nodes. The motivation/suggestion would be to accommodate particular communication protocols used by the nodes and to avoid any inter-node messaging incompatibilities (see e.g. paragraphs 25-26).

With respect to claim 2, Archer teaches: The method of claim 1, wherein: 
a message size associated with the initial stage is larger than a message size associated with the subsequent stage (see e.g. paragraph 24: “Which message passing protocol is used may depend on cutoffs based on message size”; and paragraph 26: “to achieve high message bandwidth, a message passing protocol may specify to transmit partially described packets and to have packets routed dynamically. This protocol maximizes both the amount of data to be transmitted as well the number of packets transmitted per unit time”).

With respect to claim 4, Archer teaches: The method of claim 1, wherein performing the reduce-scatter operation comprises processing elements of a data vector in parallel among the plurality of nodes (see e.g. paragraph 67: “performing a vector collective operation on a parallel computing system includes multiple compute nodes and a network connecting the compute nodes that includes ALU hardware. The compute nodes may perform a collective operation to determine displacements for performing the vector collective operation”) to reduce the elements (see e.g. paragraph 21: “A reduce operation is a collective operation that executes arithmetic or logical functions on data distributed among the compute notes of an operational group”) and scattering the reduced elements across the see e.g. paragraph 28: “Examples of vector variants of MPI operations include scattery (the vector variant of a scatter operation)”). 

With respect to claim 5, Archer teaches: The method of claim 1, further comprising: 
for the initial stage of the plurality of parallel processing stages, communicating a plurality of messages from a first node of the plurality of nodes to other nodes of the plurality of nodes to communicate data from the other node to the first node (see e.g. paragraph 21: “A broadcast operation is a collective operation for moving data among compute nodes of an operational group… An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; paragraph 52: “four compute nodes participate in the vector collective operation: compute nodes 0 through 3”; and Fig. 4: “Allreduce”), and processing the communicated data in the first node to apply a reduction operation to the communicated data (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation). 

With respect to claim 7, Archer teaches: The method of claim 1, wherein the clusters comprise respective supernodes (see e.g. paragraph 34: “Compute core 101 contains M Psets 115A-C, each including a single I/O node 111 and N compute nodes 11”; and Fig. 3: “115A-C”), the method further comprising: 
communicating messages among the nodes of each supernode of the super nodes in the initial stage (see e.g. paragraph 35: “The compute nodes within a Pset 115 communicate with the corresponding I/O node over a corresponding local I/O collective network 113A-C”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4); and 
see e.g. paragraph 35: “The compute nodes within a Pset 115 communicate with the corresponding I/O node over a corresponding local I/O collective network 113A-C”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4). 

With respect to claim 8, Archer teaches: The method of claim 1, wherein the clusters comprise supernodes (see e.g. paragraph 34: “Compute core 101 contains M Psets 115A-C, each including a single I/O node 111 and N compute nodes 11”; and Fig. 3: “115A-C”), and subsets of the supernodes are arranged in meshes (see e.g. paragraph 27: “a point-to-point network, such as a torus network”), the method further comprising: 
communicating messages among the nodes of each supernode in the initial stage (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4); 
communicating messages among the supernodes of each mesh in a second stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4); and 
communicating messages among the meshes in a third stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4). 

With respect to claim 9, Archer teaches: The method of claim 1, wherein the clusters comprise supernodes (see e.g. paragraph 34: “Compute core 101 contains M Psets 115A-C, each including a single I/O node 111 and N compute nodes 11”; and Fig. 3: “115A-C”), and the subsets of supernodes are arranged in meshes (see e.g. paragraph 27: “a point-to-point network, such as a torus network”), the method further comprising: 
communicating messages among the nodes of each supernode in the initial stage (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4); 
communicating messages among the supernodes of each mesh in a second stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4); and 
communicating messages among the meshes in a plurality of other stages of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4). 

With respect to claim 11, Archer teaches: A non-transitory computer readable storage medium to store instructions that, when executed by a parallel processing machine, causes the machine (see e.g. paragraph 10: “a computer-readable storage medium containing a program which, when executed, performs an operation to perform a collective operation on a parallel computer comprising a plurality of compute nodes, each compute node having at least a processor and a memory”) to: 
for each stage of a plurality of parallel processing stages (see e.g. paragraph 21: “A collective operation generally refers to a message-passing instruction that is executed simultaneously (or approximately so) by all the compute nodes of an operational group of compute nodes”), communicate messages among a plurality of processing nodes of the machine to exchange and reduce data (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4), wherein each processing stage is associated with an injection bandwidth (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristics. For example, a first network may support transferring data with a lower latency than a second network, while the second network may support transferring data at a higher bandwidth than the first network”), the injection bandwidths differ (see e.g. paragraph 27: “a first network may support transferring data with a lower latency than a second network, while the second network may support transferring data at a higher bandwidth than the first network”), the plurality of processing nodes comprises subsets of nodes arranged in supernodes (see e.g. paragraph 34: “Compute core 101 contains M Psets 115A-C, each including a single I/O node 111 and N compute nodes 11”; and Fig. 3: “115A-C”), and subsets of the see e.g. paragraph 27: “a point-to-point network, such as a torus network”); 
cause the nodes of each supernode to communicate with each other to reduce data in an initial stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4), wherein the initial stage is associated with the highest injection bandwidth of the associated injection bandwidths (see e.g. paragraph 8: “The primary communication strategy for the Blue Gene/L system is message passing over a torus network”; and paragraph 27: “a torus network (higher bandwidth)”);
cause the super nodes of each mesh to communicate with each other to reduce data in a second stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4); and 
cause the meshes to communicate with each other to reduce data in at least one other third stage of the plurality of parallel processing stages (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4).

Even though Archer discloses primarily using a higher bandwidth network, such as a torus network, for the nodes (see e.g. paragraphs 4, 27; Fig. 2) and using a low-latency low-bandwidth network, such as a collective network, for the nodes (see e.g. paragraph 27), Archer does not explicitly disclose using the low-bandwidth network after using the higher bandwidth network at the initial stage.
However, Archer does disclose connecting the nodes using both the low-bandwidth and high-bandwidth networks (see e.g. paragraph 27: “compute nodes of the parallel computer may be connected by both a collective network (lower latency) and a point-to-point network, such as a torus network (higher bandwidth)”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement node-to-node communications to perform a collective operation (e.g. allreduce in Fig. 4) by utilizing a higher-bandwidth network followed by utilizing a lower-bandwidth network based on the communication requirements of the nodes. The motivation/suggestion would be to accommodate particular communication protocols used by the nodes and to avoid any inter-node messaging incompatibilities (see e.g. paragraphs 25-26).

With respect to claim 12, Archer teaches: The computer readable storage medium of claim 11, wherein the computer readable storage medium stores instructions that, when executed by the parallel processing machine, cause the machine to provide a message interface library providing a function that allows ordering of the stages (see e.g. paragraph 10: “The torus network allows application programs developed for parallel processing systems to use high level interfaces such as Message Passing Interface (MPI) and Aggregate Remote Memory Copy Interface (ARMCI) to perform computing tasks and to distribute data among a set of compute nodes”; and paragraph 28: “the ALU may be used to construct descriptors in vector variants of MPI operations. The vector variants of MPI operations may require a displacement array and a length array. Examples of vector variants of MPI operations include scattery (the vector variant of a scatter operation) and gathery (the vector variant of a gather operation)”). 

With respect to claim 13, Archer teaches: The computer readable storage medium of claim 11, wherein the computer readable storage medium stores instructions that, when executed by the parallel processing machine, cause the machine to order the stages according to the associated injection bandwidths so that a stage associated with a relatively higher injection bandwidth is performed (see e.g. paragraph 8: “The primary communication strategy for the Blue Gene/L system is message passing over a torus network”; and paragraph 27: “networks connecting the compute nodes of the parallel computer have different characteristics… second network may support transferring data at a higher bandwidth… a torus network (higher bandwidth)”) before a stage associated with a relatively lower injection bandwidth (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristics. For example, a first network may support transferring data with a lower latency”; and paragraph 25: “the low latency protocol provides a low bandwidth”). 
Even though Archer discloses primarily using a higher bandwidth network, such as a torus network, for the nodes (see e.g. paragraphs 4, 27; Fig. 2) and using a low-latency low-bandwidth network, such as a collective network, for the nodes (see e.g. paragraph 27), Archer does not explicitly disclose using the low-bandwidth network for stages after using the higher bandwidth network at the initial stage.
However, Archer does disclose connecting the nodes using both the low-bandwidth and high-bandwidth networks (see e.g. paragraph 27: “compute nodes of the parallel computer may be connected by both a collective network (lower latency) and a point-to-point network, such as a torus network (higher bandwidth)”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement node-to-node communications to perform a collective operation (e.g. allreduce in Fig. 4) by utilizing a higher-bandwidth network followed by utilizing a lower-bandwidth network based on the communication requirements of the nodes. The motivation/suggestion would be to accommodate particular communication protocols used by the nodes and to avoid any inter-node messaging incompatibilities (see e.g. paragraphs 25-26).

With respect to claim 15, Archer teaches: A system (see e.g. Fig. 1) comprising: 
a plurality of processing meshes (see e.g. paragraph 27: “a point-to-point network, such as a torus network”), wherein: 
each mesh comprises a plurality of supernodes (see e.g. paragraph 34: “Compute core 101 contains M Psets 115A-C, each including a single I/O node 111 and N compute nodes 11”; and Fig. 3: “115A-C”); and 
each supernode comprises a plurality of computer processing nodes (see e.g. paragraph 34: “M Psets 115A-C, each including a single I/O node 111 and N compute nodes 112, for a total of MxN compute nodes 112” and Fig. 1: “C Node 0-N-1 112A-I”); and 
a coordinator (see e.g. paragraph 21: “An operational group may be implemented, for example, as an MPI "communicator" object”) to separate a reduce-scatter parallel processing operation for a first dataset into a plurality of parallel processing phases (see e.g. paragraph 21: “A collective operation generally refers to a message-passing instruction that is executed simultaneously (or approximately so) by all the compute nodes of an operational group of compute nodes… Examples of collective operations include… a reduce operation”; and paragraph 28: “Examples of vector variants of MPI operations include scattery (the vector variant of a scatter operation)”) comprising a first phase (see e.g. Fig. 4: “Compute Node 0”), a second phase (see e.g. Fig. 4: “Compute Node 1”) and at least one additional phase (see e.g. Fig. 4: “Compute Node 2”, “Compute Node 3”), wherein: 
in the initial phase, the computer processing nodes of each supernode communicate messages with each other to reduce the first dataset to provide a second dataset (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; paragraph 52: “four compute nodes participate in the vector collective operation: compute nodes 0 through 3… The displacement contribution array for the respective compute node specifies the contribution of the respective compute node to displacements of each other compute node participating in the vector collective operation”; and Fig. 4: “Allreduce”); 
in the second phase, the supernodes of each mesh communicate messages with each other to reduce the second dataset to produce a third dataset (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; paragraph 52: “four compute nodes participate in the vector collective operation: compute nodes 0 through 3… The displacement contribution array for the respective compute node specifies the contribution of the respective compute node to displacements of each other compute node participating in the vector collective operation”; and Fig. 4: “Allreduce”); and 
in the at least one additional phase, the meshes communicate messages with each other to further reduce the third dataset (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; paragraph 52: “four compute nodes participate in the vector collective operation: compute nodes 0 through 3… The displacement contribution array for the respective compute node specifies the contribution of the respective compute node to displacements of each other compute node participating in the vector collective operation”; and Fig. 4: “Allreduce”). 
Even though Archer discloses an exemplary allreduce operation as a collective parallel processing operation going through different phases of execution at each compute node (see e.g. Fig, 4), Archer does not explicitly disclose a specific case for the scatter operation.
However, note that Archer does disclose scatter operations as additional exemplary collective parallel processing operations (see e.g. Archer, paragraphs 28).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement node-to-node communications to perform reduce-scatter collective operations. The motivation/suggestion would be to provide different chunks of vector data to the nodes in order to accommodate each node’s requirements (as opposed to using broadcast).

With respect to claim 16, Archer teaches: The system of claim 15, wherein the coordinator comprises a Message Passing Interface (MPI) (see e.g. paragraph 21: “An operational group may be implemented, for example, as an MPI "communicator" object”). 

With respect to claim 17, Archer teaches: The system of claim 15, wherein the computer processing node comprises a plurality of processing cores (see e.g. paragraph 5: “Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory”). 

With respect to claim 18, Archer teaches: The system of claim 15, wherein in the initial phase, a given computer processing node of a given supernode communicates multiple messages with another computer processing node of the given supernode (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4). 

With respect to claim 19, Archer teaches: The system of claim 18, wherein, in the at least one additional phase comprises a third phase, and in the third phase, each mesh communicates a single message with another mesh (see e.g. paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; and Fig. 4). 

With respect to claim 20, Archer teaches: The system of claim 15, wherein the computer processing node comprises a server blade (see e.g. paragraph 5: “one family of parallel computing systems has been (and continues to be) developed by International Business Machines (IBM) under the name Blue Gene®. The Blue Gene/L architecture provides a scalable, parallel computer that may be configured with a maximum of 65,536 (216) compute nodes. Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory”).
Since Archer discloses Blue Gene/L architecture, Archer inherently discloses server blades included in this architecture.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Archer as applied to claim 9 above, and further in view of Jain et al. (“Collectives on Two-tier Direct Networks”; September 2012; from IDS filed on 02/27/2020; hereinafter Jain).

With respect to claim 10, Archer teaches:  The method of claim 9, wherein communicating messages among the meshes in a plurality of other stages of the plurality of parallel processing stages comprises communicating (see e.g. Archer paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… a point-to-point network, such as a torus network”; paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”)
Archer does not but Jain teaches:
according to a Rabenseifner-based algorithm (see e.g. Jain, page 4, section 4, Table 1: “Rabenseifner’s Reduce-Scatter with Gather”). 
Archer and Jain are analogous art because they are in the same field of endeavor: managing communications between parallel processing stages for performing a collective operation by a plurality of nodes. Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to modify Archer with the teachings of Jain. The motivation/suggestion would be to accommodate for imbalanced process arrival times.

Applicant's arguments filed 08/26/2021 have been fully considered but they are not persuasive. In detail:

(1)	Regarding claim 1, Applicant argues that “Archer does not disclose or render obvious a particular order for parallel processing stages for a reduce-scatter operation” (Remarks, pages 8-9).
	However, note that Archer does disclose an allreduce operation that comprises a reduce operation (i.e. an initial stage) followed with a broadcast operation (i.e. a subsequent stage for distributing/scattering the results from the reduce operation) both of which are disclosed as being a type of collective operation that are executed in parallel within clusters of nodes (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”; paragraph 27: “Different networks connecting the compute nodes of the parallel computer have different characteristic… As another example, a cluster may be built”; p; and Fig. 4). 
	Therefore, Archer does disclose different stages for a reduce-scatter operation which in return teaches the limitations “an initial stage of the plurality of parallel processing stages” and “a subsequent stage of the plurality parallel processing stages” as recited in claim 1. Consequently, the Examiner maintains the rejection directed to claim 1. For more details, please see the corresponding rejection above.

(2)	Regarding Applicant’s arguments with respect to the reasons for modifying claim 1 (Remarks, page 9), note that Archer does disclose a particular order for the processing stages of a reduce-scatter operation (e.g. a reduce operation stage followed by a broadcast operation stage) as described above; that is, no modifications necessary regarding the particular ordering of the processing stages.
	However, as also noted with this and the previous Office Actions, even though Archer discloses using a low latency network or a high latency network (see paragraph 27), Archer does not explicitly disclose using a low-bandwidth network “subsequent” to using a higher bandwidth network for the processing stages. 
	On the other hand, since Archer discloses using different latency networks based on the communication requirements of the nodes (see paragraphs 25-26), one of ordinary skill in the art would realize to arrange the bandwidth utilizations associated with particular processing stages (e.g. higher bandwidth for the initial reduce operation stage followed by lower bandwidth for broadcasting stage) in order to accommodate particular communication protocols used by the nodes for performing the parallel processing (see e.g. paragraphs 25-26). This would result in the benefit of avoiding potential inter-node messaging incompatibilities, such as lost data packets due to imbalanced network traffic. The Examiner also notes that these reasons are highlighted both in this and the previous Office Actions.


(3)	Regarding Applicant’s arguments with respect to the rejection directed to claim 11 (Remarks, page 9), note that Archer does disclose a particular order of parallel processing stages as described above in item (1). Further note that, Archer discloses a “store” operation (i.e. a third stage) following the broadcast operation as part of the parallel processing stages (see e.g. paragraph 21: “An allreduce operation functions as a reduce operation, followed by a broadcast (to store the result of the reduce operation in the result buffer of each process)”).
	Therefore, Archer teaches the limitation “at least one other third stage of the plurality of parallel processing stages” as recited in claim 11, and the Examiner maintains the corresponding rejection. For more details, please see the rejection directed to claim 11 above.

CONCLUSION
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Umut Onat whose telephone number is (571)270-1735.  The examiner can normally be reached on M-Th 9:00-7:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Dennis Chow can be reached on (571) 272-7767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/UMUT ONAT/Primary Examiner, Art Unit 2194