DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This final office action is responsive to the amendments filed on 01/28/2021.
Claims 1-20 are pending.

Response to Amendment

Applicant has amended independent claims 1, 9, 20 and dependent claims 2-8, 10-17, 19 to include new/old limitations in a form not previously presented necessitating new search and considerations.  

Claim Objections

Claim 1 are objected to because of the following informalities: 
-- are be -- should be -- are -- in claim 1.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.



Claims 1-20 are rejected under 35 U.S.C. 112 (b), as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.

The following claim language is not clearly understood:
Claim 1 recites sub-set of execution lanes. It is unclear if the sub-set includes null set (i.e. a sub set wherein no execution lanes are selected in the subset) and complete set (i.e. a sub-set wherein each execution lanes are selected in the subset).

Claim 4 recites “output data is provided to a thread” and “to the same execution lane”. It is unclear if there is any correspondence between the thread and execution lane to which the data output has been provided (i.e. output is provided to any thread or a thread corresponding to the execution lane). 

Claims 9 and 20 recites elements of claim 1 and have similar deficiency as claim 1. Therefore, they are rejected for the same rational. Remaining dependent claims are also rejected because of their dependency to the rejected independent claims.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-10, 12-17, 19-20 is/are rejected under 35 USC 103 as being unpatentable over Chen et al. (US Pub. No. 2019/0004814 A1, hereafter Chen) in view of Applicant Admitted Prior Art (hereafter AAA).
AAA was cited in the last office action.
Highlighted claim elements are missing from the respective cited prior art.

As per claim 1, Chen teaches the invention substantially as claimed including a method of operating a data processor (fig 4 vector processors) configured to execute programs to perform data processing operations ([0027] vector processor, data operands, execution pipelines) and in which execution threads executing a program to perform data processing operations are be grouped together into thread groups in which the plural threads of a thread group each execute a set of instructions in lockstep (fig 4 execution pipelines, separate quadrants [0027] execution pipelines, data operands, instructions, executed [0029] quadrant, execution pipeline [0041] separate sets of threads); 
the data processor comprising (fig 4 vector processors): 
an instruction execution processing circuit configured to execute instructions to perform processing operations for execution threads executing a program (fig 4 execution pipelines, crossbars, multiplexer [0029] [0041] operands, multiplexer, separate sets of threads), the execution processing circuit being configured as a plurality of execution lanes (fig 4 multiple execution pipelines 455 460 470 465), each execution lane being configured to perform processing operations for a respective execution thread of a thread group (fig 4 execution pipelines [0041] separate sets of threads [0035] execution lanes of the vector processor [0013] multi-lane execution pipeline); and 
a cross-lane permutation circuit, the cross-lane permutation circuit comprising a plurality of data lanes ([0001] crossbar, multiple lanes that allows data on lane, input, output [0013] permutation, multiple crossbars fig 4  [0022] cross-lane permutation, data routed to crossbar [0023] crossbar, multi-lane, number of lanes), each data lane having an input and an output ([0001] crossbar, multiple lanes, data, input, output [0022] data, conveyed, crossbar [0023] input lanes, output lanes, multiple smaller crossbar, four  8x8 lane crossbar [0024] output of crossbar), the cross-lane permutation circuit having fewer data lanes than the number of execution lanes of the instruction execution processing circuit ([0022] cross-lane permutation[0023] multi-lane crossbar, numbers of lanes, 4 8x8 lane crossbar, 16 lane execution pipeline [0039] crossbar, smaller, than the lane width of the vector unit), wherein each data lane is associated with a different sub-set of the execution lanes of the instruction execution processing circuit ([0016] crossbar, route data, proper, processing lanes; fig 4 cross bar, multiplexer, execution pipelines), such that each execution lane is contained within only one of the sub- sets of execution lanes ([0001] vector processor, plurality of processing lanes, data processing operations, parallel, upon respective operands [0029] fig 4 execution pipelines [0039] vector unit 32 lanes, first lanes 0-15, second lanes 16-31), wherein each data lane is configured to only receive input data associated with and provide output data for use by execution lanes of ([0001] vector processor, plurality of processing lanes, data processing operations, parallel, upon respective operands [0039] The desired output 805 is shown below the lane IDs, with desired output 805 showing how the data should be arranged in the lanes of the vector unit subsequent to the permutation fig 8 lane id-desired output data lanes fig 7 sort, operands, set of lanes, target lanes 720 725 730 735 740), wherein the cross-lane permutation circuit is configured, for each data lane, allow a data value input to that data lane to be provided to zero or more of the other data lanes ([0022] requires, cross-lane permutation [0023] multi-lane, cross-bar, permutate operant from input lanes, appropriate output lanes), such that the provided data value will be output by the zero or more other data lanes ([0023] multi-lane, cross-bar, permutate operand from input lanes, appropriate output lanes); 
the method comprising, when executing an instruction for the threads of a thread group using the execution processing circuit (fig 4 execution pipelines [0041] separate set of threads), to provide a data value or values from a thread or threads of a thread group to another thread or threads of the thread group, performing one or more permutation processing passes using the cross-lane permutation circuit ([0022] cross-lane permutations fig 8 pass 810A-B [0040] first pass [0041] next pass), wherein the data processor (fig 4 vector processor) is configured to allow, for any given processing pass (fig 1 pass 810A/B), any single execution lane of the sub-set of execution lanes associated with a data lane of the cross-lane permutation circuit to be selected to provide input data to the associated data lane (fig 8 lane ID desired output 805 pass 810 lane ID-direct write [0030] output lanes, crossbar, coupled, execution pipeline), each permutation processing ([0022] cross-lane permutations fig 8 pass 810A-B [0040] first pass [0041] next pass): 
for each data lane of the cross-lane permutation circuit (fig 8 lane ids [0038] performing multi-step permutation, crossbar [0001] crossbar, multiple lanes that allows data on lane, input, output), selecting from any of the execution lanes of the sub-set of execution lanes associated with the data lane a single execution lane from which to provide input data to the data lane ([0039] vector unit, 32 lanes, first 16 lanes, first crossbars, permutating, second 16 lines, second crossbar for permutating data), and providing said input data from the selected single execution lane executing a corresponding execution thread of the thread group to the data lane (fig 8 lane ID 0-31, pass 810A-B direct write, cross-write [0039] vector units, 32 lanes, first 16 lanes first crossbar permutating data across 0-15 lanes [0041] separate set of threads); 
the cross-lane permutation circuit performing a permutation operation using data provided to the data lanes, to provide as an output for at least one of the data lanes ([0022] cross-lane permutation, data routed to crossbar [0023] crossbar, multi-lane, number of lanes fig 7 perform a first permutation by permutating operand across a first set of N/2 lanes using a first N/2xN/2 crossbar 705 write the results of first permutation 710; fig 8 810A), a data value input to a different one of the data lanes (fig 7 perform a second permutation to arrange operands in each set of lanes to be cross written to other set of lanes 715; fig 8 810B); and 
the cross-lane permutation circuit, after performing the permutation operation, for at least one of the data lanes ([0022] cross-lane permutation, data routed to crossbar [0023] crossbar, multi-lane, number of lanes fig 7 perform a second permutation to arrange operands in each set of lanes to be cross written to other set of lanes 715; fig 8 810B), providing the data value output from the data lane for use by one of the execution lanes of the sub-set of execution lanes associated with that data lane (fig 7 perform a second permutation to arrange operands in each set of lanes to be cross written to other set of lanes 715; fig 8 810B) executing a corresponding execution thread of the thread group (fig 4 execution pipelines [0041] set of threads fig 7  convey the merged results to the multi-lane execution pipeline 740).

Chen doesn’t specifically teach execute instruction in lockstep, to provide a data value or values from a thread or threads of a thread group to another thread or threads of the thread group. 

AAA, however, teaches execute instruction in lockstep ([0001] in which the plural threads of a thread group can each execute a set of instructions in lockstep), to provide a data value or values from a thread or threads of a thread group to another thread or threads of the thread group ([0001] provision of data from one or more threads of a thread group to one or more other threads of thread group [0006] copy/transfer data between the threads of a thread group [0007] cross-lane operations). 

It would have been obvious to one of ordinary skills in the art before the effective filing date of the invention was made to combine the teachings of Chen with the teachings of AAA of executing instruction in a lockstep and provision data from one or more threads of a thread group to one or more other threads of thread group for performing cross lane operation to improve efficiency and allow execute instruction in lockstep, to provide a data value or values from a thread or threads of a thread group to another thread or threads of the thread group to the method of Chen as in the instant invention. The combination of analogous arts (Chen [0001] AAA ([0001])) would have been obvious because applying known method of exchanging data among threads of a thread group as taught by AAA to the cross lane operation method taught by Chen to yield predictable results of provisioning data between threads of a thread group with reasonable expectation of success and motivated by improved efficiency (Chen [0001] AAA [0002]).
 

As per claim 2, Chen teaches the permutation operation for a permutation processing pass comprises at least one of: 
retaining an input data value provided to a data lane such that an output data value for the data lane corresponds to the input data value that was provided to the data lane (fig 3 vector register file 305 [0027] data operands, retrieved, vector register file, coupled to crossbars); 
copying an input data value from a data lane to at least one other data lanes ([0013] permutations, crossbar, operands across lanes [0014] crossbar rearranges operands between lanes); and 
moving an input data value from a data lane to another data lane ([014] crossbar rearranges operands between lanes).  



As per claim 4, Chen teaches wherein for any given permutation processing pass, for each data lane from which an output data value is provided to a thread, the output data value from the data lane is output to the same execution lane of the sub-set of execution lanes that is associated with the data lane (fig 8 lane IDs - desired output, pass 810A input-direct write [0023] multi-lane crossbar configured to permutate operands from input lanes to the appropriate output lanes [0012] instruction, doesn’t require permutation on the input operands).  


As per claim 5, Chen teaches wherein when executing an instruction using the cross-lane permutation circuit ([0022] cross-lane permutation, data routed to crossbar [0023] crossbar, multi-lane, number of lanes ), the cross-lane permutation circuit performs a sequence of permutation processing passes (fig 8 Lane IDs, pass 810A pass 810B merge with pass 810A); 
wherein for any given permutation processing pass, for each data lane from which an output data value is provided to a thread, the output data value from the data lane is provided to a different execution lane of the sub-set of execution lanes associated with the data lane compared to other passes in the sequence of passes (fig 8 lane IDs - desired output, pass 810B cross write [0023] multi-lane crossbar configured to permutate operands from input lanes to the appropriate output lanes [0012] instruction, require permutation on the input operands, conveys the input operands to the multi-lane execution pipeline via the crossbar fig 7 715).  

As per claim 6, Chen teaches wherein for each data lane, the output data values are provided to the xecution lanes of the sub-set of execution lanes associated with the data lane over the sequence of permutation processing passes in a predetermined order (fig 8 lane IDs 0-31 pass810A direct write pass810B cross write merge with pass 810A fig 7 sort, operands, set of lanes, align their target lanes 720 725 write aligned operands to set of lanes 730 735 merge results 740).  

As per claim 7, Chen teaches wherein for any given permutation processing pass, for each data lane, the input data value for the data lane is provided from the same execution lane of the sub-set of execution lanes that is associated with the data lane (fig 8 lane IDs - desired output, pass 810A input-direct write [0023] multi-lane crossbar configured to permutate operands from input lanes to the appropriate output lanes fig 7 sort/align operands 720 725).  

As per claim 8, Chen teaches wherein the input data to the data lanes for a given permutation processing pass are selected based on at least one of: 
predetermined threads to which output data will be provided in that pass ([0030] output lanes of crossbars, coupled, execution pipelines [0031] coupling of the output lanes, crossbars, execution units, allows permutations [0041] cross-writing between separate set of threads fig 8 810A direct write); and 
the threads from which the predetermined threads will require data according to the cross-lane instruction ([0031] coupling of the output lanes, crossbars, execution units, allows permutations [0041] cross-writing between separate set of threads fig 8 810B cross-write).

Claim 9 recites a data processor configured to execute programs to perform limitations similar to those of claim 1. Therefore, it is rejected for the same rational.

Claim 10 recites the data processor to perform limitations similar to those of claim 2. Therefore, it is rejected for the same rational.
Claim 12 recites the data processor to perform limitations similar to those of claim 4. Therefore, it is rejected for the same rational.
Claim 13 recites the data processor to perform limitations similar to those of claim 5. Therefore, it is rejected for the same rational.
Claim 14 recites the data processor to perform limitations similar to those of claim 6. Therefore, it is rejected for the same rational.
Claim 15 recites the data processor to perform limitations similar to those of claim 7. Therefore, it is rejected for the same rational.
Claim 16 recites the data processor to perform limitations similar to those of claim 8. Therefore, it is rejected for the same rational.

As per claim 17, Chen teaches wherein the cross-lane permutation circuit comprises a control circuit configured to control the cross-lane permutation circuit to perform a permutation processing pass by setting which threads will provide input data to the data lanes in a pass, and configuring the permutation circuit to perform a permutation operation for the pass ([0001] routing performed by the crossbar, dependent, control signal, control logic [0022] cross-lane permutation, data routed to crossbar [0023] crossbar, multi-lane, number of lanes [040] [0041] lane ID fig 8 pass 810A direct write pass 810B cross write [0031] coupling of output lanes of crossbar to the various execution units, permutations).  


As per claim 19, Chen teaches wherein the cross-lane permutation circuit comprises a control circuit, wherein the control circuit is configured to control ([0001] routing performed by the crossbar, dependent, control signal, control logic [0022] cross-lane permutation, data routed to crossbar) which execution lanes the output data from the data lanes is provided to in a pass based on a predetermined order in which data is to be output to the execution lanes in a sequence of passes (fig 8 810A direct write pass 810B cross write pass, lane IDs - desired output [0023] multi-lane crossbar configured to permutate operands from input lanes to the appropriate output lanes fig 7 sort/align operands 720 725).


Claim 20 recites a non-transitory computer readable storage medium storing computer software code which when executing on a processor performs (Chen [0042]) limitations similar to those recited in claim 1. Therefore, it is rejected for the same rationales.


Claims 3, 11 is/are rejected under 35 USC 103 as being unpatentable over Chen in view of Applicant Admitted Prior Art (hereafter AAA) and further in view of Han et al. (US Pub. No. 2018/0018299 A1, hereafter Han).
Han was cited in the last office action.

As per claim 3, Chen teaches when executing an instruction using the cross-lane permutation circuit ([0022] cross-lane permutation, data routed to crossbar [0023] crossbar, multi-lane, number of lanes ), the cross-lane permutation circuit performs a sequence of permutation processing passes (fig 8 Lane IDs, pass 810A pass 810B merge with pass 810A); 
wherein the number of passes in the sequence of permutation passes equals the number of execution lanes in the sub-set of execution lanes associated with each data lane, wherein each sub-set of execution lanes associated with each data lane comprises a same number of execution lanes ([0014] N/2 lanes of multi-lane execution pipeline using a first set of N/2xN/2 crossbar, the first crossbar rearranges operands between lanes of the first set of N/2 lanes while the second crossbar rearranges operands between lanes of the second set of N/2 lanes).  
Chen and AAA, in combination, do not specifically teach wherein the number of passes in the sequence of permutation passes equals the number of execution lanes in the sub-set of execution lanes associated with each data lane.
Han, however, teaches the number of passes in the sequence of permutation passes equals the number of execution threads in the sub-set of execution threads associated with each data lane ([0017] shuffler circuit repeats steps for successive source/destination processing lanes - equivalent to the number of processing threads [0126] shuffle operation in N cycles N=W/M).
It would have been obvious to one of ordinary skills in the art before the effective filing date of the invention was made to combine the teachings of Chen and AAA with the teachings of Han of repeating steps for each processing lanes based on the number of processing lanes that can receive the data to improve efficiency and allow the number of passes in the sequence of permutation passes equals the number of execution threads in the sub-set of execution threads associated with each data lane to the method of Chen and AAA as in the instant invention. The combination of analogous arts (Chen [0001] AAA [0001] Han [0002]) would have been obvious because applying known method of number of pass depending on the number of execution lanes receiving the data as taught by Han to the cross lane permutation operation method taught by Chen and AAA to yield predictable results of number of passes in the sequence of permutation passes equals the number of execution lanes receiving the data in the sub-set of execution lanes with reasonable expectation of success and motivated by improved efficiency (Chen [0001] AAA [0002] Han [0004]).
Claim 11 recites the data processor to perform limitations similar to those of claim 3. Therefore, it is rejected for the same rational.



Claims 18 is/are rejected under 35 USC 103 as being unpatentable over Chen in view of Applicant Admitted Prior Art (AAA), as applied to above claims, and further in view of Van Berkel et al. (US Patent No. 8,510,534 B2, hereafter Van Berkel).
Van Berkel was cited in the last office action.

As per claim 18, Chen teaches the cross-lane permutation circuit comprises a control circuit ([0001] routing performed by the crossbar, dependent, control signal, control logic [0022] cross-lane permutation, data routed to crossbar), wherein the control circuit comprises a counter which is incremented during execution of an instruction to keep track of the permutation processing passes ([0001] control logic fig 8 pass 810A-B).  

Chen and AAA, in combination, do not specifically teach circuit comprises a counter which is incremented during execution of an instruction to keep track of passes.
Van Berkel, however, teaches circuit comprises a counter which is incremented during execution of an instruction to keep track of passes (fig 4 loop unit, program counter 430 col 12 lines 1-5).
It would have been obvious to one of ordinary skills in the art before the effective filing date of the invention was made to combine the teachings of Chen and AAA with the teachings of Van Berkel of loop unit program counter to improve efficiency and allow circuit comprises a counter which is incremented during execution to the method of Chen and AAA as in the instant invention. The combination would have been obvious because applying known method of loop counter taught by analogous art Van Berkel to the teachings of Chen and AAA of counting the number of pass with reasonable expectation of success and improved efficiency.

Response to Arguments


The previous objections to the specification have been withdrawn.
The previous objections to drawings have been withdrawn.
The previous objections to the claims have been withdrawn. 
The previous objections under 35 USC 112 (b) have been withdrawn. 
The previous objections under 35 USC 101 have been withdrawn.
Applicant's arguments filed on 01/28/2021 have been fully considered but they are moot in view of new ground of rejection without acquiescing to any characterization of the previously cited prior arts by the Applicant and only to advance prosecution.

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
JIANG; Wendong et al. (US 20200162367 A1) teaches data stream transmission.
LIN KENNETH CHENGHAO (WO 2016091164 A1) teaches multilane/multicore system and method.
Schluessler et al. (US 2019/0066355A1) teaches method and apparatus for profile-guided graphics processing optimizations.
XU; Liang et al. (US 20190073337 A1) teaches apparatuses capable of providing composite instructions in the instruction set architecture of a processor.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABU ZAR GHAFFARI whose telephone number is (571)270-3799.  The examiner can normally be reached on Monday-Thursday 9:00 - 17:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai AN can be reached on 571-272-3756.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


ABU ZAR GHAFFARI
Primary Examiner
Art Unit 2195



/ABU ZAR GHAFFARI/Primary Examiner, Art Unit 2195