DETAILED ACTION
Claims 1-5, 8-10, 12-18, and 21-22 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d).  The certified copy of IN 202041019062 was electronically retrieved by the USPTO on September 28, 2021.

Information Disclosure Statement
In the IDS submitted on September 23, 2021, the first cited document has been struck through because it had been previous cited by the examiner on July 22, 2021.  The strike-through is, thus, not indicative of a lack of consideration, but only of duplication.  The reference has been considered, as evidenced by the rejections set forth. 

Specification
The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.
The clean disclosure submitted on October 22, 2021, is objected to because of the following informalities:
In paragraph [0112], replace all instances of “509N” with --509K--.
In paragraph [0202], line 5, insert --the-- before “ALU”.
In paragraph [0202], line 8, insert --in-- after “shuffled”.
In paragraph [0202], line 10, delete the space between “1” and the comma.
Appropriate correction is required.

Drawings
Replacement FIG.16 is objected to because of the following minor informalities:
At the bottom right, the arrows are in the wrong entries.  The arrows should be in entries 2-3 and 6-7, not 1-2 and 5-6.
A corrected drawing sheet in compliance with 37 CFR 1.121(d) is required in reply to the Office action to avoid abandonment of the application. Please ensure any replacement is only in black and white to avoid pixelation and further objection.  Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Objections
Claim 1 is objected to because of the following informalities:
In line 10, insert --and-- after the comma.
Claim 8 is objected to because of the following informalities:
In line 8, it appears “first” should be replaced with --second--, as it is understood that the second crossbar circuitry does the de-compaction.  For purposes of prior art examination, the last paragraph will be interpreted to configure the second crossbar, not the first.
Claim 10 is objected to because of the following informalities:
In line 9, insert a comma after “ALU”.
In line 11, insert --the-- after “with”.
Claim 13 is objected to because of the following informalities:
In line 2, insert --the-- after “with”.
Claim 21 is objected to because of the following informalities:
In line 2, replace “include” with --including--.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:


Claims 4-5, 8-10, 12-16, and 18 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Referring to claim 4, applicant now claims that the ALU is used to execute one or more threads of a SIMT instruction.  However, the examiner could find no original disclosure of a single ALU being used to execute multiple threads simultaneously.  In paragraph [0195], a SIMT processing engine is disclosed to process multiple threads by multiple thread processors, and that the techniques described in relation to lanes of a SIMD processor can be adapted for application to multiple thread processors of a SIMT processor.  Each processor is generally understood to include at least one ALU (and, thus, a collection of ALUs would be used to execute multiple SIMT threads).  Nowhere does the original specification describe a single ALU to execute multiple SIMT threads.  As such, applicant’s amendments set forth new matter.
Claims 10, 14, 16, and 18 are rejected for similar reasons as claim 4.
Further referring to claim 10, applicant now claims compacting active data elements without adjusting data channel settings associated with the instruction.  Paragraph [0129] is the only place that gives an example of what these settings are.  The example given is predication.  However, claim 8 (and paragraph [0232]) sets forth that the compaction occurs based on the predicate mask.  Paragraph [0202] states “[t]he SIMD lanes can be optimally used by shuffling the SIMD lanes dynamically based on predication mask and compacting active lanes.”  Paragraph [0208] states “[i]n one embodiment the shuffling is performed according to the predication mask”.  As such, because the original disclosure identifies a predicate as a data channel setting, and a predicate is described as controlling the shuffling/compaction, it now constitutes new matter to claim that data channel settings are not adjusted to achieve the compaction.  It appears they must be adjusted to set the active/inactive lanes and perform related compaction.
Claims 5, 8-9, and 12-16 are rejected due to their dependence on a claim lacking adequate written description.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 3-5 and 8-9 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The claims recite the following limitations for which there is a lack of antecedent basis:
In claim 3, line 3, “the instruction”, because there is an instruction in claim 1 (2nd to last paragraph, and an instruction in claim 2, line 3.
Claims 4-5 and 8-9 are rejected due to their dependence on an indefinite claim.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 17 is rejected under 35 U.S.C. 102(a)(1) as being anticipated by Vaidya et al., U.S. Patent Application Publication No. 2014/0181477 A1 (herein referred to as Vaidya).
Claim 17 is partly rejected for similar reasons set forth in the rejection of claim 10 below (i.e., the steps performed by the hardware circuitry in response to receiving an instruction in claim 17 are mostly rejected for similar reasons as claim 10).  Vaidya has further taught:
a) configur[ing] a crossbar to map active…elements…to processor lanes to realize the sequential compaction (see FIG.6, circuitry 240, and FIG.7B, 340, which form crossbar circuitry that maps active elements to first/inactive lanes), the crossbar configured without adjusting data channel swizzle settings associated with the instruction (note that even though paragraph [0043] states that a swizzle setting is computed based on the mask to carry out the compaction (which includes the configuration of the crossbar), these swizzle settings would be set once and not further adjusted.  As such, the compaction is done without adjusting swizzling (it is set once for the entire compaction).  Any adjusting would result in a different permutation, which would be incorrect.  Alternatively, from paragraph [0043], unswizzle settings are also determined.  These are swizzle settings for the outputs.  As such, since these swizzle settings don’t relate to compaction, but de-compaction, these swizzle settings are not adjusted for configuring the crossbar);
in reverse sequential order of compaction (see FIG.6, 260 and paragraph [0043].  After processing, the output is unswizzled/de-compacted for storage into a register file.  It is the inverse of the compaction, which is shown to be sequential in the explanation in the rejection of claim 10.  Thus, the unswizzling/de-compaction is also sequential in nature.  Also, it reverses the compaction.  Thus, it de-compacts in reverse sequential order); and
c) a data processing system comprising:
c1) a memory device (e.g. any of FIG.5, 130; FIG.6, 213; FIG.8, 740; FIG.9, 1110, 1175; and FIG.12, 832, 834, 828); and
c2) a graphics processor (see paragraph [0040] and FIG.5, 120) comprising one or more hardware tiles (FIG.5, 1250-n) including processing resources having a multi-lane parallel processor architecture (each execution unit 125 is shown in FIG.6 to have a SIMD ALU 250, which is a part of a multi-lane parallel processor architecture that executes a SIMDx instruction (paragraph [0038]) and hardware circuitry configured to compact diverged processor lanes (see the title, paragraphs [0002] and [0016], and FIGs.2 and 4.  The system may use BCC and/or SCC to compact/compress divergent lanes so as to reduce disabled lanes.  SCC involves permutation, e.g. shuffling or swizzling (performed by FIG.6, 240)), wherein the hardware circuitry includes an arithmetic logic unit (ALU) (FIG.6, ALU 250) including a first number of logical processor lanes (as shown in FIGs.2-4, the ALU has 16 logical lanes to handle SIMD16 instructions) and a second number of physical processor lanes (from paragraph [0045] and FIGs.2-4, the ALU may have four physical lanes (SIMD4)), the first number is a multiple of the second number (SIMD16 is 4x SIMD4), and the ALU is configured to process the logical processor lanes over multiple clock cycles when active logical processor lanes outnumber physical processor lanes (see FIGs.2-4, which show that multiple ALU cycles are required to execute active lanes when they outnumber inactive lanes (e.g. in FIG.2, for instruction 35, there are 12 active lanes and 4 inactive lanes, and these lanes take 3 cycles to execute (cycles T+1, T+2, T+3)).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Vaidya et al., U.S. Patent Application Publication No. 2014/0181477 A1 (herein referred to as Vaidya).
Referring to claim 1, Vaidya has taught an accelerator device (FIG.5, 120) comprising:
a) a host interface (FIG.5, 140);
b) a fabric interconnect coupled with the host interface (see paragraph [0040].  Core domain and/or system agent (either of which is a “host”) is interfaced with the accelerator over an interconnect); and
c) one or more hardware tiles (FIG.5, 1250-n) coupled with the fabric interconnect, wherein the one or more hardware tiles include processing resources having a multi-lane parallel processor architecture (each execution unit 125 is shown in FIG.6 to have a SIMD ALU 250, which is a part of a multi-lane parallel processor architecture that executes a SIMDx instruction (paragraph [0038]) and hardware circuitry configured to compact diverged processor lanes (see the title, paragraphs [0002] and [0016], and FIGs.2 and 4.  The system may use BCC and/or , wherein the hardware circuitry includes:
c1) an arithmetic logic unit (ALU) (FIG.6, ALU 250) including a first number of logical processor lanes (as shown in FIGs.2-4, the ALU has 16 logical lanes to handle SIMD16 instructions) and a second number of physical processor lanes (from paragraph [0045] and FIGs.2-4, the ALU may have four physical lanes (SIMD4)), wherein the first number is a multiple of the second number (SIMD16 is 4x SIMD4), the ALU is configured to process the logical processor lanes over multiple clock cycles when active logical processor lanes outnumber physical processor lanes (see FIGs.2-4, which show that multiple ALU cycles are required to execute active lanes when they outnumber inactive lanes (e.g. in FIG.2, for instruction 35, there are 12 active lanes and 4 inactive lanes, and these lanes take 3 cycles to execute (cycles T+1, T+2, T+3));
c2) first crossbar switch circuitry configured to input data into the ALU (FIG.6, circuitry 240 and FIG.7B, 340 form crossbar circuitry that inputs data into the ALU), the first crossbar circuitry configurable, based on a predicate mask for an instruction received for execution (see paragraph [0043]), to provide input associated with a second set of logical processor lanes as input to a first set of logical processor lanes (see FIG.4.  Again, based on a mask, input associated with lanes 4 and 6 are provided as input to lanes 1 and 3); and
c3) second circuitry configured to provide output from the ALU (FIG.6, circuitry 260 outputs data from the ALU to memory via write-back 270), the second circuitry configurable to provide output from the first set of logical processor lanes to memory associated with the second set of logical processor lanes (from paragraph [0043], unswizzle 260 is the inverse of crossbar switch circuity, again note that paragraph [0043] sets forth that the output circuitry performs the inverse of the input circuitry, the latter being disclosed as including crossbar circuitry 340 (FIG.7B).  Thus, as one of skill in the art would have recognized that an output crossbar could perform the inverse of the input crossbar, this would have been a natural implementation for the second circuitry.  Crossbar circuitry is useful as it allows any input to be switched to any output, thereby maximizing flexibility in transmission of data.  Consequently, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the second circuitry to be second crossbar switch circuitry.

Claims 2-5, 8-10, 12-16, 18, and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Vaidya in view of the examiner’s taking of Official Notice.
Referring to claim 2, Vaidya has taught the accelerator device as in claim 1, wherein the host interface is configured to communicatively couple the accelerator device to a processor of a host computing device (again, see paragraph [0040].  Also, see FIG.8, which couples a host (e.g. multi-core processor) to GPU 720.  Alternatively, see FIG.10, which coupled a graphics processor 875 to a host core 874a,b).  Vaidya has not taught that the host interface is configured to receive an instruction to be executed by the accelerator device.  However, the examiner notes that it is well known in the art for a host CPU to pass graphics instructions to a graphics processor for execution.  As such, in order to realize the advantages of Vaidya’s compression/compaction in such an architecture (where a CPU offloads graphics instruction to a GPU), it would have been obvious to one of ordinary skill in the art before the effective filing  receive an instruction to be executed by the accelerator device.
Referring to claim 3, Vaidya, as modified, has taught the accelerator device as in claim 2, the one or more hardware tiles further comprising: decode circuitry to decode the instruction into a decoded instruction (FIG.6, instruction decoder 215), the decoded instruction associated with the predicate mask (see paragraphs [0027]-[0029] and [0042]-[0043]), wherein the predicate mask indicates a set of active lanes and a set of inactive lanes (see at least paragraphs [0027]-[0029] and [0042]-[0043].  Vaidya identifies divergent execution (e.g. if/else code, where some elements are active at the “if” part and others at the “else” part) and sets a mask to indicate such.  The mask would indicate active lanes (those unshaded in FIGs.2-4) and inactive lanes (those shaded in FIGs.2-4)), and the hardware circuitry is to map active lanes in a second portion of logical processor lanes to inactive lanes in a first portion of logical processor lanes (see FIGs.4 and the description thereof.  For instance, it is determined that the odd lanes are inactive for an instruction (FIG.4, 50a).  In response, active lanes 4, 6, 12, and 14 are mapped to lanes 1, 3, 9, and 11 (FIG.4a, 50b) as part of SCC compaction).
Referring to claim 4, Vaidya, as modified, has taught the accelerator device as in claim 3, but has not taught wherein one or more logical processor lanes of the ALU are configured to execute one or more threads of a single instruction multiple thread (SIMT) instruction and the predicate mask is to indicate at least one active thread of the SIMT instruction and at least one inactive thread of the SIMT instruction.  However, SIMT is known in the art as the thread equivalent of SIMD.  With SIMT, multiple threads use parallel hardware to execute the same instruction in parallel on their own data sets.  Thus, SIMT is useful when the same task needs to be performed repeatedly on different data sets.  It is known that SIMT experiences one or more logical processor lanes of the ALU are configured to execute one or more threads of a single instruction multiple thread (SIMT) instruction and the predicate mask is to indicate at least one active thread of the SIMT instruction and at least one inactive thread of the SIMT instruction.
Referring to claim 5, Vaidya, as modified, has taught the accelerator device as in claim 4, wherein the hardware circuitry is configured to: map input associated with the at least one active thread of the SIMT instruction to a logical lane associated with the at least one inactive thread of the SIMT instruction (this would be the purpose of circuitry 240 circuitry of FIG.7B.  To compact the active threads together, an active thread would be mapped to a logical lane currently associated with an inactive thread (e.g. in FIG.4, active threads 4 and 6 would be mapped to lanes 1 and 3, which are associated with inactive threads 1 and 3).
Referring to claim 8, Vaidya, as modified, has taught the accelerator device as in claim 5, wherein the one or more hardware tiles are configured to: configure the first crossbar switch circuitry to sequentially compact the diverged processor lanes into contiguous logical processor lanes based on the predicate mask without adjusting swizzle settings for input operands (see FIG.4 and note that logical lanes 0, 4, 2, and 6 are compacted into contiguous logical lanes 0, 1, 2, and 3, respectively.  Similarly, note that logical lanes 8, 12, 10, and 14 are compacted into contiguous logical lanes 8, 9, 10, and 11, respectively.  The compaction of FIG.4, for instance, is sequential in nature.  That is, from left to right, the active ; process the contiguous logical processor lanes over a reduced number of clock cycles (the above compacting allows for processing over a reduced number of cycles (see FIG.4) within four physical lanes of the ALU); and configure the [second] crossbar switch circuitry to sequentially de-compact the diverged processor lanes (again, see paragraph [0043] and FIG.6, 260).
Referring to claim 9, Vaidya, as modified, has taught the accelerator device as in claim 8, wherein the ALU includes integer and floating-point logic (from paragraph [0045], “various integer and floating point instructions can be performed in the floating point ALU”).
Referring to claim 10, Vaidya has taught a method comprising:
a) receiving an instruction (e.g. FIG.4, 50a) having predicated data elements (from FIG.4, the odd elements are turned off, and the even elements are turned on based on a predicate/mask (e.g. paragraphs [0002], [0015], [0027], etc.));
wherein the instruction is a single instruction multiple thread (SIMT) instruction.  However, SIMT is known in the art as the thread equivalent of SIMD.  With SIMT, multiple threads use parallel hardware to execute the same instruction in parallel on their own data sets.  Thus, SIMT is useful when the same task needs to be performed repeatedly on different data sets.  It is known that SIMT experiences similar divergent execution based on masking and would benefit from the compaction disclosed by Vaidya.  As such, in order to realize the benefits of SIMT, which is compatible with the invention of Vaidya, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the instruction is a single instruction multiple thread (SIMT) instruction.
c) Vaidya, as modified, has further taught determining, via a predication mask associated with the instruction, a set of inactive threads for the instruction (again, based on a mask, it is determined that the off threads are inactive, meaning they are indicated as not satisfying the “IF” condition of the program.  As an example, in FIG.4, as modified, the mask would indicate that the odd threads are off/inactive and that the even threads are on/active);
d) Vaidya, as modified, has further taught compacting data elements associated with active threads into processing lanes associated with inactive threads to create a contiguous set of active processing lanes (see FIG.4.  Active threads 4 and 6, and their data elements, are compacted into lanes 1 and 3 to creates contiguous active lanes), wherein the contiguous set of active processing lanes are processing lanes of a multi-lane ALU (see FIG.6, 250.  Lanes 0, 4, 2, and 6 have been compacted to execute on physical lanes 0, 1, 2, and 3, of the ALU in cycle T (FIG.4)) and compacting the data elements associated with the active threads into the processing lanes associated with the inactive threads includes sequentially compacting active data elements into the processing lanes associated with inactive threads without adjusting data channel settings associated with the instruction (see FIG.4 and note that active threads (and elements) 4 and 6 are compacted into inactive lanes 1 and 3, respectively.  The compaction of FIG.4, for instance, is sequential in nature.  That is, from left to right, the active threads beyond lanes 0-3 are mapped in left-to-right (sequential) fashion to the inactive lanes in 0-3.  For example, after lane 3, the first active thread is thread 4.  This is compacted to first inactive lane 1.  After lane 3, the second active thread is thread 6.  This is compacted to second inactive lane 3.  This is sequential compacting that is designed for left-to-right compaction.  Note that there is no disclosure in Vaidya of “data channel settings”.  Thus, there is no adjustment of data channel settings associated with the instruction.  From paragraph [0043], a mask is set, and a swizzle setting is set.  However, no data channel setting is adjusted.  Due to the negative nature of this limitation, it is extremely broad.  As one example, a data channel setting could be register identifiers set forth by the instruction to obtain the operands.  Such would not be adjusted to perform compaction.  That is, if an instruction indicates that the data in register R1 is to be operated upon, R1 will be the identifier used to access the register file.  If it were adjusted, the wrong data would be obtained and operated upon.  Many other settings not involved in the compaction could be considered data channel settings);
e) performing a processing operation on the contiguous set of active processing lanes (again, see cycle T in FIG.4); and
f) de-compacting output of the processing operation into an output memory, wherein de-compacting output of the processing operation into the output memory includes sequentially de-compacting output of the processing operation into the output memory (see FIG.6, 260 and paragraph [0043].  After processing, the output is unswizzled/de-compacted for .
Referring to claim 12, Vaidya, as modified, has taught the method as in claim 10, wherein the output memory is an output register (from paragraph [0034], write-back occurs to a register file (which is a register, or includes a register)).
Referring to claim 13, Vaidya, as modified, has taught the method as in claim 10, wherein compacting active data elements into the processing lanes associated with inactive threads includes configuring a crossbar (FIG.7B, 340) to map active input data elements associated with a second set of processing lanes to processing lanes in a first set of processing lanes (paragraph [0047].  Basically, in FIG.4, since active thread 4 is being mapped to inactive lane 1, the data corresponding to lane 4 must be switched through the cross-bar in order to be provided to lane 1), the processing lanes in the first set of processing lanes associated with inactive threads (the first (odd) set of lanes are associated with inactive elements (denoted by shading, per paragraph [0023])).
Referring to claim 14, Vaidya, as modified, has taught the method as in claim 13, wherein the multi-lane ALU is a single instruction multiple data (SIMD) ALU including a first number of logical SIMD lanes and a second number of physical SIMD lanes (from FIGs.4 and 6 and paragraph [0045], the SIMD ALU has four physical lanes and sixteen logical lanes such that a SIMD4 ALU can execute SIMD16 instructions (i.e., a 16-thread SIMT instruction)), the first number is a multiple of the second number (16 is a multiple of 4), and the SIMD ALU processes logical SIMD lanes over multiple clock cycles when active logical SIMD lanes outnumber physical SIMD lanes (see FIGs.2-4, which show that multiple ALU cycles are required to execute active lanes when they outnumber inactive lanes (e.g. in FIG.2, for , and wherein the SIMD ALU is mapped to multiple SIMD threads (whatever hardware simultaneously processes the different threads as shown in FIG.4, as modified, would be hardware that makes up the SIMD ALU).
Referring to claim 15, Vaidya, as modified, has taught the method as in claim 14, wherein performing the processing operation on the contiguous set of active processing lanes includes bypassing execution of the inactive threads for the instruction multiple logical SIMD lanes and processing the instruction in a reduced number of clock cycles (see FIGs.2 and 4 for examples.  For instance, in FIG.4, execution of inactive threads 1 and 3 is bypassed so that active lanes 4 and 6 can be executed.  This reduces the number of execution cycles).
Referring to claim 16, Vaidya, as modified, has taught the method as in claim 15, wherein the multi-lane ALU is a SIMD16 ALU having sixteen logical lanes (from paragraph [0045], while the ALU is disclosed as a SIMD4 ALU, this is only in the physical sense because it includes four physical lanes.  However, it is also a SIMD16 ALU in the logical sense because it executes instructions having 16 logical lanes (see FIGs.2-4)), and the SIMD16 ALU is configurable to execute 16 SIMT threads (the SIMD16 ALU will execute all 16 threads, over 4 cycles, if the mask indicates that all 16 threads are to be executed.  Any number of threads may be executed as this is dependent on program conditions).  Vaidya has not taught that the ALU has eight physical lanes (again, Vaidya has instead taught four physical lanes)).  However, changing the physical size is considered a routine expedient that does not amount to a patentable distinction given that applicant has not demonstrated the criticality of the size.  See MPEP eight physical lanes.
Referring to claim 18, Vaidya has taught the data processing system as in claim 17, wherein the output memory is an output register (from paragraph [0034], write-back occurs to a register file (which is a register, or includes a register)), but has not taught wherein the ALU includes 32 logical processor lanes, and the ALU is configured to execute 32 threads of a single instruction multiple thread (SIMT) instruction.  However, SIMT is known in the art as the thread equivalent of SIMD.  With SIMT, multiple threads use parallel hardware to execute the same instruction in parallel on their own data sets.  Thus, SIMT is useful when the same task needs to be performed repeatedly on different data sets.  It is known that SIMT experiences similar divergent execution based on masking and would benefit from the compaction disclosed by Vaidya.  As such, in order to realize the benefits of SIMT, which is compatible with the invention of Vaidya, it would have first been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the ALU is configured to execute threads of a single instruction multiple thread (SIMT) instruction.  Further, Vaidya has taught 16 logical lanes (FIG.4), not 32.  However, changing the size is considered a routine expedient that does not amount to a patentable distinction given that applicant has not demonstrated the criticality of the size.  See MPEP 2144.04(IV)(A).  The examiner notes that the ALU includes 32 logical processor lanes, and the ALU is configured to execute 32 threads of a single instruction multiple thread (SIMT) instruction (the ALU will execute all 32 threads if the mask indicates that all 32 threads are to be executed.  Any number of threads may be executed as this is dependent on program conditions).
Referring to claim 21, Vaidya has taught a graphics processor comprising:
a) one or more hardware tiles (FIG.5, 1250-n) include processing resources having a multi-lane parallel processor architecture (each execution unit 125 is shown in FIG.6 to have a SIMD ALU 250, which is a part of a multi-lane parallel processor architecture that executes a SIMDx instruction (paragraph [0038]) and hardware circuitry configured to compact diverged processor lanes (see the title, paragraphs [0002] and [0016], and FIGs.2 and 4.  The system may use BCC and/or SCC to compact/compress divergent lanes so as to reduce disabled lanes.  SCC involves permutation, e.g. shuffling or swizzling (performed by FIG.6, 240)),
b) wherein the hardware circuitry includes an arithmetic logic unit (ALU) (FIG.6, at least ALU 250, though the ALU can be said to comprise surrounding components which related to ALU processing) having sixteen logical processing lanes (as shown in FIGs.2-4, the ALU has 16 logical lanes to handle SIMD16 instructions) and four physical processing lanes (from paragraph [0045] and FIGs.2-4, the ALU may have four physical lanes (SIMD4).  Vaidya has, eight physical processing lanes.  However, changing the physical size is considered a routine expedient that does not amount to a patentable distinction given that applicant has not demonstrated the criticality of the size.  See MPEP 2144.04(IV)(A).  The examiner notes that one of ordinary skill in the art would have recognized the scalability of Vaidya to work with various combinations of ALU and instruction widths.  The examples in FIGs.2-4 illustrate 4 physical lanes.  The number of physical lanes could be trivially doubled to 8 to increase throughput (a wider ALU can execute more at once).  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the ALU has eight physical processing lanes.
c) Vaidya, as modified, has further taught the ALU including:
c1) a hardware input circuit to store sixteen data elements (see FIG.7B, registers 320 and/or latch 330.  Such may be considered included by (or part of) the ALU);
c2) a first crossbar switch circuit (see FIG.7B, 340, which again may be considered as included by (or part of) the ALU) configured to route data elements from the hardware input circuit to the eight physical processing lanes of the ALU (see paragraph [0047]), wherein the first crossbar switch circuit includes switching logic configured to compact, in sequential order, active processing lanes in an upper half of the sixteen logical processing lanes into inactive lanes in a lower half of the sixteen logical processing lanes (as modified, with eight physical lanes and sixteen logical lanes, any element that is inactive in the lower half would be utilized by an active logical lane in the upper half (assuming an inactive lane in the lower half).  This is within the scope of teachings of Vaidya and is entirely dependent on the mask, which could take on any value.  For instance, assume the mask is set such that active and inactive lanes are indicated as shown in FIG.4.  With eight physical lanes, active lanes 8, 10, 12, and 14 would ; and
c3) second circuitry configured to provide output from the ALU, the second crossbar switch circuitry configured to de-compact, in sequential order, the active processing lanes in the upper half of the sixteen logical processing lanes (see paragraph [0043] and FIG.6, 260, which again may be considered as included by (or part of) the ALU.  After processing, the output associated with the active lanes of the upper half is unswizzled/de-compacted for storage into a register file.  It is the inverse of the compaction, which is shown to be sequential in the explanation above.  Thus, the unswizzling/de-compaction is also sequential in nature).  While Vaidya has not explicitly taught that the second circuitry is second crossbar switch circuity, again note that paragraph [0043] sets forth that the output circuitry performs the inverse of the input circuitry, the latter being disclosed as including crossbar circuitry 340 (FIG.7B).  Thus, as one of skill in the art would have recognized that an output crossbar could perform the inverse of the input crossbar, this would have been a natural implementation for the second circuitry.  Crossbar circuitry is useful as it allows any input to be switched to any output, thereby maximizing flexibility in transmission of data.  Consequently, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the second circuitry to be second crossbar switch circuitry.
Referring to claim 22, Vaidya, as modified, has taught the graphics processor as in claim 21, wherein the ALU is a SIMD ALU (FIG.6, 250) and each of the sixteen logical processing lanes are mappable to a single instruction multiple data (SIMD) channel (see FIGs.2-4 for examples.  Any active lane will ultimately be mapped to one of the eight physical channels (as modified) of the ALU) or a thread of a single instruction multiple thread (SIMT) instruction.

Response to Arguments
On pages 12-13 of the response, applicant argues that a logical processor lane is treated by the system as a physical lane, even if the logical lane is not directly backed by an equal number of physical lanes.
This explanation is consistent with the examiner’s application of Vaidya.  The ALU of Vaidya is a 16-logical lane ALU because it may execute SIMD16 instructions having 16 lanes.  The hardware is able to efficiently handle 16 logical lanes with fewer physical lanes through use of compaction and multi-cycle operation.

On page 13, applicant argues that Vaidya has not taught second crossbar switch circuitry.
The examiner agrees.  However, from the related rejections above, it is obvious for the second circuitry to be second crossbar switch circuitry.  The examiner also notes that original claim 7 did not require that the second circuitry include crossbar circuitry.  Instead, a valid interpretation is that the combination of first and second circuitry included crossbar circuitry.  Because Vaidya has taught the first circuitry including crossbar switch circuitry, it can be said that the first and second circuitries as a whole include crossbar switch circuitry.

On pages 13-14, applicant argues that Vaidya has not taught SIMT.
The examiner agrees, but asserts that it is obvious to modify Vaidya to implement a SIMT environment.

On page 14 applicant argues that Vaidya uses swizzling, which is claimed to not be used in claim 10.
The examiner first notes that “swizzle” or the like does not appear in claim 10.  Thus, this argument is not applicable.  However, even assuming applicant accidentally omitted “swizzle” from claim 10, this is not a distinction (see at least the rejection of claim 17).

On page 14, applicant argues that Vaidya does not teach the SIMT limitation of claim 14 or claim 16.
The examiner agrees, but again asserts this is obvious.

Applicant argues that Vaidya fails to teach numerous limitations in claims 17 and 21 without providing any significant reasoning.
The examiner either disagrees and asserts that Vaidya does teach at least some of the limitations, or agrees, but asserts that they are obvious modifications to Vaidya.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to David J. Huisman whose telephone number is 571-272-4168.  The examiner can normally be reached on Monday-Friday, 9:00 am-5:30 pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta, can be reached at 571-270-3995.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private 






/David J. Huisman/Primary Examiner, Art Unit 2183