DETAILED ACTION
Claims 1-5, 8-10, 12-18, and 21-22 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on April 6, 2022, has been entered.

Specification
The specification is objected to as failing to provide proper antecedent basis for the “execution stage of an instruction pipeline” in claim 21.  See 37 CFR 1.75(d)(1) and MPEP § 608.01(o).  Presumably, FIG.18 is the execution stage since the ALU performs execution (paragraph [0194]) and the components of FIG.18 can be part of the ALU (from paragraph [0206], “The components of the hardware system 1800 can be included in any ALU…”).  However, the components of FIG.18 need to be equated to the execution stage of a pipeline in the specification.

Claim Objections
Claim 10 is objected to because of the following informalities:
In line 3, re-insert --instruction-- before the semicolon.  This word was deleted from the previous set of claims without indication.
Due to deletion in lines 13-15, it appears that “, and compacting…” in line 11, through the end of the paragraph in line 15 is entirely redundant (with respect to the language in lines 8-11) and should be deleted.
Due to deletion of “sequentially” in the 2nd to last line, there “wherein…” portion of the last paragraph is entirely redundant (with respect to the previous language in the same paragraph) and should be deleted.
Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.  Such claim limitation(s) is/are:
In claim 21, on page 8, 2nd to last line, “switching logic configured to compact…”.  Applicant has not set forth sufficient structure in the specification corresponding to this switching logic.  As such, broadest reasonable interpretation of this logic will be taken, and related 112(a)/(b) rejections are set forth below.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 8-9, 13-18, and 21-22 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Referring to claim 8, applicant claims configuring the first crossbar switch circuitry to compact lanes while maintaining a default data channel swizzle setting.  Applicant has not pointed to support for this feature, nor has the examiner found it in the lengthy specification.  In paragraph [0129], applicant states  “By default, the execution unit performs each instruction across all data channels of the operands. In some embodiments, instruction control field 714 enables control over certain execution options, such as channels selection (e.g., predication) and data channel order (e.g., swizzle).”  If the default is to use all channels, but the crossbar is configured to not use all channels (and as a result, compact the lanes), then the default is not being maintained.  Applicant has not adequately described, so as to demonstrate possession at the time of filing, how the default of using all channels is maintained while simultaneously compacting lanes because not all channels are used.  In addition, the examiner sees no original disclosure of a default data swizzle setting.  The default is simply tied to using all channels and not tied to swizzling.  The specification appears broad enough to allow each instruction to dictate swizzling and thus there would be no default swizzle.
Further referring to claim 8, applicant has not adequately described how the configuring of the first crossbar based on a predicate mask works in conjunction with maintaining an instruction-specified data channel swizzle setting so as to demonstration possession thereof at the time of filing.  What if a swizzle setting indicates an order that is different from that indicated by the predicate?  If the predicate mask is given priority, i.e., the swizzle setting is ignored or overruled, then the system is not maintaining the swizzle setting.  Maintaining the swizzle setting suggests that it is used to carry out a swizzle.
Claims 13 and 17 are rejected for similar reasons as claim 8.
Referring to claim 21, “switching logic…to compact…” invokes 112(f).  However, the disclosure does not provide adequate structure to perform the claimed function of compacting.  FIG.18 merely shows black box 1814, and paragraph [0207], which sets forth the switching logic, does not also set forth sufficient structure corresponding to the switching logic.  As such, the specification does not demonstrate that applicant has made an invention that achieves the claimed function because the invention is not described with sufficient detail such that one of ordinary skill in the art can reasonably conclude that the inventor had possession of the claimed invention
Claims 9, 14-16, 18, and 22 are rejected due to their dependence on a claim lacking adequate written description.

The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-5, 8-9, 15-16, and 21-22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The claims recite the following limitations for which there is a lack of antecedent basis:
In claim 1, line 16, “the input hardware”.  Does applicant mean --the input circuitry-- from the previous paragraph?
In claim 1, lines 19-20, “the instruction”.  There is an instruction in lines 12-13 and another in line 19.
In claims 2-3, “the instruction” for similar reasons.
In claim 15, each instance of “the instruction”.  There is an instruction in claim 10, line 2, and another in claim 10, line 6.
Referring to claim 21, the claim limitation “switching logic…to compact…” invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.  However, as explained above, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function.  Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.
Claims 2-5, 8-9, 16, and 22 are rejected due to their dependence on an indefinite claim.

Claim Rejections - 35 USC § 102/103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 17 is rejected under 35 U.S.C. 102(a)(1) as anticipated by or, in the alternative, under 35 U.S.C. 103 as obvious over Vaidya et al., U.S. Patent Application Publication No. 2014/0181477 A1 (herein referred to as Vaidya).
Claim 17 is partly rejected for similar reasons set forth in the rejection of claims 10 and 13 below (i.e., the steps performed by the hardware circuitry in response to receiving an instruction in claim 17 are mostly rejected for similar reasons as claims 10 and/or 13).  Vaidya has further taught:
a) a data processing system comprising:
a1) a memory device (e.g. any of FIG.5, 130; FIG.6, 213; FIG.8, 740; FIG.9, 1110, 1175; and FIG.12, 832, 834, 828); and
a2) a graphics processor (see paragraph [0040] and FIG.5, 120) comprising one or more hardware tiles (FIG.5, 1250-n) including processing resources having a multi-lane parallel processor architecture (each execution unit 125 is shown in FIG.6 to have a SIMD ALU 250, which is a part of a multi-lane parallel processor architecture that executes a SIMDx instruction (paragraph [0038]) and hardware circuitry configured to compact diverged processor lanes (see the title, paragraphs [0002] and [0016], and FIGs.2 and 4.  The system may use BCC and/or SCC to compact/compress divergent lanes so as to reduce disabled lanes.  SCC involves permutation, e.g. shuffling or swizzling (performed by FIG.6, 240)), wherein the hardware circuitry includes an arithmetic logic unit (ALU) (FIG.6, ALU 250) including a first number of logical processor lanes (as shown in FIGs.2-4, the ALU has 16 logical lanes to handle SIMD16 instructions) and a second number of physical processor lanes (from paragraph [0045] and FIGs.2-4, the ALU may have four physical lanes (SIMD4)), the first number is a multiple of the second number (SIMD16 is 4x SIMD4), and the ALU is configured to process the logical processor lanes over multiple clock cycles when active logical processor lanes outnumber physical processor lanes (see FIGs.2-4, which show that multiple ALU cycles are required to execute active lanes when they outnumber inactive lanes (e.g. in FIG.2, for instruction 35, there are 12 active lanes and 4 inactive lanes, and these lanes take 3 cycles to execute (cycles T+1, T+2, T+3)).
Note that based on reasoning in part (g1) of the rejection of claim 10, this would be a 102 rejection.  Based on reasoning in part (g2) of the rejection of claim 10, this would be a 103 rejection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Vaidya et al., U.S. Patent Application Publication No. 2014/0181477 A1 (herein referred to as Vaidya).
Referring to claim 1, Vaidya has taught an accelerator device (FIG.5, 120) comprising:
a) a host interface (FIG.5, 140, and paragraph [0040], which states that the interface couples the graphics domain to a core domain or system agent (either of which is a “host”));
b) a fabric interconnect coupled with the host interface (see paragraph [0040].  Core domain and/or system agent is interfaced with the accelerator over an interconnect); and
c) one or more hardware tiles (FIG.5, 1250-n) coupled with the fabric interconnect, wherein the one or more hardware tiles include processing resources having a multi-lane parallel processor architecture (each execution unit 125 is shown in FIG.6 to have a SIMD ALU 250, which is a part of a multi-lane parallel processor architecture that executes a SIMDx instruction (paragraph [0038])) and hardware circuitry configured to compact diverged processor lanes (see the title, paragraphs [0002] and [0016], and FIGs.2 and 4.  The system may use BCC and/or SCC to compact/compress divergent lanes so as to reduce disabled lanes.  SCC involves permutation, e.g. shuffling or swizzling (performed by FIG.6, 240)), wherein the hardware circuitry includes:
c1) an arithmetic logic unit (ALU) (FIG.6, ALU 250) having multiple processor lanes, the multiple processor lanes including a first number of logical processor lanes (as shown in FIGs.2-4, the ALU has 16 logical lanes to handle SIMD16 instructions) and a second number of physical processor lanes (from paragraph [0045] and FIGs.2-4, the ALU may have four physical lanes (SIMD4)), wherein the first number is a multiple of the second number (SIMD16 is 4x SIMD4), and the ALU is configured to process the logical processor lanes over multiple clock cycles when active logical processor lanes for an instruction outnumber physical processor lanes (see FIGs.2-4, which show that multiple ALU cycles are required to execute active lanes when they outnumber inactive lanes (e.g. in FIG.2, for instruction 35, there are 12 active lanes and 4 inactive lanes, and these lanes take 3 cycles to execute (cycles T+1, T+2, T+3));
c2) input circuitry including input data channels corresponding to the first number of logical processor lanes (see FIG.7B and paragraph [0047].  The input circuitry may be registers 320, latch 330, wires from registers 320 to latch 330, and/or wires from latch 330 to crossbar 340.  A channel is simply circuitry through which data flows.  As this circuitry sends data for use by the logical/physical lanes, the channels correspond to the number of logical lanes).
c3) first crossbar switch circuitry coupled with the input hardware (see FIG.6, circuitry 240 and FIG.7B, 340, which are coupled with the input circuitry), the first crossbar circuitry switch configured to, based on a predicate mask for an instruction received for execution and during execution of the instruction (see paragraph [0043].  Note that that at least components 240-260 may be part of execution of the instruction), provide input associated with a second set of logical processor lanes to physical processor lanes associated with a first set of logical processor lanes (see FIG.4.  Again, based on a mask, input associated with second logical lanes 4 and 6 are provided as input to physical lanes 1 and 3 of the SIMD4 pipe, which are associated with first set of logical lanes 1 and 3 (the first set of logical lanes includes those that are issued to the four physical lanes first  (cycle T)); and
c4) second circuitry configured to provide output from the physical processor lanes associated with the first set of logical lanes to memory associated with the second set of logical processor lanes (from FIG.6, circuitry 260 outputs data from the ALU’s physical lanes (which again are associated with logical lanes 0-3 in cycle T) to memory via write-back 270.  From paragraph [0043], unswizzle 260 is the inverse of swizzle 240.  Thus, the outputs from lanes 1 and 3 will be provided back to logical lanes 4 and 6 for storage).
d) Vaidya has not explicitly taught that the second circuitry is second crossbar switch circuity, again note that paragraph [0043] sets forth that the output circuitry performs the inverse of the input circuitry, the latter being disclosed as including crossbar circuitry 340 (FIG.7B).  Thus, as one of skill in the art would have recognized that an output crossbar could perform the inverse of the input crossbar, this would have been a natural implementation for the second circuitry.  Crossbar circuitry is useful as it allows any input to be switched to any output, thereby maximizing flexibility in transmission of data.  Consequently, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the second circuitry to be second crossbar switch circuitry.
e) With respect to the ALU comprising the input circuitry, the first crossbar switch circuitry, the second crossbar switch circuitry, and the memory:
e1) Under a first interpretation, one can draw a box around all of these components and call it the ALU.  It would be a collective unit that feeds data and instructions for arithmetic/logical execution and stores a result.
e2) Under a second interpretation, even if Vaidya can’t be said to teach an ALU including all of these components, this amounts to a rearrangement of parts or integrating components into a single unit.  Such actions are routine expedients, not patentable distinctions.  In other words, Vaidya already teaches all claimed parts.  Whether they are part of an ALU or separate from an ALU, the operation of Vaidya would not change.  See MPEP 2144.04, first paragraph, and sections V(B) and VI(C).  As a result, as applicant has not demonstrated the criticality of the claimed components being within the ALU, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, based on the established case law, to modify Vaidya such that the ALU comprising the input circuitry, the first crossbar switch circuitry, the second crossbar switch circuitry, and the memory.  In other words, it is obvious to move circuitry that appears before and after the ALU into the ALU.  There is no functional difference.
Referring to claim 21, Vaidya has taught a graphics processor comprising:
a) one or more hardware tiles (FIG.5, 1250-n) including processing resources having a multi-lane parallel processor architecture (each execution unit 125 is shown in FIG.6 to have a SIMD ALU 250, which is a part of a multi-lane parallel processor architecture that executes a SIMDx instruction (paragraph [0038]) and hardware circuitry configured to compact diverged processor lanes during an execution stage of an instruction pipeline (see the title, paragraphs [0002] and [0016], and FIGs.2 and 4.  The system may use BCC and/or SCC to compact/compress divergent lanes so as to reduce disabled lanes.  SCC involves permutation, e.g. shuffling or swizzling (performed by FIG.6, 240).  Also, “pipeline” is mentioned many times in Vaidya, as a pipeline is used to process instructions.  The examiner notes that an execution stage can be considered a stage where compaction occurs.  Compaction is related to execution as it determines which logical lanes are mapped to the physical lanes of the ALU for execution.  Thus, a compaction stage is an execution stage at least for this reason),
b) wherein the hardware circuitry includes an arithmetic logic unit (ALU) (FIG.6, at least ALU 250, though the ALU can be said to comprise surrounding components which related to ALU processing) having sixteen logical processing lanes (as shown in FIGs.2-4, the ALU has 16 logical lanes to handle SIMD16 instructions) and four physical processing lanes (from paragraph [0045] and FIGs.2-4, the ALU may have four physical lanes (SIMD4).  Vaidya has, thus, not taught eight physical processing lanes.  However, changing the physical size is considered a routine expedient that does not amount to a patentable distinction.  See MPEP 2144.04(IV)(A).  The examiner notes that one of ordinary skill in the art would have recognized the scalability of Vaidya to work with various combinations of ALU and instruction widths.  The examples in FIGs.2-4 illustrate 4 physical lanes.  The number of physical lanes could be trivially doubled to 8 to increase throughput (a wider ALU can execute more at once).  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the ALU has eight physical processing lanes.
c) Vaidya, as modified, has further taught a hardware input circuit to store sixteen data elements (see FIG.7B, registers 320 and/or latch 330.  Such may be considered included by (or part of) the ALU);
d) Vaidya, as modified, has further taught a first crossbar switch circuit (see FIG.7B, 340, which again may be considered as included by (or part of) the ALU) configured to route data elements from the hardware input circuit to the eight physical processing lanes of the ALU (see paragraph [0047]), wherein the first crossbar switch circuit includes switching logic configured to compact, during the execution stage, active processing lanes in an upper half of the sixteen logical processing lanes into inactive lanes in a lower half of the sixteen logical processing lanes (as modified, with eight physical lanes and sixteen logical lanes, any element that is inactive in the lower half would be utilized by an active logical lane in the upper half (assuming an inactive lane in the lower half).  This is within the scope of teachings of Vaidya and is entirely dependent on the mask, which could take on any value.  For instance, assume the mask is set such that active and inactive lanes are indicated as shown in FIG.4.  With eight physical lanes, active lanes 8, 10, 12, and 14 would be moved to inactive lanes 1, 3, 5, and 7, respectively, and the entire workload would be executed in one cycle); and
e) Vaidya, as modified, has further taught second circuitry configured to provide output from the ALU, the second crossbar switch circuitry configured to de-compact the active processing lanes in the upper half of the sixteen logical processing lanes (see paragraph [0043] and FIG.6, 260, which again may be considered as included by (or part of) the ALU.  After processing, the output associated with the active lanes of the upper half is unswizzled/de-compacted for storage into a register file.  While Vaidya has not explicitly taught that the second circuitry is second crossbar switch circuity, again note that paragraph [0043] sets forth that the output circuitry performs the inverse of the input circuitry, the latter being disclosed as including crossbar circuitry 340 (FIG.7B).  Thus, as one of skill in the art would have recognized that an output crossbar could perform the inverse of the input crossbar, this would have been a natural implementation for the second circuitry.  Crossbar circuitry is useful as it allows any input to be switched to any output, thereby maximizing flexibility in transmission of data.  Consequently, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the second circuitry to be second crossbar switch circuitry.
f) Vaidya has further taught a hardware output circuit (see FIG.6, circuit 270) to output data elements received from the second crossbar switch circuitry (circuit 270 writes data received from second crossbar 260 for storage to a register file (paragraph [0045])).
g) With respect to the ALU including the hardware input circuit, the first crossbar switch circuitry, the second crossbar switch circuitry, and the hardware output circuit:
g1) Under a first interpretation, one can draw a box around all of these components and call it the ALU.  It would be a collective unit that feeds data and instructions for arithmetic/logical execution and stores a result.
g2) Under a second interpretation, even if Vaidya can’t be said to teach an ALU including all of these components, this amounts to a rearrangement of parts or integrating components into a single unit.  Such actions are routine expedients, not patentable distinctions.  In other words, Vaidya already teaches all claimed parts.  Whether they are part of an ALU or separate from an ALU, the operation of Vaidya would not change.  See MPEP 2144.04, first paragraph, and sections V(B) and VI(C).  As a result, as applicant has not demonstrated the criticality of the claimed components being within the ALU, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, based on the established case law, to modify Vaidya such that the ALU comprising the input circuitry, the first crossbar switch circuitry, the second crossbar switch circuitry, and the memory.
Referring to claim 22, Vaidya, as modified, has taught the graphics processor as in claim 21, wherein the ALU is a SIMD ALU (FIG.6, 250) and each of the sixteen logical processing lanes are mappable to a single instruction multiple data (SIMD) channel (see FIGs.2-4 for examples.  Any active lane will ultimately be mapped to one of the eight physical channels (as modified) of the ALU) or a thread of a single instruction multiple thread (SIMT) instruction.

Claims 2-5, 8-10, 12-16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Vaidya in view of the examiner’s taking of Official Notice.
Referring to claim 2, Vaidya has taught the accelerator device as in claim 1, wherein the host interface is configured to communicatively couple the accelerator device to a processor of a host computing device (again, see paragraph [0040].  Also, see FIG.8, which couples a host (e.g. multi-core processor) to GPU 720.  Alternatively, see FIG.10, which coupled a graphics processor 875 to a host core 874a,b).  Vaidya has not taught that the host interface is configured to receive the instruction for execution by the accelerator device.  However, the examiner notes that it is well known in the art for a host CPU to pass graphics instructions to a graphics processor for more efficient/faster graphics processing.  As such, in order to realize the advantages of Vaidya’s compression/compaction in such an architecture (where a CPU offloads graphics instruction to a GPU), it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the host interface is configured to receive the instruction for execution by the accelerator device.
Referring to claim 3, Vaidya, as modified, has taught the accelerator device as in claim 2, the one or more hardware tiles further comprising: decode circuitry to decode the instruction into a decoded instruction (FIG.6, instruction decoder 215), the decoded instruction associated with the predicate mask (see paragraphs [0027]-[0029] and [0042]-[0043]), wherein the predicate mask indicates a set of active lanes and a set of inactive lanes (see at least paragraphs [0027]-[0029] and [0042]-[0043].  Vaidya identifies divergent execution (e.g. if/else code, where some elements are active at the “if” part and others at the “else” part) and sets a mask to indicate such.  The mask would indicate active lanes (those unshaded in FIGs.2-4) and inactive lanes (those shaded in FIGs.2-4)), and the hardware circuitry is to map active lanes in a second portion of logical processor lanes to inactive lanes in a first portion of logical processor lanes during the execution of the decoded instruction (see FIGs.4 and the description thereof.  For instance, it is determined that the odd lanes are inactive for an instruction (FIG.4, 50a).  In response, active lanes 4, 6, 12, and 14 are mapped to lanes 1, 3, 9, and 11 (FIG.4a, 50b) as part of SCC compaction.  Again, this mapping of lanes can be considered part of execution).
Referring to claim 4, Vaidya, as modified, has taught the accelerator device as in claim 3, but has not taught wherein one or more logical processor lanes of the ALU are configured as a thread processor to execute a thread of a single instruction multiple thread (SIMT) instruction and the predicate mask is to indicate at least one active thread of the SIMT instruction and at least one inactive thread of the SIMT instruction.  However, SIMT is known in the art as the thread equivalent of SIMD.  With SIMT, multiple threads use parallel hardware to execute the same instruction in parallel on their own data sets.  Thus, SIMT is useful when the same task needs to be performed repeatedly on different data sets.  It is known that SIMT experiences similar divergent execution based on masking and would benefit from the compaction disclosed by Vaidya.  As such, in order to realize the benefits of SIMT, which is compatible with the invention of Vaidya, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that one or more logical processor lanes of the ALU are configured as a thread processor to execute a thread of a single instruction multiple thread (SIMT) instruction and the predicate mask is to indicate at least one active thread of the SIMT instruction and at least one inactive thread of the SIMT instruction.
Referring to claim 5, Vaidya, as modified, has taught the accelerator device as in claim 4, wherein the hardware circuitry is configured to: map input associated with the at least one active thread of the SIMT instruction to a logical lane associated with the at least one inactive thread of the SIMT instruction (this would be the purpose of circuitry 240 circuitry of FIG.7B.  To compact the active threads together, an active thread would be mapped to a logical lane currently associated with an inactive thread (e.g. in FIG.4, active threads 4 and 6 would be mapped to lanes 1 and 3, which are associated with inactive threads 1 and 3).
Referring to claim 8, Vaidya, as modified, has taught the accelerator device as in claim 5, wherein the one or more hardware tiles are configured to: configure the first crossbar switch circuitry to sequentially compact the diverged processor lanes into contiguous logical processor lanes based on the predicate mask while maintaining a default of instruction-specified data channel swizzle setting for input operands (see FIG.4 and note that logical lanes 0, 4, 2, and 6 are compacted into contiguous logical lanes 0, 1, 2, and 3, respectively.  Similarly, note that logical lanes 8, 12, 10, and 14 are compacted into contiguous logical lanes 8, 9, 10, and 11, respectively.  The compaction of FIG.4, for instance, is sequential in nature.  That is, from left to right, the active lanes beyond lanes 0-3 are mapped in left-to-right (sequential) fashion in the inactive lanes in 0-3.  For example, after lane 3, the first active lane is lane 4.  This is compacted to first inactive lane 1.  After lane 3, the second active lane is lane 6.  This is compacted to second inactive lane 3.  This is sequential compacting that is designed for left-to-right compaction.  Note from paragraph [0043] that a swizzle setting is computed based on the mask to carry out the compaction.  The mask and/or the swizzle setting is an data channel swizzle setting and it is also instruction-specified because the setting can only be derived based on instruction execution.  There is only one swizzle setting for a corresponding compaction.  As such, the swizzle setting is maintained and not changed); process the contiguous logical processor lanes over a reduced number of clock cycles (the above compacting allows for processing over a reduced number of cycles (see FIG.4) within four physical lanes of the ALU); and configure the second crossbar switch circuitry to sequentially de-compact the diverged processor lanes (again, see paragraph [0043] and FIG.6, 260).
Referring to claim 9, Vaidya, as modified, has taught the accelerator device as in claim 8, wherein the ALU includes integer and floating-point logic (from paragraph [0045], “various integer and floating point instructions can be performed in the floating point ALU”).
Referring to claim 10, Vaidya has taught a method comprising:
a) receiving an instruction (e.g. FIG.4, 50a) having predicated data elements (from FIG.4, the odd elements are turned off, and the even elements are turned on based on a predicate/mask (e.g. paragraphs [0002], [0015], [0027], etc.));
b) Vaidya has not taught that wherein the instruction is a single instruction multiple thread (SIMT) instruction.  However, SIMT is known in the art as the thread equivalent of SIMD.  With SIMT, multiple threads use parallel hardware to execute the same instruction in parallel on their own data sets.  Thus, SIMT is useful when the same task needs to be performed repeatedly on different data sets.  It is known that SIMT experiences similar divergent execution based on masking and would benefit from the compaction disclosed by Vaidya.  As such, in order to realize the benefits of SIMT, which is compatible with the invention of Vaidya, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the instruction is a single instruction multiple thread (SIMT) instruction.
c) Vaidya, as modified, has further taught determining, via a predication mask associated with the instruction, a set of inactive threads for the instruction (again, based on a mask, it is determined that the off threads are inactive, meaning they are indicated as not satisfying the “IF” condition of the program.  As an example, in FIG.4, as modified, the mask would indicate that the odd threads are off/inactive and that the even threads are on/active); and
d) Vaidya, as modified, has further taught compacting data elements associated with active threads into processing lanes associated with inactive threads to create a contiguous set of active processing lanes (see FIG.4.  Active threads 4 and 6, and their data elements, are compacted into lanes 1 and 3 to creates contiguous active lanes), wherein the contiguous set of active processing lanes are processing lanes of the multi-lane ALU (see FIG.6, 250.  Lanes 0, 4, 2, and 6 have been compacted to execute on physical lanes 0, 1, 2, and 3, of the ALU in cycle T (FIG.4)) and compacting the data elements associated with the active threads into the processing lanes associated with the inactive threads includes compacting active data elements into the processing lanes associated with the inactive threads (this is inherent and redundant and, thus, taught for similar reasons set forth above);
e) Vaidya, as modified, has further taught performing a processing operation on the contiguous set of active processing lanes (again, see cycle T in FIG.4); and
f) Vaidya, as modified, has further taught de-compacting output of the processing operation into output memory, wherein de-compacting output of the processing operation into the output memory includes de-compacting output of the processing operation into the output memory (see FIG.6, 260 and paragraph [0043].  After processing, the output is unswizzled/de-compacted for storage into a register file (this is the output memory of the ALU)).
g) With respect to the limitation that the compacting, performing, and de-compacting steps are performed during execution of a decoded instruction at a multi-lane arithmetic logic unit (ALU):
g1) Under a first interpretation, one can draw a box around all of these components and call it the ALU.  It would be a collective unit that feeds data and instructions for arithmetic/logical execution and stores a result.  Any functionality performed by this collective unit is part of execution.
g2) Under a second interpretation, even if Vaidya can’t be said to teach an ALU including all of these components, this amounts to a rearrangement of parts or integrating components into a single unit.  Such actions are routine expedients, not patentable distinctions.  In other words, Vaidya already teaches all claimed parts.  Whether they are part of an ALU or separate from an ALU, the operation of Vaidya would not change.  See MPEP 2144.04, first paragraph, and sections V(B) and VI(C).  As a result, as applicant has not demonstrated the criticality of the claimed components being within the ALU, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, based on the established case law, to modify Vaidya such that the compacting, performing, and de-compacting steps are performed during execution of a decoded instruction at a multi-lane arithmetic logic unit (ALU).
Referring to claim 12, Vaidya, as modified, has taught the method as in claim 10, wherein the output memory is an output buffer or output register of the multi-lane ALU (from paragraph [0034], write-back occurs to a register file (which is an output register, or includes an output register)).
Referring to claim 13, Vaidya, as modified, has taught the method as in claim 10, wherein compacting active data elements into the processing lanes associated with the inactive threads includes configuring a crossbar (FIG.7B, 340) to map active input data elements associated with a second set of processing lanes to processing lanes in a first set of processing lanes while maintaining a default of instruction-specified data channel swizzle setting (see paragraph [0047].  Basically, in FIG.4, since active thread 4 is being mapped to inactive lane 1, the data corresponding to lane 4 must be switched through the cross-bar in order to be provided to lane 1.  Further, note from paragraph [0043] that a swizzle setting is computed based on the mask to carry out the compaction.  The mask and/or the swizzle setting is an data channel swizzle setting and it is also instruction-specified because the setting can only be derived based on instruction execution.  There is only one swizzle setting for a corresponding compaction.  As such, the swizzle setting is maintained and not changed), the processing lanes in the first set of processing lanes associated with inactive threads (the first (odd) set of lanes are associated with inactive elements (denoted by shading, per paragraph [0023])).
Referring to claim 14, Vaidya, as modified, has taught the method as in claim 13, wherein the multi-lane ALU is a single instruction multiple data (SIMD) ALU including a first number of logical SIMD lanes and a second number of physical SIMD lanes (from FIGs.4 and 6 and paragraph [0045], the SIMD ALU has four physical lanes and sixteen logical lanes such that a SIMD4 ALU can execute SIMD16 instructions (i.e., a 16-thread SIMT instruction)), the first number is a multiple of the second number (16 is a multiple of 4), and the SIMD ALU processes logical SIMD lanes over multiple clock cycles when active logical SIMD lanes outnumber physical SIMD lanes (see FIGs.2-4, which show that multiple ALU cycles are required to execute active lanes when they outnumber inactive lanes (e.g. in FIG.2, for instruction 35, there are 12 active lanes and 4 inactive lanes, and these lanes take 3 cycles to execute (cycles T+1, T+2, T+3).  In FIG.4, active lanes 0, 2, 4, 6, 8, 10, 12, and 14 are executed over two cycles (T and T+1)), and wherein the SIMD ALU is mapped to multiple SIMD threads (whatever hardware simultaneously processes the different threads as shown in FIG.4, as modified, would be hardware that makes up the SIMD ALU).
Referring to claim 15, Vaidya, as modified, has taught the method as in claim 14, wherein performing the processing operation on the contiguous set of active processing lanes includes bypassing execution of the inactive threads for the instruction multiple logical SIMD lanes and processing the instruction in a reduced number of clock cycles (see FIGs.2 and 4 for examples.  For instance, in FIG.4, execution of inactive threads 1 and 3 is bypassed so that active lanes 4 and 6 can be executed.  This reduces the number of execution cycles).
Referring to claim 16, Vaidya, as modified, has taught the method as in claim 15, wherein the multi-lane ALU is a SIMD16 ALU having sixteen logical lanes (from paragraph [0045], while the ALU is disclosed as a SIMD4 ALU, this is only in the physical sense because it includes four physical lanes.  However, it is also a SIMD16 ALU in the logical sense because it executes instructions having 16 logical lanes (see FIGs.2-4)), and the SIMD16 ALU is configurable to execute 16 SIMT threads (the SIMD16 ALU will execute all 16 threads, over 4 cycles, if the mask indicates that all 16 threads are to be executed.  Any number of threads may be executed as this is dependent on program conditions).  Vaidya has not taught that the ALU has eight physical lanes (again, Vaidya has instead taught four physical lanes)).  However, changing the physical size is considered a routine expedient that does not amount to a patentable distinction given that applicant has not demonstrated the criticality of the size.  See MPEP 2144.04(IV)(A).  The examiner notes that one of ordinary skill in the art would have recognized the scalability of Vaidya to work with various combinations of ALU and instruction widths.  The examples in FIGs.2-4 illustrate 16 logical lanes and 4 physical lanes.  The number of physical lanes could be trivially doubled to 8 to increase throughput (a wider ALU can execute more at once).  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the ALU has eight physical lanes.
Referring to claim 18, Vaidya has taught the data processing system as in claim 17, wherein the output memory is an output register (from paragraph [0034], write-back occurs to a register file (which is an output register, or includes an output register)), but has not taught wherein the ALU includes 32 logical processor lanes, and the ALU is configured to execute 32 threads of a single instruction multiple thread (SIMT) instruction.  However, SIMT is known in the art as the thread equivalent of SIMD.  With SIMT, multiple threads use parallel hardware to execute the same instruction in parallel on their own data sets.  Thus, SIMT is useful when the same task needs to be performed repeatedly on different data sets.  It is known that SIMT experiences similar divergent execution based on masking and would benefit from the compaction disclosed by Vaidya.  As such, in order to realize the benefits of SIMT, which is compatible with the invention of Vaidya, it would have first been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the ALU is configured to execute threads of a single instruction multiple thread (SIMT) instruction.  Further, Vaidya has taught 16 logical lanes (FIG.4), not 32.  However, changing the size is considered a routine expedient that does not amount to a patentable distinction.  See MPEP 2144.04(IV)(A).  The examiner notes that one of ordinary skill in the art would have recognized the scalability of Vaidya to work with various combinations of ALU and instruction widths.  Thus, the number of logical lanes could be trivially doubled to 32 to allow for more threads, i.e., more parallel execution, to increase throughput.  This may be paired with an increase in physical ALU size (i.e., the four physical lanes of FIG.4 could also be doubled to 8 to carry out more work at once).  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Vaidya such that the ALU includes 32 logical processor lanes, and the ALU is configured to execute 32 threads of a single instruction multiple thread (SIMT) instruction (the ALU will execute all 32 threads if the mask indicates that all 32 threads are to be executed.  Any number of threads may be executed as this is dependent on program conditions).

Response to Arguments
On pages 10-11 of applicant’s response, applicant argues that Vaidya has not taught performing steps of claim 17 during execution of the instructions at the ALU.
Under one interpretation, the examiner disagrees, as applicant has merely named a collection of components that performs the steps in question an ALU.  This is not a patentable distinction.  Under a second interpretation, it is obvious to integrate or rearrange components in such a way that an ALU comprises each of the components that perform the steps in question.  As such, for multiple reasons, the examiner asserts this is not a patentable distinction.

On page 11 of applicant’s response, applicant argues that a swizzle setting is not maintained in Vaidya.
The examiner disagrees for reasons set forth in the rejections above.

On pages 12-13 of applicant’s response, applicant argues that Vaidya has not taught that the claimed input hardware, first crossbar, and second crossbar are part of the ALU.
This is not persuasive for reasons set forth above for the first argument.  That is, the claimed collection of components can be referred to as the ALU.  Even if this naming were impossible (which the examiner does not believe to be the case), it is obvious for these components to be moved from outside the ALU to within the ALU.

Conclusion
The following prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Adelman, 5,598,362, has taught an ALU with many different components, including a register file, multiplexers (i.e., switches), control circuitry, etc. (see FIG.2).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to David J. Huisman whose telephone number is 571-272-4168.  The examiner can normally be reached on Monday-Friday, 9:00 am-5:30 pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta, can be reached at 571-270-3995.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/David J. Huisman/Primary Examiner, Art Unit 2183