Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
1.  Applicant’s arguments, filed May 19th, 2022, with respect to the 35 USC 103 rejection of claim 16 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.

2.  Applicant's arguments filed May 19th, 2022, with respect to the 35 USC 103 rejections of claims 1 and 9 have been fully considered but they are not persuasive.
Regarding claim 1, Applicant’s arguments rely on the limitation present in claim 16 regarding a “storage device” not dedicated to any of the plurality of VSPs.  However, this limitation is not present within the amended language of claim 1, and therefore the arguments which are persuasive with respect to claim 16 are not applicable to claim 1.  Therefore, the prior rejection of claim 1 is maintained.

Regarding claim 9, Applicant argues that Chen and Biscondi fail to teach “the first VSP is not configured to receive operands from operand gathering components dedicated to another VSP of the plurality of VSPs” as “[the] operand gathering components…are part of other SIMD blocks via the crossbar 350”.
In response to the above argument, Examiner respectfully disagrees.  The operand gathering components of Chen are clearly shown in Figures 2 and 3 to be dedicated to an individual SIMD block.  Figure 2 shows the source operand flops as a component of operand delivery network 240 within a single SIMD block.  Figure 3 additionally shows the operand delivery networks 240 being dedicated to a single SIMD block.  While an interconnect between the multiple SIMD sub-processors does exist, this interconnect is not equated to the “operand gathering components” of claim 9.  Therefore, Applicant’s arguments are not considered persuasive and the rejections are maintained.

The rest of Applicant’s arguments are based on the arguments addressed above.  Above responses are thus applicable.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

3.  Claims 1-2 and 4-15 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al (US 2018/0121386, herein Chen) in view of Biscondi et al (US 2009/0254718, herein Biscondi).

Regarding claim 1, Chen teaches a processor, comprising:
a plurality of vector sub-processors (VSPs) (Fig 3, [0059], super-SIMDs 200a-d);
a broadcast switch configured to broadcast operands between the plurality of VSPs (Figs 2&3, LDS 420); and
a plurality of memory banks dedicated to respective VSPs of the plurality of VSPs (Figs 1A & 3, [0021], VGPRs 110a-d), wherein a first memory bank dedicated to a first VSP of the plurality of VSPs comprises:
a first plurality of vector general purpose register (VGPR) banks (Fig 1A, [0021], VGPR banks 110a, 110b, any number of VGPRs can be utilized); and
a second plurality of VGPR banks corresponding to the first plurality of VGPR banks (Fig 1A, [0021], VGPR banks 110c, 110d, any number of VGPRs can be utilized);
wherein the first plurality of VGPR banks are configured to send operands to the first VSP without sending the operands through the broadcast switch (Fig 2, [0034], [0038], per-thread VGPRs & [0033], inputs from local VGPRs go directly to ALUs without traveling through the LDS).
	Chen fails to teach wherein the VGPR banks are partitioned into high and low VGPR banks, or wherein the first VSP is not configured to receive data from memory banks dedicated to other VSPs of the plurality of VSPs.
	Biscondi teaches a processor, comprising:
	a plurality of memory banks dedicated to a respective vector processor (Fig 5, [0040], vector memory banks) comprising a first plurality of high vector general purpose (VGPR) banks and a first plurality of low VGPR banks corresponding to the plurality of high VGPR banks (Fig 9, [0057-0059], [0064], concatenated register pairs split high and low order bits between adjacent banks of vector registers); and
wherein a first vector sub-processor (VSP) is not configured to receive data from memory banks dedicated to other VSPs of a plurality of VSPs (Fig 3, [0031-0033], processing clusters 30 as multiple VSPs, [0037-0038], dedicated local memories 33 and 35 as dedicated memory banks, “Each of these local memory resources 33, 35 is associated only with its associated sub-cluster 32, 34, respectively”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chen and Biscondi to utilize register pairs for operands that are wider than a single vector register of the processor and exclusively dedicated memory banks.  While Chen teaches each vector sub-processor utilizing four or more adjacent vector register banks, Chen does not explicitly contemplate these register banks being utilized in a paired or concatenated manner.  However, Chen does disclose a vector execution unit reading multiple VGPRs as source operands based on the SIMD width (Chen [0022]).  While Chen discloses memory banks (vector register banks) as being dedicated to one of the multiple VSPs (Chen’s super-SIMDs), Chen also discloses that the operand delivery network may output data to external texture units and a local data share unit (Chen [0033], Fig 2).  While Chen’s VSPs do not deliver operands directly from their dedicated memory banks to the dedicated memory banks or even the local operand delivery network of other VSPs, there is a pathway through the local data share unit for data to travel through multiple components to another VSP (Chen Figs 2 & 3).  However, utilizing exclusively dedicated memories as taught by Biscondi may reduce the complexity and power consumption of the processor, while also providing the utility of an exclusive local memory resource “useful for storing digital filter coefficients, storing and holding FFT parameters, storing tables of pseudo-random values as useful in the Kasumi cipher algorithm, and the like” (Biscondi [0038]).  Therefore, as both wide operands and exclusively dedicated local memory resources are a routine and conventional aspect of SIMD and vector processors in the art, splitting the register banks between high and low pairs to hold wide operands and including exclusively dedicated memory banks, as disclosed by Biscondi, would be an obvious means of implementing wide vector operands.  Doing so may increase the functionality and efficiency of the SIMD processor, and would merely entail a combination of known prior art elements to achieve predictable results.

	Regarding claim 2, the combination of Chen and Biscondi teaches the processor of claim 1, further comprising a second memory bank dedicated to a second VSP of the plurality of VSPs, wherein the second memory bank comprises: a second plurality of high VGPR banks; and a second plurality of low VGPR banks corresponding to the second plurality of high VGPR banks (Chen Figs 1A & 3, VGPRs 110a-d of second super-SIMD 200b & Biscondi Fig 9, [0057-0059], [0064], paired register banks for high and low order bits).

Regarding claim 4, the combination of Chen and Biscondi teaches the processor of claim 1, wherein the first memory bank further comprises a plurality of operand gathering components corresponding to VGPR banks of the first VSP, wherein a first operand gathering component is configured to store a first plurality of operands from a corresponding high VGPR bank and to store a second plurality of operands from a corresponding low VGPR bank (Chen Figs 1A & 2, input multiplexors 105 & read crossbar 330, Biscondi Fig 9, [0057-0059], [0064], paired register banks for high and low order bits).

Regarding claim 5, the combination of Chen and Biscondi teaches the processor of claim 4, further comprising a phase multiplexer of the first VSP, wherein the phase multiplexer is configured to provide operands from the first operand gathering component to an arithmetic logic unit (ALU) of the first VSP (Chen Fig 1A, input multiplexers 105).

Regarding claim 6, the combination of Chen and Biscondi teaches the processor of claim 1, further comprising a scheduler configured to assign threads to individual VSPs of the plurality of VSPs (Chen [0014], [0050], scheduler).

Regarding claim 7, the combination of Chen and Biscondi teaches the processor of claim 6, wherein dedicating a first thread to the first VSP comprises identifying a first high VGPR bank of the first VSP and a first low VGPR bank of the first VSP to store data of the first thread (Chen [0034], [0038], per-thread VGPRs & Biscondi Fig 9, [0057-0059], [0064], paired register banks for high and low order bits).

Regarding claim 8, the combination of Chen and Biscondi teaches the processor of claim 7, wherein identifying the first high VGPR bank is based on at least a portion of an address of the first thread (Chen [0038], [0064], VGPR addressing & Biscondi [0043], vector memory addressing & Fig 9, [0057-0059], [0064], paired register banks for high and low order bits).

Regarding claim 9, Chen teaches a processor, comprising:
a plurality of vector sub-processors (VSPs) (Fig 3, [0059], super-SIMDs 200a-d); and
a plurality of memory banks dedicated to respective VSPs of the plurality of VSPs (Figs 1A & 3, [0021], VGPRs 110a-d), wherein a first memory bank dedicated to a first VSP of the plurality of VSPs comprises:
a plurality of operand gathering components configured to be assigned to individual threads and to store operands for the assigned individual threads while the threads are assigned to the first VSP, wherein the plurality of operand gathering components are configured to send operands to the first VSP (Chen Figs 1A & 2, [0033], source operand flip-flops storing operands & read crossbar 330, [0034], [0038], per-thread VGPRs).
	Chen fails to teach wherein the first VSP is not configured to receive data from memory banks dedicated to other VSPs of the plurality of VSPs.
	Biscondi teaches a processor, comprising:
	a plurality of memory banks dedicated to a respective vector processor (Fig 5, [0040], vector memory banks) wherein a first vector sub-processor (VSP) is not configured to receive data from memory banks dedicated to other VSPs of a plurality of VSPs (Fig 3, [0031-0033], processing clusters 30 as multiple VSPs, [0037-0038], dedicated local memories 33 and 35 as dedicated memory banks, “Each of these local memory resources 33, 35 is associated only with its associated sub-cluster 32, 34, respectively”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chen and Biscondi to utilize exclusively dedicated memory banks. While Chen discloses memory banks (vector register banks) as being dedicated to one of the multiple VSPs (Chen’s super-SIMDs), Chen also discloses that the operand delivery network may output data to external texture units and a local data share unit (Chen [0033], Fig 2).  While Chen’s VSPs do not deliver operands directly from their dedicated memory banks to the dedicated memory banks or even the local operand delivery network of other VSPs, there is a pathway through the local data share unit for data to travel through multiple components to another VSP (Chen Figs 2 & 3).  However, utilizing exclusively dedicated memories as taught by Biscondi may reduce the complexity and power consumption of the processor, while also providing the utility of an exclusive local memory resource “useful for storing digital filter coefficients, storing and holding FFT parameters, storing tables of pseudo-random values as useful in the Kasumi cipher algorithm, and the like” (Biscondi [0038]).  Therefore, as exclusively dedicated local memory resources are a routine and conventional aspect of SIMD and vector processors in the art, including exclusively dedicated memory banks, as disclosed by Biscondi, would be an obvious means of providing dedicated storage for vector operation inputs.  Doing so may increase the functionality and efficiency of the SIMD processor, and would merely entail a combination of known prior art elements to achieve predictable results.

Regarding claim 10, the combination of Chen and Biscondi teaches the processor of claim 9, wherein a first operand gathering component of operand gathering components comprises a first storage component configured to store a first operand from a higher vector general purpose register (VGPR) bank of the first memory bank and a second operand from a low VGPR bank of the first memory bank (Chen Figs 1A & 2, input multiplexors 105 & read crossbar 330, [0034], [0038], per-thread VGPRs & Biscondi Fig 9, [0057-0059], [0064], concatenated register pairs split high and low order bits between adjacent banks of vector registers).

Regarding claim 11, the combination of Chen and Biscondi teaches the processor of claim 10, wherein the first operand gathering component is configured to receive the first operand and the second operand concurrently (Chen [0022]).

Regarding claim 12, the combination of Chen and Biscondi teaches the processor of claim 10, wherein the first operand fathering component is configured to provide the first operand and the second operand to a phase multiplexer of the first VSP (Chen Figs 1A & 2, input/source muxes).

Regarding claim 13, the combination of Chen and Biscondi teaches the processor of claim 12, wherein the phase multiplexer of the first VSP is configured to provide the first operand and the second operand to an arithmetic logic unit (ALU) of the first VSP (Chen Figs 1A & 2, ALUs).
Regarding claim 14, the combination of Chen and Biscondi teaches the processor of claim 13, further comprising a broadcast switch configured to broadcast operands between the plurality of VSPs (Chen [0026-0027], operand delivery network 240).

Regarding claim 15, the combination of Chen and Biscondi teaches the processor of claim 14, wherein the first VSP is configured to send a matrix multiplication input via the broadcast switch (Biscondi [0030], matrix algebra).

Allowable Subject Matter
4.  Claims 16-20 are allowed.
5.  Claim 3 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
The combination of Chen and Biscondi teaches a processor comprising a plurality of vector sub-processors (VSPs), a broadcast switch, a plurality of memory banks dedicated to each VSP, and a plurality of high and low register banks dedicated to each VSP.  However, neither reference teaches the broadcast switch connecting an additional storage device not dedicated to any of the VSPs to the plurality of VSPs as required by claims 3 and 16.  While Chen does disclose a broadcast switch including storage for operands to be communicated between VSPs (Chen Fig 3, LDS 420) and both references disclose an interconnect network for broadcasting operands, neither reference teaches these two separate structures as described by claims 3 and 16.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL J METZGER whose telephone number is (571)272-3105. The examiner can normally be reached Monday-Friday 7:30-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on 571-270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL J METZGER/             Primary Examiner, Art Unit 2182