DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
The applicant argues: As such, among other differences, the Dreslinski reference does not describe the prefetch of data by the prefetcher of the first GPU being halted upon reaching a page boundary between the first page and the second page or upon reaching a boundary of a memory surface. 
The examiner respectfully disagrees because it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the prefetch of data by the prefetcher of the first GPU being halted upon reaching a page boundary between the first page and the second page or upon reaching a boundary of a memory surface, with Gierach, Surti, Rao, and Dreslinski the motivation being to avoid the fetching of useless data by not fetching from pages that are from non-contiguous physical locations in memory and thereby improve the processing of the apparatus.
The applicant argues: However, it is submitted that the Gierach, Surti, Rao, and Dreslinski references do not teach a memory for storage of data, the memory including a plurality of memory elements, each of a plurality of GPUs being coupled with one or more of the plurality of memory units, the plurality of memory units providing a unified virtual memory that is accessible to each of the plurality of GPUs, the unified virtual memory including at least a first page and a second page, the first page and the second page being adjacent in the unified virtual memory; wherein the prefetcher of each of the plurality of GPUs is to prefetch data from the memory to the cache of the respective GPU; and wherein, in a prefetch operation by a first GPU that includes a prefetch to the first page, the prefetcher of the first GPU is allowed to prefetch data from the first page if it is owned by the first GPU or the host 
The examiner respectfully disagrees because the above limitations have been rejected by Gierach, Surti, Rao, and Dreslinski, see the detailed action below.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claims 1, 9, 17 and 23 is/are rejected under 35 U.S.C. 103 as being patentable over Gierach US 2015/0378920 A1 in view of Surti US 9,912,957 B1 further in view of Rao US2019/0266695A1 further in view of Dreslinski, "Analysis of Hardware Prefetching Across Virtual Page Boundaries", Dept. of Electrical Engineering and Computer Science 2260 Hayward Ave, Ann Arbor, MI 48109-2121, Copyright 2007 ACM, CF '07, May 7-9, 2007, pgs. 1-4.

Regarding claim 1, Gierach teaches: 1. (Currently amended) An apparatus comprising:

	a plurality of processors including a host processor and a plurality of graphics processing units (GPUs) to process data including at least a first GPU, each of the plurality of GPUs including a prefetcher and a cache (fig. 1:107-108 see also par. 25; fig. 11:1100,1120,1122,1124 and 1126 see also pars. 105-106); and
	a memory for storage of data (fig. 1,120 see also par. 30; fig. 11,1130 see also par. 105), wherein the prefetcher of each of the plurality of GPUs is to prefetch data from the memory to the cache of the respective GPU (fig. 1:107-108 see also par. 25; fig. 11:1122-1126 and 1130 see also par. 106).
	Gierach doesn't teach however the analogous prior art Surti teaches: the memory including a plurality of memory elements, [[for]] each of the plurality of GPUs being coupled with one or more of the plurality of memory units, the plurality of memory units providing a unified memory (Surti: fig. 4F: 410-413, 420-423 see also col. 19 II. 13-30).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the memory including a plurality of memory elements, [[for]] each of the plurality of GPUs being coupled with one or more of the plurality of memory units, the plurality of memory units providing a unified memoryas shown in Surti with Gierach for the benefit of increasing processing efficiency of the parallel processors (see col. 1II. 23-30).
	The previous combination of Gierach and Surti remains as above but doesn't teach however the analogous prior art Rao teaches: a unified virtual memory that is accessible to each of the plurality of GPUs (Rao: see par. 106).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine a unified virtual memory that is accessible to each of the plurality of GPUs as shown in Rao with the previous combination for the benefit of facilitating efficient and effective utilization of unified virtual addresses across multiple components, thereby improving system utilization (see pars. 5-6).
	Although Gierach as modified by Surti and Rao don't teach, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the unified virtual memory including at least a first page and a second page, the first page and the second page being adjacent in the unified virtual memory with Gierach as modified, the motivation being to decrease memory access time, thereby providing an efficient memory management technique. 
	The previous combination of Gierach, Surti and Rao remains as above but doesn't teach however the analogous prior art Dreslinski teaches: the memory including at least a first page and a second page, the first page and the second page being adjacent (Dreslinski: pg. 2, sec. 1; pgs. 3-4, secs. 3.1-3.2); the prefetcher is prohibited from prefetching from a page (Dreslinski: pg. 2, sec. 1; pgs. 3-4, secs. 3.1-3.2).
the memory including at least a first page and a second page, the first page and the second page being adjacent; the prefetcher is prohibited from prefetching from a page as shown in Dreslinski with the previous combination for the benefit of improving performance of data cache prefetching (see abstract).
	Although Gierach as modified doesn't teach, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine wherein, in a prefetch operation by the first GPU that includes a prefetch to the first page: 
	the prefetcher of the first GPU is allowed to prefetch data from the first page if it is owned by the first GPU or the host processor, and is prohibited from prefetching from the first page if the first page is owned by a GPU other than the first GPU, and Attorney Docket No.: AB9737-US-2- Application Filed: March 15, 2019 Application No.: 16/355,274
	upon allowing the prefetch from the first page, the prefetch of data by the prefetcher of the first GPU is halted upon reaching a page boundary between the first page and the second page or upon reaching a boundary of a memory surface with Gierach as modified, the motivation being to avoid the fetching of useless data by not fetching from pages that are from non-contiguous physical locations in memory and thereby improve the processing of the apparatus.



Regarding claim 9, Gierach teaches: 9. (Currently amended) One or more non-transitory computer-readable storage mediums having stored thereon executable computer program 
	generating a prefetch instruction for one or more prefetch operations by a prefetcher of a first graphics processing unit (GPU), the first GPU being one GPU of a plurality of GPUs in a computing system, the prefetch instruction being directed to a memory (fig. 1:107-108 see also par. 25; fig. 11:1100,1120,1122,1124,1126 and 1130 see also pars. 105-106); and
caching prefetched data in a cache of the first GPU (fig. 11:1122-1126 and 1130 see also par. 106).
	Gierach doesn't teach however the analogous prior art Surti (with the same motivation from claim 1) teaches: a memory including a plurality of memory elements, [[for]] each of the plurality of GPUs being coupled with one or more of the plurality of memory units, the plurality of memory units providing a unified memory (Surti: fig. 4F: 410-413, 420-423 see also col. 19 II. 13-30).
	The previous combination of Gierach and Surti remains as above but doesn't teach however the analogous prior art Rao (with the same motivation from claim 1) teaches: a unified virtual memory that is accessible to each of the plurality of GPUs, (Rao: see par. 106).
	Although Gierach as modified by Surti and Rao don't teach, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the unified virtual memory including at least a first page and a second page, the first page and the second page being adjacent in the unified memory the motivation being to decrease memory access time, thereby providing an efficient memory management technique. 
(Dreslinski: pg. 2, sec. 1; pgs. 3-4, secs. 3.1-3.2).
	Although Gierach as modified doesn't teach, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine wherein, in a prefetch operation by the first GPU that includes a prefetch to the first page: 
	the prefetcher of the first GPU is allowed to prefetch data from the first page if it is owned by the first GPU or a host processor of the computing system, and is prohibited from prefetching from the first page if the first page is owned by a GPU other than the first GPU, and 
	upon allowing the prefetch from the first page, the prefetch of data by the prefetcher of the first GPU is halted upon reaching a page boundary between the first page and the second page or upon reaching a boundary of a memory surface with Gierach as modified, the motivation being to avoid the fetching of useless data by not fetching from pages that are from non-contiguous physical locations in memory and thereby improve the processing of the apparatus.


Claim 17 is analogous to claim 9 and is therefore rejected using the same rationale. Claim 17 further requires a method, which is also taught by Gierach (see par. 22).

(Gierach: fig. 11: 1120, 1124, 1126, 1130 see also pars. 105, 106 and 117; fig. 5 see also par. 53).

Claims 5, 13 and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gierach in view of Surti further in view of Rao, further in view of Dreslinski further in view of Gonion US 2013/0318306 A1 further in view of Alexander US 2013/0346697 A1.

Regarding claim 5, Gierach teaches: The apparatus, wherein a prefetch instruction from a prefetcher of a GPU of the plurality of GPUs (fig. 11: 1122-1126, 1130 see also par. 106). 
Gierach doesn't teach, however the analogous prior art Gonion teaches a gather/scatter load architecture including a vector memory access instruction that references a vector of effective addresses (Gonion: see par. 113).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention, to combine a gather/scatter load architecture including a vector memory access instruction that references a vector of effective addresses as shown in Gonion with Gierach as modified for the benefit of addressing the shortcomings of the prior art in that hardware prefetchers in a conventional processor typically wait for a memory access instruction to execute numerous times to confirm that memory accesses are being performed in a streaming pattern, and to identify the stride of that pattern. If hardware streaming prefetch is initiated too soon (i.e., before a true stream is identified), performance may suffer due to unnecessary memory accesses being performed. If the prefetch is initiated too late, performance may suffer due to memory latency because the data that is requested has not already been fetched from memory (see par. 5).
(Alexander: see fig. 5, 120 see also pars. 47-51).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention, to combine a prefetch message (e.g., mask field to select function) of an instruction for a prefetch-for-coprocessor process that determine how the instruction encoding will operate in terms of access exception checking and cache behavior as shown in Alexander with Gierach as modified for the benefit of providing prefetching techniques to try to supply memory data to the L1 cache ahead of time to reduce latency (see par. 13). 
Gierach as modified doesn't explicitly teach: wherein a prefetch instruction from a prefetcher of a GPU of the plurality of GPUs is a gather/scatter prefetch message including a plurality of prefetch addresses, however it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention, and the results would have been predictable, to combine wherein a prefetch instruction from a prefetcher of a GPU of the plurality of GPUs is a gather/scatter prefetch message including a plurality of prefetch addresses with Gierach as modified, the motivation being to improve memory prefetching and thereby the processing of the apparatus.

Claim 13 is analogous to claim 5 (albeit slightly broader) and is therefore rejected using the same rationale.

Claim 21 is analogous to claim 13 and is therefore rejected using the same rationale.

Claims 6-8 and 14-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gierach in view of Surti further in view of Rao further in view of Dreslinski further in view of Gonion further in view of Alexander further in view of Sperber US 2015/0074373 A1.

Regarding claim 6, Gierach as modified above, doesn't teach, however the analogous prior art, Sperber teaches to parse the gather/scatter prefetch instruction into an opcode and corresponding data and control fields that are used by the micro-architecture to perform operations (Sperber: see Abstract, fig. 2: 226, 228 and par.  50).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention, to combine parse the gather/scatter prefetch instruction into an opcode and corresponding data and control fields that are used by the micro-architecture to perform operations as shown in Sperber with Gierach as modified for the benefit of providing the use of an index array and finite state machine responsive to, and/or in support of scatter/gather operations for improving memory access and ordering data to and from wider vectors for generating local contiguous memory access or data from other non-local and/or noncontiguous memory locations (see pars. 1 and 4).
	Gierach as modified doesn't explicitly teach:  wherein the apparatus is to parse the gather/scatter prefetch message and issue a prefetch message for each of the plurality of prefetch addresses, however it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention, and the results would have been predictable, to combine wherein the apparatus is to parse the gather/scatter prefetch message and issue a prefetch message for each of the plurality of prefetch addresses with Gierach as modified, the 

Regarding claim 7,  Gierach as modified (with the same motivation from claim 5) teaches: a prefetch message (e.g., mask field to select function) of an instruction for a prefetch-for-coprocessor process that determine how the instruction encoding will operate in terms of access exception checking and cache behavior (Alexander: see fig. 5, 120 see also pars. 47-51).
	Gierach as modified doesn't explicitly teach: wherein the gather/scatter prefetch message further includes an entry for each of the plurality of addresses to indicate a cache level for prefetching, however it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention, and the results would have been predictable, to combine wherein the gather/scatter prefetch message further includes an entry for each of the plurality of addresses to indicate a cache level for prefetching with Gierach as modified, the motivation being to improving gather/scatter operations and providing faster access to a cache line, thereby improving the processing of the apparatus.

Regarding claim 8, Gierach teaches: the prefetcher of the first GPU (fig. 11: 1122-1126, 1130 see also 106). Gierach also teaches multithreading on one or more cores (see par. 34); Gierach as modified (with the same motivation from claim 6) teaches signaling or flagging using a completion mask when gather/scatter operations are complete (Sperber: see fig. 7, 710 and pars. 114-116).
	Gierach as modified doesn't explicitly teach however, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention, and the results would have been predictable, to combine wherein the prefetcher of the first  GPU is to send a flag to a thread in a core of the first GPU when a prefetch for the thread is complete,  with Gierach as modified, the motivation being to decreasing delays associated with gather/scatter operations and thereby improving the processing of the apparatus.

Claim 14 is analogous to claim 6 and is therefore rejected using the same rationale.

Claim 15 is analogous to claim 7 and is therefore rejected using the same rationale.

Claim 16 is analogous to claim 8 and is therefore rejected using the same rationale.
Allowable Subject Matter
Claim 22 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
Regarding claim 22, the prior art doesn’t teach: 22. (Currently amended) The apparatus of claim 1, wherein the prefetching of data by the plurality of GPUs does not recognize a physical structure of the memory in the unified virtual memory.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. MILOUSHEV US20140207871A1.
THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAURICE L MCDOWELL, JR whose telephone number is (571)270-3707.  The examiner can normally be reached on Mon-Thurs 5:30-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact 






/MAURICE L. MCDOWELL, JR/Primary Examiner, Art Unit 2612