DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Claims
Claims 1-20 are pending of which claims 1, 11 and 20 are in independent form.  Claims 1-20 are rejected under 35 U.S.C. 103.

Response to Claim Amendments and Arguments
The claim amendments and arguments filed on 13 July 2022 as they apply to the 35 U.S.C. 112(b) rejections of the claims have been fully considered and are persuasive.  On page 8 of the remarks Applicant’s representative clarified that the first and the second recipe denote instances of recipes and not their relative position in the ranked list, thus resolving the 35 U.S.C. 112(b) issue, therefore the 35 U.S.C. 112(b) rejections of the claims have been withdrawn.
With respect to the 35 U.S.C. 103 rejections of the claims, on pages 9 -10 of the remarks, Applicant’s representative appears to argue the cited prior art references do not disclose the newly amended independent claim limitations reciting in part, determining an incompatibility between a first recipe in the ranked recipe list and a second recipe in the ranked recipe list, the incompatibility being determined in response to each of the first recipe and the second recipe being associated with a respective file cluster candidate that comprise at least one same file, and ignoring the first recipe in response to the second recipe being ranked higher than the first recipe in the ranked recipe list.  Applicant’s arguments are not persuasive.

Examiner’s Response:
However, Chakerian in the Abstract and paragraphs [0037]-[0039] and [0042] discloses calculating a similarity score for each cluster of a plurality of clusters with respect to a document and assigning the document to the cluster with the highest match score.  Examiner is of the position that Chakerian at paragraph [0038] teaching assigning the document to the cluster with the highest match score, assigns the document to the cluster with the highest score first, and ignores or discards all the other potential candidates with lower scores for the same document and reads on a ranked recipe list of file cluster candidates that are associated with the same file and because the document is clustered into only the highest ranking cluster, the lower scoring cluster candidates are incompatible with higher scoring candidate and discarded and reads on the newly amended claim language argued above.
Additionally, in an effort to provide compact prosecution and make Examiner’s position with regards to the Chakerian reference more clear, Examiner asks Applicant to consider the broadest reasonable interpretation of the argued claim language, and if the recipe list and recipe themselves can only consist of a single file and a plurality of cluster candidates, in other words, if the claim as worded can comprise a ranked recipe list of recipes, for example, of moving file 1 to cluster 2, move file 1 to cluster 3…etc. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-16 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Menezes et al. U.S. Patent No. 10,303,797 (hereinafter “Menezes”) in view of Hirsch et al. U.S. Pub. No. 2018/0113643 (hereinafter “Hirsch”) in view of Patterson et al. U.S. Pub. No. 2014/0337363 (hereinafter “Patterson”) in further view of Chakerian et al. U.S. Pub. No. 2016/0004764 (hereinafter “Chakerian”).
Regarding independent claim 1, Menezes discloses:
generating a plurality of file cluster candidates for a plurality of files stored at a multi- node storage system comprising a plurality of data nodes, each of the plurality of file cluster candidates comprising some of the plurality of files (Menezes in the Abstract discloses in part, “Clustering files in deduplication systems is based on an estimate of similarity between files in a file system. The estimates of similarity are based on how much content the files share…”   Examiner is of the position that the files being clustered based on an estimate of similarity between files recited in the section of Menezes cited above reads on file cluster candidates.  Further, Menezes at Column 6, Lines 1-5 discloses a multi-node storage system, more specifically Menezes discloses, “Storage units 108-109 may be implemented locally (e.g., single node operating environment) or remotely (e.g., multi-node operating environment) via interconnect 120, which may be a bus and/or a network (e.g., a storage network or a network similar to network 103).”)

While Menezes in the Abstract discloses determining similarity estimates between files, Menezes does not disclose a similarity index, more specifically, Menezes does not disclose:
determining, for each of the plurality of file cluster candidates, a similarity index based on similarity of the files comprised in the file cluster candidate.
However, Hirsch at paragraph [0032] teaches in part, “In addition to this process, statistical information, indicating the similarity of the new file and other the files found to be similar to the new file, is stored. A "file similarity index" may be used to store the statistical information.”
Both the Menezes reference and the Hirsch reference, in the sections cited by the Examiner, are in the field of endeavor of calculating file similarity and clustering of similar files in a deduplication file system.  Before the effective filing date of the claimed invention it would have been obvious to one of ordinary skill in the art to combine the determining of similarity estimates between files disclosed in Menezes with the file similarity index taught in Hirsch to facilitate in increasing the efficiency and performance of a deduplication system (See Hirsch at paragraph [0006]).

generating a ranked recipe list comprising a plurality of recipes, wherein each recipe is associated with one of the plurality of file cluster candidates, comprises a destination data node for the associated file cluster candidate, and is associated with a deduplication space savings determined based on a total file size and the similarity index of the associated file cluster candidate, wherein the plurality of recipes in the ranked recipe list are sorted based on data movement cost- adjusted deduplication space savings… (Menezes at Column 11, Lines 13-24 discloses,  “…assigning 602 all N files F1, F2, . . . FN into a super-cluster C0 at the top level (level-0) of the hierarchy. The process 600 continues sorting 604 the individual lists of K found offset numbers [O.sub.1, . . . , O.sub.K] accumulated in F.sub.K for each of the N files F1, . . . , FN such that O.sub.1 is the most significant number and O.sub.K is the least significant number. In this manner, similar files belonging to the same cluster based on their offsets should be co-located sequentially in the sorted file list. Once the file lists are sorted, the process 600 continues at 606 to assign each group of similar files F into clusters.”  Additionally, as illustrated in Menezes in the Abstract, similarity is estimated based on comparing file segments, and Menezes at Column 14 Line 64 – Column 15 Line 3 discloses, “For example, the file(s) may be broken into segments by identifying segment boundaries. Segment boundaries may be determined using file boundaries, directory boundaries, byte counts, content-based boundaries (e.g., when a hash of data in a window is equal to a value), or any other appropriate method of determining a boundary.”  Examiner is of the position that Menezes in the above cited sections discloses segments may be determined in part by file boundaries and byte counts and are used in estimating the similarity between files, the files are put in a sorted list [i.e., recipe list] based on significance and assigned to clusters.  Lastly, Menezes at Column 12, Lines 16-21 discloses “For example, the approximated optimal cluster definitions allow placing together files that compress well inside the same deduplication/compression domains because they can be collocated in the same partitions or same nodes in order to optimize physical space utilization. [i.e., space savings]”.
With respect to the claim limitation, total file size… Menezes in the Abstract discloses in part, “Clustering files in deduplication systems is based on an estimate of similarity between files in a file system. The estimates of similarity are based on how much content the files share, where the estimate of how much content is shared is based on an estimate of segments shared.”  Additionally, Menezes at Column 14 Line 64 – Column 15 Line 3 discloses, “For example, the file(s) may be broken into segments by identifying segment boundaries. Segment boundaries may be determined using file boundaries…”  Lastly, Menezes at Column 12, Lines 16-21 discloses “For example, the approximated optimal cluster definitions allow placing together files that compress well inside the same deduplication/compression domains because they can be collocated in the same partitions or same nodes in order to optimize physical space utilization. [i.e., space savings]”  Examiner is of the position that in the embodiment in which file boundaries are used as segment boundaries, the entire file size is used and relevant when comparing segments and calculating file similarities and used to optimize physical space utilization.  If Applicant intends to detail more specifically how the total file size is used, such language must be explicitly claimed.  In the event Applicant is unpersuaded by Examiner’s argument, in an effort to provide compact prosecution, Examiner also points to the Hirsch reference at paragraph [0030] teaching in part, “In other words, mutually deduplicated data having a higher similarity score are preferred to be stored in the same external finite-sized container, since the deduplication between them can save space in that container.”  And Hirsch at paragraphs [0032] – [0033] teaches calculating similarities between two files taking into consideration the total file size of each file.)

Menezes does not disclose:
A tag node mapping table comprising a mapping between a tag and each of the plurality of data nodes…
However, Patterson at paragraph [0027] teaches in part the following:
…In various embodiments, a tag comprises a sketch, a fingerprint or hash of all or some of the bytes of the segment (e.g. of the first N bytes), some metadata included in the stream and associated with the segment such as a file name, a file size, a file create date and time, a file modify date and time, a file inode number, a hash of such metadata, or any other measure useful in identifying likely similar segments. In various embodiments, a tag comprises a tag based on the content of a segment or not based on the content of a segment. In some embodiments, a tag not based on the content includes: (when not included in a stream) a file name, a file size, a file create date and time, a file modify date and time, a file inode number, a hash of such metadata, or any other measure useful in identifying likely similar segments. In some embodiments, tags associated with a segment are stored (e.g., in an index) to be used to identify segments similar to an in-coming segment.

Additionally, Patterson at paragraph [0054] teaches in part, “The tag index can be in memory, on disk, or in any other appropriate index location. Also, the tag can be one or more hash tables (e.g., for sketch features), a tree structure, part of file system data structures (e.g., a directory where you look up the file name), or any other appropriate index type.”  Lastly, Patterson at paragraph [0066] teaches in part, “In various embodiments, different combination of sending/receiving tags and/or fingerprints to identify duplicate subsegments are possible because comparisons and or calculations can be made either on the originating system or the remote or replica system.”
Both the Menezes reference and the Patterson reference, in the sections cited by the Examiner, are in the field of endeavor of determining similarity in data segments.  Before the effective filing date of the claimed invention it would have been obvious to one of ordinary skill in the art to combine the determining of file segment similarities disclosed in Menezes with the use of a tag and tag index between a plurality of systems to identify duplicate segments as taught in Patterson to facilitate in efficient identification of duplicate segments between systems (See Patterson at paragraph [0066]).


moving at least some of the plurality of files between the plurality of data nodes based on the recipes in the ranked recipe list to improve deduplication space savings in the multi-node storage system (Menezes at claim 17 discloses in part, “…the processor further configured to optimize storage capacity utilization of the data processing system, comprising placing at least some of the plurality of files in a same domain, partition, or storage node of the data processing system based on the cluster definition for compression or deduplication together.”)

While Menezes in the Abstract discloses clustering files in a deduplication system by similarity, Menezes does not disclose:
including determining an incompatibility between a first recipe in the ranked recipe list and a second recipe in the ranked recipe list, the incompatibility being determined in response to each of the first recipe and the second recipe being associated with a respective file cluster candidate that comprise at least one same file, and ignoring the first recipe in response to the second recipe being ranked higher than the first recipe in the ranked recipe list, wherein file movements based on highest ranked recipes in the ranked recipe list are performed first.
However, Chakerian in the Abstract and paragraphs [0037]-[0039] and [0042] discloses calculating a similarity score for each cluster of a plurality of clusters with respect to a document and assigning the document to the cluster with the highest match score.  Examiner is of the position that Chakerian at paragraph [0038] teaching assigning the document to the cluster with the highest match score, assigns the document to the cluster with the highest score first, and ignores or discards all the other potential candidates with lower scores for the same document and reads on a ranked recipe list of file cluster candidates that are associated with the same file and because the document is clustered into only the highest ranking cluster, the lower scoring cluster candidates are incompatible with higher scoring candidate and discarded and reads on the newly amended claim language argued above.
Additionally, in an effort to provide compact prosecution and make Examiner’s position with regards to the Chakerian reference more clear, Examiner asks Applicant to consider the broadest reasonable interpretation of the argued claim language, and if the recipe list and recipe themselves can only consist of a single file and a plurality of cluster candidates, in other words, if the claim as worded can comprise a ranked recipe list of recipes, for example, of moving file 1 to cluster 2, move file 1 to cluster 3…etc. 
Both the Menezes reference and the Chakerian reference, in the sections cited by the Examiner, are in the field of endeavor of data clustering based on similarity.  Before the effective filing date of the claimed invention it would have been obvious to one of ordinary skill in the art to combine the clustering files based on similarities as disclosed in Menezes with the clustering of documents into clusters based on cluster similarity and the clustering of a document into a cluster with the highest similarity as taught in Chakerian to facilitate in quickly grouping, sorting and categorizing large volumes of data (See Chakerian at pararaphs [0001] and [0009]).

Regarding dependent claim 2, all of the particulars of claim 1 have been addressed above.  Additionally, Menezes discloses:
wherein each of the plurality of files is associated with the tag indicative of a source of the file, and wherein each file cluster candidate comprises all files associated with one or more tags (Menezes at Column 12, Lines 62-65 discloses one or more storage units storing metadata and data objects.  Additionally, Menezes at Column 13, Lines 40-50 discloses metadata may include fingerprints or representatives contained within data objects and fingerprints are mapped to a particular data object enabling the system to identify the location of the data object containing a data segment [i.e., tag].  Further, Menezes at Column 13, Lines 60-63 discloses, In one embodiment, metadata 816 may include a file name, a storage unit identifier (ID) identifying a storage unit in which the segments associated with the file name are stored…”  Lastly, as illustrated in the rejection of claim 1 provided above, Patterson at paragraph [0027] teaches a segment tag comprising a plurality of information including file information and a file name.)

Regarding dependent claim 3, all of the particulars of claim 1 have been addressed above.  Additionally, Menezes as modified with Hirsch discloses:
wherein the similarity index comprises a Jaccard index (Menezes in the Abstract discloses estimating similarity between files and Menezes at Column 4, Lines 1-3 discloses, “For example, given two files F1 and F2, their similarity can be expressed by calculating the Jaccard's Coefficient of the sets of segments…”  Additionally, as illustrated in the rejection of claim 1, Hirsch at paragraph [0032] teaches a file similarity index.)

Regarding dependent claim 4, all of the particulars of 1 have been addressed above.  While Menezes discloses calculating a file similarity, Menezes does not disclose dividing the deduplication space savings by a quantity of files that need to be moved, more specifically, Menezes does not disclose:
wherein determining the data movement cost-adjusted deduplication space savings associated with a recipe comprises dividing the deduplication space savings associated with the recipe by a quantity of files that need to be moved.
However, Hirsch at paragraph [0030] discloses in part specifying what a user wants to pack into a container including, “This "input" refers to what the user wants to pack into containers, which is different than the input 302, as described above. The 10 6 refers to the number of 1 GB files…”  Examiner is of the position that it would have been obvious to one of ordinary skill in the art to reduce or divide the similarity score by the number of files a user is trying to fit into a container.)

Regarding dependent claim 5, all of the particulars of 1 have been addressed above.  While Menezes discloses calculating a file similarity, Menezes does not disclose dividing the deduplication space savings by an amount of data that needs to be moved, more specifically, Menezes does not disclose:
wherein determining the data movement cost-adjusted deduplication space savings associated with a recipe comprises dividing the deduplication space savings associated with the recipe by an amount of data that need to be moved.
However, Hirsch at paragraph [0036] teaches in part, “The score-based similarity between two files defined is symmetric. For example, if file "A" is 80% similar to file "B" (and file A and file B are both of the same size), then file B is 80% similar to file A.”  Examiner is of the position that, in the above cited section of Hirsch, if file “A” is 100” similar to file “B”, in other words all of file A is found in file B, however file B is twice the size as file A, file A and B would be 50% similar. 

Regarding dependent claim 6, all of the particulars of claim 1 have been addressed above.  Additionally,  Menezes as modified discloses:
wherein the first recipe in the ranked recipe list is further invalidated or removed from the ranked recipe list in response to determining the incompatibility between the first reci[e and the second recipe and in response to determining that the second recipe is ranked higher than the first recipe (As illustrated in the rejection of claim 1 provided above, Chakerian at paragraph [0038] teaches assigning the document to cluster with the highest match score and thus discarding the lower candidates because as the document is only clustered into one of the highest ranking cluster, the other candidate clusters are incompatible and in the Examiner’s opinion reads on invalidating the other potential candidate matches.)

Regarding dependent claim 8, all of the particulars of claim 1 have been addressed above.  Additionally, Menezes discloses:
wherein the generating of the plurality of file cluster candidates, the determining of the similarity indexes, and the generating of the ranked recipe list are performed within a self-contained environment separate from an operating system of the multi-mode storage system (Menezes at Figure 8 provided below illustrates storage units separate from a storage file system and deduplication logic [i.e., self-contained environment].)

    PNG
    media_image1.png
    483
    683
    media_image1.png
    Greyscale



Regarding dependent claim 9, all of the particulars of claims 1 and 8 have been addressed above.  Additionally, Menezes discloses:
wherein the ranked recipe list is stored at a shared database, and wherein the operating system of the multi-node storage system polls the shared database to read the ranked recipe list prior to the moving of at least some of the plurality of files (Menezes at Column 16, Lines 28-30 discloses “Detection is accomplished by building a database (e.g., index 824) that maintains a digest (e.g., SHA, checksum) and a deduplication key for each data block [i.e., shared database].”)

Regarding dependent claim 10, all of the particulars of claim 1 have been addressed above.  Menezes as modified discloses:
wherein one or more of the plurality of file cluster candidates whose similarity indexes are below a threshold are discarded, and not included in the ranked recipe list (Hirsch at paragraph [0032] teaches a file similarity index and at paragraph [0006] teaches in part, “…the similarity score indicating an overall deduplication ratio between the similarly compared files of the deduplicated data; wherein calculating the similarity score further includes calculating an nth percentage threshold of common data intersections shared between the plurality of similarly compared files of the deduplicated data, and wherein a transitive closure between the plurality of similarly compared files of the deduplicated data is determined.”  Additionally, Chakerian at paragraph [0038] teaches a similarity threshold for which matches below the threshold are discarded.)

Regarding independent claim 11, while independent claim 11, a non-transitory machine readable medium claim, a independent claim 1, a method claim, are directed towards different statutory classes, they are similar in scope.  Therefore claim 11 is rejected under the same rationale as claim 1.

Regarding dependent claim 12, all of the particulars of claim 11 have been addressed above.  Additionally claim 12 is rejected under the same rationale as claim 2.

Regarding dependent claim 13, all of the particulars of claim 11 have been addressed above.  Additionally claim 13 is rejected under the same rationale as claim 3.

Regarding dependent claim 14, all of the particulars of claim 11 have been addressed above.  Additionally claim 14 is rejected under the same rationale as claim 4.

Regarding dependent claim 15, all of the particulars of claim 11 have been addressed above.  Additionally claim 15 is rejected under the same rationale as claim 5.

Regarding dependent claim 16, all of the particulars of 11 have been addressed above.  Additionally claim 16 is rejected under the same rationale as claim 6.

Regarding dependent claim 18, all of the particulars of claim 11 have been addressed above.  Additionally claim 18 is rejected under the same rationale as claim 8.

Regarding dependent claim 19, all of the particulars of claims 11 and 18 have been addressed above.  Additionally claim 19 is rejected under the same rationale as claim 9.

Regarding independent claim 20, while independent claim 20, a system claim, and independent claim 1, a method claim, are directed towards different statutory classes, they are similar in scope.  Therefore, claim 20 is rejected under the same rationale as claim 1.  With respect to the hardware limitations recited in the system claim, a processor; and a memory coupled to the processor…(See Menezes at Column 12, Lines 22-31).

Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Menezes in view of Hirsch in view of Patterson in view of Chakerian in further view of Lee et al. U.S. Pub. No. 2018/0218001 (hereinafter “Lee”).
Regarding dependent claim 7, all of the particulars of 1 have been addressed above.  While Menezes as modified by Hirsch in the Abstract teaches, “Deduplicated data is packed into finite-sized containers”, Menezes as modified does not disclose removing a file that would cause a violation of the storage size limit, more specifically, Menezes as modified with Hirsch does not disclose:
wherein each of the plurality of data nodes is associated with a storage size limit, and wherein a third recipe in the ranked recipe list is ignored, invalidated, or removed when moving all files in the file cluster candidate associated with the third recipe to the destination node of the third recipe would cause a violation of the storage size limit of the destination node.
However, Lee at paragraph [0028] teaches in part, “when the size of a file selected to reuse the cluster chain is less than a lower limit of a predetermined size range and the residual capacity of the storage medium is less than a minimum of spare capacity of the storage medium, the method may further comprise deleting the selected file, selecting another file…”
The Menezes, Hirsch and Lee references, in the sections cited by the Examiner, are in the field of endeavor of writing files to storage medium.  Before the effective filing date of the claimed invention it would have been obvious to one of ordinary skill in the art to combine the packing of finite-sized containers as disclosed by Menezes as modified with Hirsch with the deleting or selecting of another file based on a space capacity taught in Lee to facilitate in preventing exceeding storage capacity.

Regarding dependent claim 17, all of the particulars of 11 have been addressed above.  Additionally claim 17 is rejected under the same rationale as claim 7.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANTHONY G GEMIGNANI whose telephone number is (571)272-1018. The examiner can normally be reached M-F 8-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hosain T Alam can be reached on 571-272-3978. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/A.G.G./Examiner, Art Unit 2154               

/HOSAIN T ALAM/Supervisory Patent Examiner, Art Unit 2154