EXAMINER'S AMENDMENT
 	An examiner’s amendment to the record appears below.  Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312.  To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
 	Authorization for this examiner’s amendment was given in a telephone interview with Michael Carmen, Registration No. 43,533 on 06/07/2022.
 	The title of the invention has been changed to the following: “DATA PROCESSING METHOD, ELECTRONIC DEVICE AND COMPUTER READABLE STORAGE METHOD FOR DEDUPLICATION OF A TRAINING DATASET”.

 	The claims have been amended as follows:

  (Currently amended)  A data processing method, comprising:
obtaining a first subset and at least a second subset in a training dataset for learning, the first subset and the at least [[a]] second subset having a same size;
determining a set of substrings based on data strings in the first subset and the at least [[a]] second subset, the substrings in the set of substrings being suffix substrings of the data strings and being sorted in a lexicographical order; 
determining a grain for deduplication of the training dataset from a set of longest common prefix (CLP) lengths of adjacent substrings in the set of substrings, for use in the deduplication; and 
training a machine learning model based on the determined grain and the deduplication of the training dataset.

  (Currently amended)  The method of claim 1, wherein determining the set of substrings comprises:
for each of the at least [[a]] second subset,
concatenating a first data string corresponding to the first subset and a second data string corresponding to the second subset by inserting a separator between the first data string and the second data string; and
determining the set of substrings based on a sort of suffix substrings of the concatenated data string in the lexicographical order.

  (Original)  The method of claim 1, wherein determining the grain comprises:
determining the grain from the set of the CLP lengths based on a length of the first subset and each CLP length in the set of the CLP lengths.

  (Original)  The method of claim 3, wherein determining the grain comprises:
determining, for each CLP length in the set of the CLP lengths, a modulo operation result of the length of the first subset and the CLP length;
in response to the modulo operation result being greater than zero, removing the CLP length from the set of the CLP lengths; and
determining the grain from the removed set of the CLP lengths.

  (Original)  The method of claim 3, wherein determining the grain comprises: 
for each CLP length in the set of the CLP lengths,
comparing the CLP length with a predetermined value, and
in response to the CLP length being less than or equal to the predetermined value, removing the CLP length from the set of the CLP lengths; and
determining the grain from the removed set of the CLP lengths.

  (Currently amended)  The method of claim 1, wherein determining the grain comprises:
determining the grain from the set of the CLP lengths based on a length of the first subset and a length of each substring in the set of substrings.

  (Currently amended)  The method of claim 6, wherein determining the grain comprises:
comparing a length difference between adjacent substrings with the length of the first subset;
in response to the length difference being less than or equal to the length of the first subset, removing respective CLP lengths of the compared adjacent substrings from the set of the CLP lengths; and
determining the grain from the removed set of the CLP lengths.

  (Currently amended)  The method of claim 1, wherein determining the grain comprises:
determining the grain from the set of the CLP lengths based on the number of times that each substring in the set of substrings occurs in the set of substrings corresponding to the respective second subset.

  (Currently amended)  The method of claim 8, wherein determining the grain comprises:
determining a product of the number of times that each substring in the set of substrings occurs in the set of substrings corresponding to the respective second subset and a length of the substring; and
determining a CLP length corresponding to the substring having the largest product as the grain.

  (Currently amended)  An electronic device, comprising:
a processor; and
a memory storing instructions which, when executed by the processor, cause the electronic device to:
obtain a first subset and at least a second subset in a training dataset for learning, the first subset and the at least [[a]] second subset having a same size;
determine a set of substrings based on data strings in the first subset and the at least [[a]] second subset, the substrings in the set of substrings being suffix substrings of the data strings and being sorted in a lexicographical order; 
determine a grain for a deduplication of the training dataset from a set of longest common prefix (CLP) lengths of adjacent substrings in the set of substrings, for use in the deduplication; and
train a machine learning model based on the determined grain and the deduplication of the training dataset.

  (Currently amended)  The electronic device of claim 10, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
for each of the at least [[a]] second subset,
concatenating a first data string corresponding to the first subset and a second data string corresponding to the second subset by inserting a separator between the first data string and the second data string; and
determine the set of substrings based on a sort of suffix substrings of the concatenated data string in the lexicographical order.

  (Original)  The electronic device of claim 10, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
determine the grain from the set of the CLP lengths based on a length of the first subset and each CLP length in the set of the CLP lengths.

  (Original)  The electronic device of claim 12, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
determine, for each CLP length in the set of the CLP lengths, a modulo operation result of the length of the first subset and the CLP length;
in response to the modulo operation result being greater than zero, remove the CLP length from the set of the CLP lengths; and
determine the grain from the removed set of the CLP lengths.

  (Original)  The electronic device of claim 12, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
for each CLP length in the set of the CLP lengths, 
compare the CLP length with a predetermined value, and
in response to the CLP length being less than or equal to the predetermined value, remove the CLP length from the set of the CLP lengths; and
determine the grain from the removed set of the CLP lengths.

  (Currently amended)  The electronic device of claim 10, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
determine the grain from the set of the CLP lengths based on a length of the first subset and a length of each substring in the set of substrings.

  (Currently amended)  The electronic device of claim 15, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
compare a length difference between adjacent substrings with the length of the first subset;
in response to the length difference being less than or equal to the length of the first subset, remove respective CLP lengths of the compared adjacent substrings from the set of the CLP lengths; and
determine the grain from the removed set of the CLP lengths.

  (Currently amended)  The electronic device of claim 10, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
determine the grain from the set of the CLP lengths based on the number of times that each substring in the set of substrings occurs in the set of substrings corresponding to the respective second subset.

  (Currently amended)  The electronic device of claim 17, wherein the memory further stores instructions which, when executed by the processor, cause the electronic device to:
determine a product of the number of times that each substring in the set of substrings occurs in the set of substrings corresponding to the respective second subset and a length of the substring; and
determine a CLP length corresponding to the substring having the largest product as the grain.

  (Currently amended)  A non-transitory computer readable storage medium having computer executable instructions stored thereon which, when executed, cause a machine to implement a data processing method, comprising:
obtaining a first subset and at least a second subset in a training dataset for learning, the first subset and the at least [[a]] second subset having a same size;
determining a set of substrings based on data strings in the first subset and the at least [[a]] second subset, the substrings in the set of substrings being suffix substrings of the data strings and being sorted in a lexicographical order; 
determining a grain for deduplication of the training dataset from a set of longest common prefix (CLP) lengths of adjacent substrings in the set of substrings, for use in the deduplication; and
training a machine learning model based on the determined grain and the deduplication of the training dataset.

  (Currently amended)  The computer readable storage medium of claim 19, wherein determining the set of substrings comprises:
for each of the at least [[a]] second subset,
concatenating a first data string corresponding to the first subset and a second data string corresponding to the second subset by inserting a separator between the first data string and the second data string; and
determining the set of substrings based on a sort of suffix substrings of the concatenated data string in the lexicographical order.

	The following is an examiner's statement of reasons for allowance:

 	The prior arts of record including the newly cited prior arts when taken individually or in combination do not expressly teach or render obvious the limitations recited in independent claims 1, 10 and 19 as a whole.  
 	At best the prior arts of record, specifically, Xia (Xia et al., “A Comprehensive Study of the Past, Present and Future of Data Deduplication, published Aug. 2, 2016 by IEEE) provides a survey of data deduplication technologies for storage systems e.g., see Xia, Abstract, pages 1681-1682.  Xu (US 2019/0294588) teaches a text deduplication method involving matching substrings in a text e.g., see Xu Abstract.  Sharangpani (US 9,292,584) teaches performing lossless data reduction on data sets; the losslessly reduced data can then be transmitted; a method for compressing data is to identify repeated occurences of a string within a sliding window e.g., see Sharangpani col. 2, lines 53-66; Austermann (US 2012/0117076) teaches identifying a subset of a dataset by comparing suffixes of query field values to data field values of records in the data set e.g., see Austermann Abstract.  

	In addition, neither reference uncovered that would have provided a basis of evidence for asserting a motivation, nor one of ordinary skilled in the art at the time the invention was made,
knowing the teaching of the prior arts of record would have combined them to arrive at the present invention as recited in the context of independent claims 1, 10 and 19 as a whole.

 	Thus, independent claims 1, 10 and 19 are allowed over the prior art of record.

 	Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue
fee. Such submissions should be clearly labeled "Comments on Statement of Reasons for Allowance."
 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YOON whose telephone number is (408)918-7581.  The examiner can normally be reached on Monday-Friday, 8 am to 5 pm, PST.  

 	If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Jennifer Welch can be reached at 571-272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-483-7388.  Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ERIC J YOON/Primary Examiner, Art Unit 2143