DETAILED ACTION
This action is in response to the reply received 3/21/22. After consideration of applicant's amendments and/or remarks:
Examiner withdraws rejections under 35 USC § 112.
Claims 1-21 rejected under 35 USC § 103.


Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 10-16, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yih et al., U.S. PG-Publication No. 2012/0323968 A1, in view of Zhang, Dell, Jun Wang, Deng Cai, and Jinsong Lu. "Self-taught hashing for fast similarity search." In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 18-25. 2010 (hereinafter Zhang), further in view of Gusev et al., U.S. Patent No. 11,233,761 B1.

Claim 1
	Yih discloses a duplicate document detection method of a computer apparatus including processing circuitry. Yih discloses a "process for applying an optimized set of parameters while comparing a plurality of text objects." Yih, ¶ 48; FIG. 4. Text objects include unstructured and/or structured documents. Id. at ¶ 44. The input to the method is a document 401 and the output comprises "documents 402 that are duplicates or near-duplicates of document 401." Id. at ¶ 50. The method is implements on a computer apparatus comprising processing circuity. Id. at ¶¶ 54-55.
	Yih discloses the method comprising: acquiring, by the processing circuitry, a respective vector expression for each of a plurality of documents using a similarity model, the similarity model being trained based on a respective [mathematical] similarity associated with each of a plurality of reference document pairs. In one embodiment, "a model is used to map a raw text representation of a text object or document to a vector space." Id. at ¶ 8. The method measures text similarity "using a vector-based method." When comparing documents, "term vectors are constructed to represent each of the documents." Id. at ¶ 18. A "label associated with the two vectors indicates a degree of similarity between the objects represented by the vectors." Id. at ¶ 9. Further, "[p]airs of raw term vectors and their labels, which indicate the similarity of the vectors, are used to train the model." Id. at ¶ 20.
	Yih detecting a duplicate document from among the plurality of documents. The input to the method is a document 401 and the output comprises "documents 402 that are duplicates or near-duplicates of document 401." Id. at ¶ 50.
	Yih does not expressly disclose generating a key by performing a vector quantization on the respective vector expression, the key including a binary character string; and detecting a duplicate document from among the plurality of documents using the key.
	Zhang discloses generating a key by performing a vector quantization on the respective vector expression, the key including a binary character string. Zhang discloses method related to similarity search; the method is "given a query document" and finds "its most similar documents from a very large document collection." One way to "accelerate similarity search" is using semantic hashing "which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance)." Zhang, 18. Zhang discloses that "hashing techniques map feature vectors to binary codes, which is key to extremely fast similarity search." One means for obtaining "binary codes for text document is to binaries the real-valued low-dimensional vectors (obtained from dimensionality reduction techniques like LSI) via thresholding." Id. at ¶ 19.  The method can "convert the … l-dimensional real-valued vectors … into binary codes via thresholding," such that the p-th element of the vector "is larger than the specified threshold," then the p-th bit of the binary code is on (one), otherwise the p-th bit of the binary code is off (zero). Id.at ¶¶ 20-21 (3.1 Stage 1: Unsupervised Learning of Binary Codes).
	Zhang discloses detecting a duplicate document from among the plurality of documents using the key. Zhang uses the binary codes to "return all the documents that are hashed into a tight Hamming ball centered around the binary code of the query document." Id. at 18-19. Zhang discloses that "semantically similar documents should be mapped to similar codes within a short Hamming distance." Id. at 20; See Also 22-23 (4.3 Results).
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify the document similarity determination method of Yih to incorporate quantifying document vectors as binary strings as taught by Zhang. One of ordinary skill in the art would be motivated to integrate quantifying document vectors as binary strings into Yih, with a reasonable expectation of success, in order to increase the speed of performing determining semantically similar documents in a large corpus of documents. See Zhang, 18 (1. Introduction).
	Yih-Zhang does not expressly disclose the respective semantic similarity being obtained by increasing or decreasing a corresponding mathematical similarity.
	Gusev discloses the respective semantic similarity being obtained by increasing or decreasing a corresponding mathematical similarity. Gusev discloses using a "trained machine learning model" to "determine similarity" between two embedding vectors. Gusev, 5:21-43. The embedded vectors are generated from text content. Id. at 5:44-50. The method uses "a cosine similarity function … resulting in an indication or measurement as to the similarity" of a content item and a landing page (e.g. text content). Id. at 7:13-37. This indication or measurement is analogous to the claimed "mathematical similarity." Further, Gusev discloses that "a non-linear weighting may be applied to this measurement … such that the closer the two items of content are, the more weight [is] afforded to this measurement." Id. Accordingly, the measurement (i.e. mathematical similarity) is increased using a non-linear weighting function to generate a weighted measurement (i.e. semantic similarity).
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify the text similarity determination method of Yih-Zhang to incorporate non-linear weightings as taught by Gusev. One of ordinary skill in the art would be motivated to integrate non-linear weightings into Yih-Zhang, with a reasonable expectation of success, in order to increase similarity determination accuracy by enabling "individual items of input data" to be "weighted in any given processing node such that the weighted input data plays a greater or lesser role in the overall computation for that processing node." See Gusev, 12:7-30.

Claim 2
	Zhang discloses wherein the respective vector expression is an N dimensional real vector, N denoting a natural number of 2 or more. The method of Zhang states with "a collection of n documents which are represented as m-dimensional vectors." Id. at ¶ 20. Further, Zhang discloses that "documents are typically represented as feature vectors in a space of more than thousands of dimensions." Id. at ¶ 19.

Claim 3
	Zhang discloses wherein the generating the key comprises generating the binary character string by: replacing a value in the respective vector expression of 0 or more with 1, or replacing a negative value in the respective vector expression with 0. The method can "convert the … l-dimensional real-valued vectors … into binary codes via thresholding," such that the p-th element of the vector "is larger than the specified threshold," then the p-th bit of the binary code is on (one), otherwise the p-th bit of the binary code is off (zero). Id.at ¶¶ 20-21 (3.1 Stage 1: Unsupervised Learning of Binary Codes).

Claim 4
	Zhang discloses wherein the detecting the duplicate document comprises detecting two among the plurality of documents associated with the key. One way to "accelerate similarity search" is using semantic hashing "which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance)." Zhang, 18.

Claim 5
	Yih discloses training the similarity model using a loss function adjusted based on a weight, the weight corresponding to a difference between an output value of the similarity model and a calculated value. Yih discloses a "loss function ... based upon the computed similarity scores and labels associated with the pairs of vectors." The parameters (e.g. weights) of the model "are adjusted or tuned to minimize the loss function." Yih, ¶ 8. Further, the "parameters in a model matrix are trained to minimize the loss of similarity scores of the output vectors." Id. at ¶ 20.  The loss function is trained (i.e. adjusted) to identify "the minimum error value in the loss function" and determine parameters "used by [a] … date comparison application, or other process to compare text objects." Id. at ¶¶ 39-41.

Claim 6
	Yih discloses adjusting an average distance between a plurality of vector expressions by adjusting a value of the weight, the plurality of vector expressions including the respective vector expression. Yih discloses that a "plurality of corresponding loss functions may be averages and the average loss function used to adjust model parameters" (i.e. weights). Id. at ¶ 47.

Claim 10
	Claim 10 recites a medium storing instructions for performing the steps of the method recited in claim 1. Accordingly, claim 10 is rejected as indicated in the rejection of claim 1.

Claims 11-16
	Claims 11-16 recite a system configured to perform the steps of the method recited in claims 1-6. Accordingly, claims 11-16 are rejected as indicated in the rejection of claims 1-6.

Claim 18
	Yih discloses wherein the training includes inputting vector expressions for a candidate document pair and a calculated semantic similarity for the candidate document pair into the loss function. Yih discloses that "each of the documents Dn from the first set of text objects is mapped to compact, low-dimensional vector LDn … using a set of parameters." The loss function "has [a] pair of compact vectors and the label data as inputs" (e.g. LDn and LDm). Id. at ¶¶ 38-39.

Claim 20
	Claim 20 recites a system configured to perform the steps of the method recited in claim 18. Accordingly, claim 20 is rejected as indicated in the rejection of claim 18.


Claims 7-9 are rejected under 35 U.S.C. 103 as being unpatentable over Yih, in view of Zhang, further in view of Gusev, further in view of Perram et al., U.S. PG-Publication No. 2018/0075138 A1.

Claim 7
	Yih discloses extracting, by the processing circuitry, a similar document pair set and a dissimilar document pair set from a document database, the similar document pair set and the dissimilar document pair set being included among the plurality of reference document pairs. Yih discloses obtaining "text object pairs" including "documents" that "are associated with labels" indicating "whether the text objects are similar or dissimilar." Yih, ¶ 7.
	Yih discloses calculating, by the processing circuitry, a mathematical similarity for each of the plurality of similar document pairs and each of the plurality of dissimilar document pairs using a mathematical measure to obtain a first plurality of mathematical similarities based on the plurality of the similar document pairs and a second plurality of mathematical similarities based on the plurality of dissimilar document pairs. Yih discloses that the "function for computing similarity scores may be a cosine, Jaccard, or any differentiable function." Id. at ¶ 9.
	Yih discloses calculating, by the processing circuitry, the respective semantic similarity for each of the plurality of similar document pairs and each of the plurality of dissimilar document pairs to obtain a first plurality of semantic similarities based on the plurality of similar document pairs and a second plurality of semantic similarities based on the plurality of dissimilar document pairs, each of the first plurality of semantic similarities being higher than a corresponding one of the first plurality of mathematical similarities, and each of the second plurality of semantic similarities being lower than a corresponding one of the second plurality of mathematical similarities. Yih discloses that "each element of [an] output vector may be a non-linear transformation, such as sigmoid, of [a] linear function." Id. at ¶ 10. Yih discloses that first and second concept vectors vp and vq are created from a first and second text object. A similarity score 110 is calculated using a similarity function 109. Specifically, a nonlinear activation function, such as sigmoid, may be added … to modify the resulting concept vector." The similarity score 110 generated from the concept vectors and nonlinear activation function is "not just a measurement of literal similarity between the text objects, but provides a measurement of the text objects' semantic similarity." Id. at ¶¶ 24-26. Further, it follows that a first semantically similar pair of documents (e.g. document pairs associated with a label indicating similarity) will generate a higher similarity score; conversely, a second semantically dissimilar pair of documents (e.g. document pairs associated with a label indicating dissimilarity) will generate a lower similarity score.
	Yih discloses training, by the processing circuitry, the similarity model based on the
plurality of similar document pairs, the plurality of dissimilar document pairs, the first plurality of semantic similarities and the second plurality of semantic similarities to obtain a trained similarity model. Yih discloses that "parameters in a model matrix are trained to minimize the loss of similarity scores of the output vectors," such that "[p]airs of raw term vectors and their labels, which indicate the similarity of the vectors, are used to train the model." Id. at ¶ 20. Further, a "projection model may be train using known pairs of text objects" using a dataset (table 203) comparing pairs of text objects with "any number of additional levels of similarity/dissimilarity." Id. at ¶ 35.
	Yih-Zhang-Gusev does not expressly disclose the similar document pair set including a plurality of similar document pairs having a common attribute, and the dissimilar document pair set including a plurality of dissimilar document pairs extracted randomly.
	Perram discloses the similar document pair set including a plurality of similar document pairs having a common attribute, and the dissimilar document pair set including a plurality of dissimilar document pairs extracted randomly. Perram discloses a method for "identifying, in [a] set of electronic documents … duplicate electronic documents." Perram, ¶ 24. The method comprises a step of "electronically assigning a document training set to be classified." Id. at ¶¶ 35; 105. The method can "identify near duplicate documents … on the basis of … dates of creation or modification" (i.e. common attributes of registration time range). Id. at ¶ 110. In another embodiment, a "random sample generator … can be used in the quality assurance of the classification process." Id. at ¶ 121. Documents presented "can be from a random selection of all files from the set of electronic documents to be classified and can be further divided into a training subset." Id. at ¶ 123.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify the method of identifying duplicate documents of Yih-Zhang-Gusev to incorporate identifying near duplicate documents based on time attributes and training by random document selection as taught by Perram. One of ordinary skill in the art would be motivated to integrate identifying near duplicate documents based on time attributes and training by random document selection into Yih-Zhang-Gusev, with a reasonable expectation of success, in order to "provide a large and dynamic training set, thus improving the accuracy of the classification system." Perram, ¶ 107.

Claim 8
	Perram discloses wherein the common attribute comprises at least one of an author of a document, a post section of the document, or a registration time range of the document. Perram discloses a method for "identifying, in [a] set of electronic documents … duplicate electronic documents." Perram, ¶ 24. The method comprises a step of "electronically assigning a document training set to be classified." Id. at ¶¶ 35; 105. The method can "identify near duplicate documents … on the basis of … dates of creation or modification" (i.e. common attributes of registration time range). Id. at ¶ 110.

Claim 9
	Yih discloses the calculating the semantic similarity comprises: calculating the first plurality of semantic similarities by inputting the first plurality of mathematical similarities to a first nonlinear function and calculating the second plurality of semantic similarities by inputting the second plurality of mathematical similarities to a second nonlinear function; and the first nonlinear function outputs a value greater than a value output by the second nonlinear function based on any value input to both the first nonlinear function and the second nonlinear function. Yih discloses that a "model is optimized by defining a function for computing a similarity score based upon two output vectors" and a "loss function is based upon the computed similarity scores and labels," wherein "parameters of the model are adjusted or tuned to minimize the loss function." Yih expressly discloses that in some embodiments, "two different sets of parameters models may be trained concurrently." Yih, ¶ 8. Further, Yih discloses that "the same of different mapping functions may be used for the first set of text objects and the second set of text objects," wherein the mapping functions may be … non-linear." Id. at ¶ 42.


Claims 17 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Yih, in view of Zhang, further in view of Gusev, further in view of Stein, U.S. PG-Publication No. 2011/0055332 A1.

Claim 17
	Stein discloses displaying a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) in response to the detecting; and blocking registration of the duplicate document in response to the detecting. Stein discloses a method of "document matching techniques for determining similarity between a candidate document and a reference document." Stein, ¶ 5. In one embodiment, "if a message is determined to be similar to a reference document, a CAPTCHA … message may be issued and sent to the message sender 104 to confirm that the message was not machine generated." Further, "the message similar to the reference document may be blocked." Id. at ¶ 37.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify the document similarity determination method of Yih-Zhang-Gusev to incorporate displaying a CAPTCHA if text is determined similar to a reference document taught by Stein. One of ordinary skill in the art would be motivated to integrate displaying a CAPTCHA if text is determined similar to a reference document into Yih-Zhang-Gusev, with a reasonable expectation of success, in order to "detect unwanted messages more promptly and efficiently." Stein, ¶ 38.

Claim 19
	Claim 19 recites a system configured to perform the steps of the method recited in claim 17. Accordingly, claim 19 is rejected as indicated in the rejection of claim 17.


	Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over Yih, in view of Zhang, further in view of Gusev, further in view of Thomas et al., U.S. PG-Publication No. 2011/0087668 A1.

	Thomas discloses comparing the key to a plurality of other keys stored in association with the plurality of documents in a table. Thomas discloses methods "for identifying clusters of near-duplicate document[s]," wherein "[c]lusters containing documents that are near-duplicates of each other can be created based on similarity constraints defined in terms of [an] edit distance." Clusters are formed by comparing an N-dimensional vector of a document to an N-dimensional vector of another document. Thomas, ¶¶ 10-11. For each document in the corpus, the method "can compute a hash vector based on word count information for the document." Id. at ¶ 13. A "document information data store" (i.e. table) stores "a vector representation of each document in a corpus." Id. at ¶ 14. Thomas discloses that the "similarity of two document can be evaluated by comparing their hash vectors." Id. at ¶ 55.
	Thomas discloses detecting a particular document associated with the respective vector expression to be the duplicate document based on determining the key to be the same as one of the plurality of other keys.  If the hash vectors are identical, the two documents "are referred to herein as 'exact' duplicates." Id. at ¶ 56.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify the document similarity determination method of Yih-Zhang-Gusev to incorporate document vector hash comparison to determine exact duplicates as taught by Thomas. One of ordinary skill in the art would be motivated to integrate document vector hash comparison into Yih-Zhang-Gusev, with a reasonable expectation of success, in order to increase efficiency of document processing applications (i.e. "speed up processing") by consolidating known exact duplicate documents. Thomas, ¶ 57.


Response to Arguments
Applicant's arguments filed 3/21/22 have been fully considered but they are not persuasive.
Applicant argues that the cited prior art does not teach "the respective semantic similarity being obtained by increasing or decreasing a corresponding mathematical similarity;" specifically that Yih does not teach both the claimed "semantic similarity" and the claimed "mathematical similarity." Rem. pg.10-11.
The Examiner disagrees.
The present specification discloses calculating a mathematical similarity "using at least one of a cosine similarity, a Euclidean distance, and/or a Jaccard similarity as the mathematical measure." Spec., ¶ 74. Yih discloses computing similarity scores between two vectors representing text using a similarity function of "a cosine, Jaccard, or any differentiable function." The similarity scores provide "a measurement of the text object's semantic similarity." Yih, ¶¶ 9; 24. Accordingly, the similarity score is analogous to the claimed "mathematical similarity."
The present specification discloses calculating a semantic similarity by increasing or decreasing the mathematical similarity using a non-linear function, wherein "[a]n increase of a decrease level of the mathematical similarity may be determined based on a nonlinear function selected between [a] first nonlinear function and [a] second nonlinear function." Spec., ¶ 75. The specification provides no specific example implementation of these nonlinear functions, rather the specification merely states that "each of the first nonlinear function and/or the second nonlinear function may be designed, determined and/or selected through empirical study." Id. Thus, the broadest reasonable interpretation of "semantic similarity" is the result of increasing or decreasing a mathematical similarity value using any nonlinear function.
	Gusev discloses the respective semantic similarity being obtained by increasing or decreasing a corresponding mathematical similarity. Gusev discloses using a "trained machine learning model" to "determine similarity" between two embedding vectors. Gusev, 5:21-43. The embedded vectors are generated from text content. Id. at 5:44-50. The method uses "a cosine similarity function … resulting in an indication or measurement as to the similarity" of a content item and a landing page (e.g. text content). Id. at 7:13-37. This indication or measurement is analogous to the claimed "mathematical similarity." Further, Gusev discloses that "a non-linear weighting may be applied to this measurement … such that the closer the two items of content are, the more weight [is] afforded to this measurement." Id. Accordingly, the measurement (i.e. mathematical similarity) is increased using a non-linear weighting function to generate a weighted measurement (i.e. semantic similarity).
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify the text similarity determination method of Yih-Zhang to incorporate non-linear weightings as taught by Gusev. One of ordinary skill in the art would be motivated to integrate non-linear weightings into Yih-Zhang, with a reasonable expectation of success, in order to increase similarity determination accuracy by enabling "individual items of input data" to be "weighted in any given processing node such that the weighted input data plays a greater or lesser role in the overall computation for that processing node." See Gusev, 12:7-30.
	Accordingly, claims 1-21 are rejected under 35 USC § 103.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FRANK D MILLS whose telephone number is (571)270-3172. The examiner can normally be reached M-F 10-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KAVITA PADMANABHAN can be reached on (571)272-8352. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/FRANK D MILLS/Primary Examiner, Art Unit 2176                                                                                                                                                                                                        July 1, 2022