Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Shapira, Content-Based Data Leakage Detection Using Extended Fingerprinting Year: 2012 (Shapira) in view of Gupta US 11,100,087. 
With respect to claim 1, Shapira teaches “1. A non-transitory computer-readable storage medium having computer-readable code stored thereon for programming one or more processors to perform steps of: obtaining a file to be checked for Data Loss Prevention (DLP)” on p. 32 
Detection phase – fingerprints of outgoing documents are extracted and compared with confidential fingerprints in the database in order to detect leakage of confidential documents.

 “determining a cryptographic hash of the file and comparing the cryptographic hash to corresponding cryptographic hashes of indexed files” on p. 16: 
Most fingerprinting methods make use of hash functions. In most cases, classic cryptographic functions, such as MD5 (Rivest, 1992) or SHA (U.S. Department of Commerce, 1993) are appropriate.

and on p. 32: 
Detection phase – fingerprints of outgoing documents are extracted and compared with
confidential fingerprints in the database in order to detect leakage of confidential documents.

	“responsive to a match between the cryptographic hash and one of the corresponding cryptographic hashes, determining a DLP match and performing an action based thereon” on p. 37 
The input of this processing step is a document d while the output is the “confidentiality score” of d. A document with a “confidentiality score” above some threshold (its optimal value depends on domain and is a tradeoff between false positive and false negative rates) is considered confidential and is detected as leakage. The input document d is fingerprinted using the new fingerprinting method described in Section 4.1.1, resulting in a list of hashes representing the document. Next, a list of documents which contain at least one of these hashes is retrieved from the fingerprints database. The confidentiality score of document d is set to the maximal number of hashes detected in any of the documents in the list.

(action is any action that performs leakage detection, for example); 
“. . . . extracting text from the file and creating an ordered sequence of hashes of variable length chunks of the extracted text” on p. 33
To support the dynamic nature of the dataset, the applied detection method should enable real-time indexing of new documents and modifications to the status of the document's confidentiality. In the proposed method, these operations are linear to the file’s size, since they only require an addition or deletion from a database.
on p. 35 
Hash selection – this step is applied to confidential documents only and modified so that only hashes that appear in less than m non-confidential documents are considered a document’s fingerprint. The aim of Hash Selection is to make a fingerprint of the core confidential content of a confidential document, ignoring common phrases, disclaimers, standard forms, etc. According to our preliminary evaluation, m should be set to 1, i.e., even if a skip-gram appears in a single non-confidential 

Figure 6 ( “baracknetanyahuobama invitnetanyahuobama invitobama white netanyahuobama white invitnetanyahu white housinvitnetanyahu housinvit white housnetanyahu white netanyahu visit white housnetanyahu visit” are all variable length chunks that are in an order corresponding to how they appear in the text); see also Figure 6 (“*In reality the hashes of presented n-grams/skip-grams stored”);  
“and determining the DLP match with one of the indexed files based on comparing the ordered sequence of hashes with corresponding ordered sequence of hashes of the indexed files” on p. 32-33: 
In modern organizations the document sets that represent confidential and non-confidential information may change from day to day. For example, highly sensitive information about a new product may become public shortly after its launch. Furthermore, in large organizations, enormous amounts of sensitive reports, emails, and presentations are produced daily and therefore it is important to detect leakage attempts in real-time. To support the dynamic nature of the dataset, the applied detection method should enable real-time indexing of new documents and modifications to the status of the document's confidentiality. In the proposed method, these operations are linear to the file’s size, since they only require an addition or deletion from a database.
p. 37 
The input of this processing step is a document d while the output is the “confidentiality score” of d. A document with a “confidentiality score” above some threshold (its optimal value depends on domain and is a tradeoff between false positive and false negative rates) is considered confidential and is detected as leakage. The input document d is fingerprinted using the new fingerprinting method described in Section 4.1.1, resulting in a list of hashes representing the document. Next, a list of documents which contain at least one of these hashes is retrieved from the fingerprints database. The confidentiality score of document d is set to the maximal number of hashes detected in any of the documents in the list.

p. 38 and Figure 7 
	As can be seen in Figure 7, both methods received the same text segment “Barack Obama has issued an invitation to Israeli Premier Benjamin Netanyahu to visit the White House.” as input. This is actually a rephrasing of the previously fingerprinted “Barack Obama invites Netanyahu for White House visit” (Figure 5). The proposed method detected three matches between these text segments (resulting in a confidentiality score of 3), while full fingerprint did not detect even a single match

Shapira further teaches detecting leakage attempts in real-time (pp. 32 last 3 lines and p. 33 lines 1-3) and real-time indexing and of dynamic data (p. 35 lines 1-3). 
 However, it appears Shapira fails to explicitly teach “responsive to no match.” That is, it appears Shapira fails to explicitly teach the extraction step “responsive to no match.”  
However, Gupta US 11,100,087 teaches “responsive to no match extracting text from the file and creating an ordered sequence of hashes of variable length chunks of the extracted text” in (emphasis added) in col. 8:9-17 (tokenization); col. 5:63-col. 6:5 (tokenization of extracted text includes hashing); Fig. 3 item 310, 316, 318, 320 (text extracted from Fig. 3 item 302 and 308; each token variable length (e.g., 316, 318) and are an ordered sequence). 
Gupta and Shapira are analogous art because they are from the same field of endeavor as Applicant’s claimed invention. 
It would have been obvious to one skilled in the art before the effective filing date of the invention to modify “extracting text from the file and creating an ordered sequence of hashes of variable length chunks of 
Claim 9 and claim 17 are rejected for the same reasons. 
With respect to claim 2, Shapira teaches “2. The non-transitory computer-readable storage medium of claim 1, wherein the determining the DLP match based on the comparing the ordered sequence of hashes utilizes a match score based on a number of the hashes that match and the DLP match is based on the match score being above a threshold” on p. 37 
The input of this processing step is a document d while the output is the “confidentiality score” of d. A document with a “confidentiality score” above some threshold (its optimal value depends on domain and is a tradeoff between false positive and false negative rates) is considered confidential and is detected as leakage.

Claim 10 and claim 18 are rejected for the same reason. 
With respect to claim 3, Shapira teaches “3. The non-transitory computer-readable storage medium of claim 2, wherein the threshold is user-configurable in value and configurable across a different profile of the indexed files” ” on p. 37 
The input of this processing step is a document d while the output is the “confidentiality score” of d. A document with a “confidentiality score” above some threshold (its optimal value depends on domain and is a tradeoff between false positive and false negative rates) is considered confidential and is detected as leakage.

p. 54 
As can be seen in figures 16-19, the parameters n - (number of words in the n-gram), and k - (number of skips allowed) that yield the best AUC for each scenario are different, therefore they should be adjusted for each specific domain... Therefore, when the properties of a domain are not known in advance, an organization should set k relatively high in order to compensate for a probable wrong choice of n. In general, it seems that setting n to 3 and setting a relatively high k is the best 

default configuration appropriate for most of the tested domains
p. 73 
The ROC is a graph representation of the tradeoff between the True Positive Rate (TPR) and the False Positive Rate (FPR) for different thresholds.

(“optimal value of threshold” is configurable by a user (i.e. organization) and can be based on a domain for example; Examiner finds an organization in a particular domain is an example of a profile). 
Claim 11 are rejected for the same reason. 
With respect to claim 4, Shapira teaches “4. The non-transitory computer-readable storage medium of claim 1, wherein the steps further include responsive to the DLP match based on the comparing the ordered sequences of hashes, performing an action based thereon” on p. 37 
The input of this processing step is a document d while the output is the “confidentiality score” of d. A document with a “confidentiality score” above some threshold (its optimal value depends on domain and is a tradeoff between false positive and false negative rates) is considered confidential and is detected as leakage. The input document d is fingerprinted using the new fingerprinting method described in Section 4.1.1, resulting in a list of hashes representing the document. Next, a list of documents which contain at least one of these hashes is retrieved from the fingerprints database. The confidentiality score of document d is set to the maximal number of hashes detected in any of the documents in the list.

(action is any action that performs leakage detection, for example);  pp. 32 last 3 lines and p. 33 lines 1-3) ( real-time indexing and of dynamic data). 
Claim 12 and claim 19 are rejected for the same reason. 



With respect to claim 5, Gupta teaches “The non-transitory computer-readable storage medium of claim 1, wherein the steps further include prior to the obtaining the file, obtaining a lookup table for a tenant associated with a user of the file, wherein the lookup table includes the ordered sequence of hashes indexed to the indexed files” in col. 8:9-33 (library table is lookup table); Fig. 3 item 322 and 324 (ordered sequence); col. 6:16-25 (Examiner finds user profile teaches at least one tenant). The motivation to combine is same as above.
Claim 13 and claim 20 are rejected for the same reason. 
With respect to claim 6, Gupta teaches “The non-transitory computer-readable storage medium of claim 5, wherein the lookup table is created in an indexing tool, and wherein the indexed files cannot be recreated from data in the lookup table” in col. 5:63-col. 6:5 and 8:41-53. The motivation to combine is same as above. Claim 14 rejected for the same reason. 
With respect to claim 7, Shapira teaches “7. The non-transitory computer-readable storage medium of claim 1, wherein the file is of a first file type, and wherein the file is determined to match one of the indexed files being a second file type, but having identical text therein” on p. 2 

Content based – detection of leakage by analyzing the file content. 
In this thesis, we describe a content-based method that can be used for detecting data leakage sourcing from inside an organization in general and mitigating the intentional data leakage scenario, in particular.

(content based detection teaches that the content is analyze; therefore, any number of file types with identical content can be detected). Claim 15 is rejected for the same reasons. 
With respect to claim 8, Shapira teaches “8. The non-transitory computer-readable storage medium of claim 1, wherein the file is determined to match one of the indexed files having similar text therein” p. 38:
As can be seen in Figure 7, both methods received the same text segment “Barack Obama has issued an invitation to Israeli Premier Benjamin Netanyahu to visit the White House.” as input. This is actually a rephrasing of the previously fingerprinted “Barack Obama invites Netanyahu for White House visit” (Figure 5). The proposed method detected three matches between these text segments (resulting in a confidentiality score of 3). . .

(Fig. 5 and Fig. 7 have similar text therein and were matched). Claim 16 is rejected for the same reason.  
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALBERT M PHILLIPS, III whose telephone number is (571)270-3256. The examiner can normally be reached 10a-6:30pm EST M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela D. Reyes can be reached on (571)270-1006. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application 





/ALBERT M PHILLIPS, III/Primary Examiner, Art Unit 2159