DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on January 28, 2020, June 4, 2020, and September 13, 2021 are being considered by the examiner.

Claim Objections
Claims 1-20 are objected to because of the following informalities:  
In claim 1 at lines 4-5, the phrase “each of the one or more features of” should read “each of the one or more features of each audio frame”.
Claims 2-10 are objected to in light of their dependence from claim 1.
In claim 11 at lines 10-11, “the first set of segments” should read “the first set of audio segments.”  
Claims 12-20 are objected to in light of their dependence from claim 11.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 11-13 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Non-Patent Literature to Zhang (Y. Zhang and J. R. Glass. “Unsupervised Spoken Keyword Spotting via Segmental DTW on Gaussian Posteriorgrams.” Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009) IEEE, 2009. 398–403, hereinafter Zhang) in view of Non-Patent Literature to Zhang (Y. Zhang and J. R. Glass, “An inner-product lower-bound estimate for dynamic time warping,” 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 5660-5663, hereinafter Zhang2).

Regarding claim 11, Zhang discloses A computer-implemented method comprising (“completely unsupervised learning framework”; Zhang, pg. 398, Col. 2, para. 5) : segmenting, by a computer, a first audio signal into a first set of one or more audio segments, and a second audio signal into a second set of one or more audio segments (The system included a “462 speaker training set of 3,696 utterances and the common 118 speaker test set of 944 utterances,” thus, including a first audio signal and a second audio signal. “The total size of the vocabulary was 5,851 words. Each utterance was segmented into a series of 25 ms frames {the first and second sets of one or more audio segments} with a 10 ms window shifting (i.e., centi-second analysis); each frame was represented by 13 Mel-Frequency Cepstral Coefficients (MFCCs).”; Zhang, pg. 401, col. 1, para 5.); generating, by the computer, sets of one or more paths for each audio segment in the first set of audio segments, and sets of one or more paths for each audio segment in the second set of audio segments (“As we keep moving the start coordinate, for each keyword, we will have {generate, by a computer} a total of [n−1/R] warping paths, each of which represents a warping between the entire keyword sample {sets of Zhang, pg. 400, col. 2, para 6.). However, Zhang fails to expressly recite calculating, by the computer, based on lower-bound dynamic time-warping algorithm, a similarity score for each path of each audio segment of the first set of audio segments, and for each path of each audio segment of the second set of audio segments; and identifying, by the computer, at least one similar acoustic region between the first set of segments and the second set of audio segments, based upon comparing the similarity scores of each path of each segment of the first set of audio segments against the similarity scores of each path of each segment of the second set of audio segments.
Zhang2 teaches the use of lower bound estimates for dynamic time warping. (Zhang2, Abstract). Regarding claim 11, Zhang2 teaches calculating, by the computer, based on lower-bound dynamic time-warping algorithm, a similarity score for each path of each audio segment of the first set of audio segments, and for each path of each audio segment of the second set of audio segments (“Given two posteriorgram sequences, Q, and S, we can determine a lower-bound {calculating, by a computer, based on lower-bound dynamic time warping algorithm…} of their actual DTW score {a similarity score}” where “all possible warping paths, φ, [can be] considered between Q and S. {for each path of each audio segment of the first set of audio segments, and for each path of each audio segment of the second set of audio segments}”; Zhang2, pg. 5661, Col. 1, paras. 4 and 5); and identifying, by the computer, at least one similar acoustic region between the first set of segments and the second set of audio segments (The system determines the “lower-bound DTW score {similarity score} between two posteriorgrams, Q and S {between the first set of segments and the second set of segments}” and determine “the overall best alignment score, DTW(Q, S) {identifying at least one similar acoustic region}”; Zhang2, pg. 5661, Col. 1, paras. 3 and 4), based upon comparing the similarity scores of each path of each segment of the first set of audio segments against the similarity scores of each path of each segment of the second set of audio segments (The system determines “overall best alignment score, DTW {comparison of similarity scores}” between “two posteriorgram sequences for a speech query, Q,{of each path of each segment of the first set of audio segments} and a speech segment, S {of each path of each segment of the second set of audio segments}.”; Zhang2, pg. 5661, Col. 1, para. 3).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the unsupervised spoken word detection framework of Zhang to incorporate the teachings of Zhang2 to include calculating, by the computer, based on lower-bound dynamic time-warping algorithm, a similarity score for each path of each audio segment of the first set of audio segments, and for each path of each audio segment of the second set of audio segments; and identifying, by the computer, at least one similar acoustic region between the first set of segments and the second set of audio segments, based upon comparing the similarity scores of each path of each segment of the first set of audio segments against the similarity scores of each path of each segment of the second set of audio segments. The use of “the lower-bound estimate prunes away undesirable segments in each utterance,” as recognized by Zhang2. (Zhang2, pg. 5663, col. 1, para. 1).

Regarding claim 12, the rejection of claim 11 is incorporated. Zhang disclose all of the elements of the current invention as stated above. However, Zhang fail(s) to expressly recite wherein each path is a fixed-length portion of an audio segment.
The relevance of Zhang2 is described above with relation to claim 11. Regarding claim 12, Zhang2 teaches wherein each path is a fixed-length portion of an audio segment (“the warp will keep local distances within r frames of each other along the entire alignment {fixed length portion of an audio segment}”; Zhang2, pg. 5661, Col. 1, para. 4).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the unsupervised spoken word detection Zhang to incorporate the teachings of Zhang2 to include wherein each path is a fixed-length portion of an audio segment. The use of “the lower-bound estimate prunes away undesirable segments in each utterance,” as recognized by Zhang2. (Zhang2, pg. 5663, col. 1, para. 1).

Regarding claim 13, the rejection of claim 11 is incorporated. Zhang disclose all of the elements of the current invention as stated above. However, Zhang fail(s) to expressly recite further comprising: clustering, by the computer, one or more features of each path of each segment in a similar acoustic region according to a modeling algorithm, thereby generating one or more models for each path; and extracting, by the computer, posterior probabilities for each of the one or more features of extracted from the audio paths according to the one or more models, wherein the similarity score for each respective path is calculated using a model selected for the respective path based on the posterior probability of the respective path.
The relevance of Zhang2 is described above with relation to claim 11. Regarding claim 13, Zhang2 teaches further comprising: clustering, by the computer, one or more features of each path of each segment in a similar acoustic region according to a modeling algorithm, thereby generating one or more models for each path (“The Gaussian posteriorgram is a feature representation of speech frames” thus clustering of said features from each path of each segment “generated from a GMM {in a similar acoustic region according to a modeling algorithm}”; Zhang2, pg. 5661, Col. 1, paras. 2); and extracting, by the computer, posterior probabilities for each of the one or more features of extracted from the audio paths according to the one or more models, (“In our work, a D-mixture {each of the one or more features}, unsupervised GMM, G, is trained from a set of unlabeled speech frames, x1,...,xN. A posterior probability, pji = P(gj |xi), can then be calculated for any speech frame, xi, for each Gaussian component….A speech frame, xi, can then be represented by a D-dimensional posterior probability feature vector.”; Zhang2, pg. 5661, Col. 1, paras. 2-3) wherein the similarity score for each respective path is calculated using a model selected for the respective path based on the posterior probability of the respective path. (“a sliding window {model} with the size equal to the length of the keyword was applied to the test utterance {selected for the respective path based on the posterior probability of the respective path} to constrain the DTW search region” and “a series of DTW matches was performed to locate the best matching segment containing the keyword query” {to calculate the similarity score for each respective path}; Zhang2, pg. 5662, Col. 2, paras. 8).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the unsupervised spoken word detection framework of Zhang to incorporate the teachings of Zhang2 to include further comprising: clustering, by the computer, one or more features of each path of each segment in a similar acoustic region according to a modeling algorithm, thereby generating one or more models for each path; and extracting, by the computer, posterior probabilities for each of the one or more features of extracted from the audio paths according to the one or more models, wherein the similarity score for each respective path is calculated using a model selected for the respective path based on the posterior probability of the respective path. The use of “the lower-bound estimate prunes away undesirable segments in each utterance,” as recognized by Zhang2. (Zhang2, pg. 5663, col. 1, para. 1).

Regarding claim 15, the rejection of claim 11 is incorporated. Zhang disclose all of the elements of the current invention as stated above. However, Zhang fail(s) to expressly recite wherein comparing the similarity scores further comprises: selecting, by the computer, from the second set of segments a first test segment at a first time index and defined by a first time window; comparing, by the computer, the similarity scores for the paths of the first test segment against the similarity scores for the paths of at least one query segment of the first set of segments, according to the first time window and the first time index; selecting, by the computer, from the 
The relevance of Zhang2 is described above with relation to claim 11. Regarding claim 15, Zhang2 teaches wherein comparing the similarity scores further comprises: selecting, by the computer, from the second set of segments a first test segment at a first time index and defined by a first time window (“a sliding window with the size equal to the length of the keyword was applied to the test utterance to constrain the DTW search region” The sliding window gradually moved (one frame forward at a time) from the beginning frame of the test utterance {a first test segment at a first time index and defined by a first time window}.”; Zhang2, pg. 5662, Col. 2, paras. 8); comparing, by the computer, the similarity scores for the paths of the first test segment against the similarity scores for the paths of at least one query segment of the first set of segments, according to the first time window and the first time index (“to compare the spoken keyword query {the query segment of the first set of segments} with a test utterance {the second set of segments including the first test segment}” in light of the sliding window {the first time window} and the frame of the test utterance {the first test segment at a first time index}; Zhang2, pg. 5662, Col. 2, paras. 8); selecting, by the computer, from the second set of segments a second test segment at a second time index and defined by a second time window (“a sliding window with the size equal to the length of the keyword... gradually moved (one frame forward at a time) from the beginning frame of the test utterance to the end frame” where any frame after the beginning frame can be the second test frame of the second set of segments. The second test segment is at a second time frame {at a second time index} and defined by the sliding window {the second time window}.”; Zhang2, pg. 5662, Col. 2, paras. 8); and comparing, by the computer, the similarity scores for the paths of the second test segment against the similarity scores for the paths of the at least one query segment, according to the second time window and the second time index (“to compare the spoken keyword query {the query segment of the first set of segments} with a test utterance {the second set of segments including the second test segment}” in light of the sliding window {the second time window} and the frame of the test utterance {the second test segment at a second time index}; Zhang2, pg. 5662, Col. 2, paras. 8).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the unsupervised spoken word detection framework of Zhang to incorporate the teachings of Zhang2 to include wherein comparing the similarity scores further comprises: selecting, by the computer, from the second set of segments a first test segment at a first time index and defined by a first time window; comparing, by the computer, the similarity scores for the paths of the first test segment against the similarity scores for the paths of at least one query segment of the first set of segments, according to the first time window and the first time index; selecting, by the computer, from the second set of segments a second test segment at a second time index and defined by a second time window; and comparing, by the computer, the similarity scores for the paths of the second test segment against the similarity scores for the paths of the at least one query segment, according to the second time window and the second time index. The use of “the lower-bound estimate prunes away undesirable segments in each utterance,” as recognized by Zhang2. (Zhang2, pg. 5663, col. 1, para. 1).

Claims 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhang and Zhang2 as applied to claim 11 above, and further in view of Shen (U.S. Pat. App. Pub. No 2019/0205786, hereinafter Shen).

Regarding claim 16, the rejection of claim 11 is incorporated. Zhang and Zhang2 disclose all of the elements of the current invention as stated above. However, Zhang and Zhang2 fail to 
Shen teaches systems and methods of “real-time classification of time-series data.” (Shen, ¶ [0003]). Regarding claim 16, Shen teaches wherein identifying a similar acoustic region further comprises: identifying, by the computer, a first-level match between a query segment of the first set of audio segments and a test segment of the second set of audio segments (the system includes an “envelope based lower bound determination 540 using a training instance and a query instance” where “one goal of this processing is to determine a computational efficient lower bound to prune the irrelevant time-series instances”; Shen, ¶¶ [0051]), based on determining that a minimum distance value between the similarity scores for the paths of the query segment and the similarity scores for the paths of the test segment satisfies a first-level threshold (“A time-series instance is pruned if its DTW distance from each labeled time-series instance in the dictionary 315 (FIG. 3) is larger than the pruning threshold 313.”; Shen, ¶¶ [0051]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the unsupervised spoken word detection framework of Zhang as modified by the lower bound estimates for dynamic time warping of Zhang2 to incorporate the teachings of Shen to include wherein identifying a similar acoustic region further comprises: identifying, by the computer, a first-level match between a query segment of the first set of audio segments and a test segment of the second set of audio segments, based on determining that a minimum distance value between the similarity scores for the paths of the query segment and the similarity scores for the paths of the test segment satisfies a first-level threshold.  “the real-time process designs an efficient lower bound computation for Shen. (Shen, ¶ [0025]).


Allowable Subject Matter
Claims 1-10 are objected to as described above in the claim objections, but would be allowable if rewritten to overcome the cited objections.
Claims 14 and 17-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
Regarding claim 1, the closest prior art of record Tyagi (U.S. Pat. App. Pub. No. 2017/0301341, hereinafter Tyagi) teaches extracting, by the computer, posterior probabilities for each of the one or more features of extracted from the audio frames according to the one or more models (“For a frame of the one or more frames, the keyword recognizer unit 210 may utilize the first model to determine the first likelihood {posterior probability} for the frame [and] the first likelihood corresponds to a probability that the one or more features {for each of the one or more features} associated with the frame are generated based on a state in the first model{according to the one or more models}.”; Tyagi, ¶¶ [0044]); receiving, by the computer, from a client computer a keyword indicator for a keyword to query in the audio signals (“the application server 106 may receive an input pertaining to one or more keywords that are to be recognized in the speech signal.”; Tyagi, ¶¶ [0034]), the keyword comprising one or more words (“A ‘keyword’ refers to a word in the speech signal that may be of importance to a user.”; Tyagi, ¶¶ [0018]); …calculating, by the computer, for each audio frame containing the keyword, a first similarity score and a second similarity score (“For each of the one or more frames, the application server 106 may determine a first likelihood, [and] a second likelihood… Tyagi, ¶¶ [0036]), the first similarity score and the second similarity score of an audio frame calculated using a model selected for the respective frame based on the posterior probability of the audio frame (“the first likelihood is determined for each of the one or more keywords using respective first models” and “the second likelihood is determined for each of the one or more keywords using respective second models”; Tyagi, ¶¶ [0058], [0060]); storing, by the computer, into a queue, a subset of audio frames having a second similarity score comparatively higher than a corresponding first similarity score, the subset containing a review-threshold amount of audio frames (“After the determination of the first likelihood and the second likelihood, the keyword recognition unit 210 may be configured to determine maxima among the first likelihood and the second likelihood. Further, the keyword recognition unit 210 may be configured to determine minima among the first likelihood and the second likelihood.”; Tyagi, ¶¶ [0061]); and generating, by the computer, a list of audio segments of the audio signals matching the keyword, the list of audio segments containing at least one of the audio frames in the subset (“After, the determination of the first score and the second score for each of the one or more frames, the keyword recognition unit 210 performs a back trace operation in the first model for each of the one or more keywords... to determine whether the keywords are present in the speech signal.”; Tyagi, ¶¶ [0084]). However, Tyagi does not specifically teach generating, by a computer, a plurality of audio frames from a plurality of audio signals; clustering, by the computer, one or more features of each audio frame according to a modeling algorithm, thereby generating one or more models for each frame; [and] receiving, by the computer, from the client computer a named entity indicator for a named entity to be redacted from the query, wherein the computer nullifies the posterior probability of each frame containing the named entity. 
Sun (U.S. Pat. No. 9,600,231, hereinafter Sun) does teach a computer-implemented method comprising (“The device or devices performing the ASR process 250 may include an acoustic front end (AFE) 256”; Sun, ¶¶ Col. 5, lines 17-19): generating, by a computer, a plurality of audio frames from a plurality of audio signals (“The AFE may reduce noise in the audio data and divide the digitized audio data into frames”; Sun, ¶¶ Col. 5, lines 25-31); clustering, by the computer, one or more features of each audio frame according to a modeling algorithm (for each of the frames “the AFE determines a number of values, called features, representing the qualities of the audio data, along with a set of those values, called a feature vector, representing the features/qualities of the audio data within the frame.”; Sun, ¶¶ Col. 5, lines 25-31), thereby generating one or more models for each frame (the system may “train an SVM classifier model on the subset of feature dimensions”; Sun, ¶¶ Col. 3, lines 31-35). 
However, none of the prior art references of record, either alone or in combination, teaches, suggests, or makes obvious the combination of limitations as recited in the independent claims.
More specifically, the limitation of “receiving, by the computer, from the client computer a named entity indicator for a named entity to be redacted from the query, wherein the computer nullifies the posterior probability of each frame containing the named entity” is not taught by the prior art of record.

Regarding claims 2-10, dependent claims 2-10 are allowable at least in light of their dependency from an allowable base claim.

Regarding claim 14, the elements of claims 11 and 13 are taught by Zhang and Zhang2, as described above. 
However, none of the prior art references of record, either alone or in combination, teaches, suggests, or makes obvious the combination of limitations as recited in the dependent claim.
More specifically, the limitation of “further comprising receiving, by the computer, from a client computer a named entity indicator for a named entity to be redacted from the query, wherein 

Regarding claim 17, the elements of claims 11 and 16 are taught by Zhang, Zhang2, and Shen, as described above. 
However, none of the prior art references of record, either alone or in combination, teaches, suggests, or makes obvious the combination of limitations as recited in the dependent claim.
More specifically, the limitation of “further comprising identifying, by the computer, a second-level match between the query segment of the first set of audio segments and the test segment of the second set of audio segments, based on determining that a number of first-level matches satisfies a second-level threshold” is not taught by the prior art of record.

Regarding claims 18 and 19, dependent claims 18 and 19 are allowable at least in light of their dependency from allowable claim 17.

Regarding claim 20, the elements of claims 11 and 16 are taught by Zhang, Zhang2, and Shen, as described above. 
However, none of the prior art references of record, either alone or in combination, teaches, suggests, or makes obvious the combination of limitations as recited in the dependent claim.
More specifically, the limitation of “further comprising: determining, by the computer, that a number of first-level matches between the query segment and the test segment fails to satisfy a second-level threshold; and selecting, by the computer, a next test segment of the set of second audio segments to compare against the query segment” is not taught by the prior art of record.
As allowable subject matter has been indicated, applicant's reply must either comply with all formal requirements or specifically traverse each requirement not complied with.  See 37 CFR 1.111(b) and MPEP § 707.07(a).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Coorman et al. (U.S. Pat App. Pub. No. 20050182629) discloses determination of similarity between two segments for a speech synthesis system.
Garland et al. (U.S. Pat App. Pub. No. 20120059656) discloses systems and methods for quantifying similarity between spoken content of two segments of audio.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about 





/Sean E Serraguard/Patent Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657