DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The Amendment filed August 31, 2022 has been entered.  Claims 1 – 6, 8 – 15 and 17 – 19 are pending in the application.  Applicant’s amendments to the Specification have overcome each and every objection previously set forth in the Non-Final Office Action mailed June 22, 2022.
Response to Arguments
Applicant’s arguments with respect to claims 1 – 6, 8 – 15 and 17 – 19 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 – 3, 5 – 6, 8 – 12, 14 – 15 and 17 – 19 are rejected under 35 U.S.C. 103 as being unpatentable over Thomson et al. (US Patent No. 11,210,461), hereinafter Thomson, in view of Gkoulalas-Divanis et al. (US Patent No. 11,217,223), hereinafter Gkoulalas-Divanis.
Regarding claim 1, Thomson discloses a method comprising:
obtaining, by a computing system, first audio data representing one or more initial utterances of a user during an interactive voice session with an interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
generating, by the computing system, based on the first audio data, a prediction regarding whether a subsequent utterance of the user during the interactive voice session will contain sensitive information, the subsequent utterance of the user following the one or more initial utterances in time (Abstract, lines 1-3, "A masking system prevents a human agent from receiving sensitive personal information (SPI) provided by a caller during caller-agent communication."; Column 6, lines 45-54, "The ingress media gateway 105 sends caller audio to the real-time redactor 110. Depending on the embodiment, the real-time redactor 110 may apply ASR, NLP, or predictive modeling techniques to determine a likelihood related to whether the caller audio contains SPI. In some embodiments, the real-time redactor 110 determines a likelihood that future caller audio received by the ingress media gateway 105 will include SPI, for example, using NLP techniques to identify prompting phrases such as “My social security number is—.” ");
obtaining, by the computing system, second audio data representing the subsequent utterance of the user (Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
determining, by the computing system, based on the prediction, whether to transmit the second audio data (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent.");
and based on a determination not to transmit the second audio data: determining, by the computing system, based on the first audio data and from a plurality of classes of sensitive information, a class of the sensitive information predicted to be contained in the subsequent utterance of the user (Column 4, lines 61-66, "Examples of reports that relate to user experience include estimates of redaction accuracy, a number of times customer utterances are unrecognizable, a number or SPI events detected, a categorization of the types of SPI events detected, and an average number of words or digits that are redacted."; Column 13, line 65 - Column 14, line 2, “Additionally, the model accuracy may be affected by the number of SPI elements that the masking system 100 is trained to look for and what type of information the SPI contains. Defining a longer list of SPI types may increase the risk of false triggering.”; Defining a list of SPI types and categorizing the types of SPI events detected read on determining a class of the sensitive information predicted to be contained in the subsequent utterance of the user.);
generating, by the computing system, third audio data (Column 8, lines 7-12, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status."),
and the third audio is based on a voice of the user (Column 8, lines 12-24, “Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice, a signal such as the “wah-wah” voice heard in an old Charlie Brown cartoon when adults are speaking, a voice signal that is synthesized to have an average pitch, average range, or spectral characteristics similar to the caller's voice without intelligible content, a distorted version of the caller's voice, scrambled so that the audio segments are out of order, a sequence of random phonemes or digits, typing sounds, a series of tones or music, or white or pink noise.”);
replacing, by the computing system, the second audio data with the third audio data (Column 7, lines 63-66, "The ingress media gateway 105 or egress media gateway 115 may also replace portions of redacted caller audio with a shorter substitute such as “comfort signal” sounds, random DTMF tones, or the like."; Column 8, lines 12-15, "Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible");
and transmitting, by the computing system, the third audio data (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Thomson does not specifically disclose: wherein the third audio data represents a replacement utterance in the same class of sensitive information as the determined class.
Gkoulalas-Divanis teaches:
wherein the third audio data represents a replacement utterance in the same class of sensitive information as the determined class (Column 5, lines 25-33, “For expository purposes, the term “direct identifier” generally refers to a data attribute, a word, a token, or a value that can be used alone to identify an individual. A direct identifier can uniquely correspond to an individual, such that it reveals an identity of the corresponding individual when present in data. Examples of direct identifiers include, but are not limited to, person names, social security numbers, national IDs, credit card numbers, phone numbers, medical record numbers, IP addresses, account numbers, etc.”; Column 5, lines 51-56, “Embodiments of the invention provide a method and system for voice de-identification and content de-identification of voice recordings that protects personal identities of speakers delivering speeches recorded in the voice recordings as well as privacy-sensitive personal information included in textual content of the speeches.”; Column 12, lines 41-51, “In one embodiment, the masking and tagging unit 430 processes a direct identifier recognized in textual content by masking (i.e., replacing) the direct identifier in the textual content with a masked value (i.e., replacement value) that is based on a type of the direct identifier. For example, in one embodiment, if the direct identifier recognized in the textual content is a name, the masking and tagging unit 430 replaces the direct identifier in the textual content with a random name (e.g., extracted from a dictionary, extracted from a publicly available dataset such as a voters' registration list, etc.) or a pseudonym (e.g., “Patient1234”).”; The type of direct identifier reads on the class of sensitive information.).
Gkoulalas-Divanis teaches replacing identifying information in utterances with masking information based on the type of identifying information in order to conceal the personal identity of the speaker and provide data privacy (Column 1, lines 13-29, “One embodiment of the invention provides a method for speaker identity and content de-identification under data privacy guarantees. The method comprises receiving input indicative of at least one level of privacy protection the speaker identity and content re-identification is required to enforce, and extracting features corresponding to a first speaker from a first speech delivered by the first speaker and recorded in a first voice recording. The method further comprises recognizing and extracting textual content from the first speech, parsing the textual content to recognize privacy-sensitive personal information corresponding to a first individual, and generating de-identified textual content by performing utility-preserving content de-identification on the textual content to anonymize the privacy-sensitive personal information to an extent that satisfies the at least one level of privacy protection. The de-identified textual content conceals a personal identity of the first individual.”).
Thomson and Gkoulalas-Divanis are considered to be analogous to the claimed invention because they are in the same field of voice systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thomson to incorporate the teachings of Gkoulalas-Divanis to replace identifying information in utterances with masking information based on the type of identifying information.  Doing so would allow for concealing the personal identity of the speaker and providing data privacy.
Regarding claim 2, Thomson in view of Gkoulalas-Divanis discloses the method as claimed in claim 1.
Thomson further discloses:
wherein the method further comprises transmitting, by the computing system, the first audio data and the third audio data to the interactive voice system and not transmitting the second audio data to the interactive voice system (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent."; Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Regarding claim 3, Thomson in view of Gkoulalas-Divanis discloses the method as claimed in claim 1.
Thomson further discloses:
obtaining the first audio data comprises obtaining, by the computing system, the first audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
obtaining the second audio data comprises obtaining, by the computing system, the second audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
and transmitting the third audio data comprises transmitting, by the computing system, the third audio data to a server system (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak."; Column 6, lines 11-16, "In one embodiment, the masking system 100 includes a telephony server. Using this architecture, a communications link can be implemented to provide an interface between the caller device and the telephony server. For example, a communications link may be a dial-up connection or a two-way wireless communication link.").
Regarding claim 5, Thomson in view of Gkoulalas-Divanis discloses the method as claimed in claim 1.
Thomson further discloses:
generating the prediction comprises determining, by the computing system, a confidence score that indicates a level of confidence that the subsequent utterance will contain the sensitive information (Column 7, lines 35-40, "The real-time redactor 110 generates confidence values related to caller audio and agent audio. The confidence values may represent a predicted likelihood that received or future caller audio contains SPI. The confidence values may be determined based on the outputs of the ASR and NLP modules.");
and determining whether to transmit the second audio data comprises determining, by the computing system, whether to transmit the second audio data based on a comparison of the confidence score and a threshold (Column 7, lines 40-45, "If a confidence value for a portion of the caller audio exceeds a predetermined threshold value, the real-time redactor 110 may send a redaction control signal 230 to the ingress media gateway, indicating what portions of the caller audio should be masked and how long the redaction should last.").
Regarding claim 6, Thomson in view of Gkoulalas-Divanis discloses the method as claimed in claim 1.
Thomson further discloses:
determining, by the computing system, an expected temporal duration of the subsequent utterance (Column 7, line 66 - Column 8, line 2, "In one embodiment, upon detection of SPI in the caller audio stream, the real-time redactor 110 may predict a length of the expected SPI");
and generating, by the computing system, the third audio data based on the expected temporal duration of the subsequent utterance (Column 9, lines 34-47, "As a first example, if a requirement exists that a predetermined number of digits are to be masked, then a process may count digits output from an automatic speech recognizer and end masking once this number of digits has been masked. For example, if the requirement is that at least four digits of a phone number shall be redacted, and the first four digits are played to an agent before masking begins, then the system may restore audio (i.e., end redaction) after the customer has spoken eight digits. In this example, the agent would hear the first four and last two digits of a 10-digit phone number. Similarly, if a requirement exists that a predetermined number of words or seconds are to be masked, then a process may count words or time in seconds and redact as in the previous example for digits.").
Regarding claim 8, Thomson in view of Gkoulalas-Divanis discloses the method as claimed in claim 1.
Thomson further discloses:
generating the third audio data comprises generating, by the computing system, a spectrogram of the voice of the user; and the method further comprises generating, by the computing system, the third audio data based on the spectrogram of the voice of the user (Column 8, lines 7-24, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status. Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice, a signal such as the “wah-wah” voice heard in an old Charlie Brown cartoon when adults are speaking, a voice signal that is synthesized to have an average pitch, average range, or spectral characteristics similar to the caller's voice without intelligible content, a distorted version of the caller's voice, scrambled so that the audio segments are out of order, a sequence of random phonemes or digits, typing sounds, a series of tones or music, or white or pink noise.").
Regarding claim 9, Thomson in view of Gkoulalas-Divanis discloses the method as claimed in claim 1.
Thomson further discloses:
wherein obtaining the second audio data comprises obtaining, by the computing system, the second audio data after generating the prediction regarding whether the subsequent utterance of the user will contain the sensitive information (Column 3, lines 44-50, "The real-time redactor 110 detects or anticipates SPI in caller audio received from the ingress media gateway 105. In some embodiments, the real-time redactor 110 may additionally or alternatively receive agent audio from the ingress media gateway 105, and may use the agent audio to predict whether SPI is likely to be present in upcoming caller audio.").
Regarding claim 10, Thomson discloses a computing system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP).") comprising:
obtaining, by a computing system, first audio data representing one or more initial utterances of a user during an interactive voice session with an interactive voice system (Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
and processing circuitry configured to:
generate, based on the first audio data, a prediction regarding whether a subsequent utterance of the user during the interactive voice session will contain sensitive information, the subsequent utterance of the user following the one or more initial utterances in time (Abstract, lines 1-3, "A masking system prevents a human agent from receiving sensitive personal information (SPI) provided by a caller during caller-agent communication."; Column 6, lines 45-54, "The ingress media gateway 105 sends caller audio to the real-time redactor 110. Depending on the embodiment, the real-time redactor 110 may apply ASR, NLP, or predictive modeling techniques to determine a likelihood related to whether the caller audio contains SPI. In some embodiments, the real-time redactor 110 determines a likelihood that future caller audio received by the ingress media gateway 105 will include SPI, for example, using NLP techniques to identify prompting phrases such as “My social security number is—.” ");
and obtain second audio data representing the subsequent utterance of the user (Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
determine, based on the prediction, whether to transmit the second audio data (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent.");
and based on a determination not to transmit the second audio data: determine, by the computing system, based on the first audio data and from a plurality of classes of sensitive information, a class of the sensitive information predicted to be contained in the subsequent utterance of the user (Column 4, lines 61-66, "Examples of reports that relate to user experience include estimates of redaction accuracy, a number of times customer utterances are unrecognizable, a number or SPI events detected, a categorization of the types of SPI events detected, and an average number of words or digits that are redacted."; Column 13, line 65 - Column 14, line 2, “Additionally, the model accuracy may be affected by the number of SPI elements that the masking system 100 is trained to look for and what type of information the SPI contains. Defining a longer list of SPI types may increase the risk of false triggering.”; Defining a list of SPI types and categorizing the types of SPI events detected read on determining a class of the sensitive information predicted to be contained in the subsequent utterance of the user.);
generate, by the computing system, third audio data (Column 8, lines 7-12, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status."),
and the third audio is based on a voice of the user (Column 8, lines 12-24, “Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice, a signal such as the “wah-wah” voice heard in an old Charlie Brown cartoon when adults are speaking, a voice signal that is synthesized to have an average pitch, average range, or spectral characteristics similar to the caller's voice without intelligible content, a distorted version of the caller's voice, scrambled so that the audio segments are out of order, a sequence of random phonemes or digits, typing sounds, a series of tones or music, or white or pink noise.”);
replace the second audio data with the third audio data (Column 7, lines 63-66, "The ingress media gateway 105 or egress media gateway 115 may also replace portions of redacted caller audio with a shorter substitute such as “comfort signal” sounds, random DTMF tones, or the like."; Column 8, lines 12-15, "Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible");
and transmit the third audio data (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Thomson does not specifically disclose: wherein the third audio data represents a replacement utterance in the same class of sensitive information as the determined class.
Gkoulalas-Divanis teaches:
wherein the third audio data represents a replacement utterance in the same class of sensitive information as the determined class (Column 5, lines 25-33, “For expository purposes, the term “direct identifier” generally refers to a data attribute, a word, a token, or a value that can be used alone to identify an individual. A direct identifier can uniquely correspond to an individual, such that it reveals an identity of the corresponding individual when present in data. Examples of direct identifiers include, but are not limited to, person names, social security numbers, national IDs, credit card numbers, phone numbers, medical record numbers, IP addresses, account numbers, etc.”; Column 5, lines 51-56, “Embodiments of the invention provide a method and system for voice de-identification and content de-identification of voice recordings that protects personal identities of speakers delivering speeches recorded in the voice recordings as well as privacy-sensitive personal information included in textual content of the speeches.”; Column 12, lines 41-51, “In one embodiment, the masking and tagging unit 430 processes a direct identifier recognized in textual content by masking (i.e., replacing) the direct identifier in the textual content with a masked value (i.e., replacement value) that is based on a type of the direct identifier. For example, in one embodiment, if the direct identifier recognized in the textual content is a name, the masking and tagging unit 430 replaces the direct identifier in the textual content with a random name (e.g., extracted from a dictionary, extracted from a publicly available dataset such as a voters' registration list, etc.) or a pseudonym (e.g., “Patient1234”).”; The type of direct identifier reads on the class of sensitive information.).
Gkoulalas-Divanis teaches replacing identifying information in utterances with masking information based on the type of identifying information in order to conceal the personal identity of the speaker and provide data privacy (Column 1, lines 13-29, “One embodiment of the invention provides a method for speaker identity and content de-identification under data privacy guarantees. The method comprises receiving input indicative of at least one level of privacy protection the speaker identity and content re-identification is required to enforce, and extracting features corresponding to a first speaker from a first speech delivered by the first speaker and recorded in a first voice recording. The method further comprises recognizing and extracting textual content from the first speech, parsing the textual content to recognize privacy-sensitive personal information corresponding to a first individual, and generating de-identified textual content by performing utility-preserving content de-identification on the textual content to anonymize the privacy-sensitive personal information to an extent that satisfies the at least one level of privacy protection. The de-identified textual content conceals a personal identity of the first individual.”).
Thomson and Gkoulalas-Divanis are considered to be analogous to the claimed invention because they are in the same field of voice systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thomson to incorporate the teachings of Gkoulalas-Divanis to replace identifying information in utterances with masking information based on the type of identifying information.  Doing so would allow for concealing the personal identity of the speaker and providing data privacy.
Regarding claim 11, Thomson in view of Gkoulalas-Divanis discloses the computing system as claimed in claim 10.
Thomson further discloses:
wherein the method further comprises transmitting, by the computing system, the first audio data and the third audio data to the interactive voice system and not transmitting the second audio data to the interactive voice system (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent."; Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Regarding claim 12, Thomson in view of Gkoulalas-Divanis discloses the computing system as claimed in claim 10.
Thomson further discloses:
the processing circuitry is configured to obtain the first audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
the processing circuitry is configured to obtain the second audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
and the processing circuitry is configured to transmit the third audio data to a server system (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak."; Column 6, lines 11-16, "In one embodiment, the masking system 100 includes a telephony server. Using this architecture, a communications link can be implemented to provide an interface between the caller device and the telephony server. For example, a communications link may be a dial-up connection or a two-way wireless communication link.").
Regarding claim 14, Thomson in view of Gkoulalas-Divanis discloses the computing system as claimed in claim 10.
Thomson further discloses:
the processing circuitry is configured such that, as part of generating the prediction, the processing circuitry determines a confidence score that indicates a level of confidence that the subsequent utterance will contain the sensitive information (Column 7, lines 35-40, "The real-time redactor 110 generates confidence values related to caller audio and agent audio. The confidence values may represent a predicted likelihood that received or future caller audio contains SPI. The confidence values may be determined based on the outputs of the ASR and NLP modules.");
and the processing circuitry is configured such that, as part of determining whether to transmit the second audio data, the processing circuitry determines whether to transmit the second audio data based on a comparison of the confidence score and a threshold (Column 7, lines 40-45, "If a confidence value for a portion of the caller audio exceeds a predetermined threshold value, the real-time redactor 110 may send a redaction control signal 230 to the ingress media gateway, indicating what portions of the caller audio should be masked and how long the redaction should last.").
Regarding claim 15, Thomson in view of Gkoulalas-Divanis discloses the computing system as claimed in claim 10.
Thomson further discloses:
determine an expected temporal duration of the subsequent utterance (Column 7, line 66 - Column 8, line 2, "In one embodiment, upon detection of SPI in the caller audio stream, the real-time redactor 110 may predict a length of the expected SPI");
and generate the third audio data based on the expected temporal duration of the subsequent utterance (Column 9, lines 34-47, "As a first example, if a requirement exists that a predetermined number of digits are to be masked, then a process may count digits output from an automatic speech recognizer and end masking once this number of digits has been masked. For example, if the requirement is that at least four digits of a phone number shall be redacted, and the first four digits are played to an agent before masking begins, then the system may restore audio (i.e., end redaction) after the customer has spoken eight digits. In this example, the agent would hear the first four and last two digits of a 10-digit phone number. Similarly, if a requirement exists that a predetermined number of words or seconds are to be masked, then a process may count words or time in seconds and redact as in the previous example for digits.").
Regarding claim 17, Thomson in view of Gkoulalas-Divanis discloses the computing system as claimed in claim 10.
Thomson further discloses:
the processing circuitry is configured to generate a spectrogram of the voice of the user; and the processing circuitry is further configured to generate the third audio data based on the spectrogram of the voice of the user (Column 8, lines 7-24, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status. Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice, a signal such as the “wah-wah” voice heard in an old Charlie Brown cartoon when adults are speaking, a voice signal that is synthesized to have an average pitch, average range, or spectral characteristics similar to the caller's voice without intelligible content, a distorted version of the caller's voice, scrambled so that the audio segments are out of order, a sequence of random phonemes or digits, typing sounds, a series of tones or music, or white or pink noise.").
Regarding claim 18, Thomson in view of Gkoulalas-Divanis discloses the computing system as claimed in claim 10.
Thomson further discloses:
wherein the processing circuitry is configured such that, as part of obtaining the second audio data, the processing circuitry obtains the second audio data after generating the prediction regarding whether the subsequent utterance of the user will contain the sensitive information (Column 3, lines 44-50, "The real-time redactor 110 detects or anticipates SPI in caller audio received from the ingress media gateway 105. In some embodiments, the real-time redactor 110 may additionally or alternatively receive agent audio from the ingress media gateway 105, and may use the agent audio to predict whether SPI is likely to be present in upcoming caller audio.").
Regarding claim 19, Thomson discloses a computer-readable storage medium comprising instructions that, when executed, cause processing circuitry of a computing system to:
obtain first audio data representing one or more initial utterances of a user during an interactive voice session with an interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
generate, based on the first audio data, a prediction regarding whether a subsequent utterance of the user during the interactive voice session will contain sensitive information, the subsequent utterance of the user following the one or more initial utterances in time (Abstract, lines 1-3, "A masking system prevents a human agent from receiving sensitive personal information (SPI) provided by a caller during caller-agent communication."; Column 6, lines 45-54, "The ingress media gateway 105 sends caller audio to the real-time redactor 110. Depending on the embodiment, the real-time redactor 110 may apply ASR, NLP, or predictive modeling techniques to determine a likelihood related to whether the caller audio contains SPI. In some embodiments, the real-time redactor 110 determines a likelihood that future caller audio received by the ingress media gateway 105 will include SPI, for example, using NLP techniques to identify prompting phrases such as “My social security number is—.” ");
and obtain second audio data representing the subsequent utterance of the user (Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
determine, based on the prediction, whether to transmit the second audio data (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent.");
and based on a determination not to transmit the second audio data: determine, by the computing system, based on the first audio data and from a plurality of classes of sensitive information, a class of the sensitive information predicted to be contained in the subsequent utterance of the user (Column 4, lines 61-66, "Examples of reports that relate to user experience include estimates of redaction accuracy, a number of times customer utterances are unrecognizable, a number or SPI events detected, a categorization of the types of SPI events detected, and an average number of words or digits that are redacted."; Column 13, line 65 - Column 14, line 2, “Additionally, the model accuracy may be affected by the number of SPI elements that the masking system 100 is trained to look for and what type of information the SPI contains. Defining a longer list of SPI types may increase the risk of false triggering.”; Defining a list of SPI types and categorizing the types of SPI events detected read on determining a class of the sensitive information predicted to be contained in the subsequent utterance of the user.);
generate, by the computing system, third audio data (Column 8, lines 7-12, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status."),
and the third audio is based on a voice of the user (Column 8, lines 12-24, “Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice, a signal such as the “wah-wah” voice heard in an old Charlie Brown cartoon when adults are speaking, a voice signal that is synthesized to have an average pitch, average range, or spectral characteristics similar to the caller's voice without intelligible content, a distorted version of the caller's voice, scrambled so that the audio segments are out of order, a sequence of random phonemes or digits, typing sounds, a series of tones or music, or white or pink noise.”);
replace the second audio data with third audio data (Column 7, lines 63-66, "The ingress media gateway 105 or egress media gateway 115 may also replace portions of redacted caller audio with a shorter substitute such as “comfort signal” sounds, random DTMF tones, or the like."; Column 8, lines 12-15, "Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible");
and transmit the third audio data (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Thomson does not specifically disclose: wherein the third audio data represents a replacement utterance in the same class of sensitive information as the determined class.
Gkoulalas-Divanis teaches:
wherein the third audio data represents a replacement utterance in the same class of sensitive information as the determined class (Column 5, lines 25-33, “For expository purposes, the term “direct identifier” generally refers to a data attribute, a word, a token, or a value that can be used alone to identify an individual. A direct identifier can uniquely correspond to an individual, such that it reveals an identity of the corresponding individual when present in data. Examples of direct identifiers include, but are not limited to, person names, social security numbers, national IDs, credit card numbers, phone numbers, medical record numbers, IP addresses, account numbers, etc.”; Column 5, lines 51-56, “Embodiments of the invention provide a method and system for voice de-identification and content de-identification of voice recordings that protects personal identities of speakers delivering speeches recorded in the voice recordings as well as privacy-sensitive personal information included in textual content of the speeches.”; Column 12, lines 41-51, “In one embodiment, the masking and tagging unit 430 processes a direct identifier recognized in textual content by masking (i.e., replacing) the direct identifier in the textual content with a masked value (i.e., replacement value) that is based on a type of the direct identifier. For example, in one embodiment, if the direct identifier recognized in the textual content is a name, the masking and tagging unit 430 replaces the direct identifier in the textual content with a random name (e.g., extracted from a dictionary, extracted from a publicly available dataset such as a voters' registration list, etc.) or a pseudonym (e.g., “Patient1234”).”; The type of direct identifier reads on the class of sensitive information.).
Gkoulalas-Divanis teaches replacing identifying information in utterances with masking information based on the type of identifying information in order to conceal the personal identity of the speaker and provide data privacy (Column 1, lines 13-29, “One embodiment of the invention provides a method for speaker identity and content de-identification under data privacy guarantees. The method comprises receiving input indicative of at least one level of privacy protection the speaker identity and content re-identification is required to enforce, and extracting features corresponding to a first speaker from a first speech delivered by the first speaker and recorded in a first voice recording. The method further comprises recognizing and extracting textual content from the first speech, parsing the textual content to recognize privacy-sensitive personal information corresponding to a first individual, and generating de-identified textual content by performing utility-preserving content de-identification on the textual content to anonymize the privacy-sensitive personal information to an extent that satisfies the at least one level of privacy protection. The de-identified textual content conceals a personal identity of the first individual.”).
Thomson and Gkoulalas-Divanis are considered to be analogous to the claimed invention because they are in the same field of voice systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thomson to incorporate the teachings of Gkoulalas-Divanis to replace identifying information in utterances with masking information based on the type of identifying information.  Doing so would allow for concealing the personal identity of the speaker and providing data privacy.
Claims 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Thomson in view of Gkoulalas-Divanis, and further in view of Baker et al. (US Patent Application Publication No. 2020/0196141), hereinafter Baker.
Regarding claim 4, Thomson in view of Gkoulalas-Divanis discloses the method as claimed in claim 1, but does not explicitly disclose: wherein the interactive voice system is a voice assistant system.
Baker teaches:
wherein the interactive voice system is a voice assistant system (Paragraph 0002, lines 1-4, "Voice integration devices, for example, voice assistants such as Amazon Echo or Google Home devices may allow a user to vocally interact with a connected microphone/speaker device."; Paragraph 0005, lines 5-8, "The privacy mode may include mechanically muting or covering up the microphone of the audio device, providing a physical disconnect, or adding interference to obfuscate the audio signal."; Paragraph 0128, lines 1-10, "The voice command to trigger privacy mode may be a command setup by a user. Or, the voice command may be based on a specific keyword. For example, the privacy mode may be automatically engaged in response to the detection of specific keywords such as “bank”, “account”, or “pin” are received at an audio device. Privacy mode may also be engaged when numerical digits are read out loud. In this way, the privacy of verbally spoken credit card, bank account, social security, and/or phone numbers may be maintained, and not transmitted by the audio device.").
Baker teaches masking audio data in a voice assistant system in order to provide a privacy mode that is not susceptible to malicious software (Paragraph 0022, lines 1-7, "This application is directed towards a high-confidence tamper-proof privacy mode for audio devices. The privacy mode may be tamper-proof in that the privacy mode is not able to be compromised by malicious software, for example, by providing a visual indication tied to the hardware that may allow a user to confidently determine whether privacy mode has truly been enabled.").
Thomson, Gkoulalas-Divanis, and Baker are considered to be analogous to the claimed invention because they are in the same field of voice systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thomson in view of Gkoulalas-Divanis to incorporate the teachings of Baker to mask audio data in a voice assistant system.  Doing so would allow for providing a privacy mode that is not susceptible to malicious software.
Regarding claim 13, Thomson in view of Gkoulalas-Divanis discloses the computing system as claimed in claim 10, but does not explicitly disclose: wherein the interactive voice system is a voice assistant system.
Baker teaches:
wherein the interactive voice system is a voice assistant system (Paragraph 0002, lines 1-4, "Voice integration devices, for example, voice assistants such as Amazon Echo or Google Home devices may allow a user to vocally interact with a connected microphone/speaker device."; Paragraph 0005, lines 5-8, "The privacy mode may include mechanically muting or covering up the microphone of the audio device, providing a physical disconnect, or adding interference to obfuscate the audio signal."; Paragraph 0128, lines 1-10, "The voice command to trigger privacy mode may be a command setup by a user. Or, the voice command may be based on a specific keyword. For example, the privacy mode may be automatically engaged in response to the detection of specific keywords such as “bank”, “account”, or “pin” are received at an audio device. Privacy mode may also be engaged when numerical digits are read out loud. In this way, the privacy of verbally spoken credit card, bank account, social security, and/or phone numbers may be maintained, and not transmitted by the audio device.").
Baker teaches masking audio data in a voice assistant system in order to provide a privacy mode that is not susceptible to malicious software (Paragraph 0022, lines 1-7, "This application is directed towards a high-confidence tamper-proof privacy mode for audio devices. The privacy mode may be tamper-proof in that the privacy mode is not able to be compromised by malicious software, for example, by providing a visual indication tied to the hardware that may allow a user to confidently determine whether privacy mode has truly been enabled.").
Thomson, Gkoulalas-Divanis, and Baker are considered to be analogous to the claimed invention because they are in the same field of voice systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thomson in view of Gkoulalas-Divanis to incorporate the teachings of Baker to mask audio data in a voice assistant system.  Doing so would allow for providing a privacy mode that is not susceptible to malicious software.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JAMES BOGGS/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657