DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign mentioned in the description: “communication channels 524” in paragraph 0045, line 3, paragraph 0045, line 5, and paragraph 0045, lines 7-8.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
Applicant is reminded of the proper language and format for an abstract of the disclosure.
The abstract should be in narrative form and generally limited to a single paragraph on a separate sheet within the range of 50 to 150 words in length. The abstract should describe the disclosure sufficiently to assist readers in deciding whether there is a need for consulting the full patent text for details.
The language should be clear and concise and should not repeat information given in the title. It should avoid using phrases which can be implied, such as, “The disclosure concerns,” “The disclosure defined by this invention,” “The disclosure describes,” etc.  In addition, the form and legal phraseology often used in patent claims, such as “means” and “said,” should be avoided.
The disclosure is objected to because of the following informalities:
In paragraph 0053, line 1, “prediction engine 204” should read “prediction engine 510”.
In paragraph 0053, line 2, “prediction engine 204” should read “prediction engine 510”.
In paragraph 0053, lines 7-8, “prediction engine 204” should read “prediction engine 510”.
In paragraph 0054, lines 7-8, “current channel is channel” should read “current channel is the channel”.
In paragraph 0059, line 7, “distort of overshadow” should read “distort or overshadow”.
In paragraph 0074, line 6, “may represent be an encoded” should read “may represent an encoded”.
In paragraph 0080, lines 4-5, “category 90N” should read “category 902N”.
In paragraph 0089, line 9, “based to indicate risk” should read “to indicate risk” or “based on risk”.
Appropriate correction is required.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 – 3, 5 – 12 and 14 – 19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Thomson et al. (US Patent No. 11,210,461), hereinafter Thomson.
Regarding claim 1, Thomson discloses a method comprising:
obtaining, by a computing system, first audio data representing one or more initial utterances of a user during an interactive voice session with an interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
generating, by the computing system, based on the first audio data, a prediction regarding whether a subsequent utterance of the user during the interactive voice session will contain sensitive information, the subsequent utterance following the one or more initial utterances in time (Abstract, lines 1-3, "A masking system prevents a human agent from receiving sensitive personal information (SPI) provided by a caller during caller-agent communication."; Column 6, lines 45-54, "The ingress media gateway 105 sends caller audio to the real-time redactor 110. Depending on the embodiment, the real-time redactor 110 may apply ASR, NLP, or predictive modeling techniques to determine a likelihood related to whether the caller audio contains SPI. In some embodiments, the real-time redactor 110 determines a likelihood that future caller audio received by the ingress media gateway 105 will include SPI, for example, using NLP techniques to identify prompting phrases such as “My social security number is—.” ");
obtaining, by the computing system, second audio data representing the subsequent utterance (Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
determining, by the computing system, based on the prediction, whether to transmit the second audio data (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent.");
and based on a determination not to transmit the second audio data: replacing, by the computing system, the second audio data with third audio data that is based on a voice of the user (Column 7, lines 63-66, "The ingress media gateway 105 or egress media gateway 115 may also replace portions of redacted caller audio with a shorter substitute such as “comfort signal” sounds, random DTMF tones, or the like."; Column 8, lines 12-15, "Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible");
and transmitting, by the computing system, the third audio data (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Regarding claim 2, Thomson discloses the method as claimed in claim 1, wherein the method further comprises transmitting, by the computing system, the first audio data and the third audio data to the interactive voice system and not transmitting the second audio data to the interactive voice system (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent."; Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Regarding claim 3, Thomson discloses the method as claimed in claim 1, wherein:
obtaining the first audio data comprises obtaining, by the computing system, the first audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
obtaining the second audio data comprises obtaining, by the computing system, the second audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
and transmitting the third audio data comprises transmitting, by the computing system, the third audio data to a server system (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak."; Column 6, lines 11-16, "In one embodiment, the masking system 100 includes a telephony server. Using this architecture, a communications link can be implemented to provide an interface between the caller device and the telephony server. For example, a communications link may be a dial-up connection or a two-way wireless communication link.").
Regarding claim 5, Thomson discloses the method as claimed in claim 1, wherein:
generating the prediction comprises determining, by the computing system, a confidence score that indicates a level of confidence that the subsequent utterance will contain the sensitive information (Column 7, lines 35-40, "The real-time redactor 110 generates confidence values related to caller audio and agent audio. The confidence values may represent a predicted likelihood that received or future caller audio contains SPI. The confidence values may be determined based on the outputs of the ASR and NLP modules.");
and determining whether to transmit the second audio data comprises determining, by the computing system, whether to transmit the second audio data based on a comparison of the confidence score and a threshold (Column 7, lines 40-45, "If a confidence value for a portion of the caller audio exceeds a predetermined threshold value, the real-time redactor 110 may send a redaction control signal 230 to the ingress media gateway, indicating what portions of the caller audio should be masked and how long the redaction should last.").
Regarding claim 6, Thomson discloses the method as claimed in claim 1, wherein the method further comprises:
determining, by the computing system, an expected temporal duration of the subsequent utterance (Column 7, line 66 - Column 8, line 2, "In one embodiment, upon detection of SPI in the caller audio stream, the real-time redactor 110 may predict a length of the expected SPI");
and generating, by the computing system, the third audio data based on the expected temporal duration of the subsequent utterance (Column 9, lines 34-47, "As a first example, if a requirement exists that a predetermined number of digits are to be masked, then a process may count digits output from an automatic speech recognizer and end masking once this number of digits has been masked. For example, if the requirement is that at least four digits of a phone number shall be redacted, and the first four digits are played to an agent before masking begins, then the system may restore audio (i.e., end redaction) after the customer has spoken eight digits. In this example, the agent would hear the first four and last two digits of a 10-digit phone number. Similarly, if a requirement exists that a predetermined number of words or seconds are to be masked, then a process may count words or time in seconds and redact as in the previous example for digits.").
Regarding claim 7, Thomson discloses the method as claimed in claim 1, wherein:
the third audio data represents an alternative utterance, the method further comprises: determining, by the computing system, based on the first audio data, a class of the sensitive information (Column 4, lines 61-66, "Examples of reports that relate to user experience include estimates of redaction accuracy, a number of times customer utterances are unrecognizable, a number or SPI events detected, a categorization of the types of SPI events detected, and an average number of words or digits that are redacted.");
and generating, by the computing system, the third audio data, wherein the third audio data represents an utterance containing a replacement utterance in the same class of sensitive information (Column 8, lines 7-16, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status. Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice"; Column 14, lines 23-26, "Meanwhile, a comfort signal is played to the agent so the agent knows that the caller is speaking. In the case of DTMF provided by the caller, the comfort signal may be a flat or a random, set of DTMF tones.").
Regarding claim 8, Thomson discloses the method as claimed in claim 1, wherein:
the method further comprises generating, by the computing system, a spectrogram of the voice of the user; and the method further comprises generating, by the computing system, the third audio data based on the spectrogram of the voice of the user (Column 8, lines 7-24, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status. Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice, a signal such as the “wah-wah” voice heard in an old Charlie Brown cartoon when adults are speaking, a voice signal that is synthesized to have an average pitch, average range, or spectral characteristics similar to the caller's voice without intelligible content, a distorted version of the caller's voice, scrambled so that the audio segments are out of order, a sequence of random phonemes or digits, typing sounds, a series of tones or music, or white or pink noise.").
Regarding claim 9, Thomson discloses the method as claimed in claim 1, wherein obtaining the second audio data comprises obtaining, by the computing system, the second audio data after generating the prediction regarding whether the subsequent utterance of the user will contain the sensitive information (Column 3, lines 44-50, "The real-time redactor 110 detects or anticipates SPI in caller audio received from the ingress media gateway 105. In some embodiments, the real-time redactor 110 may additionally or alternatively receive agent audio from the ingress media gateway 105, and may use the agent audio to predict whether SPI is likely to be present in upcoming caller audio.").
Regarding claim 10, Thomson discloses a computing system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP).") comprising:
obtaining, by a computing system, first audio data representing one or more initial utterances of a user during an interactive voice session with an interactive voice system (Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
and processing circuitry configured to:
generate, based on the first audio data, a prediction regarding whether a subsequent utterance of the user during the interactive voice session will contain sensitive information, the subsequent utterance following the one or more initial utterances in time (Abstract, lines 1-3, "A masking system prevents a human agent from receiving sensitive personal information (SPI) provided by a caller during caller-agent communication."; Column 6, lines 45-54, "The ingress media gateway 105 sends caller audio to the real-time redactor 110. Depending on the embodiment, the real-time redactor 110 may apply ASR, NLP, or predictive modeling techniques to determine a likelihood related to whether the caller audio contains SPI. In some embodiments, the real-time redactor 110 determines a likelihood that future caller audio received by the ingress media gateway 105 will include SPI, for example, using NLP techniques to identify prompting phrases such as “My social security number is—.” ");
and obtain second audio data representing the subsequent utterance (Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
determine, based on the prediction, whether to transmit the second audio data (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent.");
and based on a determination not to transmit the second audio data: replace the second audio data with third audio data that is based on a voice of the user (Column 7, lines 63-66, "The ingress media gateway 105 or egress media gateway 115 may also replace portions of redacted caller audio with a shorter substitute such as “comfort signal” sounds, random DTMF tones, or the like."; Column 8, lines 12-15, "Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible");
and transmit the third audio data (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Regarding claim 11, Thomson discloses the computing system as claimed in claim 10, wherein the method further comprises transmitting, by the computing system, the first audio data and the third audio data to the interactive voice system and not transmitting the second audio data to the interactive voice system (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent."; Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Regarding claim 12, Thomson discloses the computing system as claimed in claim 10, wherein:
the processing circuitry is configured to obtain the first audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
the processing circuitry is configured to obtain the second audio data from the interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
and the processing circuitry is configured to transmit the third audio data to a server system (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak."; Column 6, lines 11-16, "In one embodiment, the masking system 100 includes a telephony server. Using this architecture, a communications link can be implemented to provide an interface between the caller device and the telephony server. For example, a communications link may be a dial-up connection or a two-way wireless communication link.").
Regarding claim 14, Thomson discloses the computing system as claimed in claim 10, wherein:
the processing circuitry is configured such that, as part of generating the prediction, the processing circuitry determines a confidence score that indicates a level of confidence that the subsequent utterance will contain the sensitive information (Column 7, lines 35-40, "The real-time redactor 110 generates confidence values related to caller audio and agent audio. The confidence values may represent a predicted likelihood that received or future caller audio contains SPI. The confidence values may be determined based on the outputs of the ASR and NLP modules.");
and the processing circuitry is configured such that, as part of determining whether to transmit the second audio data, the processing circuitry determines whether to transmit the second audio data based on a comparison of the confidence score and a threshold (Column 7, lines 40-45, "If a confidence value for a portion of the caller audio exceeds a predetermined threshold value, the real-time redactor 110 may send a redaction control signal 230 to the ingress media gateway, indicating what portions of the caller audio should be masked and how long the redaction should last.").
Regarding claim 15, Thomson discloses the computing system as claimed in claim 10, wherein the processing circuitry is further configured to:
determine an expected temporal duration of the subsequent utterance (Column 7, line 66 - Column 8, line 2, "In one embodiment, upon detection of SPI in the caller audio stream, the real-time redactor 110 may predict a length of the expected SPI");
and generate the third audio data based on the expected temporal duration of the subsequent utterance (Column 9, lines 34-47, "As a first example, if a requirement exists that a predetermined number of digits are to be masked, then a process may count digits output from an automatic speech recognizer and end masking once this number of digits has been masked. For example, if the requirement is that at least four digits of a phone number shall be redacted, and the first four digits are played to an agent before masking begins, then the system may restore audio (i.e., end redaction) after the customer has spoken eight digits. In this example, the agent would hear the first four and last two digits of a 10-digit phone number. Similarly, if a requirement exists that a predetermined number of words or seconds are to be masked, then a process may count words or time in seconds and redact as in the previous example for digits.").
Regarding claim 16, Thomson discloses the computing system as claimed in claim 10, wherein:
the third audio data represents an alternative utterance, the processing circuitry is further configured to: determine, based on the first audio data, a class of the sensitive information ((Column 4, lines 61-66, "Examples of reports that relate to user experience include estimates of redaction accuracy, a number of times customer utterances are unrecognizable, a number or SPI events detected, a categorization of the types of SPI events detected, and an average number of words or digits that are redacted.");
and generate the third audio data, wherein the third audio data represents an utterance containing a replacement utterance in the same class of sensitive information (Column 8, lines 7-16, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status. Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice"; Column 14, lines 23-26, "Meanwhile, a comfort signal is played to the agent so the agent knows that the caller is speaking. In the case of DTMF provided by the caller, the comfort signal may be a flat or a random, set of DTMF tones.").
Regarding claim 17, Thomson discloses the computing system as claimed in claim 10, wherein:
the processing circuitry is further configured to generate a spectrogram of the voice of the user; and the processing circuitry is further configured to generate the third audio data based on the spectrogram of the voice of the user (Column 8, lines 7-24, "In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak. The comfort signal may also indicate that the caller is inputting DTMF, the caller is providing other data as input, the caller is still on the line, or to convey other indications of call status. Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible, a voice signal responsive to the caller's voice, a signal such as the “wah-wah” voice heard in an old Charlie Brown cartoon when adults are speaking, a voice signal that is synthesized to have an average pitch, average range, or spectral characteristics similar to the caller's voice without intelligible content, a distorted version of the caller's voice, scrambled so that the audio segments are out of order, a sequence of random phonemes or digits, typing sounds, a series of tones or music, or white or pink noise.").
Regarding claim 18, Thomson discloses the computing system as claimed in claim 10, wherein the processing circuitry is configured such that, as part of obtaining the second audio data, the processing circuitry obtains the second audio data after generating the prediction regarding whether the subsequent utterance of the user will contain the sensitive information (Column 3, lines 44-50, "The real-time redactor 110 detects or anticipates SPI in caller audio received from the ingress media gateway 105. In some embodiments, the real-time redactor 110 may additionally or alternatively receive agent audio from the ingress media gateway 105, and may use the agent audio to predict whether SPI is likely to be present in upcoming caller audio.").
Regarding claim 19, Thomson discloses a computer-readable storage medium comprising instructions that, when executed, cause processing circuitry of a computing system to:
obtain first audio data representing one or more initial utterances of a user during an interactive voice session with an interactive voice system (Column 5, lines 12-18, "A caller device 205 can be a telephone, a personal or mobile computing device, such as a smartphone, tablet or notebook computer, a desktop computer, or another device by which the user can communicate with an agent. In some embodiments, a caller device 205 may be any device that supports communication over a network using e-mail, interactive text chats, or voice over Internet protocol (VOIP)."; Column 6, lines 32-35, "The masking system 100 receives caller audio (or another form of communication medium) from the caller device 205 at the ingress media gateway 105, for example, via network 220.");
generate, based on the first audio data, a prediction regarding whether a subsequent utterance of the user during the interactive voice session will contain sensitive information, the subsequent utterance following the one or more initial utterances in time (Abstract, lines 1-3, "A masking system prevents a human agent from receiving sensitive personal information (SPI) provided by a caller during caller-agent communication."; Column 6, lines 45-54, "The ingress media gateway 105 sends caller audio to the real-time redactor 110. Depending on the embodiment, the real-time redactor 110 may apply ASR, NLP, or predictive modeling techniques to determine a likelihood related to whether the caller audio contains SPI. In some embodiments, the real-time redactor 110 determines a likelihood that future caller audio received by the ingress media gateway 105 will include SPI, for example, using NLP techniques to identify prompting phrases such as “My social security number is—.” ");
and obtain second audio data representing the subsequent utterance (Column 6, lines 35-42, "In one embodiment, the ingress media gateway 105 receives an ongoing stream of caller audio. For example, in the case of a phone call, the ingress media gateway 105 receives a stream of caller audio data while simultaneously sending audio data to the real-time redactor 110, sending masked caller audio data to the egress media gateway 115, and receiving and forwarding agent audio on to the caller device 205.");
determine, based on the prediction, whether to transmit the second audio data (Column 5, lines 22-29, "Once the masking system 100 determines that the caller is providing SPI, the caller audio is masked, i.e., the masking system 100 does not provide the portion of the caller audio stream containing the SPI to the agent (or provides only a subset thereof). Once the masking system 100 determines that the caller is no longer providing SPI, the masking system 100 recommences providing the caller's audio stream to the agent.");
determine, based on the prediction, whether to transmit the second audio data; and based on a determination not to transmit the second audio data: replace the second audio data with third audio data that is based on a voice of the user (Column 7, lines 63-66, "The ingress media gateway 105 or egress media gateway 115 may also replace portions of redacted caller audio with a shorter substitute such as “comfort signal” sounds, random DTMF tones, or the like."; Column 8, lines 12-15, "Examples of a comfort signal include a text-to-speech rendering of the caller's voice speaking random digits, a distorted version of the caller's voice, altered such that the speech is unintelligible");
and transmit the third audio data (Column 8, lines 5-9, "In one embodiment, the agent hears silence or any unredacted audio while the masking system 100 masks caller audio. In another embodiment, a comfort signal is played to alert the agent that the caller is speaking or is expected to speak.").
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Thomson in view of Baker et al. (US Patent Application Publication No. 2020/0196141), hereinafter Baker.
Regarding claim 4, Thomson discloses the method as claimed in claim 1, but does not explicitly disclose: wherein the interactive voice system is a voice assistant system.
Baker teaches:
wherein the interactive voice system is a voice assistant system (Paragraph 0002, lines 1-4, "Voice integration devices, for example, voice assistants such as Amazon Echo or Google Home devices may allow a user to vocally interact with a connected microphone/speaker device."; Paragraph 0005, lines 5-8, "The privacy mode may include mechanically muting or covering up the microphone of the audio device, providing a physical disconnect, or adding interference to obfuscate the audio signal."; Paragraph 0128, lines 1-10, "The voice command to trigger privacy mode may be a command setup by a user. Or, the voice command may be based on a specific keyword. For example, the privacy mode may be automatically engaged in response to the detection of specific keywords such as “bank”, “account”, or “pin” are received at an audio device. Privacy mode may also be engaged when numerical digits are read out loud. In this way, the privacy of verbally spoken credit card, bank account, social security, and/or phone numbers may be maintained, and not transmitted by the audio device.").
Baker teaches masking audio data in a voice assistant system in order to provide a privacy mode that is not susceptible to malicious software (Paragraph 0022, lines 1-7, "This application is directed towards a high-confidence tamper-proof privacy mode for audio devices. The privacy mode may be tamper-proof in that the privacy mode is not able to be compromised by malicious software, for example, by providing a visual indication tied to the hardware that may allow a user to confidently determine whether privacy mode has truly been enabled.").
Thomson and Baker are considered to be analogous to the claimed invention because they are in the same field of voice systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thomson to incorporate the teachings of Baker to mask audio data in a voice assistant system.  Doing so would allow for providing a privacy mode that is not susceptible to malicious software.
Regarding claim 13, Thomson discloses the computing system as claimed in claim 10, but does not explicitly disclose: wherein the interactive voice system is a voice assistant system.
Baker teaches:
wherein the interactive voice system is a voice assistant system (Paragraph 0002, lines 1-4, "Voice integration devices, for example, voice assistants such as Amazon Echo or Google Home devices may allow a user to vocally interact with a connected microphone/speaker device."; Paragraph 0005, lines 5-8, "The privacy mode may include mechanically muting or covering up the microphone of the audio device, providing a physical disconnect, or adding interference to obfuscate the audio signal."; Paragraph 0128, lines 1-10, "The voice command to trigger privacy mode may be a command setup by a user. Or, the voice command may be based on a specific keyword. For example, the privacy mode may be automatically engaged in response to the detection of specific keywords such as “bank”, “account”, or “pin” are received at an audio device. Privacy mode may also be engaged when numerical digits are read out loud. In this way, the privacy of verbally spoken credit card, bank account, social security, and/or phone numbers may be maintained, and not transmitted by the audio device.").
Baker teaches masking audio data in a voice assistant system in order to provide a privacy mode that is not susceptible to malicious software (Paragraph 0022, lines 1-7, "This application is directed towards a high-confidence tamper-proof privacy mode for audio devices. The privacy mode may be tamper-proof in that the privacy mode is not able to be compromised by malicious software, for example, by providing a visual indication tied to the hardware that may allow a user to confidently determine whether privacy mode has truly been enabled.").
Thomson and Baker are considered to be analogous to the claimed invention because they are in the same field of voice systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thomson to incorporate the teachings of Baker to mask audio data in a voice assistant system.  Doing so would allow for providing a privacy mode that is not susceptible to malicious software.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Schmidt et al. (US Patent No. 11,024,295) teaches automatically blocking sensitive data in an audio stream.
Papania-Davis et al. (US Patent No. 10,885,902) teaches replacing sensitive information in conversational audio data with pseudo-language audio data.
Channakeshava et al. (US Patent No. 10,728,384) teaches redacting sensitive information from audio recordings.
Pycko et al. (US Patent No. 9,787,835) teaches suppressing sensitive information during an audio communication.
Chong et al. (US Patent Application Publication No. 2021/0389924) teaches extracting and redacting sensitive information from audio data.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAMES BOGGS/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657