DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2/23/2021 has been entered.
 

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 3-11, 13-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tesch U.S. PAP 2014/0163980 A1 in view of Berckhardt U.S. PAP 2018/0108351.
Regarding claim 1 Tesch teaches a method for suppressing vocal tracks in content upon detection of corresponding words (method receiving, a message input having a set of words or phrases for generating a multimedia message, see par. [0007]), the method comprising:
detecting, during output of content, an utterance, wherein the content comprises a vocal track and at least one additional audio track (a message input having a set of words or phrases for generating a multimedia message. A first media content portion is determined from media content that includes a first audio content portion of a first video content portion and a second media content portion is determined that includes a second audio content portion of a second video content portion, see par. [0007]); 
determining at least one first word in the detected utterance (the first media content portion and the second media content portion correspond to the set of words or phrases of the message input, see par. [0007]); 

comparing the at least one first word with the at least one second word (The selected audio content portion can be associated with a first video content portion, a media content portion can have one actor, object or person within it speaking the same words, but with another voice, see par. [0063]); 
determining, based on the comparing, that the at least one first word matches the at least one second word (a media content portion can have one actor, object or person within it speaking the same words, but with another voice, see par. [0063]); 
and in response to determining that the at least one first word matches the at least one second word, suppressing output of the vocal track of the content (a  media content portion determined by the media component 108 can have audio content in associated with it. The overlay component 106 operates to examine the audio content portions generated from media content and remove, extract, identify, replace and/or combine the audio content portion with a video content portion that the audio content portion is not originally associated with, see par. [0069]).
However Tesch does not teach detecting during output of content, an utterance of a user, wherein the content comprises a vocal track and at least one additional audio track.
The method may further comprise adjusting a volume of audio playback in response to detecting the voice input. The method may further comprise determining that the audio playback is music playback, and where adjusting the volume of the audio playback in response to detecting the voice input comprises ducking the music playback. The method may further comprise determining that the audio playback is playback of an audio book, and where adjusting the volume of the audio playback in response to detecting the voice input comprises pausing the playback of the audio book, see par. [0030].
It would have been obvious to one of ordinary skill in the art to combine the teachings of Tesch with the teachings of Berckhardt for the benefit of allowing ducking of music playback in order to improve reliability in voice recognition of a voice command, see par. [0150].

Regarding claim 3 Tesch teaches the method of claim 1, wherein determining at least one second word included in a portion of the content comprises accessing a transcript of the content 
Regarding claim 4 Tesch teaches the method of claim 1, wherein determining at least one second word included in a portion of the content comprises analyzing an audio component of the content (such as a voice input, or any other input that can provide a word and/or phrase and be received by the input component 80, see par. [0080]). 
Regarding claim 5 Tesch teaches the method of claim 1, wherein: a device is used to output the content (the multimedia message can include audio content portions that are outputted as podcasts corresponding to the message inputs with images and/or video, see par. [0080]; 
the device comprises a microphone (microphone, see par. [0237]); 
and the microphone is used to detect the utterance (The set of inputs 3712 can be received via an input device that can include one or more mechanisms that permit a user to input information to the computing device 3702, such as microphone, see par. [0237]).
Regarding claim 6 Tesch teaches the method of claim 1, wherein determining the at least one first word in the detected utterance comprises: extracting features of the detected utterance (voice recognition, see par. [0237]); 
and comparing the extracted features to a speech database (Indexes can be created with the media content portions, classifications, and corresponding words or phrases using one or more columns of a database table, see par. [0269]). 
claim 7 Tesch teaches the method of claim 6, wherein extracting features of the detected utterance comprises deriving Mel-frequency cepstral coefficients of the detected utterance (the audio analysis component 3812 recognizes words or phrases within a set of media content, such as by performing a sound analysis on the spectral content of the media content, see par. [0254). 
Regarding claim 8 Tesch teaches the method of claim 1, wherein the at least one first word comprises a plurality of first words and the at least one second word comprises a plurality of second words (a second set of words or phrases that are different from the first set of words and phrases received by the input component 804 and that further have the same or a similar definition as the first set of words or phrases, see par. [0099]), the method further comprising: 
determining timing information of the plurality of first words, and wherein determining a match further comprises determining that the timing information of the plurality of first words matches timing information of the plurality of second words (selected to correspond with the set of media content… a time period selected to correspond with the set of media content selected to correspond with the set of media content from a personal video or audio stored in a data store, such as a characteristic pertaining to the media content portions, see par. [0084]). 
Regarding claim 9 Tesch teaches the method of claim 1, wherein suppressing output of the vocal track of the content comprises applying a filter to remove a vocal component of the content (The voice filter component 306 is configured to separate the video content portion from the audio content portion so that the different portions are presented as options to a user for selection, see par. [0086]). 
claim 10 Tesch teaches the method of claim 1, wherein: the content comprises a first version including a vocal component and a second version not including the vocal component (a segment from the movie "Gone with the Wind" could be generated by the media content component 104, in which Clark Gable's role says, "Frankly my dear, I don't give a damn" to Vivien Leigh's role. The music playing in the background could then be removed as one of the audio content portions identified within the media content portion. The overlay component could then overlay another music audio portion instead, which could be stored, generated or communicated thereto., see par. [0074]); 
and suppressing output of the vocal track of the content comprises switching output from the first version of the content to the second version of the content (the overlay component 106 can operate to discern multiple voices or sounds from within a media content portion. The sounds within the media content portion can be distinguished and either removed to overlay another media content portion, see par. [0073]) .
Regarding claim 11 Tesch teaches a system for suppressing vocal tracks in content upon detection of corresponding words (An exemplary system comprises a memory that stores computer-executable components and a processor, communicatively coupled to the memory, which is configured to facilitate execution of the computer-executable components, see par. [0006]), the system comprising: 
and control circuitry (a processor, see par. [0006]) configured to:
determine at least one first word in the detected utterance (a message input having a set of words or phrases for generating a multimedia message. A first media content portion is 
 determine at least one second word included in a portion of the content that was output at a time when the at least one first word was uttered the first media content portion and the second media content portion correspond to the set of words or phrases of the message input, see par. [0007]); 
compare the at least one first word with the at least one second word(The selected audio content portion can be associated with a first video content portion, a media content portion can have one actor, object or person within it speaking the same words, but with another voice, see par. [0063]); 
determine, based on the comparing, that the at least one first word matches the at least one second word (a media content portion can have one actor, object or person within it speaking the same words, but with another voice, see par. [0063]);; 
and in response to determining that the at least one first word matches the at least one second word, suppress output of the vocal track of the content (a  media content portion determined by the media component 108 can have audio content in associated with it. The overlay component 106 operates to examine the audio content portions generated from media content and remove, extract, identify, replace and/or combine the audio content portion with a video content portion that the audio content portion is not originally associated with, see par. [0069]).

In a similar field of endeavor Berckhardt teaches methods, systems, products, features, services, and other elements directed to media playback, see par. [0001]. Berckhardt teaches detecting a first voice input; determining a first measure of confidence associated with the first voice input; receiving a message, wherein the message comprises a second measure of confidence associated with detection of the first voice input by a network device; determining whether the first measure of confidence is greater than the second measure of confidence; and based on the determination that the first measure of confidence is greater than the second measure of confidence, sending a second voice input to a server. The method may further comprise adjusting a volume of audio playback in response to detecting the voice input. The method may further comprise determining that the audio playback is music playback, and where adjusting the volume of the audio playback in response to detecting the voice input comprises ducking the music playback. The method may further comprise determining that the audio playback is playback of an audio book, and where adjusting the volume of the audio playback in response to detecting the voice input comprises pausing the playback of the audio book, see par. [0030].
It would have been obvious to one of ordinary skill in the art to combine the teachings of Tesch with the teachings of Berckhardt for the benefit of allowing ducking of music playback in order to improve reliability in voice recognition of a voice command, see par. [0150].
Regarding claim 13 Tesch teaches the system of claim 11, wherein determining at least one second word included in the portion of the content comprises accessing a transcript of the 
Regarding claim 14 Tesch teaches the system of claim 11, wherein determining at least one second word included in the portion of the content comprises analyzing an audio component of the content (such as a voice input, or any other input that can provide a word and/or phrase and be received by the input component 80, see par. [0080]). 
Regarding claim 15 Tesch teaches the system of claim 11, wherein: 
a device is used to output the content (the multimedia message can include audio content portions that are outputted as podcasts corresponding to the message inputs with images and/or video, see par. [0080]); 
and the microphone is used to detect the utterance (The set of inputs 3712 can be received via an input device that can include one or more mechanisms that permit a user to input information to the computing device 3702, such as microphone, see par. [0237]).
Regarding claim 16 Tesch teaches 16. The system of claim 11, wherein determining the at least one first word in the detected utterance comprises: extracting features of the detected utterance (voice recognition, see par. [0237]); 
and comparing the extracted features to a speech database (Indexes can be created with the media content portions, classifications, and corresponding words or phrases using one or more columns of a database table, see par. [0269]). 
claim 17 Tesch teaches the system of claim 16, wherein extracting features of the detected utterance comprises deriving Mel-frequency cepstral coefficients of the detected utterance (the audio analysis component 3812 recognizes words or phrases within a set of media content, such as by performing a sound analysis on the spectral content of the media content, see par. [0254).
Regarding claim 18 Tesch teaches the system of claim 11, wherein: 
wherein the at least one first word comprises a plurality of first words and the at least one second word comprises a plurality of second words (a second set of words or phrases that are different from the first set of words and phrases received by the input component 804 and that further have the same or a similar definition as the first set of words or phrases, see par. [0099]), the method further comprising: 
determining timing information of the plurality of first words, and wherein determining a match further comprises determining that the timing information of the plurality of first words matches timing information of the plurality of second words (selected to correspond with the set of media content… a time period selected to correspond with the set of media content selected to correspond with the set of media content from a personal video or audio stored in a data store, such as a characteristic pertaining to the media content portions, see par. [0084]). 
Regarding claim 19 Tesch teaches the system of claim 11, wherein suppressing output of the vocal track of the content comprises applying a filter to remove a vocal component of the content (The voice filter component 306 is configured to separate the video content portion from 
Regarding claim 20 Tesch teaches the system of claim 11, wherein: 
the content comprises a first version including a vocal component and a second version not including the vocal component (a segment from the movie "Gone with the Wind" could be generated by the media content component 104, in which Clark Gable's role says, "Frankly my dear, I don't give a damn" to Vivien Leigh's role. The music playing in the background could then be removed as one of the audio content portions identified within the media content portion. The overlay component could then overlay another music audio portion instead, which could be stored, generated or communicated thereto, see par. [0074]); 
and suppressing output of the vocal track of the content comprises switching output from the first version of the content to the second version of the content (the overlay component 106 can operate to discern multiple voices or sounds from within a media content portion. The sounds within the media content portion can be distinguished and either removed to overlay another media content portion, see par. [0073]).
Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 2 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tesch U.S. PAP 2014/0163980 A1 in view of Berckhardt U.S. PAP 2018/0108351, further in view of Danieli U.S. Patent No. 7,437,290.

Regarding claim 2 Tesch teaches the method of claim 1, further comprising: subsequent to determining that the at least one first word matches the at least one second word ( message components generate a set of words or phrases to be communicated by a client device and/or third party server multimedia message, see par. [0075]; generate other media content corresponding to the textual word or phrase generated within the message inputs or received by the input component, see par. [0076]).
However Tesch in view of Berckhardt does not teach determining at least one additional word included in a portion of the content that was output; comparing the at least one additional word with the utterance; determining, based on comparing the at least one additional word with the utterance, that the at least one additional word does not match the utterance; and in response to determining that the at least one additional word does not match the utterance, continuing output of the vocal track of the content. 

comparing the at least one additional word with the utterance (match between a current portion of the input speech and the undesired speech, see col. 11 lines 19-21); 
determining, based on comparing the at least one additional word with the utterance, that the at least one additional word does not match the utterance (decision step 210 determines if there is any match between a current portion of the input speech and the undesired speech, see col. 11 line 19-21); 
and in response to determining that the at least one additional word does not match the utterance, continuing output of the vocal track of the content (If not, a step 212 provides for passing the input speech unaltered to the output stream, see col. 11 lines 21-22). 
It would have been obvious to one of ordinary skill in the art to combine the Tesch in view of Berckhardt invention with the teachings of Danieli for the benefit of censoring undesired speech from the output stream, see abstract.
Regarding claim 12 Tesch teaches the  system of claim 11, wherein the control circuitry is further configured to: : subsequent to determining that the at least one first word matches the at least one second word ( message components generate a set of words or phrases to be communicated by a client device and/or third party server multimedia message, see par. [0075]; other media content corresponding to the textual word or phrase generated within the message inputs or received by the input component, see par. [0076]).
However Tesch in view of Berckhardt does not teach determining at least one additional word included in a portion of the content that was output; comparing the at least one additional word with the utterance; determining, based on comparing the at least one additional word with the utterance, that the at least one additional word does not match the utterance; and in response to determining that the at least one additional word does not match the utterance, continuing output of the vocal track of the content. 
In a similar field of endeavor Danieli teaches an input audio data stream comprising speech is processed by an automatic censoring filter in either a real-time mode, or a batch mode, producing censored speech that has been altered so that undesired words or phrases are either unintelligible or inaudible, see abstract.
comparing the at least one additional word with the utterance (match between a current portion of the input speech and the undesired speech, see col. 11 lines 19-21); 
determining, based on comparing the at least one additional word with the utterance, that the at least one additional word does not match the utterance (decision step 210 determines if there is any match between a current portion of the input speech and the undesired speech, see col. 11 line 19-21); 
and in response to determining that the at least one additional word does not match the utterance, continuing output of the vocal track of the content (If not, a step 212 provides for passing the input speech unaltered to the output stream, see col. 11 lines 21-22). 
.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711.  The examiner can normally be reached on Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-






/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656