DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending and have been examined.
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). Receipt is acknowledged of some of the certified copies of papers required by 37 CFR 1.55.
Applicant cannot rely upon the certified copy of the foreign priority application to overcome this rejection because a translation of said application has not been made of record in accordance with 37 CFR 1.55. See MPEP §§ 215 and 216.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/09/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to because of the following informalities: In Fig. 5, elements 511-16, 521-26, and 531-33 are not recited in the as-filed specifications.  
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure 
Claim Objections
Claims 8, 9, and 20 are objected to because of the following informalities:  
Claim 8:
The recitation of “the preset sentence text information” is lacking in antecedent basis, as no such feature is recited in independent claim 1. 
The recitation of “the to-be-replaced sentences and preset audio sample data”, where the limitation is read as “the...preset audio sample”, does not provide appropriate antecedent basis for “the...preset audio sample”, as no such feature is recited in independent claim 1.  
Claims 9 and 20: The claims recite “the client”, which has no antecedent basis to the independent claims 1 and 16, respectively.
Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim(s) 1, 11, and 16, the limitation(s) of “extracting”, “recognizing”, “determining”, “obtaining”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. More specifically, the mental process of a human listening to audio from a movie and writing down the lines that are heard, identifying which lines were said by a particular character, marking the time in the score that the lines were said, and reciting all of the movie lines aloud, imitating the original actors except in the case of the character that needed to be changed, and using a new voice for that particular character. The final limitation recites the rules being used for identifying different audio segments, which can be implemented by a human. The specification at [0004] further cites that the method is currently done manually in the post processing of a movie. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas. Accordingly, the claim recites an abstract idea.

The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to extract, recognize, determine, and obtain amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not patent eligible.
	
	
With respect to claim(s) 2 and 17, the claim(s) recite(s) “receiving” and “separating”, which reads on a human being asked to change a voice of a character and listening to the movie to write down the lines of the movie in response. No additional limitations are present.

With respect to claim(s) 3, 12, and 18, the claim(s) recite(s) “extracting” and “recognizing”, which reads on a human identifying a specific voice feature of a 

With respect to claim(s) 4 and 19, the claim(s) recite(s) “building”, “inputting”, “training”, and “determining”, which reads on a human writing out the equations for a determining if the audio is or is not related to a particular character, making the calculations and adjusting the equations based on the results to improve the set of equations, and make the final calculation to determine if the audio is or is not related to the character. No additional limitations are present.

With respect to claim(s) 5 and 13, the claim(s) recite(s) “extracting” and “determining”, which reads on a human identifying which lines to evaluate further based on how long the line is, and identifying time information associated with the line, such as a start time. No additional limitations are present.

With respect to claim(s) 6, the claim(s) recite(s) “extracting” and “correcting”, which reads on a human recognizing different timing information based on a longer line, and writing down new timing information based on the different time recognized. No additional limitations are present.

With respect to claim(s) 7, the claim(s) recite(s) “presetting” and “correcting”, which reads on a human reading transcripts of the audio that includes timing information 

With respect to claim(s) 8, the claim(s) recite(s) “determining” and “generating”, which reads on a human identifying, using the transcript and timing information, which lines need to have the voice changed and speaking the lines out loud using a different voice. No additional limitations are present.

With respect to claim(s) 9, 15, and 20, the claim(s) recite(s) “receiving” and “adjusting”, which reads on a human being asked by a person to change the voice of a character and speaking the character’s lines with a new voice matching the request. No additional limitations are present.

With respect to claim(s) 10, the claim(s) recite(s) replacing, which reads on a human reading all of the lines of the video out loud, and using the new voice for the changed character at the time the character is supposed to speak. No additional limitations are present.

With respect to claim(s) 14, the claim(s) recite(s) “presetting”, “extracting”, and “correcting”, which reads on a human reading transcripts of the audio that includes timing information, recognizing other timing information based on how long certain lines are, and adjusting the time written down based on what the human heard using the time 

These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 5-18, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski et al. (U.S. PG Pub No. 2020/0058289), hereinafter Gabryjelski, in view of Rossano et al. (U.S. PG Pub No. 2016/0021334), hereinafter Rossano, and further in view of Tsang et al. (U.S. Patent No. 8731905), hereinafter Tsang.

(claim 1) An audio file processing method for an electronic device (a method for automatic dubbing of media content, i.e. audio file processing, method [0004], comprising:
(claim 11) An electronic device (an automatic dubbing apparatus, such as a computer system [0005-6]), comprising:
(claim 11) a processor (one or more processors [0006]); and
(claim 11) a memory, the memory storing computer instructions executable by the processor, wherein the processor is configured to execute the computer instruction to perform (a memory storing computer-executable instruction, i.e. memory storing computer instructions, that, when executed, cause a processor, i.e. executable by the processor [0006]):
(claim 16) A non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform (a non-transitory computer-readable medium having instructions, i.e. non-transitory computer-readable storage medium storing…instructions [0007], where the instructions of the memory on the computer system, i.e. computer program instructions, cause a processor to perform processes, i.e. executable by at least one processor to perform [0006]):

extracting at least one audio segment from a first audio file (the audio processing module may extract speeches, i.e. extracting at least one audio segment, from an audio portion of the media content, i.e. a first audio file [0032]);
recognizing at least one to-be-replaced audio segment representing a target role from the at least one audio segment (an original voice or original language of a character, i.e. target role, may be changed so as to use a different voice print or different language in the character’s own voice for dubbing the audio portion of the media content, i.e. to-be-replaced audio segment…from the at least one audio segment [0026:1-5],[0028], where the speeches of a particular voice may be extracted, i.e. recognizing…a target role [0032]), and determining time frame information … in the first audio file (the media content contains metadata that provides location information of the audio portion and the visual portion so that both can be synchronized, i.e. determining time frame information…in the first audio file [0035]); and
obtaining to-be-dubbed audio data for each to-be-replaced audio segment, and replacing data in the to-be-replaced audio segment with the to-be-dubbed audio data according to the time frame information, to obtain a second audio file (the extracted speeches, i.e. each to-be-replaced audio segment, are processed to generate replacement speeches, i.e. obtaining to-be-dubbed audio data, where the extracted speeches are replaced, i.e. replacing data in the to-be-replaced audio segment, with the generated speeches, i.e. with the to-be-dubbed audio data, in the audio portion of the media content to obtain a dubbed audio of the media content, i.e. to obtain a second audio file [0032], where the media content contains metadata that provides location information of the audio portion and the visual portion so that both can be synchronized, i.e. according to the time frame information [0035])....  
While Gabryjelski provides the extraction and replacement of speech, Gabryjelski does not specifically teach the detection of time information for each 
determining time frame information of each to-be-replaced audio segment in the first audio file; and
replacing data in the to-be-replaced audio segment with the to-be-dubbed audio data according to the time frame information…;
wherein the at least one to-be-replaced audio segment is divided from the at least one audio segment based on a structure and a word count in a sentence corresponding to each to-be-replaced audio segment.
Rossano, however, teaches determining time frame information of each to-be-replaced audio segment in the first audio file (the prosody analysis unit compares the original audio, i.e. first audio file, with the generated baseline voice, to identify the exact speech beginning timing and speed of the sound segment, i.e. determining time frame information of each to-be-replaced audio segment [0052]); and
replacing data in the to-be-replaced audio segment with the to-be-dubbed audio data according to the time frame information… (the dubbing unit ‘speaks’ the local language using TTS, using the relevant voice, on top of the video’s audio, i.e. replacing data in the to-be-replaced audio segment with the to-be-dubbed audio data, where additional adjustments are performed to comply with the given timing of the original audio, such as stretching or shrinking the dubbed speech audio, i.e. according to the time frame information [0053-5]);
wherein the at least one to-be-replaced audio segment is divided from the at least one audio segment based on a structure and a word count in a sentence corresponding to each to-be-replaced audio segment (a speech-to-text module can be run to generate text from the video sound track [0060], and a text analysis unit identifies the next subtitle text, which can be one or more lines of text, i.e. based on a structure, where the text is a sentence, i.e. a sentence corresponding to each to-be-replaced audio segment, to be translated into a target language, and then passed to the TTS generation unit, i.e. one to-be-replaced audio segment is divided from the at least one audio segment based on a structure [0042],[0050], and the gaps between words in the sentence that is output by the TTS are also recognized, i.e. word...in a sentence [0053]).
Gabryjelski and Rossano are analogous art because they are from a similar field of endeavor in dubbing audiovisual content. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the extraction and replacement of speech teachings of Gabryjelski with the determination of timing for each sound segment and the identification of subtitle text and sentences as taught by Rossano. The motivation to do so would have been to achieve a predictable result of enabling the comparison of the audio of an original sentence to the synthesized version of the sentence (Rossano [0052]).
While Gabryjelski in view of Rossano provides the recognition of words in a sentence, Gabryjelski in view of Rossano does not specifically teach the recognition of the word count, and thus does not teach
a word count in a sentence....
Tsang, however, teaches a word count in a sentence...(parsing a sentence includes break points at the beginning and end of the sentence, and a recognition of .
Gabryjelski, Rossano, and Tsang are analogous art because they are from a similar field of endeavor in processing language for use with synthesized speech. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the recognition of words in a sentence teachings of Gabryjelski, as modified by Rossano, with the specific counting of the words as taught by Tsang. The motivation to do so would have been to substitute similar elements to achieve a predictable result of enabling the highlighting of text for a reader at the same time the text is spoken aloud (Tsang (2:24-29)).

Regarding claims 2 and 17, Gabryjelski in view of Rossano and Tsang teaches claims 1 and 16, and Gabryjelski further teaches 
receiving a voice replacement request for the target role in a source video file from a client (the user may click an option, i.e. receiving...from a client, to select a voice to customize the voice of a character, i.e. a voice replacement request for the target role [0025:1-5], in a film or game, where the voice may be changed so as to use a different voice print or different language in the character’s own voice for dubbing the audio portion of the media content, i.e. source video content [0026:1-5],[0028]); and
separating the first audio file from the source video file according to the voice replacement request (a bit stream of the media content for which the user wants to customize a voice, i.e. according to the voice replacement request, may be .  

Regarding claims 3, 12, and 18, Gabryjelski in view of Rossano and Tsang teaches claims 1, 11, and 16, and Gabryjelski further teaches
extracting an audio feature of each audio segment (voice characteristics of the speeches, i.e. each audio segment, including parameters such as spectrum, pitch, and tone, are determined by voice analysis, i.e. extracting an audio feature [0048]); and
recognizing, according to the audio feature, the at least one to-be-replaced audio segment representing the target role from the at least one audio segment (the voice characteristics may be used, i.e. according to the audio feature, to cluster speeches to be associated with particular speakers, such as a particular actor, i.e. i.e. recognizing... the target role from the at least one audio segment, and speeches of a specific voice are extracted for customization and replacement, i.e. the at least one to-be-replaced audio segment representing the target role [0032],[0047-8]).  

	Regarding claims 5 and 13, Gabryjelski in view of Rossano and Tsang teaches claims 1 and 11, and Rossano further teaches
extracting the at least one audio segment from the first audio file based on a ... sentence division principle, and determining first candidate time frame information of each audio segment (a speech-to-text module can be run to generate text from the video sound track, i.e. audio segment from the first audio file [0060], and a ; and
 the determining time frame information of each to-be-replaced audio segment in the first audio file comprises:
determining the time frame information according to the first candidate time frame 00144.0851.00 US (18PCT372/US)34information (the prosody analysis unit compares the original audio with the generated baseline voice to identify the exact speech beginning timing and speed of the sound segment, i.e. determining the time frame information, and the post-processing unit identifies the length of gaps and words in a sound segment, i.e. according to first candidate time frame information [0052-3]).  
And Tsang further teaches a short sentence division principle (the document can be parsed into phrasal segments, i.e. a short sentence division principle, based on break points within a sentence (2:37-56),(3:11-21)).
Where the motivation to combine is the same as previously presented.


extracting, from the first audio file, second candidate time frame information based on a long sentence (a speech-to-text module can be run to generate text from the video sound track, i.e. from the first audio file [0060], and a text analysis unit identifies the next subtitle text, which can be one or more lines of text where the text is a sentence, i.e. a long sentence, and the sentence will be translated into a target language and then passed to the TTS generation unit, i.e. extracting [0042],[0050], where the prosody analysis unit compares the original audio with the generated baseline voice to identify the exact speech beginning timing and speed of the sound segment, i.e. second candidate time frame information, and the post-processing unit identifies the length of gaps and words in a sound segment, i.e. determining first candidate time frame information of each audio segment [0052]), 
wherein the determining the time frame information according to the first candidate time frame information comprises:
correcting the first candidate time frame information according to the second candidate time frame information, and determining the time frame information (the prosody analysis unit compares the original audio with the generated baseline voice to identify the exact speech beginning timing and speed of the sound segment, i.e. according to the second candidate time frame information, determining the time frame information, and the post-processing unit identifies adjustments that should be made to the length of gaps in a sound segment in order to comply with the given timing, i.e. correcting the first candidate time frame information [0052-5]).  


Regarding claim 7, Gabryjelski in view of Rossano and Tsang teaches claim 5, and Rossano further teaches
presetting sentence text information comprising third candidate time frame information (the length of the gaps between words is recognized in addition to the length of, i.e. third candidate time frame information, the actual said words, i.e. presetting text information [0057]), 
wherein the determining the time frame information according to the first candidate time frame information comprises:
correcting the first candidate time frame information according to the third candidate time frame information, and determining the time frame information (the prosody analysis unit compares the original audio with the generated baseline voice to identify the exact speech beginning timing and speed of the sound segment, i.e. determining the time frame information [0052], and the post-processing unit identifies the length of the gaps between words in addition to the length of, i.e. third candidate time frame information, the actual said words, and the unit determines the scale that the length of gaps should be adjusted different from the length of words in a sound segment in order to comply with the given timing, i.e. correcting the first candidate time frame information according to the third candidate time frame information [0055-7]).  
Where the motivation to combine is the same as previously presented.


determining to-be-replaced sentences corresponding to the to-be-replaced audio segment from the preset sentence text information according to the duration (the speeches, which are extracted speeches of a particular voice to be replaced, i.e. corresponding to the to-be-replaced audio segment [0031], may be converted into texts using a STT module, i.e. determining to-be-replaced sentences, where the characteristics such as speech speed may also be detected at the STT module, i.e. preset sentence text information according to the duration [0057]); and
00144.0851.OOUS (18PCT372 US)35generating the to-be-dubbed audio data according to the to-be-replaced sentences and preset audio sample data (replacement speeches are generated, , such as by a TTS module, i.e. generating the to-be-dubbed audio data, utilizing the extracted speeches, where a STT module converts the speeches into text and identifies speech characteristics, which are used by the TTS module, i.e. according to the to-be-replaced sentences and preset audio sample data [0032],[0063],[0066]).  

Regarding claims 9, 15, and 20, Gabryjelski in view of Rossano and Tsang teaches claims 1, 11, and 16, and Gabryjelski further teaches
receiving a processing request for an audio effect from the client (the user may click an option to select a voice to customize the voice of a character, i.e. receiving a processing request [0025:1-5], in a film or game, where the voice may be changed so as to use a different voice print or different language in the character’s own voice for ; and
adjusting the audio effect of the audio data according to the processing request (the extracted speeches are processed to generate replacement speeches, i.e. adjusting the audio effect of the audio data, using a voice print model chosen by the user, i.e. according to the processing request [0026],[0032]).  

Regarding claim 10, Gabryjelski in view of Rossano and Tsang teaches claim 9, and Gabryjelski further teaches
replacing the data in the to-be-replaced audio segment with the adjusted audio data according to the time frame information (the extracted speeches of the voice, i.e. to-be-replaced audio segment, are replaced with the generated replacement speeches, i.e. replacing the data...with the adjusted audio data [0032], where the metadata provides location information of the audio portion and the visual portion so that both can be synchronized, i.e. according to the time frame information [0035]).  

Regarding claim 14, Gabryjelski in view of Rossano and Tsang teaches claim 13, and Rossano further teaches 
presetting sentence text information comprising third candidate time frame information (the length of the gaps between words is recognized in addition to the length of, i.e. third candidate time frame information, the actual said words, i.e. presetting text information [0057]);
extracting, from the first audio file, second candidate time frame information based on a long sentence (a speech-to-text module can be run to generate text from the video sound track, i.e. from the first audio file [0060], and a text analysis unit identifies the next subtitle text, which can be one or more lines of text where the text is a sentence, i.e. a long sentence, and the sentence will be translated into a target language and then passed to the TTS generation unit, i.e. extracting [0042],[0050], where the prosody analysis unit compares the original audio with the generated baseline voice to identify the exact speech beginning timing and speed of the sound segment, i.e. second candidate time frame information, and the post-processing unit identifies the length of gaps and words in a sound segment, i.e. determining first candidate time frame information of each audio segment [0052]); and
 correcting the first candidate time frame information according to the second candidate time frame information and the third candidate time frame information, and determining the time frame information (the prosody analysis unit compares the original audio with the generated baseline voice to identify the exact speech beginning timing and speed of the sound segment, i.e. according to the second candidate time frame information, determining the time frame information, and the post-processing unit identifies adjustments that should be made to the length of gaps and length of actual said words, i.e. according to ...the third candidate time frame information, in a sound segment in order to comply with the given timing, i.e. correcting the first candidate time frame information [0052-7]).  
Where the motivation to combine is the same as previously presented.

Claim(s) 4 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski, in view of Rossano, in view of Tsang, and further in view of Das (U.S. PG Pub No. 2009/0006093), hereinafter Das.

Regarding claims 4 and 19, Gabryjelski in view of Rossano and Tsang teaches claims 3 and 19, and Gabryjelski further teaches
building a binary classification model based on the target role (speeches matching the voice characteristic data of a particular actor or the characteristic parameters of a specific voice, i.e. based on the target role, are classified to be associated with the actor or a voice, i.e. building a binary classification model, and speeches associated with speakers or voices are clustered [0047-8]); and
inputting each audio segment and the audio feature of the audio segment into the binary classification model, ... and determining the at least one to-be replaced audio segment according to a ... result (speeches, i.e. each audio segment, are analyzed, i.e. inputting, using voice characteristic parameters such as spectrum, pitch, and tone, i.e. audio feature, and the characteristic parameters of a specific voice are classified to be associated with the actor or a voice, i.e. binary classification model, and speeches associated with speakers or voices are clustered [0047-8], where the speeches of a voice are extracted for further processing and replacement, i.e. determining the at least one to-be replaced audio segment according to a ... result [0032]).  
While Gabryjelski in view of Rossano and Tsang provides the use of binary analysis to determine whether or not a speech is associated with a particular voice, 
inputting each audio segment and the audio feature of the audio segment into the binary classification model, performing training based on a logistic regression algorithm, and determining the at least one ... audio segment according to a training result.  
Das, however, teaches inputting each audio segment and the audio feature of the audio segment into the binary classification model, performing training based on a logistic regression algorithm, and determining the at least one ... audio segment according to a training result (the system includes a training phase, i.e. performing training, where the system collects voice samples and feature vectors of each voice sample, i.e. inputting each audio segment and the audio feature of the audio segment, for classifying whether the sample is associated with a particular speaker, i.e. binary classification model [0020-0022], where the system can learn the weights of the classifiers, i.e. perform training based on, using linear regression techniques, i.e. logistic regression algorithm, and when the classifier score is low enough based on the learned criteria and training data, it is considered a reliable indication that the person is the purported speaker, i.e. determining the at least one ... audio segment according to a training result [0027]).
Gabryjelski, Rossano, Tsang, and Das are analogous art because they are from a similar field of endeavor in processing speech information. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of binary analysis to determine whether or not a speech is .
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
McCoy et al. (U.S. PG Pub No. 2015/0199978): Dubbing video content with audio content and matching movement to audio.
Liu et al. (U.S. PG Pub No. 2009/0037179): Using TTS to mimic the voice of a particular speaker.
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/NICOLE A K SCHMIEDER/Examiner, Art Unit 2659      

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659