Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings were received on 11/21/2019.  These drawings are accepted.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5,7,10-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Todic (US Publication No.: 20110288862) in view of Jansson et al (Publication Title: Singing Voice Separation With Deep U-Net Convolutional Networks).
Claim 1, Todic discloses 

	receiving polyphonic audio data for a media item (Fig. 1, label audio signal, paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., vocals) to an automated speech recognition (ASR) decoder 104. When the input audio signal is a musical song, the spoken or sung words may correspond to lyrics of the song, for example.” Such disclosure indicates the audio signal or audio data is polyphonic.);
	generating, from the polyphonic audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict text probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of textual units for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 
	wherein the probability matrix includes: 

	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective textual units at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 
	identifying, for the first portion of the first sample, a first sequence of textual units using the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters. Table 5 shows confidence line for each line, which indicates a probability metric as per paragraph 70,89. Paragraph 90,104 discloses aligning lyrics and audio using the confidence score or confidence line as outputted by label 720 of Fig. 7. Paragraph 77 discloses “The system 200 may further use speech recognition techniques to map expected textual transcriptions of the audio signal to the audio signal. Alternatively, correct lyrics are received and are taken as the textual transcriptions of the vocal elements in the audio signal (so that speech recognition is not needed to determine the textual transcriptions) and a forced alignment of the lyrics can be performed to the audio signal to generate timing boundary information, for example.” By aligning lyrics with the audio signal, the lyrics or first sequence of characters such as shown in Tables 1,2 where the phonetics are matched to the text as well as audio. In other words, the audio aligned to phonetics indicates a first portion of 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network.
Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.).
Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 
Claim 2, Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	convolving the first sample;

	after downsampling the first sample, upsampling an output of the convolution of the first sample to increase the dimension of the first sample.
	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 
convolving the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional, wherein the downsampled input audio includes the first sample.);
downsampling the first sample to reduce a dimension of the first sample (Section 3.1.2 discloses an audio input. Short Time Fourier Transform is performed on the audio input in order to output samples and spectrograms. Downsampling of the first sample of the input audio. Fig. 1 shows the neural network, with the encoder on the left side, decoder on the right side and convolutional layer at the bottom, Conv2D. ); and
	after downsampling the first sample, upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the 
Claim 3, Todic discloses receiving, from an external source, lyrics corresponding to the media item (Fig. 1,2, label lyrics text); and using the received lyrics and the probability matrix, aligning textual units in the first sequence of textual units with the received lyrics corresponding to the media item (Table 1,2,3 shows the alignment of the first sequence of characters or phonetics with the received lyrics based on the timing information or probability matrix. Paragraph 35 discloses words obtained from phonemes to grammar of the audio signal and corresponding feature vector using statistical descriptions of each phoneme.)
Claim 4, Todic discloses determining a set of lyrics based on the first sequence of textual units (Table 1,2,3 shows the determined set of lyrics from the phonemes.); and
	storing the set of lyrics in association with the media item (Paragraph 48 discloses memory for storing computing software that performs the functions of the components of Fig. 1. Fig. 1, label synced lyrics, Table 1,2,3 shows the set of lyrics.).
Claim 5, Todic discloses using a language model and at least a portion of the first sequence of textual units, determine a first word in the first portion of the first sample (Paragraph 35 discloses the use of language model to determine grammar of the audio signal matching words obtained from statistical descriptions of phonemes. Table 1,2,3 shows the words corresponding to the phonetics.); and

	Claim 7, Todic discloses the received polyphonic audio data includes an instrumental track and a vocal track. (Paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., vocals) to an automated speech recognition (ASR) decoder 104. When the input audio signal is a musical song, the spoken or sung words may correspond to lyrics of the song, for example.”)
Claim 10, Todic discloses identifying, from the first sequence of textual units, one or more keywords associated with the media item (Table 1,2,3 shows the phonetics matching the lyrics, wherein one or more keywords pertaining to the specific song or lyrics of the song is shown in such tables. Paragraph 40 discloses an example of identifying one or more keywords.).
Claim 11, Todic discloses determining whether any of the one or more keywords corresponds to a defined set of words (Table 1,2, paragraph 40 describes matching lyrics to keywords or phonetics of keywords. For example, the phonetics for asleep matches the lyric “As I fell Asleep If Fireflies” (Table 1).); and
	in accordance with a determination that a first keyword of the one or more keywords corresponds to the defined set of words, performing an operation on a portion of the sample that corresponds to the first keyword (Paragraph 41,45, Table 2 discloses an operation is performed on a frame of speech corresponding to the phonemes and words such as keywords.).

	one or more processors (paragraph 48 discloses a processor.); and 
memory storing instructions for execution by the one or more processors, the instructions including instructions for (paragraph 48):
	receiving polyphonic audio data for a media item (Fig. 1, label audio signal, paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., vocals) to an automated speech recognition (ASR) decoder 104. When the input audio signal is a musical song, the spoken or sung words may correspond to lyrics of the song, for example.” Such disclosure indicates the audio signal or audio data is polyphonic.);
	generating, from the polyphonic audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 

	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 
	identifying, for the first portion of the first sample, a first sequence of characters using the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters. Table 5 shows confidence line for each line, which indicates a probability metric as per paragraph 70,89. Paragraph 90,104 discloses aligning lyrics and audio using the confidence score or confidence line as outputted by label 720 of Fig. 7. Paragraph 77 discloses “The system 200 may further use speech recognition techniques to map expected textual transcriptions of the audio signal to the audio signal. Alternatively, correct lyrics are received and are taken as the textual transcriptions of the vocal elements in the audio signal (so that speech recognition is not needed to determine the textual transcriptions) and a forced alignment of the lyrics can be performed to the audio signal to generate timing boundary information, for example.” By aligning lyrics with the audio signal, the lyrics or first sequence of characters such as shown in Tables 1,2 where the phonetics are matched to the text as well as audio. In 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network.
Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.).
Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 
Claim 13, Todic discloses 
	receive polyphonic audio data for a media item (Fig. 1, label audio signal, paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., 
	generate, from the polyphonic audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM 
	identifying, for the first portion of the first sample, a first sequence of characters using the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters. Table 5 shows confidence line for each line, which indicates a probability metric as per paragraph 70,89. Paragraph 90,104 discloses aligning lyrics and audio using the confidence score or confidence line as outputted by label 720 of Fig. 7. Paragraph 77 discloses “The system 200 may further use speech recognition techniques to map expected textual transcriptions of the audio signal to the audio signal. Alternatively, correct lyrics are received and are taken as the textual transcriptions of the vocal elements in the audio signal (so that speech recognition is not needed to determine the textual transcriptions) and a forced alignment of the lyrics can be performed to the audio signal to generate timing boundary information, for example.” By aligning lyrics with the audio signal, the lyrics or first sequence of characters such as shown in Tables 1,2 where the phonetics are matched to the text as well as audio. In other words, the audio aligned to phonetics indicates a first portion of the first sample of the plurality of samples, wherein such is also aligned to text or lyrics or a sequence of characters.). 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network.

Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-6,9-13 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1,3,4,5,6,9,10,11,12,13 of copending application No. 16569372 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the .  
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044.  The examiner can normally be reached on 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/LINDA WONG/Primary Examiner, Art Unit 2656