Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings were received on 9/12/2019.  These drawings are accepted.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1,3-5,7-8,10-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Todic (US Publication No.: 20110288862) in view of Jansson et al (Publication Title: Singing Voice Separation With Deep U-Net Convolutional Networks).
Claim 1, Todic discloses 

	receiving audio data for a media item (Fig. 1, label audio signal, paragraph 28);
	generating, from the audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines 
	identifying, for the first portion of the first sample, a first sequence of characters based on the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters based on the ). 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	downsampling the first sample to reduce a dimension of the first sample;
	convolving an output of the downsampling of the first sample; and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample.
	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 
downsampling the first sample to reduce a dimension of the first sample (Section 3.1.2 discloses an audio input. Short Time Fourier Transform is performed on the audio input in order to output samples and spectrograms. Downsampling of the first sample of 
	convolving an output of the downsampling of the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional.); and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 
Claim 3, Todic discloses receiving, from an external source, lyrics corresponding to the media item (Fig. 1,2, label lyrics text); and using the received lyrics and the probability matrix, aligning characters in the first sequence of characters with the received lyrics corresponding to the media item (Table 1,2,3 shows the alignment of the first sequence of characters or phonetics with the received lyrics based on the timing information or probability matrix. Paragraph 35 discloses words obtained from 
Claim 4, Todic discloses determining a set of lyrics based on the first sequence of characters (Table 1,2,3 shows the determined set of lyrics from the phonemes.); and
	storing the set of lyrics in association with the media item (Paragraph 48 discloses memory for storing computing software that performs the functions of the components of Fig. 1. Fig. 1, label synced lyrics, Table 1,2,3 shows the set of lyrics.).
	Claim 5, Todic discloses using a language model and at least a portion of the first sequence of characters, determine a first word in the first portion of the first sample (Paragraph 35 discloses the use of language model to determine grammar of the audio signal matching words obtained from statistical descriptions of phonemes. Table 1,2,3 shows the words corresponding to the phonetics.); and
	determining, using the timing information that corresponds to the first portion of the first sample, a time that corresponds to the first word (Table 1,2,3, label start time, end time. Fig. 5 shows the time alignment of the lyrics to audio. (paragraph 77)).
	Claim 7, Todic discloses the received audio data includes an extracted vocal track that has been separated from a media content item (Paragraph 29 discloses extraction of vocal data or vocal track from the audio signal.).
	Claim 8, Todic discloses the received audio data is polyphonic media content item (Paragraph 28 discloses the audio signal can include instrumental music, background noise and spoken or sung words.).
	Claim 10, Todic discloses identifying, from the first sequence of characters, one or more keywords associated with the media item (Table 1,2,3 shows the phonetics matching the lyrics, wherein one or more keywords pertaining to the specific song or 
	Claim 11, Todic discloses determining whether any of the one or more keywords corresponds to a defined set of words (Table 1,2, paragraph 40 describes matching lyrics to keywords or phonetics of keywords. For example, the phonetics for asleep matches the lyric “As I fell Asleep If Fireflies” (Table 1).); and
	in accordance with a determination that a first keyword of the one or more keywords corresponds to the defined set of words, performing an operation on a portion of the sample that corresponds to the first keyword (Paragraph 41,45, Table 2 discloses an operation is performed on a frame of speech corresponding to the phonemes and words such as keywords.).
Claim 12, Todic discloses 
	one or more processors (paragraph 48 discloses a processor.); and 
memory storing instructions for execution by the one or more processors, the instructions including instructions for (paragraph 48):
	receiving audio data for a media item (Fig. 1, label audio signal, paragraph 28);
	generating, from the audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 
	identifying, for the first portion of the first sample, a first sequence of characters based on the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters based on the ). 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:

	convolving an output of the downsampling of the first sample; and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample.
	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 
downsampling the first sample to reduce a dimension of the first sample (Section 3.1.2 discloses an audio input. Short Time Fourier Transform is performed on the audio input in order to output samples and spectrograms. Downsampling of the first sample of the input audio. Fig. 1 shows the neural network, with the encoder on the left side, decoder on the right side and convolutional layer at the bottom, Conv2D. );
	convolving an output of the downsampling of the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional.); and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the 
Claim 13, Todic discloses 
	receive audio data for a media item (Fig. 1, label audio signal, paragraph 28);
	generate, from the audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and

	identify, for the first portion of the first sample, a first sequence of characters based on the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters based on the ). 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	downsampling the first sample to reduce a dimension of the first sample;
	convolving an output of the downsampling of the first sample; and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample.
	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 
downsampling the first sample to reduce a dimension of the first sample (Section 3.1.2 discloses an audio input. Short Time Fourier Transform is performed on the audio 
	convolving an output of the downsampling of the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional.); and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 

Allowable Subject Matter
Claims 6,9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Publication Title: Online Singing Voice Separation Using a Recurrent 1-Dimensional U-Net Trained with Deep Feature Losses discloses voice separation using convolutional encoder decoder.
Publication Title: Automatic Recognition Of Lyrics of Singing discloses automatic recognition of lyrics using HMM. 
Publication Title: Spectrogram Channels U-Net: A Source Separation Model Viewing Each Channel as The Spectrogram of each Source discloses sound source separation for automatic lyric transcription includes convolutional encoder decoder that performs downsampling or max pooling layer, convolution of the downsampled input and upsampling. (Fig. 3)
Publication Title: Application of recurrent U-net architecture to speech enhancement discloses a U-Net neural architecture.
US Publication No.: 20190115013 discloses speech recognition using HMM and convolutional neural network.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044.  The examiner can normally be reached on 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


/LINDA WONG/Primary Examiner, Art Unit 2656