Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Regarding 35 USC 103 rejection, Applicant's arguments filed 5/19/2021 have been fully considered but they are not persuasive. 
The applicant contends
A. Todic in view of Jansson does not teach or suggest at least, "generating a 
probability matrix of characters for a first portion of a first sample of the plurality of samples" that is used to "identify[], for the first portion of the first sample, a first sequence of characters," as required by amended claim 1. 
The Office Action states:020143-5114-US 7 Response to Office Action 
Todic discloses ... 
generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.) (Office Action at 4). 
The cited portion of Todic disclose: 
The confidence score engine 216 may also analyze forward (or reverse) recognition results and determine a probability metric of line duration given a distribution of durations of all lines in the song or audio signal. This metric leverages the symmetric notion of modern western songs and computes a probability that a duration of a specific line fits a line duration model for a song or audio signal, for example. Given the duration of each line as determined in the automated alignment process (e.g., taken from the forward and/or reverse alignment), a parametric model of line duration can be estimated by calculating a mean and standard deviation of line duration... Todic, [0070]. 
However, determining a probability metric that "computes a probability that a duration of a specific line fits a line duration model for a song or audio signal" does not teach or suggest "generating a probability matrix of characters for a first portion of a first sample of the plurality of samples," as required by amended claim 1. Instead, Todic synchronizes the lyrics, and, in addition to outputting the determined lyrics, also outputs a probability that represents a "confidence score" for the lyrics (and/or the line length) after determining the lyrics. See, e.g., Todic, [0088] and Figure 2. Todic's teaching of determining a "confidence score" after Todic has already determined the lyrics does not teach or suggest "identifying...a first sequence of characters using the generated probability matrix." 
The other cited portion of Todic, paragraph 35, recites: 
...The ASR decoder 104 may use a Hidden Markov Model (HMM) database 112 that statistically describes each phoneme in020143-5114-US 8 Response to Office Action the features space (e.g., using MFCC) to obtain an optimal sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Todic, [0035]. 
However, it is not clear from the Office Action whether the Examiner is analogizing the "probability matrix" of claim 1 to Todic's probability metric, or to Todic's HMM. The two are not the same: 
Todic, Figure 2 

The examiner disagrees. The claim recites “…, generating a probability matrix of characters for a first portion of a first sample of the plurality of samples …. wherein the probability matrix includes character information, timing information and respective probabilities of respective characters at respective times …”. MPEP 2141 VI. PRIOR ART MUST BE CONSIDERED IN ITS ENTIRETY, INCLUDING DISCLOSURES THAT TEACH AWAY FROM THE CLAIMS: 
“A prior art reference must be considered in its entirety, i.e., as a whole, including portions that would lead away from the claimed invention. W.L. Gore & Assoc., Inc. v. Garlock, Inc., 721 F.2d 1540, 220 USPQ 303 (Fed. Cir. 1983), cert. denied, 469 U.S. 851 (1984) …”

Paragraph 70 discloses “ … The confidence score engine 216 may also analyze forward (or reverse) recognition results and determine a probability metric of line duration given a distribution of durations of all lines in the song or audio signal. This metric leverages the symmetric notion of modern western songs and computes a probability that a duration of a specific line fits a line duration model for a song or audio signal, for example. …” 
Paragraph 89 discloses “As another example, at block 720, a probability metric of line duration can be computed and compared to a threshold (e.g., two standard deviations of line duration), at block 722. If the metric is not within the threshold, the line of lyrics is marked as a high confidence line, at block 716.” 
The highlighted portion of the paragraph indicates the confidence score engine calculates a probability metric of line duration for each line of lines lyrics (label line 1,2,3 of Table 1,2,3,5) of a song. When the metric, the probability metric, is not within 
B. Todic in view of Jansson does not teach or suggest at least, "wherein the probability matrix includes: character information, timing information, and respective probabilities of respective characters at respective times," as recited in claim 1. 
Moreover, claim 1 expressly recites the contents of the probability matrix. To that end, claim 1 states "wherein the probability matrix includes: character information, timing information, and respective probabilities of respective characters at respective times." 4-US 9 Response to Office Action 
The Office Action states: 
Todic discloses ... wherein the probability matrix includes: character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.), timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); (Office Action at 3-4).
But Tables 1, 2, and 3 in Todic do not show the contents of any "matrix" at all, much less the content of a "probability matrix" that is used to "identify[], for the first portion of the first sample, a first sequence of characters." In fact, none of the Tables, or their corresponding textual description, show probabilities at all. 
The Examiner previously cited Todic's "probability metric," as described in paragraph 70 of Todic, and Todic's HMM, as described in paragraph 35, for teaching the claimed "probability matrix." However, Tables 1, 2, and 3 in Todic do not show the contents of Todic's "probability metric" or HMM. Rather, Todic, Table 1, shows Lyric Lines and "Input Lyrics Text (words and corresponding phonetic transcription"; Table 2 shows Lyric Line, Start Time, "Output lyrics text (words and corresponding phonetic transcription)" and End Time; and Table 3 shows Lyric Line and Reverse Input Lyrics Text Line (words and corresponding phonetic transcription)." These tables merely show the output of Todic's methodologies for various lyric lines, not the contents of a generated probability matrix. 

Table 1-3

The examiner disagrees. Each line of lines of lyrics of a song is shown in Tables 1-3 along with character information, such as phonetics of the feature vectors matching the words or characters of the lyrics. Table 1,2,3 also shows the timing of the phonetics and words or lyrics of each line or lines of lyrics of a song, label start time, end time. Table 5 shows the forward time duration of each line of lines of lyrics as well as 
Jansson is not cited for, and does not teach or suggest, the missing limitations of claim 1. also does not teach "identifying... a first sequence of characters using the generated probability matrix." 
Thus, Todic in view of Jansson does not teach or suggest all of the features of claim 1 as amended. Independent claims 12-13 include analogous limitations to claim 1 and are thus also patentable over Todic in view of Jansson for at least the reasons explained above. Accordingly,020143-5114-US 1 Response to Office Action claims 1, 12-13 and their associated dependent claims are patentable over Todic in view of Jansson. 

	The examiner disagrees. The office action further clarification, due to typographical error, to the office action indicating Todic discloses the recited limitation “identifying, for the first portion of the first sample, a first sequence of characters using the generated probability matrix”. The applicant merely states Todic fails to disclose the limitation, but provides no explanation or arguments regarding reasoning the applicant does not believe Todic discloses the recited limitation. The office action clearly indicates the limitations Jansson discloses and motivation for combination of Todic and Jansson. 
	Regarding allowable subject matter, claims 6,9, the applicant contends
Applicant would like to thank the Examiner for noting that claims 6 and 9 would be allowable if rewritten in independent form. The Applicant respectfully requests reconsideration of the objection in light of the remarks above.

The examiner has considered the applicant’s remarks. The objection stands as previously stated in light of the rebuttal above and office action below.
For the reasons as indicated in the rebuttal above as well as the office action below (adjusted in view of the amendments), the reference Todic in view of Jansson discloses the recited limitations. Please see the office action below.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1,3-5,7-8,10-16,18-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Todic (US Publication No.: 20110288862) in view of Jansson et al (Publication Title: Singing Voice Separation With Deep U-Net Convolutional Networks).
Claim 1, Todic discloses 
	at an electronic device (Paragraph 48 discloses computing device) having one or more processors (paragraph 48 discloses a processor.) and memory storing instructions for execution by the one or more processors (paragraph 48):
	receiving audio data for a media item (Fig. 1, label audio signal, paragraph 28);
	generating, from the audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 
	identifying, for the first portion of the first sample, a first sequence of characters using the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters. Table 5 shows confidence line for each line, which indicates a probability metric as per paragraph 70,89. Paragraph 90,104 discloses aligning lyrics and audio using the confidence score or confidence line as outputted by label 720 of Fig. 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	downsampling the first sample to reduce a dimension of the first sample;
	convolving an output of the downsampling of the first sample; and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample.
	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 

	convolving an output of the downsampling of the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional.); and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 
Claim 3, Todic discloses receiving, from an external source, lyrics corresponding to the media item (Fig. 1,2, label lyrics text); and using the received lyrics and the probability matrix, aligning characters in the first sequence of characters with the received lyrics corresponding to the media item (Table 1,2,3 shows the alignment of the 
Claim 4, Todic discloses determining a set of lyrics based on the first sequence of characters (Table 1,2,3 shows the determined set of lyrics from the phonemes.); and
	storing the set of lyrics in association with the media item (Paragraph 48 discloses memory for storing computing software that performs the functions of the components of Fig. 1. Fig. 1, label synced lyrics, Table 1,2,3 shows the set of lyrics.).
	Claim 5, Todic discloses using a language model and at least a portion of the first sequence of characters, determine a first word in the first portion of the first sample (Paragraph 35 discloses the use of language model to determine grammar of the audio signal matching words obtained from statistical descriptions of phonemes. Table 1,2,3 shows the words corresponding to the phonetics.); and
	determining, using the timing information that corresponds to the first portion of the first sample, a time that corresponds to the first word (Table 1,2,3, label start time, end time. Fig. 5 shows the time alignment of the lyrics to audio. (paragraph 77)).
	Claim 7, Todic discloses the received audio data includes an extracted vocal track that has been separated from a media content item (Paragraph 29 discloses extraction of vocal data or vocal track from the audio signal.).
	Claim 8, Todic discloses the received audio data is polyphonic media content item (Paragraph 28 discloses the audio signal can include instrumental music, background noise and spoken or sung words.).

	Claim 11, Todic discloses determining whether any of the one or more keywords corresponds to a defined set of words (Table 1,2, paragraph 40 describes matching lyrics to keywords or phonetics of keywords. For example, the phonetics for asleep matches the lyric “As I fell Asleep If Fireflies” (Table 1).); and
	in accordance with a determination that a first keyword of the one or more keywords corresponds to the defined set of words, performing an operation on a portion of the sample that corresponds to the first keyword (Paragraph 41,45, Table 2 discloses an operation is performed on a frame of speech corresponding to the phonemes and words such as keywords.).
Claim 12, Todic discloses 
	one or more processors (paragraph 48 discloses a processor.); and 
memory storing instructions for execution by the one or more processors, the instructions including instructions for (paragraph 48):
	receiving audio data for a media item (Fig. 1, label audio signal, paragraph 28);
	generating, from the audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 
	identifying, for the first portion of the first sample, a first sequence of characters using the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters. Table 5 shows confidence line for each line, which indicates a probability metric as per paragraph 70,89. Paragraph 90,104 discloses aligning lyrics and audio using the confidence score or confidence line as outputted by label 720 of Fig. 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	downsampling the first sample to reduce a dimension of the first sample;
	convolving an output of the downsampling of the first sample; and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample.
	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 

	convolving an output of the downsampling of the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional.); and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 
Claim 13, Todic discloses 
	receive audio data for a media item (Fig. 1, label audio signal, paragraph 28);

	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 

Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	downsampling the first sample to reduce a dimension of the first sample;
	convolving an output of the downsampling of the first sample; and

	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 
downsampling the first sample to reduce a dimension of the first sample (Section 3.1.2 discloses an audio input. Short Time Fourier Transform is performed on the audio input in order to output samples and spectrograms. Downsampling of the first sample of the input audio. Fig. 1 shows the neural network, with the encoder on the left side, decoder on the right side and convolutional layer at the bottom, Conv2D. );
	convolving an output of the downsampling of the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional.); and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to 
Claim 14, Todic discloses receiving, from an external source, lyrics corresponding to the media item (Fig. 1,2, label lyrics text); and using the received lyrics and the probability matrix, aligning characters in the first sequence of characters with the received lyrics corresponding to the media item (Table 1,2,3 shows the alignment of the first sequence of characters or phonetics with the received lyrics based on the timing information or probability matrix. Paragraph 35 discloses words obtained from phonemes to grammar of the audio signal and corresponding feature vector using statistical descriptions of each phoneme.)
Claim 15, Todic discloses determining a set of lyrics based on the first sequence of characters (Table 1,2,3 shows the determined set of lyrics from the phonemes.); and
	storing the set of lyrics in association with the media item (Paragraph 48 discloses memory for storing computing software that performs the functions of the components of Fig. 1. Fig. 1, label synced lyrics, Table 1,2,3 shows the set of lyrics.).
	Claim 16, Todic discloses using a language model and at least a portion of the first sequence of characters, determine a first word in the first portion of the first sample (Paragraph 35 discloses the use of language model to determine grammar of the audio signal matching words obtained from statistical descriptions of phonemes. Table 1,2,3 shows the words corresponding to the phonetics.); and
	determining, using the timing information that corresponds to the first portion of the first sample, a time that corresponds to the first word (Table 1,2,3, label start time, end time. Fig. 5 shows the time alignment of the lyrics to audio. (paragraph 77)).

	Claim 19, Todic discloses the received audio data is polyphonic media content item (Paragraph 28 discloses the audio signal can include instrumental music, background noise and spoken or sung words.).

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1,3-13 is provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 13 of copending Application No. 16691463 in view of Jansson et al (Publication Title: Singing Voice Separation With Deep U-Net Convolutional Networks). 
Regarding claims 1,12,13, the limitations of the copending application are similarly recited in this application, but the copending application fails to recite “generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	downsampling the first sample to reduce a dimension of the first sample;
	convolving an output of the downsampling of the first sample; and

	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 
downsampling the first sample to reduce a dimension of the first sample (Section 3.1.2 discloses an audio input. Short Time Fourier Transform is performed on the audio input in order to output samples and spectrograms. Downsampling of the first sample of the input audio. Fig. 1 shows the neural network, with the encoder on the left side, decoder on the right side and convolutional layer at the bottom, Conv2D. );
	convolving an output of the downsampling of the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer with 2D convolutional.); and
	upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	The copending application discloses streaming media content to users including music with lyrical content (paragraph 4) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify copending application’s system by incorporating a neural network to perform voice 
Regarding claims 3-11, such claims are anticipated by claims 3-11 of the copending application.
This is a provisional nonstatutory double patenting rejection.

Allowable Subject Matter
Claims 6,9,17,20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 


Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/LINDA WONG/Primary Examiner, Art Unit 2656