Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 1/4/2022 have been fully considered but they are not persuasive. 
The applicant contends Todic in view of Jannsson does not teach or suggest at least, “using a neural network trained to predict text probabilities, generating a probability matrix of textual units for a first portion of a first sample of the plurality of samples”, as recited in claim 1.
The examiner disagrees. The applicant’s remarks merely includes the office action and parts of the disclosure of Todic. The applicant’s remarks does not provide an explanation as to the difference between the probability matrix as disclosed by Todic and the recited claimed language. 

A: 	Consideration of limitation “generating a probability matrix of textual units for a first portion of a first sample of the plurality of samples”. 
Such recited language does not specify what constitutes as a matrix, textual units, a first portion, and a first sample of the plurality of samples. In the broadest sense of the terminology in light of the specification, the claimed language is interpreted in the following:
textual units: are considered words in the lyrics,
a first portion: some portion of the first sample,
a first sample: a song line or song lyric,
a plurality of samples: multiple song lines, 
Probability matrix: a matrix is simply a table composed of columns and rows with data, specifically in regards to the recited claimed language, a table of probability scores of textual units for a first portion of a song line or song lyric of multiple song lines. 
Todic discloses a natural language model processing song lyrics or a plurality of samples (Fig. 2, label 216), wherein the processing of a song lyric includes generating probabilities or scores of the line duration of the song lyric, such as the confidence score of the phonemes to word. For example, Table 1 shows each song lyric or line of multiple song lines or lyrics. Table 2 shows, for example, lyric 1: would you believe your eyes, an example of a first sample and a first portion of the first sample is the portion of the song lyric within a duration time. Table 2 also includes the textual units, such as the phonemes associated with the song lyric as shown in Table 2. 
Paragraph 70 discloses “The confidence score engine 216 may also analyze forward (or reverse) recognition results and determine a probability metric of line duration given a distribution of durations of all lines in the song or audio signal. This metric leverages the symmetric notion of modern western songs and computes a probability that a duration of a specific line fits a line duration model for a song or audio signal, for example.” The highlighted portion of the paragraph discloses the probability metric is calculated for a line duration, such as Table 2, lyric or line 1, with duration from start time to end time as indicated in Table 2. An example of a probability matrix of the textual units for each line of a song with a line duration as 

B. 	Consideration of limitation “using a neural network trained to predict text probabilities, generating” the probability matrix as discussed above. 
The applicant’s remarks are merely directed to the singular reference, Todic as opposed to Todic in view of Jansson et al as indicated in the office action. As per MPEP Section 2145 IV, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.
The limitation merely recites the use of a neural network to generate the probability matrix as discussed above, wherein such neural network is trained to predict text probabilities. The term “generate” is defined as “to be the cause of (a situation, action or physical process” (https://www.merriam-webster.com/dictionary/generate). Such definition indicates that the use of the neural network (as recited in the claim) is to cause or generate the probability matrix as indicated above. In light of the recited claimed language and such definition, Todic, as indicated in the previous office action, discloses an audio encoder and ASR decoder performing speech recognition in order to cause or generate the probability matrix as discussed above. (Fig. 1,2 shows the ASR decoder and audio encoder with confidence score engine, label 216, that generates the confidence scores mentioned in paragraph 70-71.) Such system includes a natural language model that is trained to predict text probabilities to generate the probability 
As indicated in the office action, Todic discloses generating lyrics from an audio signal or lyric transcription using a system comprising an audio engine performing voice separation or extraction of vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2), wherein the confidence score engine is caused by the audio engine and ASR decoder to generate the probability matrix, but fails to disclose such system with natural language model includes using a neural network. 
Although Todic does not disclose such system includes a neural network, Jansson et al discloses the use of a deep U-Net convolution neural network for source or voice separation in order to perform functionalities such as lyric transcription of a song. (Section I) As indicated in the office action, it would be obvious to one skilled in the art to incorporate a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al into Todic so to improve lyric transcription, which improves the understanding song lyrics, needed for commercial application such as karaoke.

The applicant contends Todic in view of Jansson does not teach or suggest at least “wherein the probability matrix includes: information about the textual units, timing information, and respective probabilities of respective textual units at respective times”, as recited in claim 1.
The examiner disagrees. The applicant’s remarks merely copy and paste the tables of Topic (Tables 1,2,3) without consideration of the entirety of the reference as well as the indicated paragraphs in the office action. The remarks fails to indicate the difference between the prior art reference and the recited claimed language. For example, what constitutes as information about textual units? What constitutes as a textual unit? 
As such, the claim is interpreted in the broadest sense of the terminology in light of the specification. The following is an interpretation of the recited claimed language indicated above: 
information about textual units: associated phonemes of the words or characters associated with each song lyric
timing information: duration time such as start time and end time associated with each song lyric
respective probabilities of respective textual units at respective times: confidence score or probability of a song lyric composed of textual units during the duration time, specifically start time and end time.
Based on such interpretation, Topic discloses table 1,2,3 shows each song lyric. Each song lyric includes phonemes associated to words or characters of the song lyric. Such shows the information about the textual units. Each song lyric is associated with a duration time (label start time and end time) or timing information. 
The office action also points to paragraph 70, wherein such paragraph discloses “the confidence score engine 216 … determine a probability metric of line duration 

The applicant contends Rejection of claims 1-6 and 9-13 provisionally on the ground of nonstatutory double patenting as being unpatentable over claims 1,3-6, 9-13 of copending application no 16569372.
	The status of such claims under double patenting stands as previously stated until such rejection is overcome.

Conclusion
Based on the rebuttal indicated above, the status of the claims stands as previously stated. A copy of the office action is found below with minor adjustments. Please see the office action below.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5,7,10-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Todic (US Publication No.: 20110288862) in view of Jansson et al (Publication Title: Singing Voice Separation With Deep U-Net Convolutional Networks).
Claim 1, Todic discloses 
	at an electronic device (Paragraph 48 discloses computing device) having one or more processors (paragraph 48 discloses a processor.) and memory storing instructions for execution by the one or more processors (paragraph 48):
	receiving polyphonic audio data for a media item (Fig. 1, label audio signal, paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., vocals) to an automated speech recognition (ASR) decoder 104. When the input audio signal is a musical song, the spoken or sung words may correspond to lyrics of the song, for example.” Such disclosure indicates the audio signal or audio data is polyphonic.);

	using a natural language model (Fig. 1,2, labels 200) trained to predict text probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of textual units for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 
	wherein the probability matrix includes: 
	information about the textual units (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective textual units at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 

Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network.
Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.).

Claim 2, Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network  trained to predict character probabilities includes:
	convolving the first sample;
downsampling the first sample to reduce a dimension of the first sample; and
	after downsampling the first sample, upsampling an output of the convolution of the first sample to increase the dimension of the first sample.
	Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.) including 
convolving the first sample (Fig. 1, label conv2D of the downsampling performed in the encoder. Section 3.1.2 discloses downsampling the input audio and encoder layer 
downsampling the first sample to reduce a dimension of the first sample (Section 3.1.2 discloses an audio input. Short Time Fourier Transform is performed on the audio input in order to output samples and spectrograms. Downsampling of the first sample of the input audio. Fig. 1 shows the neural network, with the encoder on the left side, decoder on the right side and convolutional layer at the bottom, Conv2D. ); and
	after downsampling the first sample, upsampling an output of the convolution of the first sample to increase the dimension of the first sample (Fig. 1, label deconv2D layers as the decoder. Section 3 discloses encoding is then decoded to original size of the image by a stack of upsampling layers.);
	Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 
Claim 3, Todic discloses receiving, from an external source, lyrics corresponding to the media item (Fig. 1,2, label lyrics text); and using the received lyrics and the probability matrix, aligning textual units in the first sequence of textual units with the received lyrics corresponding to the media item (Table 1,2,3 shows the alignment of the first sequence of characters or phonetics with the received lyrics based on the timing 
Claim 4, Todic discloses determining a set of lyrics based on the first sequence of textual units (Table 1,2,3 shows the determined set of lyrics from the phonemes.); and
	storing the set of lyrics in association with the media item (Paragraph 48 discloses memory for storing computing software that performs the functions of the components of Fig. 1. Fig. 1, label synced lyrics, Table 1,2,3 shows the set of lyrics.).
Claim 5, Todic discloses using a language model and at least a portion of the first sequence of textual units, determine a first word in the first portion of the first sample (Paragraph 35 discloses the use of language model to determine grammar of the audio signal matching words obtained from statistical descriptions of phonemes. Table 1,2,3 shows the words corresponding to the phonetics.); and
	determining, using the timing information that corresponds to the first portion of the first sample, a time that corresponds to the first word (Table 1,2,3, label start time, end time. Fig. 5 shows the time alignment of the lyrics to audio. (paragraph 77)).
	Claim 7, Todic discloses the received polyphonic audio data includes an instrumental track and a vocal track. (Paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., vocals) to an automated speech recognition (ASR) decoder 104. When the input audio signal is a musical song, the spoken or sung words may correspond to lyrics of the song, for example.”)

Claim 11, Todic discloses determining whether any of the one or more keywords corresponds to a defined set of words (Table 1,2, paragraph 40 describes matching lyrics to keywords or phonetics of keywords. For example, the phonetics for asleep matches the lyric “As I fell Asleep If Fireflies” (Table 1).); and
	in accordance with a determination that a first keyword of the one or more keywords corresponds to the defined set of words, performing an operation on a portion of the sample that corresponds to the first keyword (Paragraph 41,45, Table 2 discloses an operation is performed on a frame of speech corresponding to the phonemes and words such as keywords.).
Claim 12, Todic discloses 
	one or more processors (paragraph 48 discloses a processor.); and 
memory storing instructions for execution by the one or more processors, the instructions including instructions for (paragraph 48):
	receiving polyphonic audio data for a media item (Fig. 1, label audio signal, paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., vocals) to an automated speech recognition (ASR) decoder 104. When the input audio 
	generating, from the polyphonic audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM database.), generating a probability matrix of characters for a first portion of a first sample of the plurality of samples (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector. Table 1,2,3 includes further portions of the probability matrix.), 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an 
	identifying, for the first portion of the first sample, a first sequence of characters using the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters. Table 5 shows confidence line for each line, which indicates a probability metric as per paragraph 70,89. Paragraph 90,104 discloses aligning lyrics and audio using the confidence score or confidence line as outputted by label 720 of Fig. 7. Paragraph 77 discloses “The system 200 may further use speech recognition techniques to map expected textual transcriptions of the audio signal to the audio signal. Alternatively, correct lyrics are received and are taken as the textual transcriptions of the vocal elements in the audio signal (so that speech recognition is not needed to determine the textual transcriptions) and a forced alignment of the lyrics can be performed to the audio signal to generate timing boundary information, for example.” By aligning lyrics with the audio signal, the lyrics or first sequence of characters such as shown in Tables 1,2 where the phonetics are matched to the text as well as audio. In other words, the audio aligned to phonetics indicates a first portion of the first sample of the plurality of samples, wherein such is also aligned to text or lyrics or a sequence of characters.). 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network.
Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I 
Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to improve lyric transcription needed for commercial application such as karaoke. (Section I) 
Claim 13, Todic discloses 
	receive polyphonic audio data for a media item (Fig. 1, label audio signal, paragraph 28 discloses “the audio signal may include a speech, a song or musical data, a TV signal, etc. and thus may include spoken or sung words and accompanying instrumental music or background noise and outputs the spoken or sung words (e.g., vocals) to an automated speech recognition (ASR) decoder 104. When the input audio signal is a musical song, the spoken or sung words may correspond to lyrics of the song, for example.” Such disclosure indicates the audio signal or audio data is polyphonic.);
	generate, from the polyphonic audio data, a plurality of samples (Paragraph 35 discloses the audio signal is suppressed by extract feature vectors about every 10 ms.), each sample having predefined maximum length (Paragraph 35);
	using a natural language model (Fig. 1,2, labels 200) trained to predict character probabilities (Fig. 2, label 216 outputs the confidence scores or probabilities of the words or characters of the lyrics. Fig. 1,2, label dictionary database and HMM 
	wherein the probability matrix includes: 
	character information (Table 1,2,3 shows the phonetics of the feature vectors matching the words or characters of the lyrics.),
	timing information (Table 1,2,3 shows the phonetics and words or lyrics include a timing, label start time, end time.), and
	respective probabilities of respective characters at respective times (Paragraph 70 discloses probability metric of line duration given a distribution of durations of all lines in the song or audio signal. Paragraph 35 discloses the ASR decoder user HMM database that statistically describes each phoneme in the feature spaces to obtain an optical sequence of words from the phonemes that matches the grammar of the audio signal and corresponding feature vector.); 
	identifying, for the first portion of the first sample, a first sequence of characters using the generated probability matrix (Table 1,2,3 shows identification of phonetics or first sequence of characters. Table 5 shows confidence line for each line, which indicates a probability metric as per paragraph 70,89. Paragraph 90,104 discloses aligning lyrics and audio using the confidence score or confidence line as outputted by label 720 of Fig. 7. Paragraph 77 discloses “The system 200 may further use speech recognition 
Todic discloses generated lyrics from an audio signal using an audio engine performing voice separation or extraction of the vocal data or data representing spoken utterances of words (paragraph 29) and ASR decoder (Fig. 1,2) but fails to disclose the generation of lyrics includes using a neural network.
Jansson et al discloses using a deep U-Net convolution neural network model for the purpose of voice separation or of a clean vocal signal for lyric transcription (Section I discloses estimating what the sung melody and accompaniment would sound like in isolation for lyric transcription.).
Todic discloses an audio engine that performs voice separation or vocal extraction (paragraph 29, Fig. 1,2, label audio engine) and Jansson et al discloses voice separation for lyric transcription using a neural network (Fig. 1, Section 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Todic’s audio engine by incorporating a neural network to perform voice separation for lyric transcription as disclosed by Jansson et al so to  



Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-6,9-13 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1,3,4,5,6,9,10,11,12,13 of copending application No. 16569372 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the present application are broader than claims of the copending application, hence anticipates the limitations of the copending application.  
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  


Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LINDA WONG/Primary Examiner, Art Unit 2655