DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to because:
Element “Inactive Media Repository 122” in Figure 1 is referred to as “anaphora repository 122” and “inactive media repository 122” in the specification.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
In paragraph 0004, lines 2-3, “a computer receives” should read “a computer that receives”.
In paragraph 0014, line 2, “referring to or replacing” should read “refers to or replaces”.
In paragraph 0033, line 4, “a file any type” should read “a file of any type”.
In paragraph 0034, line 2, “into signal over time” should read “into a signal over time”.
In paragraph 0034, line 9, “neural network..” should read “neural network.”.
In paragraph 0040, line 2, “cannot located” should read “cannot be located”.
In paragraph 0040, line 6, “example, embodiment” should read “example embodiment”.
In paragraph 0042, line 7, “the text to with” should read “the text with”.
In paragraph 0046, lines 4-5, “that may represented” should read “that may be represented”.
In paragraph 0049, line 2, the trademark WI-FI® is used without being cited as a registered trademark.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 2, 9 and 16 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claims contain subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Regarding claim 2, the limitation “converting the multimedia data into the plurality of amplitudes using an auto decoder neural network” lacks adequate written description because the claim element defines the invention in functional language specifying a desired result, but the specification does not sufficiently identify how the function is performed or result is achieved.  An auto decoder neural network typically processes data encoded by an auto encoder neural network.  In the specification, Paragraph 0034, lines 2-9, states “According to an example embodiment, the anaphora disambiguation program 110A, 110B may convert the multimedia object into a signal wave by extracting audio data or if the multimedia object incorporates text, by convert the text into audio using an amplitude auto decoder neural network that converts the text into amplitude values and each value is assign a timeframe associated with the time the value was generated. In another embodiment, the anaphora disambiguation program 110A, 110B may convert the multimedia object into text using speech-to-text and then apply a trained neural network that converts text into a plurality of amplitudes over time using the auto decoder neural network.”  However, the specification does not provide any information about how to implement an auto decoder neural network that processes text data directly instead of processing text data that has been encoded with an auto encoder neural network.
Regarding claim 9, the limitation “converting the multimedia data into the plurality of amplitudes using an auto decoder neural network” lacks adequate written description because the claim element defines the invention in functional language specifying a desired result, but the specification does not sufficiently identify how the function is performed or result is achieved.  An auto decoder neural network typically processes data encoded by an auto encoder neural network.  In the specification, Paragraph 0034, lines 2-9, states “According to an example embodiment, the anaphora disambiguation program 110A, 110B may convert the multimedia object into a signal wave by extracting audio data or if the multimedia object incorporates text, by convert the text into audio using an amplitude auto decoder neural network that converts the text into amplitude values and each value is assign a timeframe associated with the time the value was generated. In another embodiment, the anaphora disambiguation program 110A, 110B may convert the multimedia object into text using speech-to-text and then apply a trained neural network that converts text into a plurality of amplitudes over time using the auto decoder neural network.”  However, the specification does not provide any information about how to implement an auto decoder neural network that processes text data directly instead of processing text data that has been encoded with an auto encoder neural network.
Regarding claim 16, the limitation “convert the multimedia data into the plurality of amplitudes using an auto decoder neural network” lacks adequate written description because the claim element defines the invention in functional language specifying a desired result, but the specification does not sufficiently identify how the function is performed or result is achieved.  An auto decoder neural network typically processes data encoded by an auto encoder neural network.  In the specification, Paragraph 0034, lines 2-9, states “According to an example embodiment, the anaphora disambiguation program 110A, 110B may convert the multimedia object into a signal wave by extracting audio data or if the multimedia object incorporates text, by convert the text into audio using an amplitude auto decoder neural network that converts the text into amplitude values and each value is assign a timeframe associated with the time the value was generated. In another embodiment, the anaphora disambiguation program 110A, 110B may convert the multimedia object into text using speech-to-text and then apply a trained neural network that converts text into a plurality of amplitudes over time using the auto decoder neural network.”  However, the specification does not provide any information about how to implement an auto decoder neural network that processes text data directly instead of processing text data that has been encoded with an auto encoder neural network.
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2, 9 and 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 2 recites the limitation “generating the signal wave from the plurality of amplitudes”.  There is insufficient antecedent basis for “the plurality of amplitudes” in this limitation.  Claim 2 depends from claim 1, and claim 1 recites “wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes”.  Since “a plurality of amplitudes” in claim 1 is derived from the signal wave, it does not provide a sufficient antecedent basis for “the plurality of amplitudes” in claim 2 that are used to generate the signal wave.  For examination purposes, the term “the plurality of amplitudes” will be interpreted as referring to different amplitudes than the amplitudes referred to as “a plurality of amplitudes” in claim 1.  Also, the claim 2 limitation “converting the multimedia data into the plurality of amplitudes using an auto decoder neural network” is indefinite because it is not clear how to apply an auto decoder to process multimedia data directly.  For examination purposes, “converting the multimedia data into the plurality of amplitudes using an auto decoder neural network” will be interpreted as using an auto decoder neural network to deriving amplitudes from data that has been derived from the multimedia data using an auto encoder neural network.
Claim 9 recites the limitation “generating the signal wave from the plurality of amplitudes”.  There is insufficient antecedent basis for “the plurality of amplitudes” in this limitation.  Claim 9 depends from claim 8, and claim 8 recites “wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes”.  Since “a plurality of amplitudes” in claim 8 is derived from the signal wave, it does not provide a sufficient antecedent basis for “the plurality of amplitudes” in claim 9 that are used to generate the signal wave.  For examination purposes, the term “the plurality of amplitudes” will be interpreted as referring to different amplitudes than the amplitudes referred to as “a plurality of amplitudes” in claim 8.  Also, the claim 9 limitation “converting the multimedia data into the plurality of amplitudes using an auto decoder neural network” is indefinite because it is not clear how to apply an auto decoder to process multimedia data directly.  For examination purposes, “converting the multimedia data into the plurality of amplitudes using an auto decoder neural network” will be interpreted as using an auto decoder neural network to deriving amplitudes from data that has been derived from the multimedia data using an auto encoder neural network.
Claim 16 recites the limitation “generate the signal wave from the plurality of amplitudes”.  There is insufficient antecedent basis for “the plurality of amplitudes” in this limitation.  Claim 16 depends from claim 15, and claim 15 recites “wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes”.  Since “a plurality of amplitudes” in claim 15 is derived from the signal wave, it does not provide a sufficient antecedent basis for “the plurality of amplitudes” in claim 16 that are used to generate the signal wave.  For examination purposes, the term “the plurality of amplitudes” will be interpreted as referring to different amplitudes than the amplitudes referred to as “a plurality of amplitudes” in claim 15.  Also, the claim 16 limitation “convert the multimedia data into the plurality of amplitudes using an auto decoder neural network” is indefinite because it is not clear how to apply an auto decoder to process multimedia data directly.  For examination purposes, “converting the multimedia data into the plurality of amplitudes using an auto decoder neural network” will be interpreted as using an auto decoder neural network to deriving amplitudes from data that has been derived from the multimedia data using an auto encoder neural network.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 8 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz (US Patent No. 10,482,885) in view of Taubman et al. (US Patent No.  9,529,793), hereinafter Taubman.
Regarding claim 1, Moniz discloses a processor-implemented method for anaphora disambiguation, the method comprising:
receiving multimedia data, wherein the multimedia data comprises a plurality of frames (Column 12, lines 5-11, "The AFE 256 may reduce noise in the audio data 111 and divide the digitized audio data 111 into frames representing time intervals for which the AFE 256 determines a number of values (i.e., features) representing qualities of the audio data 111, along with a set of those values (i.e., a feature vector or audio feature vector) representing features/qualities of the audio data 111 within each frame.");
identifying a frame from the plurality of frames having a pronoun (Column 23, lines 63-67, "An NLU component, such as an NER component 262, may identify word(s) that correspond to an entity that is not explicitly mentioned in the utterance.  Such word(s) may correspond to anaphora, exophora, or the like."; Anaphora reads on a pronoun.);
identifying a topic of the frame using a deep neural network (Column 25, lines 39-49, "Various machine learning techniques may be used to perform the training of the anaphora resolver 710 or other components. Models may be trained and operated according to various machine learning techniques. Such techniques may include, for example, inference engines, trained classifiers, etc. Examples of trained classifiers include conditional random fields (CRF) classifiers, Support Vector Machines (SVMs), neural networks (such as deep neural networks and/or recurrent neural networks), decision trees, AdaBoost (short for “Adaptive Boosting”) combined with decision trees, and random forests."; Column 26, lines 22-28, "The anaphora resolver 710 may consider data representing a number of different features to determine which entity should be matched with this word. The context data 604, other data 791, information from the knowledge base 272, user profile 404, NLU data 702, may include any of the data discussed above, which may be considered when determining the appropriate entity."; The context data reads on the topic.);
searching for a frame in a media repository having a highest correlation coefficient with the frame, wherein the frame from the media repository comprises a bag of objects (Column 23, line 61 - Column 24, line 4, "FIG. 7 illustrates an anaphora resolver component 710, which may be incorporated into the NLU module 260 or may be located separately. An NLU component, such as an NER component 262, may identify word(s) that correspond to an entity that is not explicitly mentioned in the utterance. Such word(s) may correspond to anaphora, exophora, or the like. Data 702 including indication(s) of the word(s) may be sent to the anaphora resolver 710. The data 702 may be in the form of an N-best list, for example a post-recognizer N-best list 340, post cross-domain N-best list 360, or other N-best."; Column 26, lines 13-21, "The anaphora resolver 710 may thus associate a word in an utterance with an entity from a previous utterance, media content, or otherwise, using a combination of potential techniques. As part of the determination of which entity to associate with a word, the anaphora resolver 710 may also rank the possible entities to determine which of the various choices is the appropriate entity to match with the ambiguous word(s) and thus complete the population of the post-NLU fields."; The entities read on the objects, the media content reads on the media repository, and ranking the possible entities to choose the appropriate entity reads on searching for a frame having the highest correlation coefficient.);
and resolving the anaphora disambiguation by substituting the pronoun with an object from the bag of objects (Column 26, lines 29-32, "Once an entity has been identified as corresponding to the word(s) of the incoming utterance, the anaphora resolver 710 may output an indicator of the entity (such as an entity ID) in a format useful by a downstream component.").
Moniz does not specifically disclose:
converting the multimedia data into a signal wave, wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes.
Taubman teaches:
converting the multimedia data into a signal wave, wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes (Column 6, lines 53-67, "Acoustic parameters module 308 analyzes the audio signal waveform that corresponds to voice query 304 with techniques that analyze audio signals, for example, conventional techniques such as Fast Fourier Transforms (FFTs). From the analysis, the acoustic parameters describing the voice query are identified. For example, the volume of the audio signal waveform can be determined from the amplitudes of the audio signal waveform. The frequency can be determined from the number of oscillations in the audio signal waveform in a period of time. The pitch can be determined from the frequency that describes the audio signal waveform. Other acoustic parameters can be determined by performing other appropriate mathematical analysis of the audio signal waveform to provide relevant data for the audio signal waveform.").
Taubman teaches using a Fast Fourier Transform to analyze the frequency content of an audio signal waveform in order to resolve ambiguous pronouns in a voice query by associating a concept with the pronoun based on the acoustic parameters (Column 4, lines 1-7, "When a voice query containing a pronoun is received by the search system 112, the pronoun resolution system 122 resolves the ambiguous pronoun by associating a concept with the pronoun based on the acoustic parameters of the received voice query, as described in more detail below with reference to FIG. 4. A concept can be a noun or subject that is referenced by the ambiguous pronoun.")
Moniz and Taubman are considered to be analogous to the claimed invention because they are in the same field of resolving anaphora in natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz to incorporate the teachings of Taubman to use a Fast Fourier Transform to analyze the frequency content of an audio signal waveform.  Doing so would allow for resolving ambiguous pronouns in a voice query by associating a concept with the pronoun based on the acoustic parameters.
Regarding claim 8, Moniz discloses a computer system for anaphora disambiguation, the computer system comprising:
one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage medium, and program instructions stored on at least one of the one or more tangible storage medium for execution by at least one of the one or more processors via at least one of the one or more memories (Column 13, lines 1-3, “The device performing NLU processing 260 (e.g., server 120) may include various components, including potentially dedicated processor(s), memory, storage, etc.”),
wherein the computer system is capable of performing a method comprising:
receiving multimedia data, wherein the multimedia data comprises a plurality of frames (Column 12, lines 5-11, "The AFE 256 may reduce noise in the audio data 111 and divide the digitized audio data 111 into frames representing time intervals for which the AFE 256 determines a number of values (i.e., features) representing qualities of the audio data 111, along with a set of those values (i.e., a feature vector or audio feature vector) representing features/qualities of the audio data 111 within each frame.");
identifying a frame from the plurality of frames having a pronoun (Column 23, lines 63-67, "An NLU component, such as an NER component 262, may identify word(s) that correspond to an entity that is not explicitly mentioned in the utterance.  Such word(s) may correspond to anaphora, exophora, or the like."; Anaphora reads on a pronoun.);
identifying a topic of the frame using a deep neural network (Column 25, lines 39-49, "Various machine learning techniques may be used to perform the training of the anaphora resolver 710 or other components. Models may be trained and operated according to various machine learning techniques. Such techniques may include, for example, inference engines, trained classifiers, etc. Examples of trained classifiers include conditional random fields (CRF) classifiers, Support Vector Machines (SVMs), neural networks (such as deep neural networks and/or recurrent neural networks), decision trees, AdaBoost (short for “Adaptive Boosting”) combined with decision trees, and random forests."; Column 26, lines 22-28, "The anaphora resolver 710 may consider data representing a number of different features to determine which entity should be matched with this word. The context data 604, other data 791, information from the knowledge base 272, user profile 404, NLU data 702, may include any of the data discussed above, which may be considered when determining the appropriate entity."; The context data reads on the topic.);
searching for a frame in a media repository having a highest correlation coefficient with the frame, wherein the frame from the media repository comprises a bag of objects (Column 23, line 61 - Column 24, line 4, "FIG. 7 illustrates an anaphora resolver component 710, which may be incorporated into the NLU module 260 or may be located separately. An NLU component, such as an NER component 262, may identify word(s) that correspond to an entity that is not explicitly mentioned in the utterance. Such word(s) may correspond to anaphora, exophora, or the like. Data 702 including indication(s) of the word(s) may be sent to the anaphora resolver 710. The data 702 may be in the form of an N-best list, for example a post-recognizer N-best list 340, post cross-domain N-best list 360, or other N-best."; Column 26, lines 13-21, "The anaphora resolver 710 may thus associate a word in an utterance with an entity from a previous utterance, media content, or otherwise, using a combination of potential techniques. As part of the determination of which entity to associate with a word, the anaphora resolver 710 may also rank the possible entities to determine which of the various choices is the appropriate entity to match with the ambiguous word(s) and thus complete the population of the post-NLU fields."; The entities read on the objects, the media content reads on the media repository, and ranking the possible entities to choose the appropriate entity reads on searching for a frame having the highest correlation coefficient.);
and resolving the anaphora disambiguation by substituting the pronoun with an object from the bag of objects (Column 26, lines 29-32, "Once an entity has been identified as corresponding to the word(s) of the incoming utterance, the anaphora resolver 710 may output an indicator of the entity (such as an entity ID) in a format useful by a downstream component.").
Moniz does not specifically disclose:
converting the multimedia data into a signal wave, wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes.
Taubman teaches:
converting the multimedia data into a signal wave, wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes (Column 6, lines 53-67, "Acoustic parameters module 308 analyzes the audio signal waveform that corresponds to voice query 304 with techniques that analyze audio signals, for example, conventional techniques such as Fast Fourier Transforms (FFTs). From the analysis, the acoustic parameters describing the voice query are identified. For example, the volume of the audio signal waveform can be determined from the amplitudes of the audio signal waveform. The frequency can be determined from the number of oscillations in the audio signal waveform in a period of time. The pitch can be determined from the frequency that describes the audio signal waveform. Other acoustic parameters can be determined by performing other appropriate mathematical analysis of the audio signal waveform to provide relevant data for the audio signal waveform.").
Taubman teaches using a Fast Fourier Transform to analyze the frequency content of an audio signal waveform in order to resolve ambiguous pronouns in a voice query by associating a concept with the pronoun based on the acoustic parameters (Column 4, lines 1-7, "When a voice query containing a pronoun is received by the search system 112, the pronoun resolution system 122 resolves the ambiguous pronoun by associating a concept with the pronoun based on the acoustic parameters of the received voice query, as described in more detail below with reference to FIG. 4. A concept can be a noun or subject that is referenced by the ambiguous pronoun.")
Moniz and Taubman are considered to be analogous to the claimed invention because they are in the same field of resolving anaphora in natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz to incorporate the teachings of Taubman to use a Fast Fourier Transform to analyze the frequency content of an audio signal waveform.  Doing so would allow for resolving ambiguous pronouns in a voice query by associating a concept with the pronoun based on the acoustic parameters.
Regarding claim 15, Moniz discloses a computer program product for anaphora disambiguation, the computer program product comprising:
one or more computer-readable tangible storage medium and program instructions stored on at least one of the one or more tangible storage medium, the program instructions executable by a processor (Column 29, lines 1-7, “Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure.”),
the program instructions comprising:
program instructions to receive multimedia data, wherein the multimedia data comprises a plurality of frames (Column 12, lines 5-11, "The AFE 256 may reduce noise in the audio data 111 and divide the digitized audio data 111 into frames representing time intervals for which the AFE 256 determines a number of values (i.e., features) representing qualities of the audio data 111, along with a set of those values (i.e., a feature vector or audio feature vector) representing features/qualities of the audio data 111 within each frame.");
program instructions to identify a frame from the plurality of frames having a pronoun (Column 23, lines 63-67, "An NLU component, such as an NER component 262, may identify word(s) that correspond to an entity that is not explicitly mentioned in the utterance.  Such word(s) may correspond to anaphora, exophora, or the like."; Anaphora reads on a pronoun.);
program instructions to identify a topic of the frame using a deep neural network (Column 25, lines 39-49, "Various machine learning techniques may be used to perform the training of the anaphora resolver 710 or other components. Models may be trained and operated according to various machine learning techniques. Such techniques may include, for example, inference engines, trained classifiers, etc. Examples of trained classifiers include conditional random fields (CRF) classifiers, Support Vector Machines (SVMs), neural networks (such as deep neural networks and/or recurrent neural networks), decision trees, AdaBoost (short for “Adaptive Boosting”) combined with decision trees, and random forests."; Column 26, lines 22-28, "The anaphora resolver 710 may consider data representing a number of different features to determine which entity should be matched with this word. The context data 604, other data 791, information from the knowledge base 272, user profile 404, NLU data 702, may include any of the data discussed above, which may be considered when determining the appropriate entity."; The context data reads on the topic.);
program instructions to search for a frame in a media repository having a highest correlation coefficient with the frame, wherein the frame from the media repository comprises a bag of objects (Column 23, line 61 - Column 24, line 4, "FIG. 7 illustrates an anaphora resolver component 710, which may be incorporated into the NLU module 260 or may be located separately. An NLU component, such as an NER component 262, may identify word(s) that correspond to an entity that is not explicitly mentioned in the utterance. Such word(s) may correspond to anaphora, exophora, or the like. Data 702 including indication(s) of the word(s) may be sent to the anaphora resolver 710. The data 702 may be in the form of an N-best list, for example a post-recognizer N-best list 340, post cross-domain N-best list 360, or other N-best."; Column 26, lines 13-21, "The anaphora resolver 710 may thus associate a word in an utterance with an entity from a previous utterance, media content, or otherwise, using a combination of potential techniques. As part of the determination of which entity to associate with a word, the anaphora resolver 710 may also rank the possible entities to determine which of the various choices is the appropriate entity to match with the ambiguous word(s) and thus complete the population of the post-NLU fields."; The entities read on the objects, the media content reads on the media repository, and ranking the possible entities to choose the appropriate entity reads on searching for a frame having the highest correlation coefficient.);
and program instructions to resolve the anaphora disambiguation by substituting the pronoun with an object from the bag of objects (Column 26, lines 29-32, "Once an entity has been identified as corresponding to the word(s) of the incoming utterance, the anaphora resolver 710 may output an indicator of the entity (such as an entity ID) in a format useful by a downstream component.").
Moniz does not specifically disclose:
program instructions to convert the multimedia data into a signal wave, wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes.
Taubman teaches:
program instructions to convert the multimedia data into a signal wave, wherein the signal wave is converted, using a direct Fourier transfer, to a plurality of sine waves having a plurality of frequencies and a plurality of amplitudes (Column 6, lines 53-67, "Acoustic parameters module 308 analyzes the audio signal waveform that corresponds to voice query 304 with techniques that analyze audio signals, for example, conventional techniques such as Fast Fourier Transforms (FFTs). From the analysis, the acoustic parameters describing the voice query are identified. For example, the volume of the audio signal waveform can be determined from the amplitudes of the audio signal waveform. The frequency can be determined from the number of oscillations in the audio signal waveform in a period of time. The pitch can be determined from the frequency that describes the audio signal waveform. Other acoustic parameters can be determined by performing other appropriate mathematical analysis of the audio signal waveform to provide relevant data for the audio signal waveform.").
Taubman teaches using a Fast Fourier Transform to analyze the frequency content of an audio signal waveform in order to resolve ambiguous pronouns in a voice query by associating a concept with the pronoun based on the acoustic parameters (Column 4, lines 1-7, "When a voice query containing a pronoun is received by the search system 112, the pronoun resolution system 122 resolves the ambiguous pronoun by associating a concept with the pronoun based on the acoustic parameters of the received voice query, as described in more detail below with reference to FIG. 4. A concept can be a noun or subject that is referenced by the ambiguous pronoun.")
Moniz and Taubman are considered to be analogous to the claimed invention because they are in the same field of resolving anaphora in natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz to incorporate the teachings of Taubman to use a Fast Fourier Transform to analyze the frequency content of an audio signal waveform.  Doing so would allow for resolving ambiguous pronouns in a voice query by associating a concept with the pronoun based on the acoustic parameters.
Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz in view of Taubman, and further in view of Shen et al. (“Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”), hereinafter Shen.

Regarding claim 2, as best understood based on the 35 U.S.C. 112(a) and 112(b) issues identified above, Moniz in view of Taubman discloses the method as claimed in claim 1, but does not specifically disclose: wherein converting the multimedia data into the signal wave comprises: converting the multimedia data into the plurality of amplitudes using an auto decoder neural network; and generating the signal wave from the plurality of amplitudes based on a timeframe of each of the plurality of amplitudes.
Shen teaches:
wherein converting the multimedia data into the signal wave comprises:
converting the multimedia data into the plurality of amplitudes using an auto decoder neural network (Section 2.2, lines 35-37, "The decoder is an autoregressive recurrent neural network which predicts a mel spectrogram from the encoded input sequence one frame at a time."; The spectrogram reads on the amplitudes, as the spectrogram is a representation of the frequencies and amplitudes of a signal over time.);
and generating the signal wave from the plurality of amplitudes based on a timeframe of each of the plurality of amplitudes (Section 2.3, lines 1-3, "We use a modified version of the WaveNet architecture from [8] to invert the mel spectrogram feature representation into time-domain waveform samples.").
Shen teaches generating a spectrogram for an input sequence and generating a time-domain waveform from the spectrogram in order to simplify speech synthesis using a neural network trained only on sequences of characters (Abstract, lines 1-6, "This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms."; Section 1, lines 19-23, "Tacotron [12], a sequence-to-sequence architecture [13] for producing magnitude spectrograms from a sequence of characters, simplifies the traditional speech synthesis pipeline by replacing the production of these linguistic and acoustic features with a single neural network trained from data alone.").
Moniz, Taubman, and Shen are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Shen to generate a spectrogram for an input sequence and generating a time-domain waveform from the spectrogram.  Doing so would allow for simplifying speech synthesis using a neural network trained only on sequences of characters.
Regarding claim 9, as best understood based on the 35 U.S.C. 112(a) and 112(b) issues identified above, Moniz in view of Taubman discloses the computer system as claimed in claim 8, but does not specifically disclose: wherein converting the multimedia data into the signal wave comprises: converting the multimedia data into the plurality of amplitudes using an auto decoder neural network; and generating the signal wave from the plurality of amplitudes based on a timeframe of each of the plurality of amplitudes.
Shen teaches:
wherein converting the multimedia data into the signal wave comprises:
converting the multimedia data into the plurality of amplitudes using an auto decoder neural network (Section 2.2, lines 35-37, "The decoder is an autoregressive recurrent neural network which predicts a mel spectrogram from the encoded input sequence one frame at a time."; The spectrogram reads on the amplitudes, as the spectrogram is a representation of the frequencies and amplitudes of a signal over time.);
and generating the signal wave from the plurality of amplitudes based on a timeframe of each of the plurality of amplitudes (Section 2.3, lines 1-3, "We use a modified version of the WaveNet architecture from [8] to invert the mel spectrogram feature representation into time-domain waveform samples.").
Shen teaches generating a spectrogram for an input sequence and generating a time-domain waveform from the spectrogram in order to simplify speech synthesis using a neural network trained only on sequences of characters (Abstract, lines 1-6, "This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms."; Section 1, lines 19-23, "Tacotron [12], a sequence-to-sequence architecture [13] for producing magnitude spectrograms from a sequence of characters, simplifies the traditional speech synthesis pipeline by replacing the production of these linguistic and acoustic features with a single neural network trained from data alone.").
Moniz, Taubman, and Shen are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Shen to generate a spectrogram for an input sequence and generating a time-domain waveform from the spectrogram.  Doing so would allow for simplifying speech synthesis using a neural network trained only on sequences of characters.
Regarding claim 16, as best understood based on the 35 U.S.C. 112(a) and 112(b) issues identified above, Moniz in view of Taubman discloses the computer program product as claimed in claim 15, but does not specifically disclose: wherein program instructions to convert the multimedia data into the signal wave comprises: program instructions to convert the multimedia data into the plurality of amplitudes using an auto decoder neural network; and program instructions to generate the signal wave from the plurality of amplitudes based on a timeframe of each of the plurality of amplitudes.
Shen teaches:
wherein program instructions to convert the multimedia data into the signal wave comprises: program instructions to convert the multimedia data into the plurality of amplitudes using an auto decoder neural network (Section 2.2, lines 35-37, "The decoder is an autoregressive recurrent neural network which predicts a mel spectrogram from the encoded input sequence one frame at a time."; The spectrogram reads on the amplitudes, as the spectrogram is a representation of the frequencies and amplitudes of a signal over time.);
and program instructions to generate the signal wave from the plurality of amplitudes based on a timeframe of each of the plurality of amplitudes (Section 2.3, lines 1-3, "We use a modified version of the WaveNet architecture from [8] to invert the mel spectrogram feature representation into time-domain waveform samples.").
Shen teaches generating a spectrogram for an input sequence and generating a time-domain waveform from the spectrogram in order to simplify speech synthesis using a neural network trained only on sequences of characters (Abstract, lines 1-6, "This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms."; Section 1, lines 19-23, "Tacotron [12], a sequence-to-sequence architecture [13] for producing magnitude spectrograms from a sequence of characters, simplifies the traditional speech synthesis pipeline by replacing the production of these linguistic and acoustic features with a single neural network trained from data alone.").
Moniz, Taubman, and Shen are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Shen to generate a spectrogram for an input sequence and generating a time-domain waveform from the spectrogram.  Doing so would allow for simplifying speech synthesis using a neural network trained only on sequences of characters.
Claims 3 – 4, 10 – 11 and 17 – 18 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz in view of Taubman, and further in view of Biadsy et al. (US Patent Application Publication No. 2022/0122579), hereinafter Biadsy.
Regarding claim 3, Moniz in view of Taubman discloses the method as claimed in claim 1, but does not specifically disclose: wherein each of the plurality of frames have a duration of time, and wherein the duration of time is determined based on a short-term Fourier transform of the signal wave.
Biadsy teaches:
wherein each of the plurality of frames have a duration of time, and wherein the duration of time is determined based on a short-term Fourier transform of the signal wave (Paragraph 0079, lines 1-8,"The base encoder configuration may be similar to other encoders with some variations discussed below. From an example input speech signal sampled at 16 kHz, the encoder may extract 80-dimensional log-mel spectrogram acoustic feature frames over a range of 125-7600 Hz, calculated using a Hann window, 50 ms frame length, 12.5 ms frame shift, and 1024-point Short-Time Fourier Transform (STFT).").
Biadsy teaches basing the frame time on the Short-Time Fourier Transform size in order to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition (Paragraph 0005, lines 7-16, "The discussion below describes a process of using a model trained using machine learning to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition. The model receives the speech audio spoken by the speaker and converts the speech audio to a mathematical representation. The model converts the mathematical representation to speech audio in a different voice without performing speech recognition on the speech audio spoken by the speaker.").
Moniz, Taubman, and Biadsy are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Biadsy to base the frame time on the Short-Time Fourier Transform size.  Doing so would allow for converting speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition.  
Regarding claim 4, Moniz in view of Taubman discloses the method as claimed in claim 1, but does not specifically disclose: separating the signal wave into frames using a spectrogram approach.
Biadsy teaches:
separating the signal wave into frames using a spectrogram approach (Paragraph 0079, lines 1-8,"The base encoder configuration may be similar to other encoders with some variations discussed below. From an example input speech signal sampled at 16 kHz, the encoder may extract 80-dimensional log-mel spectrogram acoustic feature frames over a range of 125-7600 Hz, calculated using a Hann window, 50 ms frame length, 12.5 ms frame shift, and 1024-point Short-Time Fourier Transform (STFT).").
Biadsy teaches extracting spectrogram acoustic feather frames from an input speech signal in order to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition (Paragraph 0005, lines 7-16, "The discussion below describes a process of using a model trained using machine learning to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition. The model receives the speech audio spoken by the speaker and converts the speech audio to a mathematical representation. The model converts the mathematical representation to speech audio in a different voice without performing speech recognition on the speech audio spoken by the speaker.").
Moniz, Taubman, and Biadsy are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Biadsy to extract spectrogram acoustic feather frames from an input speech signal.  Doing so would allow for converting speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition.  
Regarding claim 10, Moniz in view of Taubman discloses the computer system as claimed in claim 8, but does not specifically disclose: wherein each of the plurality of frames have a duration of time, and wherein the duration of time is determined based on a short-term Fourier transform of the signal wave.
Biadsy teaches:
wherein each of the plurality of frames have a duration of time, and wherein the duration of time is determined based on a short-term Fourier transform of the signal wave (Paragraph 0079, lines 1-8,"The base encoder configuration may be similar to other encoders with some variations discussed below. From an example input speech signal sampled at 16 kHz, the encoder may extract 80-dimensional log-mel spectrogram acoustic feature frames over a range of 125-7600 Hz, calculated using a Hann window, 50 ms frame length, 12.5 ms frame shift, and 1024-point Short-Time Fourier Transform (STFT).").
Biadsy teaches basing the frame time on the Short-Time Fourier Transform size in order to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition (Paragraph 0005, lines 7-16, "The discussion below describes a process of using a model trained using machine learning to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition. The model receives the speech audio spoken by the speaker and converts the speech audio to a mathematical representation. The model converts the mathematical representation to speech audio in a different voice without performing speech recognition on the speech audio spoken by the speaker.").
Moniz, Taubman, and Biadsy are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Biadsy to base the frame time on the Short-Time Fourier Transform size.  Doing so would allow for converting speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition.  
Regarding claim 11, Moniz in view of Taubman discloses the computer system as claimed in claim 8, but does not specifically disclose: separating the signal wave into frames using a spectrogram approach.
Biadsy teaches:
separating the signal wave into frames using a spectrogram approach (Paragraph 0079, lines 1-8,"The base encoder configuration may be similar to other encoders with some variations discussed below. From an example input speech signal sampled at 16 kHz, the encoder may extract 80-dimensional log-mel spectrogram acoustic feature frames over a range of 125-7600 Hz, calculated using a Hann window, 50 ms frame length, 12.5 ms frame shift, and 1024-point Short-Time Fourier Transform (STFT).").
Biadsy teaches extracting spectrogram acoustic feather frames from an input speech signal in order to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition (Paragraph 0005, lines 7-16, "The discussion below describes a process of using a model trained using machine learning to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition. The model receives the speech audio spoken by the speaker and converts the speech audio to a mathematical representation. The model converts the mathematical representation to speech audio in a different voice without performing speech recognition on the speech audio spoken by the speaker.").
Moniz, Taubman, and Biadsy are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Biadsy to extract spectrogram acoustic feather frames from an input speech signal.  Doing so would allow for converting speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition.  
Regarding claim 17, Moniz in view of Taubman discloses the computer program product as claimed in claim 15, but does not specifically disclose: wherein each of the plurality of frames have a duration of time, and wherein the duration of time is determined based on a short-term Fourier transform of the signal wave.
Biadsy teaches:
wherein each of the plurality of frames have a duration of time, and wherein the duration of time is determined based on a short-term Fourier transform of the signal wave (Paragraph 0079, lines 1-8,"The base encoder configuration may be similar to other encoders with some variations discussed below. From an example input speech signal sampled at 16 kHz, the encoder may extract 80-dimensional log-mel spectrogram acoustic feature frames over a range of 125-7600 Hz, calculated using a Hann window, 50 ms frame length, 12.5 ms frame shift, and 1024-point Short-Time Fourier Transform (STFT).").
Biadsy teaches basing the frame time on the Short-Time Fourier Transform size in order to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition (Paragraph 0005, lines 7-16, "The discussion below describes a process of using a model trained using machine learning to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition. The model receives the speech audio spoken by the speaker and converts the speech audio to a mathematical representation. The model converts the mathematical representation to speech audio in a different voice without performing speech recognition on the speech audio spoken by the speaker.").
Moniz, Taubman, and Biadsy are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Biadsy to base the frame time on the Short-Time Fourier Transform size.  Doing so would allow for converting speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition.  
Regarding claim 18, Moniz in view of Taubman discloses the computer program product as claimed in claim 15, but does not specifically disclose: program instructions to separate the signal wave into frames using a spectrogram approach.
Biadsy teaches:
program instructions to separate the signal wave into frames using a spectrogram approach (Paragraph 0079, lines 1-8,"The base encoder configuration may be similar to other encoders with some variations discussed below. From an example input speech signal sampled at 16 kHz, the encoder may extract 80-dimensional log-mel spectrogram acoustic feature frames over a range of 125-7600 Hz, calculated using a Hann window, 50 ms frame length, 12.5 ms frame shift, and 1024-point Short-Time Fourier Transform (STFT).").
Biadsy teaches extracting spectrogram acoustic feather frames from an input speech signal in order to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition (Paragraph 0005, lines 7-16, "The discussion below describes a process of using a model trained using machine learning to convert speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition. The model receives the speech audio spoken by the speaker and converts the speech audio to a mathematical representation. The model converts the mathematical representation to speech audio in a different voice without performing speech recognition on the speech audio spoken by the speaker.").
Moniz, Taubman, and Biadsy are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Biadsy to extract spectrogram acoustic feather frames from an input speech signal.  Doing so would allow for converting speech audio in the voice of a speaker to speech audio in a different voice without performing speech recognition.  
Claims 5, 12 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz in view of Taubman and Biadsy, and further in view of Lee (US Patent No. 11,323,835).
Regarding claim 5, Moniz in view of Taubman and Biadsy discloses the method as claimed in claim 4, but does not specifically disclose: wherein the highest correlation coefficient is based on the spectrogram approach.
Lee teaches:
wherein the highest correlation coefficient is based on the spectrogram approach (Column 18, lines 52-62, "When at least one specific signal for inspecting performance of the speaker or the microphone is detected from the sound signal, the AI device 100 may acquire a first spectrum for the sound signal and a second spectrum for the feedback signal (S115: YES, S120). In various embodiments of the present disclosure, the data to be calculated for a cross-correlation coefficient is not limited to the spectrum, and may be similarly implemented using a spectrogram. The spectrogram is a tool for visualizing and grasping sound or waves, and a combination of waveform and spectrum characteristics.").
Lee teaches calculating a correlation coefficient using data from a spectrogram in order to provide a method for self-inspecting the performance of a sound input/output device (Column 1, lines 41-44, "In addition, an object of the present disclosure is to implement a method of inspecting a sound input/output device capable of self-inspecting the performance of the sound input/output device.").
Moniz, Taubman, Biadsy, and Lee are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman and Biadsy to incorporate the teachings of Lee to calculate a correlation coefficient using data from a spectrogram.  Doing so would allow for providing a method for self-inspecting the performance of a sound input/output device.
Regarding claim 12, Moniz in view of Taubman discloses the computer system as claimed in claim 11, but does not specifically disclose: wherein the highest correlation coefficient is based on the spectrogram approach.
Lee teaches:
wherein the highest correlation coefficient is based on the spectrogram approach (Column 18, lines 52-62, "When at least one specific signal for inspecting performance of the speaker or the microphone is detected from the sound signal, the AI device 100 may acquire a first spectrum for the sound signal and a second spectrum for the feedback signal (S115: YES, S120). In various embodiments of the present disclosure, the data to be calculated for a cross-correlation coefficient is not limited to the spectrum, and may be similarly implemented using a spectrogram. The spectrogram is a tool for visualizing and grasping sound or waves, and a combination of waveform and spectrum characteristics.").
Lee teaches calculating a correlation coefficient using data from a spectrogram in order to provide a method for self-inspecting the performance of a sound input/output device (Column 1, lines 41-44, "In addition, an object of the present disclosure is to implement a method of inspecting a sound input/output device capable of self-inspecting the performance of the sound input/output device.").
Moniz, Taubman, Biadsy, and Lee are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman and Biadsy to incorporate the teachings of Lee to calculate a correlation coefficient using data from a spectrogram.  Doing so would allow for providing a method for self-inspecting the performance of a sound input/output device.
Regarding claim 19, Moniz in view of Taubman discloses the computer program product as claimed in claim 18, but does not specifically disclose: wherein the highest correlation coefficient is based on the spectrogram approach.
Lee teaches:
wherein the highest correlation coefficient is based on the spectrogram approach (Column 18, lines 52-62, "When at least one specific signal for inspecting performance of the speaker or the microphone is detected from the sound signal, the AI device 100 may acquire a first spectrum for the sound signal and a second spectrum for the feedback signal (S115: YES, S120). In various embodiments of the present disclosure, the data to be calculated for a cross-correlation coefficient is not limited to the spectrum, and may be similarly implemented using a spectrogram. The spectrogram is a tool for visualizing and grasping sound or waves, and a combination of waveform and spectrum characteristics.").
Lee teaches calculating a correlation coefficient using data from a spectrogram in order to provide a method for self-inspecting the performance of a sound input/output device (Column 1, lines 41-44, "In addition, an object of the present disclosure is to implement a method of inspecting a sound input/output device capable of self-inspecting the performance of the sound input/output device.").
Moniz, Taubman, Biadsy, and Lee are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman and Biadsy to incorporate the teachings of Lee to calculate a correlation coefficient using data from a spectrogram.  Doing so would allow for providing a method for self-inspecting the performance of a sound input/output device.
Claims 6, 13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz in view of Taubman, and further in view of Laxman et al. (US Patent Application Publication No. 2021/0142291), hereinafter Laxman.
Regarding claim 6, Moniz in view of Taubman discloses the method as claimed in claim 1, but does not specifically disclose: identifying a label of the frame using deep neural network.
Laxman teaches:
identifying a label of the frame using deep neural network (Paragraph 0054, lines 7-11, "The tokens in a dialog that constitute a frame (and/or sub-frame) are assigned the label obtained by the concatenation of the frame-type (or slot-type) and its associated intent (if one exists)."; Paragraph 0070, lines 1-8, "FIG. 14 illustrates an example process for implementing a hybrid neural model for a conversational AI first solution that successfully combines goal-orientation and chat-bots, according to some embodiments. In step 1402, process 1400 implements a recursive slot-filling for efficient, data driven mixed-initiative semantics. In step 1404, process 1400 implements a deep neural network for response retrieval over growing conversation spaces.").
Laxman teaches assigning labels to frames in order to implement a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience (Paragraph 0003, lines 12-20, "We present a Virtual Business Assistant powered by our groundbreaking MIDGO AI technology that automates multi-point communication, helping with not just automating communication with a customer but also effectively coordinating with the business staff and manager/owner regarding that customer. This approach delivers dramatic improvements in the level of automation together with significantly higher quality of customer experience.").
Moniz, Taubman, and Laxman are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Laxman to assign labels to frames.  Doing so would allow for implementing a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience.
Regarding claim 13, Moniz in view of Taubman discloses the computer system as claimed in claim 8, but does not specifically disclose: identifying a label of the frame using deep neural network.
Laxman teaches:
identifying a label of the frame using deep neural network (Paragraph 0054, lines 7-11, "The tokens in a dialog that constitute a frame (and/or sub-frame) are assigned the label obtained by the concatenation of the frame-type (or slot-type) and its associated intent (if one exists)."; Paragraph 0070, lines 1-8, "FIG. 14 illustrates an example process for implementing a hybrid neural model for a conversational AI first solution that successfully combines goal-orientation and chat-bots, according to some embodiments. In step 1402, process 1400 implements a recursive slot-filling for efficient, data driven mixed-initiative semantics. In step 1404, process 1400 implements a deep neural network for response retrieval over growing conversation spaces.").
Laxman teaches assigning labels to frames in order to implement a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience (Paragraph 0003, lines 12-20, "We present a Virtual Business Assistant powered by our groundbreaking MIDGO AI technology that automates multi-point communication, helping with not just automating communication with a customer but also effectively coordinating with the business staff and manager/owner regarding that customer. This approach delivers dramatic improvements in the level of automation together with significantly higher quality of customer experience.").
Moniz, Taubman, and Laxman are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Laxman to assign labels to frames.  Doing so would allow for implementing a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience.
Regarding claim 20, Moniz in view of Taubman discloses the computer program product as claimed in claim 15, but does not specifically disclose: program instructions to identify a label of the frame using deep neural network.
Laxman teaches:
program instructions to identify a label of the frame using deep neural network (Paragraph 0054, lines 7-11, "The tokens in a dialog that constitute a frame (and/or sub-frame) are assigned the label obtained by the concatenation of the frame-type (or slot-type) and its associated intent (if one exists)."; Paragraph 0070, lines 1-8, "FIG. 14 illustrates an example process for implementing a hybrid neural model for a conversational AI first solution that successfully combines goal-orientation and chat-bots, according to some embodiments. In step 1402, process 1400 implements a recursive slot-filling for efficient, data driven mixed-initiative semantics. In step 1404, process 1400 implements a deep neural network for response retrieval over growing conversation spaces.").
Laxman teaches assigning labels to frames in order to implement a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience (Paragraph 0003, lines 12-20, "We present a Virtual Business Assistant powered by our groundbreaking MIDGO AI technology that automates multi-point communication, helping with not just automating communication with a customer but also effectively coordinating with the business staff and manager/owner regarding that customer. This approach delivers dramatic improvements in the level of automation together with significantly higher quality of customer experience.").
Moniz, Taubman, and Laxman are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman to incorporate the teachings of Laxman to assign labels to frames.  Doing so would allow for implementing a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience.
Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz in view of Taubman and Laxman, and further in view of Peng et al. (“A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings”), hereinafter Peng.
Regarding claim 7, Moniz in view of Taubman and Laxman discloses the method as claimed in claim 6.  Laxman further discloses:
generating a vector describing the object wherein the vector comprises the label (Paragraph 0054, lines 7-11, "The tokens in a dialog that constitute a frame (and/or sub-frame) are assigned the label obtained by the concatenation of the frame-type (or slot-type) and its associated intent (if one exists)."; Paragraph 0086, lines 16-18, "It is noted that entities that are detected can be added as features to the word vectors by each level's labeler.").
Laxman teaches adding labels to word vectors in order to implement a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience (Paragraph 0003, lines 12-20, "We present a Virtual Business Assistant powered by our groundbreaking MIDGO AI technology that automates multi-point communication, helping with not just automating communication with a customer but also effectively coordinating with the business staff and manager/owner regarding that customer. This approach delivers dramatic improvements in the level of automation together with significantly higher quality of customer experience.").
Moniz, Taubman, and Laxman are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman and Laxman to further incorporate the teachings of Laxman to add labels to word vectors.  Doing so would allow for implementing a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience.
Moniz in view of Taubman and Laxman does not specifically disclose: generating a vector describing the object wherein the vector comprises a velocity and an acceleration.
Peng teaches:
generating a vector describing the object wherein the vector comprises a velocity and an acceleration Section 4.1, lines 29-30, "The input features are 13-dimensional mel-frequency cepstral coefficients (MFCCs) plus velocity and acceleration vectors.").
Peng teaches using feature vectors that include velocity and acceleration vectors in order to provide word embeddings that can be used in search, discovery, and indexing systems for low resource languages (Abstract, lines 1-4, "We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation. The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.").
Moniz, Taubman, Laxman, and Peng are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman and Laxman to further incorporate the teachings of Peng to use feature vectors that include velocity and acceleration vectors.  Doing so would allow for providing word embeddings that can be used in search, discovery, and indexing systems for low resource languages.
Regarding claim 14, Moniz in view of Taubman and Laxman discloses the computer system as claimed in claim 13.  Laxman further discloses:
generating a vector describing the object wherein the vector comprises the label (Paragraph 0054, lines 7-11, "The tokens in a dialog that constitute a frame (and/or sub-frame) are assigned the label obtained by the concatenation of the frame-type (or slot-type) and its associated intent (if one exists)."; Paragraph 0086, lines 16-18, "It is noted that entities that are detected can be added as features to the word vectors by each level's labeler.").
Laxman teaches adding labels to word vectors in order to implement a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience (Paragraph 0003, lines 12-20, "We present a Virtual Business Assistant powered by our groundbreaking MIDGO AI technology that automates multi-point communication, helping with not just automating communication with a customer but also effectively coordinating with the business staff and manager/owner regarding that customer. This approach delivers dramatic improvements in the level of automation together with significantly higher quality of customer experience.").
Moniz, Taubman, and Laxman are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman and Laxman to further incorporate the teachings of Laxman to add labels to word vectors.  Doing so would allow for implementing a neural network conversational AI application to automate communications with an improved level of automation and higher quality customer experience.
Moniz in view of Taubman and Laxman does not specifically disclose: generating a vector describing the object wherein the vector comprises a velocity and an acceleration.
Peng teaches:
generating a vector describing the object wherein the vector comprises a velocity and an acceleration Section 4.1, lines 29-30, "The input features are 13-dimensional mel-frequency cepstral coefficients (MFCCs) plus velocity and acceleration vectors.").
Peng teaches using feature vectors that include velocity and acceleration vectors in order to provide word embeddings that can be used in search, discovery, and indexing systems for low resource languages (Abstract, lines 1-4, "We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation. The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.").
Moniz, Taubman, Laxman, and Peng are considered to be analogous to the claimed invention because they are in the same field of natural language processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Moniz in view of Taubman and Laxman to further incorporate the teachings of Peng to use feature vectors that include velocity and acceleration vectors.  Doing so would allow for providing word embeddings that can be used in search, discovery, and indexing systems for low resource languages.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAMES BOGGS/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657