DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-18 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 4, 5, 7, 10, 11, 13, 16, 17  is/are rejected under 35 U.S.C. 103 as being unpatentable over Liao U.S. PAP 2018/0329512 A1 in view of Sak U.S. PAP 2019/0057683 A1.
claim 1 Liao teaches a method for controlling devices through voice interaction (method for interacting based on multimodal inputs, see abstract), the method comprises: 
identifying, by a controlling device, at least one feature of a target device and an action to be performed on the feature, based on an intent and an object determined from a received voice input (acquiring a plurality of input information including a voice input, performing analysis of the plurality of input information to generate an operation command, where the operation command has operation elements, including an operation object an operation action and an operation parameter, see par. [0009-0010]); 
determining, by the controlling device, a correspondence between the feature and the action to be performed using a trained neural network, wherein the trained neural network is pre-trained based on a correspondence between a plurality of prior actions and a plurality of features associated with the target device (second unit may perform logic matching and arbitration selection using a machine learning method so as to determine element information of the operation corresponding to the element types and includes at least one of a convolutional neural network, see par. [0060]); 
comparing, by the controlling device, a current operational state of the feature with an operational threshold of the feature (performing arbitration analysis of the plurality of structured data so as to generate an operation command, see par. [0091]); 
and performing, by the controlling device, the action on the feature based on the determined correspondence, when the current operational state is within one or more limits of the 
However Liao does not teach and wherein the trained neural network is pre-trained to correctly identify the feature and the action by combining a loss function for recognition of the action and a loss function for recognition of the target device that is to perform the action.
In the same field of endeavor Sak teaches methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs, see abstract. a recurrent neural network model can be used as an end-to-end speech recognition system. The model can be trained to perform a sequence mapping task, with the model estimating an alignment between an input sequence, e.g., frames of audio data, and an output target sequence, e.g., graphemes. The sequence of graphemes that is indicted by the model can then be used to provide a transcription for speech, see par. [0003]. by applying multiple training processes, the speech recognition system's ability to recognize given audio may be fine-tuned and improved, see par. [0031]. The computing system uses the training data to adjust decoder neural network weights from initial to trained values two training processes--a first training process (A) that minimizes a log likelihood loss function and a second training process (B) that minimizes an expected loss function. See par. [0063].
It would have been obvious to one of ordinary skill in the art to combine the Liao invention with the teachings of Sak for the benefit of improving the speech recognition system’s ability to recognize given audio, see par. [0031].

Regarding claim 4 Liao teaches the method of claim 1 further comprising determining, by the controlling device, from an image associated with the target device, and by a convoluting neural network (CNN), the feature of the target device, wherein the CNN is trained to identify features from the target device using at least one training image associated with the target device (the deep learning neural network architecture model is a convolutional neural network configured for voice analysis and image recognition, see par. [0063]; the CNN mainly recognizing a two-dimensional image, and the CNN performs learning through training data, see par. [0065]). 
Regarding claim 5 Liao teaches the method of claim 4, wherein the image comprises at least one of a blueprint of the target device, a drawing of the target device, or a layout of the target device and the method further comprises updating, by the controlling device, the image associated with the target device, when the current operational state is within the limits of the 

Regarding claim 7 Liao teaches a controlling device, comprising: a processor (processor see par. [0131]); and a memory communicatively coupled to the processor and storing instructions that (ram memory, see par. [0131]), when executed by the processor, cause the processor to: 
identify, by a controlling device, at least one feature of a target device and an action to be performed on the feature, based on an intent and an object determined from a received voice input (acquiring a plurality of input information including a voice input, performing analysis of the plurality of input information to generate an operation command, where the operation command has operation elements, including an operation object an operation action and an operation parameter, see par. [0009-0010]); 
determining, by the controlling device, a correspondence between the feature and the action to be performed using a trained neural network, wherein the trained neural network is pre-trained based on a correspondence between a plurality of prior actions and a plurality of features associated with the target device (second unit may perform logic matching and arbitration selection using a machine learning method so as to determine element information of the 
compare, by the controlling device, a current operational state of the feature with an operational threshold of the feature (performing arbitration analysis of the plurality of structured data so as to generate an operation command, see par. [0091]); 
and perform, by the controlling device, the action on the feature based on the determined correspondence, when the current operational state is within one or more limits of the operational threshold (performing a corresponding cooperation on the operation object based on the operation command, see par. [0011]). 
However Liao does not teach and wherein the trained neural network is pre-trained to correctly identify the feature and the action by combining a loss function for recognition of the action and a loss function for recognition of the target device that is to perform the action.
In the same field of endeavor Sak teaches methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs, see abstract. a recurrent neural network model can be used as an end-to-end speech recognition system. The model can be trained to perform a sequence mapping task, with the model estimating an alignment between an input two training processes--a first training process (A) that minimizes a log likelihood loss function and a second training process (B) that minimizes an expected loss function. See par. [0063].
It would have been obvious to one of ordinary skill in the art to combine the Liao invention with the teachings of Sak for the benefit of improving the speech recognition system’s ability to recognize given audio, see par. [0031].

Regarding claim 10 Liao teaches the controlling device of claim 7, wherein the instructions, when executed by the processor, further cause the processor to determine, from an image associated with the target device and by a convoluting neural network (CNN), the feature of the target device, wherein the CNN is trained to identify features from the target device using 
Regarding claim 11 Liao teaches the controlling device of claim 10, wherein the image comprises at least one of a blueprint of the target device, a drawing of the target device, or a layout of the target device and the instructions, when executed by the processor, further cause the processor to update the image associated with the target device, when the current operational state is within the limits of the operational threshold (the real scene information may be an image, a picture a scheme image a real object image or an object with a specific shape, see par. [0084]; scene image recognition obtains real scene information inputted by image input module to obtained structured data about a setoff operable objects, see par. [0093]). 
Regarding claim 13 Liao teaches a  non-transitory computer readable medium having stored thereon instructions for controlling devices through voice interaction comprising executable code which when executed by one or more processors (computer readable recording medium, see par. [0131]), causes the one or more processors to: 
identify, by a controlling device, at least one feature of a target device and an action to be performed on the feature, based on an intent and an object determined from a received voice input (acquiring a plurality of input information including a voice input, performing analysis of the plurality of input information to generate an operation command, where the operation command has operation elements, including an operation object an operation action and an operation parameter, see par. [0009-0010]); 

compare, by the controlling device, a current operational state of the feature with an operational threshold of the feature (performing arbitration analysis of the plurality of structured data so as to generate an operation command, see par. [0091]); 
and perform, by the controlling device, the action on the feature based on the determined correspondence, when the current operational state is within one or more limits of the operational threshold (performing a corresponding cooperation on the operation object based on the operation command, see par. [0011]). 
However Liao does not teach and wherein the trained neural network is pre-trained to correctly identify the feature and the action by combining a loss function for recognition of the action and a loss function for recognition of the target device that is to perform the action.
In the same field of endeavor Sak teaches methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time two training processes--a first training process (A) that minimizes a log likelihood loss function and a second training process (B) that minimizes an expected loss function. See par. [0063].
It would have been obvious to one of ordinary skill in the art to combine the Liao invention with the teachings of Sak for the benefit of improving the speech recognition system’s ability to recognize given audio, see par. [0031].

Regarding claim 16 Liao teaches the non-transitory computer-readable medium of claim 13, wherein the executable code, when executed by the processors, further causes the processors to determine, from an image associated with the target device and by a convoluting neural network (CNN), the feature of the target device, wherein the CNN is trained to identify features from the target device using at least one training image associated with the target device (the deep learning neural network architecture model is a convolutional neural network configured for voice analysis and image recognition, see par. [0063]; the CNN mainly recognizing a two-dimensional image, and the CNN performs learning through training data, see par. [0065]).. 
Regarding claim 17 Liao teaches the non-transitory computer-readable medium of claim 16, wherein the image comprises at least one of a blueprint of the target device, a drawing of the target device, or a layout of the target device and the executable code, when executed by the processors, further causes the processors to update the image associated with the target device, when the current operational state is within the limits of the operational threshold (the real scene information may be an image, a picture a scheme image a real object image or an object with a specific shape, see par. [0084]; scene image recognition obtains real scene information inputted by image input module to obtained structured data about a setoff operable objects, see par. [0093]). 

Claim 2, 3, 8, 9, 14, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liao U.S. PAP 2018/0329512 A1, in view of Sak U.S. PAP 2019/0057683 A1, further in view of Kandur U.S. PAP 2019/0340202 A1.
claim 2 Liao in view of Sak does not teach the method of claim 1 further comprising: converting, by the controlling device, the received voice input to text; and determining, by the controlling device, each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model.
IN a similar field of endeavor Kandur teaches a device for providing contextual recommendations and a method therefor, see abstract. 
converting, by the controlling device, the voice input received from the user to text (The virtual assistant unit 106 may be configured to receive the voice input from the user, see par. [0064]. The input analyzing unit 202 may be configured to analyze the inputs provided by the user from the at least one of the display unit 104 and the virtual assistant unit 106. In an embodiment, the input analyzing unit 202 uses a natural language processing (NLP) technique to analyze the input. Embodiments herein are explained using the NLP technique to analyze the input, but it may be obvious to a person of ordinary skill in the art that any other text processing techniques may be used for analyzing the input, see par. [0075]); 
and determining, by the controlling device, each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model (The domain-object determination unit 206 may be configured to determine the objects related to the domain determined for the input. On identifying the input provided by the user, the domain-object determination unit 206 analyzes open-corpus/data set of the Domain-specific LM/trained LM. The character-embedding layer forms a vector for each word of the input by processing the characters of the input. The feature vector may include the floating numbers corresponding to the selected plurality of words. The LM engine processes the feature vector using a long short-term 
It would have been obvious to one of ordinary skill in the art to combine the teachings of Liao in view of Sak with the Kandur invention for the benefit of generating grammatically corrected and domain-related objects, see par. [0079].
Regarding claim 3 Kandur teaches the method of claim 2, wherein the text is provided to the LSTM model in the form of sequence of words using word embeddings and the LSTM model is trained based on the prior actions and each of the prior actions is associated with a probability of execution (the domain-object determination unit 206 analyzes open-corpus/data set of the Domain-specific LM/trained LM. The data set of the trained LM may include a data set of the domains with respective vocabulary, see par. [0079]).
Regarding claim 8 Liao in view of Sak does not teach the controlling device of claim 7, wherein the instructions, when executed by the processor, further cause the processor to: convert the received voice input to text; and determine each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model. 
IN a similar field of endeavor Kandur teaches a device for providing contextual recommendations and a method therefor, see abstract. 
converting, by the controlling device, the voice input received from the user to text (The virtual assistant unit 106 may be configured to receive the voice input from the user, see par. [0064]. The input analyzing unit 202 may be configured to analyze the inputs provided by the user from the at least one of the display unit 104 and the virtual assistant unit 106. In an 
and determining, by the controlling device, each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model (The domain-object determination unit 206 may be configured to determine the objects related to the domain determined for the input. On identifying the input provided by the user, the domain-object determination unit 206 analyzes open-corpus/data set of the Domain-specific LM/trained LM. The character-embedding layer forms a vector for each word of the input by processing the characters of the input. The feature vector may include the floating numbers corresponding to the selected plurality of words. The LM engine processes the feature vector using a long short-term memory (LSTM)layer which generates grammatically corrected and domain-related objects, see par. [0079]).
It would have been obvious to one of ordinary skill in the art to combine the teachings of Liao in view of Sak with the Kandur invention for the benefit of generating grammatically corrected and domain-related objects, see par. [0079].

Regarding claim 9 Kandur teaches the controlling device of claim 8, wherein the text is provided to the LSTM model in the form of sequence of words using word embeddings and the LSTM model is trained based on the prior actions and each of the prior actions is associated with 
Regarding claim 14 Liao in view of Sak does not teach the non-transitory computer-readable medium of claim 13, wherein the executable code, when executed by the processors, further causes the processors to: convert the received voice input to text; and determine each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model. 
IN a similar field of endeavor Kandur teaches a device for providing contextual recommendations and a method therefor, see abstract. 
converting, by the controlling device, the voice input received from the user to text (The virtual assistant unit 106 may be configured to receive the voice input from the user, see par. [0064]. The input analyzing unit 202 may be configured to analyze the inputs provided by the user from the at least one of the display unit 104 and the virtual assistant unit 106. In an embodiment, the input analyzing unit 202 uses a natural language processing (NLP) technique to analyze the input. Embodiments herein are explained using the NLP technique to analyze the input, but it may be obvious to a person of ordinary skill in the art that any other text processing techniques may be used for analyzing the input, see par. [0075]); 
and determining, by the controlling device, each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model (The domain-object determination unit 206 may be configured to determine the objects related to the domain 
It would have been obvious to one of ordinary skill in the art to combine the teachings of Liao in view of Sak with the Kandur invention for the benefit of generating grammatically corrected and domain-related objects, see par. [0079].
Regarding claim 15 Kandur teaches the non-transitory computer-readable medium of claim 14, wherein the text is provided to the LSTM model in the form of sequence of words using word embeddings and the LSTM model is trained based on the prior actions and each of the prior actions is associated with a probability of execution (the domain-object determination unit 206 analyzes open-corpus/data set of the Domain-specific LM/trained LM. The data set of the trained LM may include a data set of the domains with respective vocabulary, see par. [0079]). 

Claim 6, 12, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liao U.S. PAP 2018/0329512 A1, in view of Sak U.S. PAP 2019/0057683 A1, further in view of Huang U.S. PAP 2019/0311070 A1.
claim 6 Liao in view of Sak does not teach the method of claim 1 further comprising: establishing, by the controlling device, non-performance of the action on the feature, when the current operational state is outside the limits of the operational threshold; and outputting, by the controlling device, an alert regarding the non-performance of the action, wherein the alert comprises details associated with the non-performance of the action. 
In the same field of endeavor Huang teaches a method for using a speech signal to augment a visual search, see abstract. Understanding the search intent of the user may improve the accuracy and relevance of visual searches, see par. [0002]. 
establishing, by the controlling device, non-performance of the action on the feature, when the current operational state is outside the limits of the operational threshold (When the confidence values of both the speech search intent and the image search intent are less than T2, the process 320, at block 368, does not have sufficient confidence in either search intent and may display a prompt requesting a search query from the user, see par. [0043]);
and outputting, by the controlling device, an alert regarding non-performance of the action, wherein the alert comprises details associated with the non-performance of the action (Block 320 may or may not combine the speech and image search intents. Block 320 may, for example, display text such as "unable to generate search query" and provide the user with a window in which to manually enter the query, see par. [0043]).
It would have been obvious to one of ordinary skill in the art to combine the Liao in view of Sak invention with the teachings of Huang for the benefit of using speech to improve the accuracy of visual searches, see par. [0002].
claim 12 Liao in view of Sak does not teach the controlling device of claim 7, wherein the instructions, when executed by the processor, further cause the processor to: establish non-performance of the action on the feature, when the current operational state is outside the limits of the operational threshold; and output an alert regarding the non-performance of the action, wherein the alert comprises details associated with the non-performance of the action. 
In the same field of endeavor Huang teaches a method for using a speech signal to augment a visual search, see abstract. Understanding the search intent of the user may improve the accuracy and relevance of visual searches, see par. [0002]. 
establishing, by the controlling device, non-performance of the action on the feature, when the current operational state is outside the limits of the operational threshold (When the confidence values of both the speech search intent and the image search intent are less than T2, the process 320, at block 368, does not have sufficient confidence in either search intent and may display a prompt requesting a search query from the user, see par. [0043]);
and outputting, by the controlling device, an alert regarding non-performance of the action, wherein the alert comprises details associated with the non-performance of the action (Block 320 may or may not combine the speech and image search intents. Block 320 may, for example, display text such as "unable to generate search query" and provide the user with a window in which to manually enter the query, see par. [0043]).


Regarding claim 18 Liao in view of Sak does not teach the non-transitory computer-readable medium of claim 13, wherein the executable code, when executed by the processors, further causes the processors to: establish non-performance of the action on the feature, when the current operational state is outside the limits of the operational threshold; and output an alert regarding the non-performance of the action, wherein the alert comprises details associated with the non-performance of the action. 
In the same field of endeavor Huang teaches a method for using a speech signal to augment a visual search, see abstract. Understanding the search intent of the user may improve the accuracy and relevance of visual searches, see par. [0002]. 
establishing, by the controlling device, non-performance of the action on the feature, when the current operational state is outside the limits of the operational threshold (When the confidence values of both the speech search intent and the image search intent are less than T2, the process 320, at block 368, does not have sufficient confidence in either search intent and may display a prompt requesting a search query from the user, see par. [0043]);
and outputting, by the controlling device, an alert regarding non-performance of the action, wherein the alert comprises details associated with the non-performance of the action (Block 320 may or may not combine the speech and image search intents. Block 320 may, for 
It would have been obvious to one of ordinary skill in the art to combine the Liao in view of Sak invention with the teachings of Huang for the benefit of using speech to improve the accuracy of visual searches, see par. [0002].
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711.  The examiner can normally be reached on Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656                                                                                                                                                                                                        
a