Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 

Response to Arguments
1.	Applicant’s arguments with respect to claims 1-18 have been considered but are moot because the arguments do not apply to any of the new citations from the current prior art reference combination including new reference Hart et al., US Patent (8,700,392) being used in the current rejection.  See full rejection detail below. 

Claim Rejections - 35 USC § 103
2.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
3.	The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.

s 1-5, 7-11 and 13-18 is/are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Hart et al., US Patent (8,700,392), hereinafter “Hart” and Hart.

Regarding claim 1 Hart teaches a portable terminal device a computing device 302 [Hart col 6 lines 57-58] comprising: 
a camera that captures images of an operator an image capture element 304 is on the same general side of the computing device 302 as a display element such that when the user 308 is viewing the display element, the image capture element has a viewable area or viewable range 312 that, according to this example, includes the face of the user 308.[Hart col 6 lines 13-15]; 
a microphone that captures a voice of the operator Device 350, in one embodiment, may also analyze the outputs of elements 354, 356, 360 and 362 to determine a relative location (e.g., a distance and/or orientation) of user 308 relative to device 350 to identify at least one of the proper audio capture element (e.g., microphone) 354, 356 and/or image capture element 360, 362 to use to assist in determining the audio content/user input. [Hart col 8 lines 34-40]; 
a controller which is programmed to execute a plurality of operations FIG. 4 illustrates an example process 400 for controlling an interface (or otherwise providing input) using a device such as that described with respect to FIGS. 3(a)-3(c). … the device begins recording audio 404 and capturing video in a direction of a typical user 408, the audio including words spoken by the user. … a minimum amount of motion is detected in the captured video or some other such threshold is reached. … analyze the captured audio for speech information 406 and/or monitor the video for lip motion 410 or some other such motion or gesture as discussed elsewhere herein. In some embodiments, video data is captured and/or analyzed … with voice recognition [Hart col 8, 9 lines 65-67, 1-20]; 
a communication interface that transmits and receives data with an external server The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 8-12] the network includes the Internet, as the environment includes a Web server 1206 for receiving requests and serving content in response thereto, although for other networks, …  The illustrative environment includes at least one application server 1208 and a data store 1210. [Hart col 20 lines 25-32]; and 
wherein the controller is further programmed to: when the images are obtained from the camera and the voice is obtained from the microphone, A device in certain embodiments can also utilize image and/or voice recognition to apply context to input provided by a user. [ Hart col 17 lines 35-37] control the communication interface to transmit data The application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. … The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46], of the obtained images and the obtained voice to the external server, Thus, the user device 714, utilizing captured video data, identifies that Sarah's mouth 710 is moving at substantially the same time the as the audio data containing speech content was captured. [Hart col 13, 14 lines 45-67, 1-5] and when the communication interface receives information from the external server The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 8-12] including one or more results identified by the external server based on the transmitted data, The device can identify words contained within the captured audio and/or video data 606 and can compare these words against the input words contained in the state-dependent dictionary 608. [Hart col 12 lines 37-40], the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user [Hart col 6 lines 40-44]  wherein, wherein, when the communication interface receives only one result for operation identified in response to the transmitted data by the external server, the controller is further programmed to execute an operation corresponding to the one result, The device can identify words contained within the captured audio and/or video data 606 and can compare these words against the input words contained in the state-dependent dictionary 608. Upon matching 610 a word from the audio/video data with an input word, the input corresponding to the matched input word is processed 614 (e.g., launch browser, create calendar event). [Hart col 12 lines 37-46] (in this example it is clear Hart shows a system that receives one comand such as launch a browser and the system performs said command solely)
wherein, when the communication interface receives a plurality of results for operation identified in response to the transmitted data by the external server, the controller is further programmed to: display information corresponding to the plurality of results for operation as a plurality of candidates of operation options, the device may capture audio containing speech and/or video data containing images of persons other than the primary user of the device. For instance, consider the example situation 700 of FIG. 7, wherein there are three people in a meeting, including the primary user 722 of a first computing device 714 (the "user device" in this example) and two other persons (here, Mark 702 and Sarah 704) each (potentially) with their own second device 706, 708, respectively. ). [Hart col 12 lines 35-43] (the system displays the three possible speakers)
capture additional voice for selecting one from the plurality of results for operation (the Office views the operation here as identifying the speaker in a group) during displaying the information corresponding to the plurality of results for operation,During the meeting, the user device 714 may detect audible speech from the user(s), Mark and/or Sarah, either individually or at the same time. Thus, when the user device 714 detects audible speech, the user device can utilize one or more image algorithms to analyze the captured video data to determine whether the user, [Hart col 12 lines 50-55]
Sarah is speaking at this point during the meeting (e.g., saying "Thanks. Quarterly profits are up 10%") 712, and the user and Mark are listening to Sarah. Thus, the user device 714, utilizing captured video data, identifies that Sarah's mouth 710 is moving at substantially the same time the as the audio data containing speech content was captured. The captured video data also indicates that the user 722 and Mark 702 are not speaking at that time because their lips are not moving. Thus, the device associates the speech with Sarah. [Hart col 13 & 14  lines 66-67 & 1-8]  and execute an operation corresponding to the determined one result the display 718 on the user device shows a transcription of a recent portion of the meeting. Because the user device is able to determine which person spoke each word, the device can list that person's name (or other identifier) next to each portion attributed to that person. The display can also include other elements such as a cursor 720 or other element indicating who last spoke and/or the last word that was spoken. [Hart col 14 lines 27-40]  determining identities for the persons and/or devices in the room, or otherwise nearby or within a detectable range, also provides a number of other advantages and input possibilities. For example, if the device knows that one of the nearby people is named "Sarah" then the device can add that name to the sub-dictionary being used in order to more quickly and easily recognize when that name is spoken. Further, if information associated with Sarah, such as a device identifier or email address, is known to the device, such as through contact information, various actions can be performed using that information.[Hart col 14 lines 35-44]  

Hart discloses a user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or video information which can be used to assist in analyzing the audio information. For example, image information is utilized by the device to determine when someone is speaking, and the movement of the person's lips can be analyzed to assist in determining the words that were spoken. Any gestures or other motions can assist in the determination as well. By combining various types of data to determine user input, the accuracy of a process such as speech recognition can be improved, and the need for lengthy application training processes can be avoided. 
Hart also illustrates an example of an environment 1200 for implementing aspects in accordance with various embodiments. As will be appreciated, although a Web-based environment is used for purposes of explanation, different environments may be used, as appropriate, to implement various embodiments. The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. The illustrative environment includes at least one application server 1208 and a data store 1210. 

Before the effective date of the invention it would have been obvious to one of ordinary skill in the art to combine the embodiments taught by Hart to design a networked client and server communication network that could transfer audio and images from a portable device to a server containing similar hardware and software as the client device to perform voice and lip movement recognition and transfer the result of the word spoken by the operator over the networked devices as even Hart discloses that the specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention and further as it has been ruled that matters of well-known designs were matters of ordinary skill in the art.

Regarding claim 2 Hart teaches everything above (see claim 1).  In addition Hart teaches wherein the displayed information corresponding to the plurality of results is where the speaker can actually be identified automatically and associated with the proper speech. For example, the display 718 on the user device shows a transcription of a recent portion of the meeting [Hart col 14 lines 25-28]

Regarding claim 3 Hart teaches everything above (see claim 2).  In addition Hart teaches further comprising selecting one from the plurality of results from a voice corresponding to the characters or the character string. where the speaker can actually be identified automatically and associated with the proper speech. For example, the display 718 on the user device shows a transcription of a recent portion of the meeting [Hart col 14 lines 25-28]

Regarding claim 4 Hart teaches everything above (see claim 1).  In addition Hart teaches wherein when the images are obtained from the camera and the voice is obtained from the microphone, the controller is further programmed to identify if the operator is a specific operator based on at least one of the obtained images and the obtained voice, and when the operator is identified as the specific operator, the device may capture audio containing speech and/or video data containing images of persons other than the primary user of the device. For instance, consider the example situation 700 of FIG. 7, wherein there are three people in a meeting, including the primary user 722 of a first computing device 714 (the "user device" in this example) and two other persons (here, Mark 702 and Sarah 704) each (potentially) with their own second device 706, 708, respectively. [Hart col 13 lines 35-43]   based on at least one of the obtained images and the obtained voice, The user device in this example has a first image capture element 724 able to capture images of the user 722 and a second image capture element 726 on the other side of the user device able to capture image data including images of Mark 702 and Sarah 704 located on the other side of the user device. During the meeting, the user device 714 may detect audible speech from the user(s), Mark and/or Sarah, either individually or at the same time. Thus, when the user device 714 detects audible speech, the user device can utilize one or more image algorithms to analyze the captured video data to determine whether the user, Mark or Sarah, provided the audible speech. For example, the user device 714 may utilize the one or more image algorithms to identify whether the user's, Mark's or Sarah's, lips were moving at substantially the same time as the audio was captured by at least one of microphones 730, 740, and 742 of the device 714. In this example, at least one of audio capture units or microphones 740, 742 capture audio from user 722, and audio capture element or microphone 730, located on the opposite side of the device as user 722, captures audio from Mark 702 and Sarah 704. (51)    In the example provided in FIG. 7, Sarah is speaking at this point during the meeting (e.g., saying "Thanks. Quarterly profits are up 10%") 712, and the user and Mark are listening to Sarah. Thus, the user device 714, utilizing captured video data, identifies that Sarah's mouth 710 is moving at substantially the same time the as the audio data containing speech content was captured. [Hart col 13, 14 lines 45-67, 1-5]
 The application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. … The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46]

Regarding claim 2 Hart teaches everything above (see claim 1).  In addition Hart teaches wherein the result identified by the external server is based on temporal changes of a lateral size and a vertical size of a lip in the transmitted images. if the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user. Similar to a user speech model, a user word formation model can be used, developed and/or refined that models how a user forms certain words using the user's lips, tongue, teeth, cheeks or other visible portions of the user's face. [Hart col 6 lines 40-44] in sensory inputs are utilized by the device to identify whether the user provided a command: captured audio is utilized to detect an audible command and captured video is utilized to determine that the user (i) said that word based upon image analysis of the user's face and (ii) performed an appropriate gesture. The video of the user's face and gestures may be based upon video captured from the same or different image capture elements.  [Hart col 10, 11 line 67, 1-7] 



Regarding claim 7 Hart teaches an information processing system comprising: 
a portable terminal device the computing device 302 [Hart col 6 line 7]; and a server one application server 1208 and a data store 1210. [Hart col 20 line 31] connected over a network to the portable terminal device an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 10-12], wherein the portable terminal device includes: 
a camera that captures images of an operator an image capture element 304 is on the same general side of the computing device 302 as a display element such that when the user 308 is viewing the display element, the image capture element has a viewable area or viewable range 312 that, according to this example, includes the face of the user 308.[Hart col 6 lines 13-15]; 
a microphone that captures a voice of the operator Device 350, in one embodiment, may also analyze the outputs of elements 354, 356, 360 and 362 to determine a relative location (e.g., a distance and/or orientation) of user 308 relative to device 350 to identify at least one of the proper audio capture element (e.g., microphone) 354, 356 and/or image capture element 360, 362 to use to assist in determining the audio content/user input. [Hart col 8 lines 34-40]; 
 FIG. 4 illustrates an example process 400 for controlling an interface (or otherwise providing input) using a device such as that described with respect to FIGS. 3(a)-3(c). … the device begins recording audio 404 and capturing video in a direction of a typical user 408, the audio including words spoken by the user. … a minimum amount of motion is detected in the captured video or some other such threshold is reached. … analyze the captured audio for speech information 406 and/or monitor the video for lip motion 410 or some other such motion or gesture as discussed elsewhere herein. In some embodiments, video data is captured and/or analyzed … with voice recognition [Hart col 8, 9 lines 65-67, 1-20]; 
a first communication interface that transmits and receives data with the server The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 8-12] the network includes the Internet, as the environment includes a Web server 1206 for receiving requests and serving content in response thereto, although for other networks, …  The illustrative environment includes at least one application server 1208 and a data store 1210. [Hart col 20 lines 25-32]; and 
wherein the first controller is further programmed to: when the images are obtained from the camera and the voice is obtained from the microphone, control the first communication interface to transmit data of the obtained images and the obtained voice to the server, sensory inputs are utilized by the device to identify whether the user provided a command: captured audio is utilized to detect an audible command and captured video is utilized to determine that the user (i) said that word based upon image analysis of the user's face and (ii) performed an appropriate gesture. The video of the user's face and gestures may be based upon video captured from the same or different image capture elements.  [Hart col 10, 11 line 67, 1-7] wherein the server includes: 
a second communication interface that receives the data of the obtained images and the obtained voice transmitted to the server The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 8-12] the network includes the Internet, as the environment includes a Web server 1206 for receiving requests and serving content in response thereto, although for other networks, …  The illustrative environment includes at least one application server 1208 and a data store 1210. [Hart col 20 lines 25-32] (Figure 12 discusses a networked managed system including an application server to run and that the application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device); and 
a second controller which is programmed to execute a plurality of operations, The application server provides access control services in cooperation with the data store and is able to generate content such as text, graphics, audio and/or video to be transferred to the user [Hart col 20 line 47-50]  wherein the second The application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. … The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46] if the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user. Similar to a user speech model, a user word formation model can be used, developed and/or refined that models how a user forms certain words using the user's lips, tongue, teeth, cheeks or other visible portions of the user's face. [Hart col 6 lines 40-44]  (Hart discloses that the application server will have the same hardware and software and software necessary to perform the functions of the client device and that the server can communicate with a client device over a network);
identify one or more of the operations to be executed based on the received voice data and the image data, The device can identify words contained within the captured audio and/or video data 606 and can compare these words against the input words contained in the state-dependent dictionary 608. [Hart col 12 lines 37-40], the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user [Hart col 6 lines 40-44]  sensory inputs are utilized by the device to identify whether the user provided a command: captured audio is utilized to detect an audible command and captured video is utilized to determine that the user (i) said that word based upon image analysis of the user's face and (ii) performed an appropriate gesture. The video of the user's face and gestures may be based upon video captured from the same or different image capture elements.  [Hart col 10, 11 line 67, 1-7]; and control the second communication interface to transmit an identification of one or more results of the plurality of operations to the portable terminal device, The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 8-12]The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46]
wherein, when the first communication interface receives only one result for operation identified in response to the transmitted data by the external server, the controller is further programmed to execute an operation corresponding to the one result, The device can identify words contained within the captured audio and/or video data 606 and can compare these words against the input words contained in the state-dependent dictionary 608. Upon matching 610 a word from the audio/video data with an input word, the input corresponding to the matched input word is processed 614 (e.g., launch browser, create calendar event). [Hart col 12 lines 37-46] (in this example it is clear Hart shows a system that receives one comand such as launch a browser and the system performs said command solely)
wherein, when the first communication interface receives a plurality of results for operation identified in response to the transmitted data by the external server, the controller is further programmed to: display information corresponding to the plurality of results for operation as a plurality of candidates of operation options, the device may capture audio containing speech and/or video data containing images of persons other than the primary user of the device. For instance, consider the example situation 700 of FIG. 7, wherein there are three people in a meeting, including the primary user 722 of a first computing device 714 (the "user device" in this example) and two other persons (here, Mark 702 and Sarah 704) each (potentially) with their own second device 706, 708, respectively. ). [Hart col 12 lines 35-43] (the system displays the three possible speakers)
capture additional voice for selecting one from the plurality of results for operation (the Office views the operation here as identifying the speaker in a group) during displaying the information corresponding to the plurality of results for operation,During the meeting, the user device 714 may detect audible speech from the user(s), Mark and/or Sarah, either individually or at the same time. Thus, when the user device 714 detects audible speech, the user device can utilize one or more image algorithms to analyze the captured video data to determine whether the user, [Hart col 12 lines 50-55] 
Sarah is speaking at this point during the meeting (e.g., saying "Thanks. Quarterly profits are up 10%") 712, and the user and Mark are listening to Sarah. Thus, the user device 714, utilizing captured video data, identifies that Sarah's mouth 710 is moving at substantially the same time the as the audio data containing speech content was captured. The captured video data also indicates that the user 722 and Mark 702 are not speaking at that time because their lips are not moving. Thus, the device associates the speech with Sarah. [Hart col 13 & 14  lines 66-67 & 1-8]  and
	execute an operation corresponding to the determined one result the display 718 on the user device shows a transcription of a recent portion of the meeting. Because the user device is able to determine which person spoke each word, the device can list that person's name (or other identifier) next to each portion attributed to that person. The display can also include other elements such as a cursor 720 or other element indicating who last spoke and/or the last word that was spoken. [Hart col 14 lines 27-40]  determining identities for the persons and/or devices in the room, or otherwise nearby or within a detectable range, also provides a number of other advantages and input possibilities. For example, if the device knows that one of the nearby people is named "Sarah" then the device can add that name to the sub-dictionary being used in order to more quickly and easily recognize when that name is spoken. Further, if information associated with Sarah, such as a device identifier or email address, is known to the device, such as through contact information, various actions can be performed using that information.[Hart col 14 lines 35-44]  

Hart discloses a user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or video information which can be used to assist in analyzing the audio information. For example, image information is utilized by the device to determine when someone is speaking, and the movement of the person's lips can be analyzed to assist in determining the words that were spoken. Any gestures or other motions can assist in the determination as well. By combining various types of data to determine user input, the accuracy of a process such as speech recognition can be improved, and the need for lengthy application training processes can be avoided. 
Hart also illustrates an example of an environment 1200 for implementing aspects in accordance with various embodiments. As will be appreciated, although a Web-based environment is used for purposes of explanation, different environments may be used, as appropriate, to implement various embodiments. The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. The illustrative environment includes at least one application server 1208 and a data store 1210. 

Before the effective date of the invention it would have been obvious to one of ordinary skill in the art to combine the embodiments taught by Hart to design a networked client and server communication network that could transfer audio and images from a portable device to a server containing similar hardware and software as the client device to perform voice and lip movement recognition and transfer the result of the word spoken by the operator over the networked devices as even Hart discloses that the specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention and further as it has been ruled that matters of well-known designs were matters of ordinary skill in the art.

Regarding claim 8 Hart teaches everything above (see claim 7).  In addition Hart teaches wherein the displayed information corresponding to the plurality of results is  where the speaker can actually be identified automatically and associated with the proper speech. For example, the display 718 on the user device shows a transcription of a recent portion of the meeting [Hart col 14 lines 25-28]

Regarding claim 9 Hart teaches everything above (see claim 8).  In addition Hart teaches further comprising selecting one from the plurality of results from a voice corresponding to the characters or the character string. where the speaker can actually be identified automatically and associated with the proper speech. For example, the display 718 on the user device shows a transcription of a recent portion of the meeting [Hart col 14 lines 25-28]

Regarding claim 10 Hart teaches everything above (see claim 7).  In addition Hart teaches wherein when the images are obtained from the camera and the voice is obtained from the microphone, the first controller is further programmed to identify if the operator is a specific operator based on at least one of the obtained images and the obtained voice, Mark's or Sarah's, lips were moving at substantially the same time as the audio was captured by at least one of microphones 730, 740, and 742 of the device 714. In this example, at least one of audio capture units or microphones 740, 742 capture audio from user 722, and audio capture element or microphone 730, located on the opposite side of the device as user 722, captures audio from Mark 702 and Sarah 704. (51)    In the example provided in FIG. 7, Sarah is speaking at this point during the meeting (e.g., saying "Thanks. Quarterly profits are up 10%") 712, and the user and Mark are listening to Sarah. Thus, the user device 714, utilizing captured video data, identifies that Sarah's mouth 710 is moving at substantially the same time the as the audio data containing speech content was captured. [Hart col 13, 14 lines 45-67, 1-5] and when the operator is identified as the specific operator, the obtained images and voice are transmitted to the server. The application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. … The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46]

Regarding claim 11 Hart teaches everything above (see claim 7).  In addition Hart teaches wherein the result identified by the server is based on temporal changes of a lateral size and a vertical size of a lip in the transmitted images. . if the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user. Similar to a user speech model, a user word formation model can be used, developed and/or refined that models how a user forms certain words using the user's lips, tongue, teeth, cheeks or other visible portions of the user's face. [Hart col 6 lines 40-44] in sensory inputs are utilized by the device to identify whether the user provided a command: captured audio is utilized to detect an audible command and captured video is utilized to determine that the user (i) said that word based upon image analysis of the user's face and (ii) performed an appropriate gesture. The video of the user's face and gestures may be based upon video captured from the same or different image capture elements.  [Hart col 10, 11 line 67, 1-7] 

Regarding claim 13 Hart teaches an information processing the computing device 302 [Hart col 6 line 7] method comprising: 
capturing images of an operator with a camera of a portable terminal device an image capture element 304 is on the same general side of the computing device 302 as a display element such that when the user 308 is viewing the display element, the image capture element has a viewable area or viewable range 312 that, according to this example, includes the face of the user 308.[Hart col 6 lines 13-15]; 
capturing a voice of the operator with a microphone of the portable terminal deviceDevice 350, in one embodiment, may also analyze the outputs of elements 354, 356, 360 and 362 to determine a relative location (e.g., a distance and/or orientation) of user 308 relative to device 350 to identify at least one of the proper audio capture element (e.g., microphone) 354, 356 and/or image capture element 360, 362 to use to assist in determining the audio content/user input. [Hart col 8 lines 34-40]; 
transmitting and receiving data between the portable terminal device and a server connected over a network to the portable terminal device The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 8-12] the network includes the Internet, as the environment includes a Web server 1206 for receiving requests and serving content in response thereto, although for other networks, …  The illustrative environment includes at least one application server 1208 and a data store 1210. [Hart col 20 lines 25-32]; wherein 
a first controller of the portable terminal device is programmed to execute a plurality of operations, including: when the images are obtained from the camera and the voice is obtained from the microphone, transmitting the images obtained from the camera and the voice obtained from the microphone to the server sensory inputs are utilized by the device to identify whether the user provided a command: captured audio is utilized to detect an audible command and captured video is utilized to determine that the user (i) said that word based upon image analysis of the user's face and (ii) performed an appropriate gesture. The video of the user's face and gestures may be based upon video captured from the same or different image capture elements.  [Hart col 10, 11 line 67, 1-7]; wherein
 a second controller of the server is programmed to execute a plurality of operations, The application server provides access control services in cooperation with the data store and is able to generate content such as text, graphics, audio and/or video to be transferred to the user [Hart col 20 line 47-50]  including: receiving the data of the obtained images and the obtained voice transmitted  The application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. … The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46] if the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user. Similar to a user speech model, a user word formation model can be used, developed and/or refined that models how a user forms certain words using the user's lips, tongue, teeth, cheeks or other visible portions of the user's face. [Hart col 6 lines 40-44]  (Hart discloses that the application server will have the same hardware and software and software necessary to perform the functions of the client device and that the server can communicate with a client device over a network); wherein 
the second controller of the server is further programmed to: identify one or more of the operations to be executed based on the received voice data and the image data, and transmit an identification of one or more results of the plurality of operations to the portable terminal device, The device can identify words contained within the captured audio and/or video data 606 and can compare these words against the input words contained in the state-dependent dictionary 608. [Hart col 12 lines 37-40], the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user [Hart col 6 lines 40-44]  sensory inputs are utilized by the device to identify whether the user provided a command: captured audio is utilized to detect an audible command and captured video is utilized to determine that the user (i) said that word based upon image analysis of the user's face and (ii) performed an appropriate gesture. The video of the user's face and gestures may be based upon video captured from the same or different image capture elements.  [Hart col 10, 11 line 67, 1-7] 
wherein the portable terminal device receives information from the server including the one or more results identified by the server, The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. [Hart col 20 lines 8-12] The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46] 
wherein, when the portable terminal device receives only one result for operation identified in response to the transmitted data by the external server, the controller is further programmed to execute an operation corresponding to the one result, The device can identify words contained within the captured audio and/or video data 606 and can compare these words against the input words contained in the state-dependent dictionary 608. Upon matching 610 a word from the audio/video data with an input word, the input corresponding to the matched input word is processed 614 (e.g., launch browser, create calendar event). [Hart col 12 lines 37-46] (in this example it is clear Hart shows a system that receives one comand such as launch a browser and the system performs said command solely)
wherein, when the portable terminal device receives a plurality of results for operation identified in response to the transmitted data by the external server, the controller is further programmed to: display information corresponding to the plurality of results for operation as a plurality of candidates of operation options, the device may capture audio containing speech and/or video data containing images of persons other than the primary user of the device. For instance, consider the example situation 700 of FIG. 7, wherein there are three people in a meeting, including the primary user 722 of a first computing device 714 (the "user device" in this example) and two other persons (here, Mark 702 and Sarah 704) each (potentially) with their own second device 706, 708, respectively. ). [Hart col 12 lines 35-43] (the system displays the three possible speakers)
capture additional voice for selecting one from the plurality of results for operation (the Office views the operation here as identifying the speaker in a group) during displaying the information corresponding to the plurality of results for operation,During the meeting, the user device 714 may detect audible speech from the user(s), Mark and/or Sarah, either individually or at the same time. Thus, when the user device 714 detects audible speech, the user device can utilize one or more image algorithms to analyze the captured video data to determine whether the user, [Hart col 12 lines 50-55] 
Sarah is speaking at this point during the meeting (e.g., saying "Thanks. Quarterly profits are up 10%") 712, and the user and Mark are listening to Sarah. Thus, the user device 714, utilizing captured video data, identifies that Sarah's mouth 710 is moving at substantially the same time the as the audio data containing speech content was captured. The captured video data also indicates that the user 722 and Mark 702 are not speaking at that time because their lips are not moving. Thus, the device associates the speech with Sarah. [Hart col 13 & 14  lines 66-67 & 1-8]  and
	execute an operation corresponding to the determined one result the display 718 on the user device shows a transcription of a recent portion of the meeting. Because the user device is able to determine which person spoke each word, the device can list that person's name (or other identifier) next to each portion attributed to that person. The display can also include other elements such as a cursor 720 or other element indicating who last spoke and/or the last word that was spoken. [Hart col 14 lines 27-40]  determining identities for the persons and/or devices in the room, or otherwise nearby or within a detectable range, also provides a number of other advantages and input possibilities. For example, if the device knows that one of the nearby people is named "Sarah" then the device can add that name to the sub-dictionary being used in order to more quickly and easily recognize when that name is spoken. Further, if information associated with Sarah, such as a device identifier or email address, is known to the device, such as through contact information, various actions can be performed using that information.[Hart col 14 lines 35-44]  

Hart discloses a user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or video information which can be used to assist in analyzing the audio information. For example, image information is utilized by the device to determine when someone is speaking, and the movement of the person's lips can be analyzed to assist in determining the words that were spoken. Any gestures or other motions can assist in the determination as well. By combining various types of data to determine user input, the accuracy of a process such as speech recognition can be improved, and the need for lengthy application training processes can be avoided. 
Hart also illustrates an example of an environment 1200 for implementing aspects in accordance with various embodiments. As will be appreciated, although a Web-based environment is used for purposes of explanation, different environments may be used, as appropriate, to implement various embodiments. The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. The illustrative environment includes at least one application server 1208 and a data store 1210. 

Before the effective date of the invention it would have been obvious to one of ordinary skill in the art to combine the embodiments taught by Hart to design a networked client and server communication network that could transfer audio and images from a portable device to a server containing similar hardware and software as


Regarding claim 14 Hart teaches everything above (see claim 13).  In addition Hart teaches wherein the information displayed corresponding to the plurality of results is characters or a character string. where the speaker can actually be identified automatically and associated with the proper speech. For example, the display 718 on the user device shows a transcription of a recent portion of the meeting [Hart col 14 lines 25-28]


 where the speaker can actually be identified automatically and associated with the proper speech. For example, the display 718 on the user device shows a transcription of a recent portion of the meeting [Hart col 14 lines 25-28]

Regarding claim 16 Hart teaches everything above (see claim 14).  In addition Hart teaches wherein when the images are obtained from the camera and the voice is obtained from the microphone, the first controller is further programmed to identify if the operator is a specific operator based on at least one of the obtained images and the obtained voice, Mark's or Sarah's, lips were moving at substantially the same time as the audio was captured by at least one of microphones 730, 740, and 742 of the device 714. In this example, at least one of audio capture units or microphones 740, 742 capture audio from user 722, and audio capture element or microphone 730, located on the opposite side of the device as user 722, captures audio from Mark 702 and Sarah 704. (51)    In the example provided in FIG. 7, Sarah is speaking at this point during the meeting (e.g., saying "Thanks. Quarterly profits are up 10%") 712, and the user and Mark are listening to Sarah. Thus, the user device 714, utilizing captured video data, identifies that Sarah's mouth 710 is moving at substantially the same time the as the audio data containing speech content was captured. [Hart col 13, 14 lines 45-67, 1-5] and when the operator is identified as the specific operator, the obtained images and voice are transmitted to the  The application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. … The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206. [Hart col 20 lines 42-46]

Regarding claim 17 Hart teaches everything above (see claim 13).  In addition Hart teaches wherein the result identified by the server is based on temporal changes of a lateral size and a vertical size of a lip in the transmitted images. . if the device can determine the location of the user's mouth in the captured images, the device can monitor the movement of the user's lips in order to attempt to determine any words, characters or other information spoken by the user. Similar to a user speech model, a user word formation model can be used, developed and/or refined that models how a user forms certain words using the user's lips, tongue, teeth, cheeks or other visible portions of the user's face. [Hart col 6 lines 40-44] in sensory inputs are utilized by the device to identify whether the user provided a command: captured audio is utilized to detect an audible command and captured video is utilized to determine that the user (i) said that word based upon image analysis of the user's face and (ii) performed an appropriate gesture. The video of the user's face and gestures may be based upon video captured from the same or different image capture elements.  [Hart col 10, 11 line 67, 1-7] 

5.	Claims 6, 12 and 18 is/are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Hart in further view of Wang et al., US Patent Application (US 20140214424 A1), hereinafter "Wang"

Regarding claim 6 Hart teaches everything above (see claim 1).  Hart does not teach but Wang teaches further comprising: a speaker, a speaker, IVI system 100 may include additional items such as a speaker [Wang para 0020];
wherein the controller is further programmed to: restrict output from the speaker, and control the speaker to output audio of a voice associated with the operation based on the information received from the external server when the output from the speaker is restricted. the volume of the vehicle audio output may be lowered based at least in part on the determination of whether any the one or more occupants of the vehicle is speaking. [Wang para 0051] where a user command may be determined. For example, a user command may be determined via control system 308. Such a determination of a user command may be based at least in part on the performed speech recognition and/or voice recognition [Wang para 0060]

Hart discloses a user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or video information which can be used to assist in analyzing the audio information. For 
 Hart also illustrates an example of an environment 1200 for implementing aspects in accordance with various embodiments. As will be appreciated, although a Web-based environment is used for purposes of explanation, different environments may be used, as appropriate, to implement various embodiments. The system includes an electronic client device 1202, which can include any appropriate device operable to send and receive requests, messages or information over an appropriate network 1204 and convey information back to a user of the device. The illustrative environment includes at least one application server 1208 and a data store 1210. 
The application server 1208 can include any appropriate hardware and software for integrating with the data store 1210 as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. The application server provides access control services in cooperation with the data store and is able to generate content such as text, graphics, audio and/or video to be transferred to the user, which may be served to the user by the Web server 1206. The handling of all requests and responses, as well as the delivery of content between the client device 1202 and the application server 1208, can be handled by the Web server 1206.

Before the effective date of the invention it would have been obvious to one of ordinary skill in the art to combine the teachings of Hart and Wang.  Hart has designed a voice and lip movement recognition system which can interpret spoken words to determine a spoken command by a user of a portable device.  Wang allows the ability to restrict the audio output of speakers in a system like a car or entertainment system which causes the speaker output to be lowered while the user is speaking to reduce background noise.   


Regarding claim 12 Hart teaches everything above (see claim 7).  Hart does not teach but Wang teaches wherein the portable terminal further comprises: a speaker, IVI system 100 may include additional items such as a speaker [Wang para 0020]; the volume of the vehicle audio output may be lowered based at least in part on the determination of whether any the one or more occupants of the vehicle is speaking. [Wang para 0051] where a user command may be determined. For example, a user command may be determined via control system 308. Such a determination of a user command may be based at least in part on the performed speech recognition and/or voice recognition [Wang para 0060]

Regarding claim 18 Hart teaches everything above (see claim 13).  Hart does not teach but Wang teaches wherein the first controller of the portable terminal further is further programmed to: restrict output from a speaker of the portable terminal device, and control the speaker to output audio of a voice associated with the operation based on the information received from the server when the output from the speaker is restricted.  IVI system 100 may include additional items such as a speaker [Wang para 0020]; the volume of the vehicle audio output may be lowered based at least in part on the determination of whether any the one or more occupants of the vehicle is speaking. [Wang para 0051] where a user command may be determined. For example, a user command may be determined via control system 308. Such a determination of a user command may be based at least in part on the performed speech recognition and/or voice recognition [Wang para 0060]
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT J MICHAUD whose telephone number is (571)270-3981.  The examiner can normally be reached on 8:30 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Patrick Edouard can be reached on 571-272-7603.   The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/ROBERT J MICHAUD/Examiner, Art Unit 2694