Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

1.) Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/08/22 and 05/05/21 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

2.) Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

3.) Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Goetz (US Patent No.: 9973732B1) and further in view of Sasaki et al. (US Pub No.: 2003/0117505A1).

Regarding Claim 1, Goetz discloses a method implemented by one or more processors (Methods, apparatuses, and systems for facilitating video communications between users in environments including a plurality of imaging devices. A network device 126 can include one or more processors 130 and one or more computer-readable media 132 including a user locator component 134, a user profile component 136, a device selection component 138, an image processing component 140, a context component 142, and a machine learning component 144, Abstract; Column 6, Lines 54-59; Column 9, Line 49 to Column 10, Line 14), the method comprising: 
receiving, at a computing device (See computing device, Figures 1-2; Column 3, Lines 10-49; Column 6, Lines 54-59) , a spoken utterance (uttering a command) that is directed to an automated assistant that is accessible via the computing device, wherein the computing device also provides access to a camera (The following example illustrates one use case for conducting a video communication in the environments 302 and 304 utilizing the network device 126. For example, the user 306 (“Alice”) can initiate a conversation with the user 318 (“Bob”) by speaking a wake word (e.g., “Computer . . . ”) and uttering a command (e.g., “Connect me to Bob.”) The imaging device 308 in the environment 302 can capture the audio uttered by the user 306 and can transmit the audio to the network device 126. The network device 126 can determine that the audio represents a request from Alice to initiate a communication with Bob. Based at least in part on one or more user preferences (and/or based at least in part on one or more commands), the network device 126 can determine that the communication is to be a video communication. 
Upon determining the communication is to be a video communication, the network device 126 can access a user profile associated with the user 318 (e.g., “Bob”) to determine that a group of devices is associated with the user profile. The network device 126 can instruct devices associated with the group of devices (e.g., the imaging device 324 and the smart appliance 328) to provide audio data and/or image data to the network device 126 so that the network device 126 can locate the user 318 and/or determine a device to be the primary device for the communication. That is, the network device 126 can receive image data from the imaging device 324 and the smart appliance 328 and can determine that the user 318 is represented in both image data. Column 3, Lines 50-57; Column 15, Lines 11-29; Figure 3; Column 17, Lines 34-50. Also see step 402 in Figure 4); 
determining, based on the spoken utterance (uttering a command), that a user is directing the automated assistant to control the camera according to whether one or more conditions are satisfied (identity and other identifiers related to a user) (The voice-controlled device can be configured to initiate a communication between two users. For instance, a user can issue a voice command to the voice-controlled device to “Connect Alice to Bob” or to “Connect me to Bob.” The voice-controlled device or another device can perform ASR on a captured audio signal to identify the command (“connect”) along with the referenced users (“Alice” and “Bob”). Similarly, a communication can be initiated using a GUI of a computing device or using a gesture-based imaging system. Based on the user requests, the network device (e.g., a server computer remote from a user environment or located at the user environment) can locate the users “Alice” and “Bob,” and can determine a primary device of one or more devices at the respective locations of the users for use in the video communication, for example. In some instances, a location of a user can be continuously tracked and stored in memory at a network device. In such a case, when a user request is received to initiate a communication, the location can be retrieved from memory and provided to initiate a video communication at the respective location, Column 3, Line 59 to Column 4, Line 14; Column 11, Lines 62-67; Column 17, Lines 34-50; Step 402, Figure 4); 
wherein the one or more conditions are described in natural language content of the spoken utterance (In response to receiving this audio signal, the speech-recognition component 202 can begin performing automated speech recognition (ASR) and/or natural language understanding on the audio signal to generate text and identify one or more user voice commands from the generated text. For instance, with reference to FIG. 1, a user request (e.g., from the user 102) can include the speech “Connect Alice to Bob.” As the audio signal representing this sound is uploaded to the speech-recognition component 202, the component 202 can identify the user requesting to initiate a communication between “Alice” and “Bob.” In some instances, the speech-recognition component 202 can include a voice recognition module that can determine an identity of a user based upon analyzing the audio signal including the speech of that user. Thus, in an example where the natural-language command includes “Connect me to Bob,” the speech-recognition component 202 can determine that “me” corresponds to “Alice”,  Column 3, Lines 59-67; Column 12, Lines 7-25; Figure 2; Speech recognition component 202, Column 17, Lines 34-50. Step 402, Figure 4); 
determining, based on data that is available to the automated assistant, whether the one or more conditions are satisfied (An environment can include a first imaging device having a first field of view and a second imaging device having a second field of view. When a video communication is to be conducted in the environment, the first imaging device and the second imaging device can provide data to a network device, which can analyze the image data (e.g., using facial detection/recognition techniques) to determine an identity of the user represented in the data. Further, a user can be identified and located by monitoring a radio frequency (RF) signal associated with a device that is carried or worn by a user, and/or by voice recognition techniques. After determining an identity of a user, the network device can determine a user profile associated with the user, which can include preferences associated with conducting a video communication. For example, preferences can be associated with device selection, zoom selection, audio selection, subject framing, image composition, etc. Further, one or more machine learned algorithms can be utilized to determine an optimal view of the user for the video communication based on the fields of view associated with each imaging device, Column 1, Lines 55-65; Column 2, Lines 8-24; Column 4, Lines 4-14, 26-30; Column 6, Lines 60-67; Column 7, Lines 19-32; Figure 1; See user locator component 134, Column 16, Lines 34-38; Column 18, Lines 36-47; Figure 4; Column 19, Lines 30-58); and 
when the one or more conditions are satisfied: causing the camera to capture image data (Data captured by the imaging device 114 and the smart appliance 116 can be provided to the network device 126, which can determine to provide at least a portion of the image data 122 for presentation via the imaging device 108. Similarly, the imaging device 108 can capture data of the user 102 and provide the data to the network device 126, which in turn can provide the data for presentation via the imaging device 114 and/or the smart appliance 116. In some instances, and as illustrated in FIG. 1, the network device 126 can selectively provide data of the environment 106 for presentation via the imaging device 114 (e.g., based on the user 104 facing the imaging device 114). As discussed herein, one or more preferences associated with a user profile associated with the user 104, for example, can determine, at least in part, how data is captured and presented in the context of a video communication, Column 6, Lines 39-41; Figure 1; Column 18, Line 62-Column 19, Line 5; Figure 4).

Goetz does not explicitly teach or disclose causing the image data, captured by the camera when the one more conditions are satisfied, to be persistently stored as a file at the computing device. Sasaki et al. teach of causing image data, captured by a camera when one more conditions are satisfied, to be persistently stored as a file at a computing device, (Sasaki et al. teach of a digital camera system includes an optical system for forming optical images onto an image sensor. The image sensor provides digital images of the optical images to an intermediate memory. A display provides a visual display of selected digital images stored in the intermediate memory. A controller responds to a manual input from a user to initiate long-term storage of selected digital images from the intermediate memory in a long-term memory, Abstract; Figures 1-2B of Sasaki et al.. Sasaki et al. teach of permanently storing (persistently stored) only the desired images captured by a camera 100 in an external memory 115, Paragraphs 0017, 0019, 0024; Figures 1-2B of Sasaki et al.. It would have been obvious and well-known to one of ordinary skill in the art before the effective filing date of the claimed invention to enable the teachings of Goetz to cause the image data, captured by the camera when the one more conditions are satisfied, to be persistently stored as a file at the computing device as taught by Sasaki et al., because this provides for a permanent way of a user being able to access and retrieve the image for future use). 

With regard to Claim 2, Goetz and Sasaki et al. disclose the method of claim 1, wherein determining that the user is directing the automated assistant to control the camera according to whether one or more conditions are satisfied includes: accessing, based on the natural language content of the spoken utterance (conversation), current image data that is based on an operation of the camera, and biasing speech recognition processing of audio data, corresponding to the spoken utterance, based on one or more objects (persons of interest) that are present in the current image data (Further, the person of interest component 206 can include functionality to determine a conversation score associated with individual users in an environment to determine whether the person is to be a focus of the video communication, for example. The person of interest component 206 can perform operations for each of the environments 106 and 112. For example, with respect to the environment 106, the person of interest component 206 can determine that the user 102 is the only user in the field of view 110 associated with the imaging device 108, and therefore, the user 102 can be considered to be a person of interest in the environment 106. For example, the person of interest component can utilize face detection algorithms, body detection algorithms, etc., to determine that a face or body is present in image data. Similarly, the person of interest component 206 can determine that the user 104 is the only user in the fields of view 118 and 120, and thus the person of interest component 206 can determine that the user 104 is a person of interest, Column 3, Line 59 to Column 4, Line 3; Column 12, Lines 26-31 and 47-67 of Goetz).

In regard to Claim 3, Goetz and Sasaki et al. disclose the method of claim 1, wherein determining whether the one or more conditions are satisfied includes: processing, in response to receiving the spoken utterance, other audio data that captures audio in an environment (voice recognition techniques using the audio captured) of the computing device or another computing device, and determining whether the other audio data includes one or more audio features that satisfy the one or more conditions (Further, the network device can perform automated speech recognition (ASR) on audio captured from an environment to determine a context of a conversation and/or to determine which user out of a plurality of users is a person of interest for the video communication. Further, as the user moves around his or her environment, the network device can select data from another device to be provided as the imaging stream for the video communication, with the network device determining an optimal source of image data and audio data, as well as how to present the data at a far-end destination of the video communication, Column 1, Lines 55-65; Column 2, Lines 8-24; Column 4, Lines 26-30; Column 6, Lines 60-67; Column 7, Lines 19-32; Figure 1; Column 18, Lines 36-47 and Figure 4 of Goetz. The operation can include determining that there are at least a first user and a second user at the conversation location. At 506, the operation can include determining a first conversation score and a second conversation score associated with the first user and the second user, respectively, Column 19, Lines 30-58; Figure 5 of Goetz).

With regard to Claim 4, Goetz and Sasaki et al. disclose the method of claim 1, wherein determining whether the one or more conditions are satisfied includes: processing, in response to receiving the spoken utterance, other image data that captures one or more visual features of an environment of the computing device or another computing device, and determining whether the one or more visual features satisfy the one or more conditions (An environment can include a first imaging device having a first field of view and a second imaging device having a second field of view. When a video communication is to be conducted in the environment, the first imaging device and the second imaging device can provide data to a network device, which can analyze the image data (e.g., using facial detection/recognition techniques) to determine an identity of the user represented in the data. Further, a user can be identified and located by monitoring a radio frequency (RF) signal associated with a device that is carried or worn by a user, and/or by voice recognition techniques. After determining an identity of a user, the network device can determine a user profile associated with the user, which can include preferences associated with conducting a video communication, Column 1, Lines 55-65; Column 2, Lines 8-24; Column 4, Lines 26-30; Column 6, Lines 60-67; Column 7, Lines 19-32; Figure 1 of Goetz. See step 410, Column 18, Lines 36-47; Figure 4 of Goetz. Also see steps 504 and 506, Column 19, Lines 30-58; Figure 5 of Goetz). 

Regarding Claim 5, Goetz and Sasaki et al. disclose the method of claim 1, wherein causing the camera to capture the image data includes: modifying, based on the natural language content of the spoken utterance (conversations), one or more settings of the camera (panning, tilting to track a user), wherein the image data is captured when the camera is operating according to the one or more settings  (In a near-end environment with at least two users, the network device can receive data from one or more imaging devices and analyze the data to determine a conversation score associated with individual users. For example, a conversation score can represent a level of engagement of the user in the communication, and can determine whether the network device should “follow” the user as the user moves about the environment, for example, or whether the user can be emphasized in the video communication. 
For example, a conversation score can be based at least in part on one or more of a location of the user in a field of view of the imaging device, a context of speech of the user (e.g., determined using ASR), movement of the user, preferences associated with a user profile of the user, etc. In some instances, the conversation score can be used to determine a size of image data representing the user on a far-end device of the video communication. As discussed herein, a near-end device can be considered to be a source of data (e.g., an imaging device capturing data), while a far-end device can be considered to be a destination of the data (e.g., a display presenting at least a portion of the data). In some instances, an imaging device can be panned, tilted, or otherwise manipulated to track one or more users based in part on a conversation score. Further, the network device can crop a portion of image data based on a resolution of the data, a location of a person of interest, etc., Column 1, Lines 55-65; Column 2, Lines 8-50 of Goetz).

In regard to Claim 6, Goetz and Sasaki et al. disclose the method of claim 1, wherein determining whether the one or more conditions are satisfied includes: processing, in response to receiving the spoken utterance (as conversation is initiated), application data that indicates a state of an application (frame rates/resolutions) that is accessible via the computing device or another computing device, and determining whether the state of the application satisfies the one or more conditions (As a conversation is initiated, the network device can instruct one or more devices (e.g., imaging devices) to transmit data to the network device to determine which device of the one or more devices is to be a primary device. In some instances, the network device can provide data captured by the near-end imaging device to a far-end display to facilitate the video communication. In some instances, while the primary device is providing data to the network device at a first rate (e.g., first bit rate, first frame rate, first resolution, etc.), other imaging devices can be providing data to the network device at a second rate (e.g., second bit rate, second frame rate, second resolution, etc.). In some cases, the second rate can be lower (or represent lower quality) than the first rate. For example, an imaging device designated as the primary device can be capturing and providing data to the network device at a first rate of 60 frames per second (FPS), while a secondary device can be capturing data and/or providing data to the network device at a second rate of 2 FPS. If the network device determines that the secondary device can provide optimal image data representing the, the network device can instruct the secondary device to increase the rate (e.g., quality) at which data is captured and/or provided by the secondary device to the network device, in advance of transitioning the far-end stream from the primary device to the (current) secondary device, which can be designated at the next primary device, Column 2, Line 51 to Column 3, Line 9 of Goetz).

With regard to Claim 7, Goetz discloses the method of claim 1, wherein the computing device is a portable computing device and the spoken utterance is received while the user is handling the portable computing device (See the portable computing device such as a smartphone etc., Column 3, Lines 10-57; Column 5, Line 61 to Column 6, Line 9; Figure 1; Column 21, Line 30 to Column 22, Line 8).

Regarding Claim 8, Goetz and Sasaki et al. disclose the method of claim 7, wherein causing the camera to capture the image data is performed without the user, subsequent to providing the spoken utterance, directly contacting any programmable touch interface of the computing device (See the portable computing device that has a includes a touch screen for providing input as well as additional functionality, Column 3, Lines 10-57; Column 5, Line 61 to Column 6, Line 9; Figure 1; Column 21, Line 30 to Column 22, Line 8 of Goetz).

With regard to Claim 9, Goetz discloses a method implemented by one or more processors (Methods, apparatuses, and systems for facilitating video communications between users in environments including a plurality of imaging devices. For example, an environment can include first and second imaging devices having associated fields of view providing multiple perspectives of a user. Upon initiating a video communication session, a network device (which includes one or more processors) can receive image data from the imaging devices to determine an identity of a user. A user profile of the user can include preferences for the communication, such as device selection, zoom selection, audio selection, subject framing, and the like. Further, the network device can utilize machine learned algorithms to determine an optimal view of the user for the video communication based on the fields of view associated with each imaging device. Column 6, Lines 54-59; Figures 1-2), the method comprising: 
receiving, at a computing device (Voice controlled device 340, Figure 3; Column 14, Line 47 to Column 15, Line 6 of Goetz), an input from a user (voice/speech from a user), wherein the computing device provides access to an automated assistant (network device) and a camera (imaging device 308) (A user can issue a voice command to the voice-controlled device to “Connect Alice to Bob” or to “Connect me to Bob.” The voice-controlled device or another device can perform ASR on a captured audio signal to identify the command (“connect”) along with the referenced users (“Alice” and “Bob”). Similarly, a communication can be initiated using a GUI of a computing device or using a gesture-based imaging system, Column 3, Lines 50-67 of Goetz. 
The user 306 (“Alice”) can initiate a conversation with the user 318 (“Bob”) by speaking a wake word (e.g., “Computer . . . ”) and uttering a command (e.g., “Connect me to Bob.”) The imaging device 308 in the environment 302 can capture the audio uttered by the user 306 and can transmit the audio to the network device 126. The network device 126 can determine that the audio represents a request from Alice to initiate a communication with Bob. Based at least in part on one or more user preferences (and/or based at least in part on one or more commands), the network device 126 can determine that the communication is to be a video communication, Column 15, Lines 11-24 and Figure 3 of Goetz. Also see step 402, Column 17, Lines 34-50; Figure 4); 
determining, based on the input (spoken commands), that the input is a request for the automated assistant to operate the camera according to one or more conditions (In some cases, a communication can be initiated by a user via a voice command received by the imaging device 108 or 114, the smart appliance 116, or any other devices that can be present in an environment (e.g., the environments 106 and 112). For example, in response to one of the imaging device 108 or 114, or the smart appliance 116 identifying the user 102 and/or 104, and/or in response to the user 102 and/or 104 speaking a predefined wake word, the particular device can begin uploading an audio signal (or image data including an audio signal) representing sound captured in the environment 106 or 112, respectively, up to the network device 126 over the network 128, Column 3, Lines 59-67; Column 11, Lines 62-67. Also see step 402, Column 17, Lines 34-50), wherein the one or more conditions are specified in natural language content of the input (In response to receiving this audio signal, the speech-recognition component 202 can begin performing automated speech recognition (ASR) and/or natural language understanding on the audio signal to generate text and identify one or more user voice commands from the generated text. For instance, with reference to FIG. 1, a user request (e.g., from the user 102) can include the speech “Connect Alice to Bob.” As the audio signal representing this sound is uploaded to the speech-recognition component 202, the component 202 can identify the user requesting to initiate a communication between “Alice” and “Bob”, Column 3, Lines 59-67; Column 12, Lines 7-25; Figure 2 and Column 17, Lines 34-50); 
accessing, based on the one or more conditions, one or more trained machine learning models, wherein the automated assistant accesses the one or more trained machine learning models to assist with identifying one or more features of an environment of the computing device or another computing device (The machine learning component 144 can receive the image data 122 and 124 and determine that the image data 122 represents a “better” image of the user 104 than the image data 124. In some instances, the machine learning component 144 can be trained using training data that has been annotated or scored to indicate an optimal view of a user in a video communication. In some instances, the machine learning component 144 can use any machine learning algorithms, including but not limited to, neural networks, convolutional neural networks, decision forests, etc., Column 2, Lines 4-7; Column 4, Lines 40-54; Column 10, Line 27-Column 11, Line 13; Figure 2); 
processing, using the one or more trained machine learning models (see machine learning component 144), data that characterizes one or more current features of the environment of the computing device or another computing device (The network device 126 can include one or more processors 130 and one or more computer-readable media 132 including a user locator component 134, a user profile component 136, a device selection component 138, an image processing component 140, a context component 142, and a machine learning component 144. The machine learning component 144 can include functionality to determine an optimal view of a user during a communication to provide a seamless communication experience. For example, the machine learning component 144 can receive the image data 122 and 124 and determine that the image data 122 represents a “better” image of the user 104 than the image data 124. In some instances, the machine learning component 144 can be trained using training data that has been annotated or scored to indicate an optimal view of a user in a video communication. In some instances, the machine learning component 144 can use any machine learning algorithms, including but not limited to, neural networks, convolutional neural networks, decision forests, etc., Column 4, Lines 40-54; Column 10, Line 27 to Column 11, Line 13; Figure 2); 
causing the camera to capture image data (Data captured by the imaging device 114 and the smart appliance 116 can be provided to the network device 126, which can determine to provide at least a portion of the image data 122 for presentation via the imaging device 108. Similarly, the imaging device 108 can capture data of the user 102 and provide the data to the network device 126, which in turn can provide the data for presentation via the imaging device 114 and/or the smart appliance 116. In some instances, and as illustrated in FIG. 1, the network device 126 can selectively provide data of the environment 106 for presentation via the imaging device 114 (e.g., based on the user 104 facing the imaging device 114). As discussed herein, one or more preferences associated with a user profile associated with the user 104, for example, can determine, at least in part, how data is captured and presented in the context of a video communication, Column 6, Lines 39-41; Figure 1; Column 18, Line 62-Column 19, Line 5; Figure 4);
determining, based on the data, whether the one or more current features of the environment satisfy the one or more conditions, wherein the one or more conditions are satisfied when the environment of the computing device or the other computing device exhibits one or more specified features (In some instances, a user can be located within a room or a zone associated with an environment, such as a home of the user. In some instances, a user environment can include imaging devices that can image the environment and perform facial recognition to determine that the user (e.g., “Alice” or “Bob”) is in a particular room or zone of an environment. In some instances, it may not be possible to identify a user with certainty, and an environment can monitor other sensor data associated with a user to improve a certainty or confidence level of an identity and/or location of the user, Column 1, Lines 55-65; Column 2, Lines 8-24; Column 4, Lines 4-14, 26-30; Column 6, Lines 60-67; Column 7, Lines 19-32; Figure 1; Column 16, Lines 34-38; Column 18, Lines 36-47; Figure 4; Column 19, Lines 30-58 and Figure 5).  
Goetz does not explicitly disclose that in response to first image data, of the image data, being captured by the camera when the one or more conditions are determined to be satisfied, causing the first image data to be persistently stored as a file at the computing device; and in response to second image data, of the image data, being captured by the camera when the one or more conditions are determined to not be satisfied, causing the second image data to be deleted. Sasaki et al. teach that in response to first image data, of the image data, being captured by the camera when the one or more conditions are determined to be satisfied (desired images), causing the first image data to be persistently stored as a file at the computing device; and in response to second image data, of the image data, being captured by the camera when the one or more conditions are determined to not be satisfied, causing the second image data to be deleted, 
(Sasaki et al. teach of a digital camera system includes an optical system for forming optical images onto an image sensor. The image sensor provides digital images of the optical images to an intermediate memory. A display provides a visual display of selected digital images stored in the intermediate memory. A controller responds to a manual input from a user to initiate long-term storage of selected digital images from the intermediate memory in a long-term memory, Abstract; Figures 1-2B of Sasaki et al.. Sasaki et al. teach of permanently storing (persistently stored) only the desired images captured by a camera 100 in an external memory 115. Unwanted images (second image data where conditions are not to be satisfied) are deleted, Paragraphs 0017, 0019, 0024; Figures 1-2B of Sasaki et al.. 
It would have been obvious and well-known to one of ordinary skill in the art before the effective filing date of the claimed invention to enable the teachings of Goetz to cause first image data, captured by the camera when the one more conditions are satisfied, to be persistently stored as a file at the computing device as taught by Sasaki et al., because this provides for a permanent way of a user being able to access and retrieve the image for future use. It would have also been obvious and well-known to one of ordinary skill in the art before the effective filing date of the claimed invention in response to second image data, of the image data, being captured by the camera when the one or more conditions are determined to not be satisfied (unwanted images), causing the second image data to be deleted since this would free up memory to thereby more efficiently and effectively make use of the memory, Paragraph 0017 of Sasaki et al.).

In regard to Claim 10, Goetz and Sasaki et al. disclose the method of claim 9, wherein a condition of the one or more conditions is satisfied when a current feature of the environment exhibits a particular property, and wherein processing the data that characterizes the one or more current features includes: assigning a confidence score (confidence level) for a property of the current feature of the environment, wherein the condition is satisfied when the confidence score satisfies a threshold score (As mentioned above, a user can be located within a room or a zone associated with an environment, such as a home of the user. In some instances, a user environment can include imaging devices that can image the environment and perform facial recognition to determine that the user (e.g., “Alice” or “Bob”) is in a particular room or zone of an environment. In some instances, it may not be possible to identify a user with certainty, and an environment can monitor other sensor data associated with a user to improve a certainty or confidence level of an identity and/or location of the user, Column 1, Lines 55-65; Column 2, Lines 8-24; Column 4, Lines 4-14, 26-30; Column 6, Lines 60-67; Column 7, Lines 19-32; Figure 1; Column 16, Lines 34-38; Column 18, Lines 36-47; Figure 4; Column 19, Lines 30-58 and Figure 5 of Goetz).

Regarding Claim 11, Goetz and Sasaki et al. disclose the method of claim 10, wherein determining that the input is the request for the automated assistant to operate the camera according to the one or more conditions includes: biasing, based on the current feature of the environment, a natural language understanding of the input (Further, the person of interest component 206 can include functionality to determine a conversation score associated with individual users in an environment to determine whether the person is to be a focus of the video communication, for example. The person of interest component 206 can perform operations for each of the environments 106 and 112. For example, with respect to the environment 106, the person of interest component 206 can determine that the user 102 is the only user in the field of view 110 associated with the imaging device 108, and therefore, the user 102 can be considered to be a person of interest in the environment 106. For example, the person of interest component can utilize face detection algorithms, body detection algorithms, etc., to determine that a face or body is present in image data. Similarly, the person of interest component 206 can determine that the user 104 is the only user in the fields of view 118 and 120, and thus the person of interest component 206 can determine that the user 104 is a person of interest, Column 3, Line 59 to Column 4, Line 3; Column 12, Lines 26-31 and 47-67 of Goetz).

With regard to Claim 12, Goetz and Sasaki et al. disclose the method of claim 9, further comprising: determining that the input, or another input, includes another request for the automated assistant to cause the image data to be modified (cropping portion of image data etc.); and when the one or more conditions are determined to be satisfied: causing the image data that is captured by the camera to be modified according to the input or the other input (In a near-end environment with at least two users, the network device can receive data from one or more imaging devices and analyze the data to determine a conversation score associated with individual users. For example, a conversation score can represent a level of engagement of the user in the communication, and can determine whether the network device should “follow” the user as the user moves about the environment, for example, or whether the user can be emphasized in the video communication. 
For example, a conversation score can be based at least in part on one or more of a location of the user in a field of view of the imaging device, a context of speech of the user (e.g., determined using ASR), movement of the user, preferences associated with a user profile of the user, etc. In some instances, the conversation score can be used to determine a size of image data representing the user on a far-end device of the video communication. As discussed herein, a near-end device can be considered to be a source of data (e.g., an imaging device capturing data), while a far-end device can be considered to be a destination of the data (e.g., a display presenting at least a portion of the data). In some instances, an imaging device can be panned, tilted, or otherwise manipulated to track one or more users based in part on a conversation score. Further, the network device can crop a portion of image data based on a resolution of the data, a location of a person of interest, etc., Column 1, Lines 55-65; Column 2, Lines 8-50 of Goetz).

Regarding Claim 13, Goetz and Sasaki et al. disclose the method of claim 9, wherein the other request is embodied in the other input provided by the user, and wherein the other input (instruction to pan, tilt and zoom) is received when the camera is capturing the image data (The conversation score can be used to determine a size of image data representing the user on a far-end device of the video communication. As discussed herein, a near-end device can be considered to be a source of data (e.g., an imaging device capturing data), while a far-end device can be considered to be a destination of the data (e.g., a display presenting at least a portion of the data). In some instances, an imaging device can be panned, tilted, or otherwise manipulated to track one or more users based in part on a conversation score. Further, the network device can crop a portion of image data based on a resolution of the data, a location of a person of interest, etc., Column 2, Lines 45-50; Column 20, Lines 6-18 of Goetz).

With regard to Claim 14, Goetz and Sasaki et al. disclose the method of claim 9, wherein causing the camera to capture the image data is performed without the user directly contacting a touch interface (without touching touchscreen and thus controlled using voice commands) of the computing device to start capturing the image data (The environment includes a device configured to receive voice commands from the user and to cause performance of the operations requested via these voice commands. Such a device, which can be known as a “voice-controlled device,” can include one or more microphones for capturing audio signals that represent or are otherwise associated with sound from an environment, including voice commands of the user. The voice-controlled device can also be configured to perform automated speech recognition (ASR) on the audio signals, or can be configured to provide the audio signals to another device (e.g., a device of a network device) for performing the ASR on the audio signals. After the voice-controlled device or another device identifies a voice command of the user, the voice-controlled device or the other device can attempt to direct the requested operation to be performed, Column 3, Lines 10-57; Column 5, Line 61 to Column 6, Line 9; Figure 1 of Goetz. Once a request to establish a communication session from a user profile is determined, image data is received from an imaging device, Claims 1-5 of Goetz).

Regarding Claim 15, Goetz discloses a method implemented by one or more processors (Methods, apparatuses, and systems for facilitating video communications between users in environments including a plurality of imaging devices. For example, an environment can include first and second imaging devices having associated fields of view providing multiple perspectives of a user. Upon initiating a video communication session, a network device (which includes one or more processors) can receive image data from the imaging devices to determine an identity of a user. A user profile of the user can include preferences for the communication, such as device selection, zoom selection, audio selection, subject framing, and the like. Further, the network device can utilize machine learned algorithms to determine an optimal view of the user for the video communication based on the fields of view associated with each imaging device. Column 6, Lines 54-59; Figures 1-2), the method comprising: 
receiving, by a computing device (Voice controlled device 340, Figure 3; Column 14, Line 47 to Column 15, Line 6), a spoken utterance from a user (voice/speech from a user), wherein the computing device provides access to an automated assistant (network device) and a camera (imaging device 308) (A user can issue a voice command to the voice-controlled device to “Connect Alice to Bob” or to “Connect me to Bob.” The voice-controlled device or another device can perform ASR on a captured audio signal to identify the command (“connect”) along with the referenced users (“Alice” and “Bob”). Similarly, a communication can be initiated using a GUI of a computing device or using a gesture-based imaging system, Column 3, Lines 50-67. 
The user 306 (“Alice”) can initiate a conversation with the user 318 (“Bob”) by speaking a wake word (e.g., “Computer . . . ”) and uttering a command (e.g., “Connect me to Bob.”) The imaging device 308 in the environment 302 can capture the audio uttered by the user 306 and can transmit the audio to the network device 126. The network device 126 can determine that the audio represents a request from Alice to initiate a communication with Bob. Based at least in part on one or more user preferences (and/or based at least in part on one or more commands), the network device 126 can determine that the communication is to be a video communication, Column 15, Lines 11-24 and Figure 3. Also see step 402, Column 17, Lines 34-50; Figure 4); 
determining, based on the spoken utterance (spoken commands), that the spoken utterance includes a request for the automated assistant to control the camera (In some cases, a communication can be initiated by a user via a voice command received by the imaging device 108 or 114, the smart appliance 116, or any other devices that can be present in an environment (e.g., the environments 106 and 112). For example, in response to one of the imaging device 108 or 114, or the smart appliance 116 identifying the user 102 and/or 104, and/or in response to the user 102 and/or 104 speaking a predefined wake word, the particular device can begin uploading an audio signal (or image data including an audio signal) representing sound captured in the environment 106 or 112, respectively, up to the network device 126 over the network 128, Column 3, Lines 59-67; Column 11, Lines 62-67. Also see step 402, Column 17, Lines 34-50) wherein the spoken utterance specifies one or more conditions that, when satisfied, causes the automated assistant to initialize performance of an operation that utilizes the camera (In response to receiving this audio signal, the speech-recognition component 202 can begin performing automated speech recognition (ASR) and/or natural language understanding on the audio signal to generate text and identify one or more user voice commands from the generated text, Column 3, Lines 59-67; Column 12, Lines 7-25; Figure 2 and Column 17, Lines 34-50. A conversation score can be based at least in part on one or more of a location of the user in a field of view of the imaging device, a context of speech of the user (e.g., determined using ASR), movement of the user, preferences associated with a user profile of the user, etc. In some instances, the conversation score can be used to determine a size of image data representing the user on a far-end device of the video communication. As discussed herein, a near-end device can be considered to be a source of data (e.g., an imaging device capturing data), while a far-end device can be considered to be a destination of the data (e.g., a display presenting at least a portion of the data). In some instances, an imaging device can be panned, tilted, or otherwise manipulated to track one or more users based in part on a conversation score. Further, the network device can crop a portion of image data based on a resolution of the data, a location of a person of interest, etc., Column 1, Lines 55-65; Column 2, Lines 8-50); 
processing, based on the one or more conditions, image data that is generated using the camera in furtherance of determining whether the one or more conditions are satisfied (The network device 126 can include one or more processors 130 and one or more computer-readable media 132 including a user locator component 134, a user profile component 136, a device selection component 138, an image processing component 140, a context component 142, and a machine learning component 144. The machine learning component 144 can include functionality to determine an optimal view of a user during a communication to provide a seamless communication experience. For example, the machine learning component 144 can receive the image data 122 and 124 and determine that the image data 122 represents a “better” image of the user 104 than the image data 124. In some instances, the machine learning component 144 can be trained using training data that has been annotated or scored to indicate an optimal view of a user in a video communication. In some instances, the machine learning component 144 can use any machine learning algorithms, including but not limited to, neural networks, convolutional neural networks, decision forests, etc., Column 4, Lines 40-54; Column 10, Line 27 to Column 11, Line 13; Figure 2);  and 
when the one or more conditions are determined to be satisfied: 
causing the automated assistant to initialize performance of the operation using the camera, wherein initializing the operation causes the camera to capture additional image data (capturing additional data by tracking a user) (Again, as mentioned above, a conversation score can be based at least in part on one or more of a location of the user in a field of view of the imaging device, a context of speech of the user (e.g., determined using ASR), movement of the user, preferences associated with a user profile of the user, etc. In some instances, the conversation score can be used to determine a size of image data representing the user on a far-end device of the video communication. As discussed herein, a near-end device can be considered to be a source of data (e.g., an imaging device capturing data), while a far-end device can be considered to be a destination of the data (e.g., a display presenting at least a portion of the data). In some instances, an imaging device can be panned, tilted, or otherwise manipulated to track one or more users based in part on a conversation score. Further, the network device can crop a portion of image data based on a resolution of the data, a location of a person of interest, etc., Column 1, Lines 55-65; Column 2, Lines 8-50). 
Goetz does not explicitly disclose causing the captured additional image data to be persistently stored, as a file at the computing device, and wherein at least some of the image data, processed in determining whether the one or more conditions are satisfied, is only temporarily stored at the computing device. Sasaki et al. teach of causing the captured additional image data to be persistently stored, as a file at the computing device, and wherein at least some of the image data, processed in determining whether the one or more conditions are satisfied, is only temporarily stored at the computing device,
(Sasaki et al. teach of a digital camera system includes an optical system for forming optical images onto an image sensor. The image sensor provides digital images of the optical images to an intermediate memory. A display provides a visual display of selected digital images stored in the intermediate memory. A controller responds to a manual input from a user to initiate long-term storage of selected digital images from the intermediate memory in a long-term memory, Abstract; Figures 1-2B of Sasaki et al.. Sasaki et al. teach of permanently storing (persistently stored) desired images captured by a camera 100 in an external memory 115. Unwanted images (second image data where conditions are not to be satisfied) are deleted, Paragraphs 0017, 0019, 0024; Figures 1-2B of Sasaki et al.. The user has a choice to optionally delete or keep images (temporarily store). A user may change long term storage media after taking a picture so that the image can be stored on an appropriate or preselected memory device (temporarily stored at computing device), Paragraphs 0017 and Figures 1-2B of Sasaki et al..
It would have been obvious and well-known to one of ordinary skill in the art before the effective filing date of the claimed invention to enable the teachings of Goetz to cause image data, captured by the camera when the one more conditions are satisfied, to be persistently stored as a file at the computing device as taught by Sasaki et al., because this provides for a permanent way of a user being able to access and retrieve the image for future use. It would have also been obvious and well-known to one of ordinary skill in the art before the effective filing date of the claimed invention to cause at least some of the image data, processed in determining whether the one or more conditions are satisfied, be only temporarily stored (deleted or changing of memory device) at the computing device since this would free up memory to thereby more efficiently and effectively make use of the memory, Paragraph 0017 of Sasaki et al.).


With regard to Claim 16, Goetz and Sasaki et al. disclose the method of claim 15, wherein the additional image data includes video data, and wherein causing the automated assistant to initialize performance of the operation using the camera includes: causing the camera to capture the video data for a period of time in which the one or more conditions are satisfied (At 412, the operation can include selecting a device as a primary device based at least in part on the person of interest. For example, the operation 412 can include utilizing one or more machine learning algorithms to determine an optimal view of a user designated as a person of interest for a conversation. Further, selecting a device as the primary device can be based on one or more preferences associated with a user profile, for example. In some cases, user preferences can indicate which devices are to be used for particular types of communication (e.g., voice, video, etc.), between particular users, time of the day, a context of the communication, objects present in the field of view associated with an imaging device, restricted areas of an environment, a number of users in an environment, etc.. The operation 414 can include processing the data, such as cropping or zooming to frame the person of interest according to an optimal view and/or user preferences, Column 6, Lines 39-41; Figure 1; Column 18, Line 6 to Column 19, Line 5; Figure 4 of Goetz).

In regard to Claim 17, Goetz and Sasaki et al. disclose the method of claim 15, further comprising: identifying, based on the one or more conditions, one or more trained machine learning models, wherein processing the image data is performed using the one or more trained machine learning models (machine learning component), and wherein the one or more trained machine learning models are trained using training data that characterizes environmental features that satisfy the one or more conditions (The network device 126 can include one or more processors 130 and one or more computer-readable media 132 including a user locator component 134, a user profile component 136, a device selection component 138, an image processing component 140, a context component 142, and a machine learning component 144. The machine learning component 144 can include functionality to determine an optimal view of a user during a communication to provide a seamless communication experience. For example, the machine learning component 144 can receive the image data 122 and 124 and determine that the image data 122 represents a “better” image of the user 104 than the image data 124. In some instances, the machine learning component 144 can be trained using training data that has been annotated or scored to indicate an optimal view of a user in a video communication. In some instances, the machine learning component 144 can use any machine learning algorithms, including but not limited to, neural networks, convolutional neural networks, decision forests, etc., Column 4, Lines 40-54; Column 10, Line 27 to Column 11, Line 13; Figure 2 of Goetz).

With regard to Claim 18, Goetz and Sasaki et al. disclose the method of claim 15, further comprising: subsequent to determining that the one or more conditions are satisfied: processing separate image data (image captured by the camera of a user who is not of interest) in furtherance of determining whether the one or more conditions are no longer satisfied (determining that a person is no longer of interest), wherein the separate image data is captured using the camera (In some instances, the user 320 “Eve” can enter the environment 304 and can enter the field of view 326 and utter “Hi, Alice!” However, the user 320 may be moving through the field of view 326, represented by movement 346. Based at least in part on the movement (e.g., speed, direction, length of time in the field of view 326, etc.), and based at least in part on the context of the utterance (e.g., “Hi, Alice!”), the network device 126 can associate a low conversation score with the user 320, and accordingly, can determine that the user 320 is not to be a person of interest. Accordingly, the network device 126 can determine not to provide image data representing the user 320 to the environment 302, Column 15, Lines 48-62 of Goetz).

Regarding Claim 19, Goetz and Sasaki et al. disclose the method of claim 18, further comprising: subsequent to determining that the one or more conditions are satisfied: determining that the one or more conditions are no longer satisfied, and causing, based on the one or more conditions no longer being satisfied, the computing device to store the additional image data and at least a portion of the separate image data (region data and background etc. associated with both satisfactory and unsatisfactory condition) as an image file (As mentioned above, a user 320 “Eve” can enter the environment 304 and can enter the field of view 326 and utter “Hi, Alice!” However, the user 320 may be moving through the field of view 326, represented by movement 346. Based at least in part on the movement (e.g., speed, direction, length of time in the field of view 326, etc.), and based at least in part on the context of the utterance (e.g., “Hi, Alice!”), the network device 126 can associate a low conversation score with the user 320, and accordingly, can determine that the user 320 is not to be a person of interest. Accordingly, the network device 126 can determine not to provide image data representing the user 320 to the environment 302.  If for instance another user, “Victor” is determined as being a person of interest. If the user 322 (“Victor”) enters the environment 304 and stands within the field of view 330 and utters “Hi Alice! Let me tell you about my day . . . ”. In some cases, the network device 126 can determine that the user 322 is a person of interest based at least in part on a location of the user 322 within the field of view 330, a speed of the user 322, a length of time in the field of view, a determination that the user 322 is viewing a display or imaging device associated with the smart appliance 328, a context of the utterance, etc. That is, the network device 126 can determine that the user 322 is engaging in the conversation, and can seamlessly integrate the user 322 into the conversation. ,Column 15, Line 48 to Column 16, Line 47 of Goetz).

With regard to Claim 20, Goetz and Sasaki et al. disclose the method of claim 15, wherein causing the automated assistant to initialize performance of the operation using the camera is performed without the user directly contacting a touch interface of the computing device (without touching touchscreen and thus using the voice command) to start capturing the additional image data or stop capturing the additional image data (The environment includes a device configured to receive voice commands from the user and to cause performance of the operations requested via these voice commands. Such a device, which can be known as a “voice-controlled device,” can include one or more microphones for capturing audio signals that represent or are otherwise associated with sound from an environment, including voice commands of the user. The voice-controlled device can also be configured to perform automated speech recognition (ASR) on the audio signals, or can be configured to provide the audio signals to another device (e.g., a device of a network device) for performing the ASR on the audio signals. After the voice-controlled device or another device identifies a voice command of the user, the voice-controlled device or the other device can attempt to direct the requested operation to be performed, Column 3, Lines 10-57; Column 5, Line 61 to Column 6, Line 9; Figure 1 of Goetz. Once a request to establish a communication session from a user profile is determined, image data is received from an imaging device, Claims 1-5 of Goetz).



Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PRITHAM DAVID PRABHAKHER whose telephone number is (571)270-1128. The examiner can normally be reached Monday to Friday 8:00 am to 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lin Ye can be reached on 5712727372. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





Pritham David Prabhakher
Patent Examiner
Pritham.Prabhakher@uspto.gov
/PRITHAM D PRABHAKHER/Primary Examiner, Art Unit 2697