Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1 and 11 are independent.
This Application was published as U.S. 2019/0311718.
Apparent potential priority 5 April 2018.

The instant Application is directed to a virtual assistant method or device for receiving voice command/query from a user and responding to the user.  The modality, as speech or displayed, and other characteristics of the response take into account “a current use context” of the device.  The “current user context” is determined according to a variety of sensors such as cameras and microphones that track various parameters such as the location of the user, the direction of his gaze, his distance from the virtual assistant, and the volume of his voice as he utters the command/query and sets the volume of the output voice or adjust the size of the output elements on an output display.  Scott (U.S. 2017/0289766), filed 13 December 2016 is the closest reference identified to date.
Claim Objections
Claims 9 and 19 are objected to because of the informalities that may be addressed by the following suggested amendments.  

The Claims are missing a “:” which results in an ambiguity in the language.
9. The voice-interaction device of claim 8, 
wherein the context controller is configured to detect: 

the at least one voice characteristic,
 a distance between the target user and the voice-interaction device, and/or
environmental noise.

19. The method of claim 18, further comprising: 
detecting the target user's speech volume based at least in part on an amplitude of the input audio signal, 
detecting the at least one voice characteristic, 
detecting a distance between the target user and the voice-interaction device, and/or
detecting environmental noise.

As is and without the proper “:” mark, the Claims could be interpreted to mean that “the context controller is configured to detect he target user’s speech volume based … on an amplitude … voice characteristics … distance … environmental noise.”  This (1) does not make sense and (2) is not consistent with the supporting Specification.  First, volume of the user’s voice as he issues the command is based on amplitude of the input voice; it is not based on a distance between the user and the device or the noise.  The distance can be 1 feet or 20 feet and the user can be standing there quietly in both situations.  Second, the goal of determination of the listed factors is to determine an “OUTPUT” volume and the output volume would depend on the distance or 

Note the supporting portion of the Specification in the instant Application:
[0022] In various embodiments, systems and methods for adaptively adjusting the voice output are also provided. In one embodiment, a voice-interaction device receives audio input signals including a target audio signal, enhances the target audio signal, determines audio characteristics associated with the target audio signal, and modulates the audio output in accordance with the determined characteristics and other available context information. The context controller may determine an appropriate output volume level for the use context, for example, by analyzing a plurality of characteristics of the user's voice and the environment which may include the speech volume in the target audio signal as measured at the device input, environmental noise (e.g., background noise level), distance between the target user and the voice-interaction device (e.g., via estimated time of arrival of speech from user, object tracking from an image sensor), voicing (e.g., whether the speech is whisper, neutral speech, shouting), and other voice characteristics. For example, if target audio signal includes whispered speech commands spoken by a user in a quiet room who is near the voice-interactive device, the context controller may lower the device output volume to deliver voice response output that approximates the input context. In another example, if the target audio signal includes shouted speech commands received in a noisy environment from a target user who is a distance away from the voice-interactive device (e.g., across 

Appropriate correction is required.

Additionally, note that the connector “and/or” means “and or or” which due to the use of connector “or” between “and” and “or” means just “or.”  Thus, this connector is interpreted as “or” and plays the same role that an “at least one” in the preamble would have had.  In other words, a reference that teaches one of the listed limitations, teaches the Claim.

According to the above interpretation, Claims 9 and 19 are mapped to Scott alone.  However, the Conclusion section provides a mapping of the Claim to Scott, Raitio, and Shurtz in the case that “and/or” was intended as “and” and all of the limitations need to be mapped.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: “input component” and “virtual assistant module” in Claims 1. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to a camera, touchscreen and processors or a combination of processor and memory or to a combination of software and hardware.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “camera,” “touchscreen,” and “processor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-2, 8-9, 11-12, and 18-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Scott (U.S. 2017/0289766).
Regarding Claim 1, Scott teaches:
1. A voice-interaction device comprising: [Scott, Figure 1, “client device 102.”  Figure 9, “computing device 902.”  Title:  “Digital Assistant Experience based on Presence Detection.”  The context of the use: including the location of the user and other users present in the vicinity is considered in the type of output and:  “[0004] Techniques for digital assistant experience based on presence sensing are described herein. In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements. Various other contextual factors may additionally or alternatively be considered in adapting a digital assistant experience.”]
a plurality of input and output components configured to facilitate interaction between the voice-interaction device and a target user, [Scott, Figure 1, “client device 102” includes “sensors 132” and “displays 118, 122.”  Figure 9, “computing device 902” includes “sensors 132” and “input/output interfaces 908.”  “[0122] Input/output interface(s) 908 are representative of functionality to allow a user to enter commands and information to computing device 902, and also allow information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice recognition and/or spoken input), a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to detect movement that does not involve touch as gestures), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth. Thus, the computing device 902 may be configured in a variety of ways as further described below to support user interaction.”]
the plurality of input and output components comprising:
a microphone configured to sense sound and generate an audio input signal; [Scott, Figure 1, “audio sensors 132b” include microphones.  “[0017] According to one or more implementations, techniques described herein are able to receive voice commands and react upon presence, identity and context of one or more people. By way of example, the described techniques can be implemented via a computing device equipped with one or multiple microphones, a screen, and sensors to sense the context of a user. Various sensors are contemplated including for example a camera, a depth sensor, a presence sensor, biometric monitoring devices, and so forth.”  See also [0122].]
a speaker configured to output an audio signal to the target user; and [Scott, Figure 9, “input/output interfaces 908” include “speakers”: “[0122] Input/output interface(s) 908 are representative of functionality to allow a user to enter commands and information to computing device 902, and also allow information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice recognition and/or spoken input), a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to detect movement that does not involve touch as gestures), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth. Thus, the computing device 902 may be configured in a variety of ways as further described below to support user interaction.”] 
an input component configured to sense at least one non-audible interaction from the target user; [Scott, Figure 1, “sensors 132” including “light sensors 132a,” “touch sensors 132c,” and “presence sensors 132d” are directed to input of other than voice/sound/audible interaction.  “[0122] … Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice recognition and/or spoken input), a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to detect movement that does not involve touch as gestures), and so forth. …”]
a context controller configured to monitor the plurality of input and output components and determine a current use context; and [Scott, Figures 1, 2 and 9, the “digital assistant 126” of Figures 1-2, or the “processing system 904” of Figure 9, performs the functions of the “context controller” of the Claim by receiving “sensor data 202” and generating “user experience 206” which is a method and modality of “output” based on the context provided by “sensor data 202.”  “[0020] Various types of adaptations scenarios are contemplated. For instance, sensors may be used to obtain data for context sensing beyond a simple presence sensor, such as estimating the number of people present, recognizing the identities of people present, detection of distance/proximity to the people, and/or sensing when people approach or walk away from the device and/or other contextual sensors. For instance, different contextual factors can be sensed and/or inferred, such as age and/or gender based on visual information, a state a person is in (e.g., the user is able to see, talk, and so forth). Such contextual factors may be detected in various ways, such as via analysis of user motion, user viewing angle, eye tracking, and so on.”  To the degree that “output components” provide an output to the environment of the “client device 102,” their outputs becomes part of the context is monitored by the sensors 132 of device 102.]
a virtual assistant module configured to facilitate voice communications between the voice-interaction device and the target user and configure one or more of the input and output components in response to the current use context. [Scott, Figure 1, “Digital Assistant 126.”  Figure 6, shows that the input components provide more than just voice data and output components modify the output User Interface to generate a user experience that is custom made for the particular user and his context.  For example a GUI is used for output under some circumstances/context and an audible response under different circumstances/context are used:  “[0022] Context sensors as noted above may also enable adaptations to the operation of a voice UI, such as responding differently based on whether multiple people are present or a single person, and responding differently based on proximity to a person. For example, when distance from a reference point to the person is relatively small, a graphical UI is considered appropriate and is therefore presented on a display screen. However, when the person is positioned such that the display screen may not be visible and/or the person is not looking at the display screen, the graphical UI may not be helpful in which case the system may utilize audible alerts, voice interaction, and audio responses.”]

Regarding Claim 2, Scott teaches:
2. The voice-interaction device of claim 1, further comprising: 
audio input circuitry configured to receive the audio input signal and generate an enhanced target signal including audio generated by the target user; and [Scott teaches using multiple microphones and conducting beamforming to “enhance” the input signal:  “[0048] Sound: In order to enable interaction with the computer using a speech-based interface, one or multiple microphones representing instances of the sensors 132 can be employed. Using multiple microphones enables the use of sophisticated beamforming techniques to raise the quality of speech recognition and thus the overall interaction experience. Further, when motion information (e.g., angle of arrival information) is available (e.g., from radar information), a beamforming estimate can be used to enhance speech recognition, such as before any speech input is detected.”]
a voice processor configured to detect a voice command in the enhanced target signal; [Scott, Figure 9, “processing system 904.”  The device is a PDA which receives commands primarily through speech interfaces and speech recognition:  “[0074] … Consequently, when the system has identified Bob it may cause the digital assistant 126 to use speech interfaces along with visual information or switch entirely to speech interfaces.”  “[0086] The scenario 500 further depicts an active conversation interface 512, which may be output during an ongoing conversation between a user and the digital assistant. Here, the system provides indications and feedback with respect to the conversation, such as by displaying recognized speech 514, providing suggestions, and/or indicating available voice command options….”  See [0017], [0019] and [0086] for express teaching of “voice commands.”]
wherein the virtual assistant module is configured to execute the detected voice command in accordance with the current use context. [Scott, the general, well-known purpose of a virtual/digital/personal assistant is to execute the command that is input by the user.  Scott uses the context of user, as obtained by the various sensors, to both determine the intent of the input command and the output mode:  “[0031] For example, requests may include spoken or written (e.g., typed text) data that is interpreted through natural language processing capabilities of the digital assistant 126. The digital assistant 126 may interpret various input and contextual clues to infer the user's intent, translate the inferred intent into actionable tasks and parameters, and then execute operations and deploy device services 128 to perform the tasks….”]

Regarding Claim 8, Scott teaches (Note that Scott does not teach the whisper mode which is the intention of this Claim.  But this Claim is broadly stated and Scott teaches the broad language of the Claim.  For more specific references see rejection of Claim 10 which is stated with the appropriate level of particularity.):
8. The voice-interaction device of claim 1, wherein the virtual assistant module [This Claim is mapped to Scott, Figure 6, “Detect User Identity 604.”]
detects a voice interaction from the target user, [Scott, the “client device 102” receives its commands generally by voice:  “[0017] According to one or more implementations, techniques described herein are able to receive voice commands and react upon presence, identity and context of one or more people. By way of example, the described techniques can be implemented via a computing device equipped with one or multiple microphones, a screen, and sensors to sense the context of a user. Various sensors are contemplated including for example a camera, a depth sensor, a presence sensor, biometric monitoring devices, and so forth.”]
determines at least one voice characteristic, and [Scott, user identity is one of the parameters detected by the system so that the “experience” i.e. input and output modalities can be adapted for the particular user as shown in Figure 7.  The user identity is detected from various indicators including his phone device or his biometrics which include his particular voice characteristics.  “[0097] Identity of a user is detected (block 604). The user, for example, moves to a proximity to the client device 102 where a user-specific attribute is detected by one or more of the sensors 132, and used to identify and/or authenticate the user. Different ways of detecting user identity are discussed above, and include various biometric features and techniques.”  “[0068] … Examples of such biometric techniques include facial recognition, voice recognition, gait recognition, and so forth. …” ] 
modulates an output volume according to the determined voice characteristic. [Scott, if the user is deaf the volume of the output is adjusted accordingly.  The output volume is modulated according to identity of the user; user identity is in turn is determined from his voice characteristics.  Thus the output volume is modulated according to the voice characteristics:  “[0023] Context sensors and techniques discussed herein may also be employed to improve accessibility scenarios. For example, the system may detect or be aware that a particular person is partially deaf. In this case, volume level may be adapted when that particular user is present….”]

Regarding Claim 9, Scott teaches: 
9. The voice-interaction device of claim 8, wherein the context controller is configured to detect [Note the Objection above and note that “and/or” is interpreted as “or” or as “at least one” such that a reference that teaches one of the limitations teaches the Claim.  Refer to the Conclusion for a mapping of the all of the limitations of Claim 9.]
 the target user's speech volume based at least in part on an amplitude of the input audio signal, [Scott does not teach sensing the volume/loudness of the speech of the user as a factor.  “Amplitude of the input signal” is the definition of volume/loudness. Scott modulates the volume of the output speech to suit the particular user and his context.  See rejection of Claim 8.  Scott does not teach that the input volume is sensed or determined as part of conext.]
the at least one voice characteristic, [Scott identifies the users/speakers based on biometrics which include “voice recognition.”  Biometric voice recognition identifies characteristics of the input voice:  “[0068] … Alice is identified and authenticated as being associated with a particular user profile. Examples of such biometric techniques include facial recognition, voice recognition, gait recognition, and so forth….”  Scott also teaches that for better speech recognition the system uses speaker’s identity to take into consideration the characteristics of his voice such as accent:  “[0049] …When the identity of a user is known (such as discussed below), it is possible to apply a different speech recognition model that actually fits the user's accent, language, acoustic speech frequencies, and demographic.”]
a distance between the target user and the voice-interaction device, and/or [Scott teaches that the distance between the user and device is determined in order to determine the method of output as display on a GUI or by voice.  See Figure 3.  “[0066] In at least some implementations, a volume of the audio prompt is adjusted based on Alice's distance from the client device 102. For instance, when Alice first enters the proximity zone 306 and is initially detected, an audio prompt may be relatively loud. However, as Alice continues toward the client device 102 an approaches the proximity zone 304, volume of an audio prompt may be reduced.”  “… In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements. ….”  Abstract.  “[0058] … , additional sensors are invoked to detect position, distance, identity, and other characteristics that enable further context-based adaptations of the system behavior….”]
environmental noise. [Scott considered the background noise as a factor to be considered in recognizing the input speech of the user:  “[0049] Also, the system (e.g., the client device 102) can disambiguate between multiple sound sources, such as by filtering out the position of a known noise-producing device (e.g., a television) or background noise. ….”  Scott adjusts the volume of output based on noise in the environment:  “[0089] … Volume adjustments may also be made based on proximity and/or ambient noise levels….”]

Claim 11 is an independent method claim with limitations similar to the limitation of Claim 1 that are rejected under similar rationale.
Regarding Claim 11, Scott teaches:
11. A method comprising: 
facilitating communications between a voice-interaction device and a target user using a plurality of input and output components, including sensing sound to generate an audio input signal, outputting an audio signal to the target user, and sensing at least one non-audible interaction from the target user; [Scott, Figure 1, “client device 102” / “voice interaction device” of the Claim includes “sensors 132,” including “audio sensors 132b” and “presence sensors 132d” and Figure 9, “sensors 132” and “Input/Output Interfaces 908.”  Figures 3-4 and 7 show that distance of the user with the “client device 102” is taken into consideration.  See 702 and 704.  “[0028] The client device 102 can be embodied as any suitable computing system and/or device such as, by way of example and not limitation, a gaming system, a desktop computer, a portable computer, a tablet or slate computer, a handheld computer such as a personal digital assistant (PDA), a cell phone, a set-top box, a wearable device (e.g., watch, band, glasses, etc.), a large-scale interactivity system, and so forth….”]
monitoring, by a context controller, a current use context of the plurality of input and output components; and [Scott, Figure 1, “sensors 132” and Figure 2, “sensor data” teach the “context controller” of the Claim that is providing the “current use context” of the device’s input/output interface components.  Figure 9, “input/output interfaces 908.”]
adapting one or more of the input and output components to the current use context. [Scott, Figure 2, “user experience 206” is determined by the “digital assistance 126” including the “system behavior manager 130” with the use of context from the input “sensor data 202.”  “[0044] … In operation, the system behavior manager 130 obtains sensor data 202 that may be collected via various sensors 132. The sensor data 202 is analyzed and interpreted by the system behavior manager 130 to determine contextual factors such as user presence, identity, proximity, emotional state, and other factors noted above and below….  System behavior adaptations 204 that correspond to the current context are identified and applied to adapt the user experience 206 accordingly. Generally, the user experience 206 includes different attributes of a digital assistant experience such as audible experience, visual experience, touch-based experience, and combinations thereof Various types of adaptations of user experience are contemplated, details of which are described above and below.”  See flowcharts of Figures 6 and 7.]

Claim 12 is a method claim with limitations similar to the limitation of Claim 2.
Claim 18 is a method claim with limitations similar to the limitation of Claim 8.
Claim 19 is a method claim with limitations similar to the limitation of Claim 9.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 3, 6, 13, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Scott.
The only feature not expressly taught by Scott is “digital images,” in Claims and 13, which is suggested by the teaching of “digital assistant.”  Also, the significance of the image being a digital image, if any, is not set forth in the Claim.  Hence the single reference 103.  Claims 6 and 16 are taught expressly in their entirety but depend from 3 and 13, respectively.
Regarding Claim 3, Scott teaches and suggests:
3. The voice-interaction device of claim 1, 
wherein the input component comprises an image sensor configured to capture digital images of a field of view; and [Scott, Figure 1 teaches “sensors 132” which include a “camera”:  “[0017] … Various sensors are contemplated including for example a camera, a depth sensor, a presence sensor, biometric monitoring devices, and so forth.”  “[0047] Presence sensing: The physical presence of people (i.e. people nearby the system) may be detected using sensors 132 like pyro-electric infrared sensors, passive infrared (PIR) sensors, microwave radar, microphones or cameras, and using techniques such as Doppler radar, radar using time-of-flight sensing, angle-of-arrival sensing inferred from one or more of Doppler radar or time-of-flight sensing, and so forth.”  A camera inherently has a field of view.  The “digital” nature of the “digital images” is suggested by the fact that the system implements a digital assistant module.  See [0125] to [0127] for hardware implementation.]
wherein the context controller is further configured to analyze the digital images to detect and/or track a position of the target user in relation to the voice-interaction device and determine a current use context based at least in part on the position of the target user. [Scott, the “presence sensors 132d” of Figure 1 include cameras which detect and track the position of various users in the environment as they enter and exit a particular region in order to provided context for the command interpretation task.  See Figures 3-4 also:  “[0050] Position: As noted above, radar or camera-based sensors 132 may provide a position for one or multiple users….”  “[0105] In at least some implementations, a reference point (e.g., the display 308) can be occluded at different distances, such as depending on an angle of approach of a user relative to the reference point. In such as case, a particular sensor (e.g., a camera) can resolve this occlusion, even when another sensor (e.g., radar) may not be able to resolve the occlusion.”  “[0020] … For instance, sensors may be used to obtain data for context sensing beyond a simple presence sensor, such as estimating the number of people present, recognizing the identities of people present, detection of distance/proximity to the people, and/or sensing when people approach or walk away from the device and/or other contextual sensors….” ]

Regarding Claim 6, Scott teaches:
6. The voice-interaction device of claim 3, wherein the context controller is configured to: 
estimate a distance between the target user and the voice-interaction device based at least in part on the relative position of the target user in relation to the voice-interaction device; and [Scott, “… In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements….”  Abstract.  See claims 1, 3-6 and 11 which are all related to “distance.”  “[0050] .. Distance and/or proximity can also be detected using ultrasonic detection, time-of-flight, radar, and/or other techniques.”]
provide attention-aware output rendering in which output fidelity is determined based at least in part on the target user's distance from the voice-interaction device to facilitate interactions from various distances.  [Scott, “… In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements….”  Abstract.  “[0022] … For example, when distance from a reference point to the person is relatively small, a graphical UI is considered appropriate and is therefore presented on a display screen….”  “[0057] …Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization….”]

Claim 13 is a method claim with limitations similar to the limitation of Claim 3.
Claim 16 is a method claim with limitations similar to the limitation of Claim 6.

Claims 4-5, 7, 14-15, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Scott in view of Singh (U.S. 20180032300).
Regarding Claim 4, Scott teaches and suggests:
4. The voice-interaction device of claim 3, further comprising 
a display configured to present visual display elements to the target user; [Scott, Figure 1, “display device 118” and “integrated display 122.”  See [0028].  Figure 3, “display 308.”  Figure 9, “Input/output interfaces 908.”  “[0122] … Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth….”]
wherein context controller is further configured to analyze the digital images to determine a gaze direction of the target user; and [Scott teaches that the contextual features that it detects include “eye tracking” which teaches “gaze direction” of the Claim.  “[0020] … For instance, different contextual factors can be sensed and/or inferred, such as age and/or gender based on visual information, a state a person is in (e.g., the user is able to see, talk, and so forth). Such contextual factors may be detected in various ways, such as via analysis of user motion, user viewing angle, eye tracking, and so on.”]
wherein the virtual assistant module is further configured to 
direct interactions to the target user through the visual display elements on the display in response to the gaze direction being directed toward the display, and [Scott, Figures 6 and 7 provide flowcharts of “presenting a digital assistant experience at the client device” and adapting this presentation/output according to context data.  This includes deciding whether to use the display or audio modality for output.  “[0055] The output, or more general actuation, is mainly based on two interaction modalities: sound and display. However, these two modalities are tightly interwoven with each other based on the contextual data retrieved by sensors 132 and prior knowledge about the user's habits and situation.”  “[0056] Switching system behavior between multiple output modalities can be illustrated by the following example situations. Adaptations of the system behavior are designed to making information easily accessible in various interaction scenarios and contexts.”  The former teachings and [0057] together suggest that when convenient, the output is presented on a display and one of the contextual factors considered in determining this convenience is whether the user can see the screen which may be determined by eye tracking.]
 direct interactions through the audio output signal and the speaker in response to the gaze direction being directed away from the display. [Scott switches to sound output if context indicates that the person cannot see the display.  “[0057] Adapting to the position: When the contextual information indicates that a person who would like to interact with the system is not able to see the screen, the system behavior manager 130 may be configured to switch to sound output in preference over displaying data visually….”  Context according to [0020] above includes “eye tracking.”  The two teachings together suggest that based on eye tracking the output mode may be changed to audio. ]

Scott includes various teachings that together very strongly suggest that in response to eye tracking the output is negotiated between display and audio.  This not express in Scott, however.
Singh teaches:
…
a display configured to present visual display elements to the target user; [Singh, Figure 1 showing a number of displays.  “The invention relates to a display control apparatus (1) for dynamically controlling the display of information in a vehicle (3). …. In dependence on a determined gaze direction of the vehicle driver, the at least one electronic processor (13) controls a switching module (16) to cause a first information data set (INF1) displayed on said first display (4) to be displayed on said second display (5). The present invention also relates to a method of controlling the display of information in a vehicle (3). One aim is to shift the driver's view back to the road.”  Abstract.]
wherein context controller is further configured to analyze the digital images to determine a gaze direction of the target user; and [Singh, “[0013] The at least one electronic processor can be configured to receive sensor data from at least one eye tracking sensor; and to determine the gaze direction of vehicle driver in dependence on the received sensor data. Alternatively, the at least one electronic processor can be configured to receive the gaze direction information from an eye tracking apparatus.”  “[0034] The eye tracking apparatus 11 comprises first and second image sensors 17-1, 17-2 each comprising a driver-facing camera. At least one of said first and second image sensors 17-1, 17-2 can comprise an infra-red (or near infra-red) capability for eye-tracking purposes. In a variant, the first and second image sensors 17-1, 17-2 could detect light at a visible wavelength to determine head position and/or eye gaze. The first and second image sensors 17-1, 17-2 are connected to an image processing unit 19 configured to process the image data to generate tracking data DAT1.”]
wherein the virtual assistant module is further configured to direct interactions to the target user through the visual display elements on the display in response to the gaze direction being directed toward the display, and direct interactions through the audio output signal and the speaker in response to the gaze direction being directed away from the display. [Singh, “[0014] The display control apparatus can be configured to record historical data to determine driver behaviour and/or preferences. Based on historical driver behaviour, the display control apparatus can determine if the driver prefers visual or audio output of information. If the driver prefers audio behaviour and the determined gaze direction is not coincident with the first display, the display control apparatus can be configured to output the first information data set in an audio form rather than audio-visual or visual information.”  “22. The display control apparatus as claimed in claim 4, wherein the at least one electronic processor is configured to control an audio device and/or a haptic device to selectively output in an audio form and/or in a haptic form in dependence on the determined gaze direction and optionally on one or more preferences of the vehicle driver.”]
Scott and Singh pertain to voice commands and output of the result of the command or query by display or audio and it would have been obvious to modify the system of Scott which provides for a combination of visual and audio user interfaces that are switched according to context, teaches that the output modality is modified from display to audio when the user cannot see the display, and also provides for eye/gaze tracking, as one of the input parameters to its context detector, with the system of Singh which expressly teaches providing the output to a display or audio output/speaker depending on the location of attention of a user (driver) which is determined from an eye/gaze tracker in order to further accommodate a user who is not looking at the display or cannot look at the display.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 5, Scott teaches:
5. The voice-interaction device of claim 4, 
wherein the virtual assistant module adjusts a size of the visual display elements as the target user moves relative to the voice-interaction device, [Scott expressly teaches this limitation in the “adapting to the position” category and as shown in Figures 3-7.  “[0057] Adapting to the position: When the contextual information indicates that a person who would like to interact with the system is not able to see the screen, the system behavior manager 130 may be configured to switch to sound output in preference over displaying data visually. The same is true for situations in which a person interacts from farther away. Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. ...”  Figures 3-4 shows keeping track of movements/position of a user:  “1. … and adapting an element of the first digital assistant experience to generate a second digital assistant experience at the client device that is based on a change in a contextual factor that results from the user moving from the first detected distance to the second detected distance from the reference point.”  “[0050] Position: As noted above, radar or camera-based sensors 132 may provide a position for one or multiple users. The position is then used to infer context, e.g. approaching the client device 102, moving away from the client device 102, presence in a different room than the client device 102, and so forth….”  “[0085] A user identified/detail interface 510 represents an expanded visualization that may be provide when the user moves closer to the client device 102 and/or is identified. For instance, the interface 510 can be presented when a user moves from the proximity zone 306 to the proximity zone 304 and is identified and/or authenticated as a particular user. The interface 510 may include various interaction options, customized elements, user-specific information, and so forth. Such details are appropriate when the system detects that the user is more engaged by moving closer, providing input, and so forth….”]
wherein the size of the visual display elements is adjusted to facilitate readability at a distance between the target user and the voice-interaction device. [Scott changes/adapts the “digital assistant experience” / output interface and modality in order to accommodate the user: position and movement of the user or his age and disabilities. “[0057] … Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. ...”  “[0059] …System behavior may also be adapted by selecting a different output modality, such as to support people with limited eyesight, limited hearing, or use age appropriate user interfaces and vocabulary.”  “[0088] In general, the system is able to transition between different UIs and adapt the UIs dynamically during an ongoing interaction based on changing circumstances. For example, different UI and modalities in response to changes in user proximity, number of users present, user characteristics and ages, availability of secondary device/displays, lighting conditions, user activity, and so forth….”]

Regarding Claim 7, Scott teaches:
7. The voice-interaction device of claim 3, [Scott, Figure 3, “[0062] FIG. 3 depicts an example scenario 300 which represents different proximity based interaction modalities ….”]
wherein the input component comprises a touch control; [Scott, Figure 1, “touch sensors 132c.”]
wherein the context controller is configured to select a proxemic input modality based at least in part on an analysis of the digital images; and [Scott teaches that based on proximity of the user to the client device, the device will provide touch based interactions / “proxemics input modality” to the user.  See Figure 3, “[0063] At close proximity in the first proximity zone 302 (e.g., a within a 2 foot arc from the client device 102), touchable interactions are available since a user is close enough to touch a display 308 of the client device 102, use input devices of the client device 102, and so forth. Accordingly, the digital assistant 126 may make adaptations to a user interface displayed on the display 308 to support touch and other close proximity interactions….”  The determination of proximity and location is based on data collected from a number of sensors that include “image” creating sensors:  “[0081] … For instance, a motion sensor (e.g., an infrared sensor) can detect user motion and trigger a camera-based sensor to wake and capture image data, such as to identify a user. As a user moves between different proximity zones, for example, sensors may communicate with one another to wake and/or hibernate each other depending on user proximity and position. …”   See [0047] for use of camera and “[0050] Position: As noted above, radar or camera-based sensors 132 may provide a position for one or multiple users. …”  ]
wherein the virtual assistant module is configured to activate the input component if the target user is determined to be in reach of the input component and [Scott, Figure 3, teaches that when the user is close to the device, the input modality is adjusted to permit touch input/output. The communication modes go from Speech/Audio (zone 306, far from device) to Speech/Audio and Visual (zones 302 and 304 which are closer to the device) to Visual and Touch (zone 302 which is closest to the device and touching is possible).  “[0063] …. Accordingly, the digital assistant 126 may make adaptations to a user interface displayed on the display 308 to support touch and other close proximity interactions….” ]
activate voice communications when the target user is out of reach of the voice-interaction device. [Scott, Figure 3.  In this limitation “out of reach” is interpreted to mean “out of touching reach” of the user.  When user is in zones 304 and 306 which is farther from the display 308, “speech” is the mode of communication both input of commands and output of results for the device.  “[0064] Farther away within the proximity zone 304 (e.g., between a 2 foot and a 3 foot arc from the client device 102), visual interactions are available since the digital assistant 126 determines that a user is likely close enough to be able to see the display 308 clearly. In this case, the digital assistant 126 may make adaptations to accommodate visual interactions and delivery of information visually. Speech may be used in this range also since the user is determined to be not close enough for touch. Still further away in the proximity zone 306 (e.g., between a 3 foot and a 10 foot arc from the client device 102), speech interactions are available since the digital assistant 126 determines that the user is determined to be too far from the display 308 for other modes like touch and visual interaction. In the proximity zone 306, for instance, the digital assistant 126 determines that a user is likely not be able to see the display clearly. Here, the digital assistant 126 may make adaptations to provide audio-based interactions and commands, and/or modify UI to accommodate the distance by using large elements, increasing text size, and reducing details so the information is easier to digest from a distance.”  Ass the user walks and gets closer or farther, the communication modes available to him/her are adjusted accordingly.  See [0070].]

“[0070] In an example scenario, Alice may ask about a scheduled soccer game while in the proximity zone 306 and receive a voice response because the digital assistant 126 knows Alice's proximity and determines voice is appropriate in the proximity zone 306. As she walks closer and enters the proximity zone 304, the digital assistant 126 recognizes the approach (e.g., change in proximity) and adapts the experience accordingly. For example, when Alice enters the proximity zone 304, the digital assistant 126 may automatically display a map of the soccer game location on the display 308 in response to detection of her approach via the system behavior manager 130.” See also the provision for “touchless gestures” as input:   “[0076] Alternatively or additionally, when Alice is in the proximity zones 304, 306, the digital assistant 126 can present touchless input elements on the display 308 that are capable of receiving user interaction from Alice via touches {sic touchless} gestures recognized by the light sensors 132b.”

Claim 14 is a method claim with limitations similar to the limitation of Claim 4.
Claim 15 is a method claim with limitations similar to the limitation of Claim 5.
Claim 17 is a method claim with limitations similar to the limitation of Claim 7.

Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Scott in view of Raitio (U.S. 2017/0358301).
Regarding Claim 10, Scott teaches:
10. The voice-interaction device of claim 9,  
wherein the context controller analyzes the characteristics of an input audio signal and modulates the output volume of the voice-interaction device to match a detected use context; [Scott teaches that the output volume depends on and is modified according to the identity of the user and other context such that if the user is deaf or old or far or in presence of other people or presence of noise the volume of the output is adjusted.  “[0023] Context sensors and techniques discussed herein may also be employed to improve accessibility scenarios. For example, the system may detect or be aware that a particular person is partially deaf. In this case, volume level may be adapted when that particular user is present….”  “[0057] … When the person is further away from the system, the system may also adjust the volume of sound output or the clarity of speech synthetization by increasing the overall pitch….”  See also [0057], [0066], [0071]-[0073], [0089], and [0107] for adjustment of output volume due to other contextual factors.]
wherein if the voice is determined to be whisper, the context controller may indicate a lower volume to respond with a corresponding voice output level; and 
wherein if the target user is located distance away, the context controller may indicate a volume adjustment to project the voice output to the target user at a corresponding volume level. [Scott adjust the volume of the output according to distance of the user from the client device as one of the contextual factors.  See Figures 6-7 and “[0057] … When the person is further away from the system, the system may also adjust the volume of sound output or the clarity of speech synthetization by increasing the overall pitch….”]
Scott teaches that the output volume is modified according to identity of the person who is receiving the output and his personal limitations as well as other characteristics of the environment.  However, it does not teach that the device considers the volume of input command which is referred to as a whisper mode meaning that it does not teach that the device whispers back if the user whispers in his command.

Raitio teaches:
wherein if the voice is determined to be whisper, the context controller may indicate a lower volume to respond with a corresponding voice output level; and [Raitio detects whether the input speech has been whispered and if so outputs the result with whispered synthesized speech.  Title: “DIGITAL ASSISTANT PROVIDING WHISPERED SPEECH.”  This feature is used, for example, if the user is in a library or private place and does not want to disturb others in the environment.  See [0003] for examples of use.  The output is in whisper.  “[0002] The present disclosure relates generally to a digital assistant and, more specifically, to a digital assistant that is capable of detecting a whispered speech input and providing a whispered speech response.”  One characteristics of Whisper is lower volume:  “[0245] As described and shown in FIG. 8A, in response to receiving a speech input from user 830, whispered speech determination module 820 can determine whether the speech input includes a whispered speech input. In some examples, whispered speech determination module 820 can make such determination based on one or more spectrum characteristics, such as the amplitude, the energy, the volume, the slope, or a combination thereof….”  Therefore, an output whispered speech will have a lower volume:  “6. wherein the one or more first spectrum characteristics comprise at least one of: a first amplitude, wherein the first amplitude is less than a second amplitude below a threshold frequency, the second amplitude being associated with the non-whispered speech; a first energy, wherein the first energy is less than a second energy below the threshold frequency, the second energy being associated with the non-whispered speech; a first volume, wherein the first volume is less than a second volume by a threshold volume percentage, the second volume being associated with the non-whispered speech .…”]
Scott and Raitio pertain to voice operated personal digital assistants and it would have been obvious to combine the whisper feature of Raitio which provides a whispered output when the input command/query is detected to be whispered in with the system of Scott which modulates the volume of output response according to the context of the user including user’s distance from the device and presence of other people in the vicinity to provide the system with one more item of context and create a system that considers the volume of input voice as a clue to the intent of the user.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 20 is a method claim with limitations similar to the limitation of Claim 10.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Lee (U.S. 20200241824) teaches and suggests:
4. The voice-interaction device of claim 3, further comprising 
a display configured to present visual display elements to the target user; 
wherein context controller is further configured to analyze the digital images to determine a gaze direction of the target user; and [Lee, Figure 1, “interior camera 102.”  Figure 4, “interior camera 302” uses for tracking “gaze position” in the “controller 306.”  “[0066] The interior camera 102 constantly monitors the eye movement of driver 110 to predict the driver's line of sight. …”  “[0081] …The controller 306 is further configured to rank content displayed on the p displays 308 according to an attention score determined by the interior camera 302 and head movement sensor 303 which are configured to measure gaze of driver 310 based on the driver's line of sight….”]
wherein the virtual assistant module is further configured to [Lee, “[0055] For example, the disclosed display system may be used in a virtual reality headset, gaming console, mobile device, multiple screen setup or home assistant, or may be used in theatre.”]
direct interactions to the target user through the visual display elements on the display in response to the gaze direction being directed toward the display, and [Lee, Figures 1, 1, 2, or 4, output being provided to one of the displays shown in 108/208/308 according to the direction of gaze of the driver/user 310.  “[0083] Say driver 310 is driving to a destination entered into the navigation display and the modification in this case is to suspend the power to the non-active zone. Driver 310 glances at the navigation display to check the route taken, and after the controller 306 determines that the gaze of driver 310 is genuine, the controller 306 determines and processes the active and non-active zone signal, and transmits a telegram to the controller (not shown) of displays 308. When the telegram is received by the controller of displays 308, the controller of displays 308 identifies that the navigation display belongs to the active zone, and the head up display and displays 3 to p belong to the non-active zone. The controller of displays 308 then proceeds to suspend the power to the non-active zone, while maintaining the navigation display as a colour display with full brightness and contrast as well as maintaining movement of the navigation map as the vehicle advances to the destination. Thus, the navigation display is operated at an enhanced level as compared to the head up display and displays 3 to p.”]
direct interactions through the audio output signal and the speaker in response to the gaze direction being directed away from the display. [Lee.  This limitation is suggested.  Lee provides an audio output in response to eye and head direction as a command or increases the volume of the audio output in response to head and eye tracking.  There is also a teaching that when the driver is looking at the road, the instructions are being provided by audio.  However, again there is express teaching of looking away leads to audio output.  Figure 4, output being provided to the “audio output” according to the gaze of the user 310.  The head movement and facial expressions of the Driver/User 310 operate ALSO as a command to turn on the audio output of the driving instructions:  “[0084] Driver 310 is now driving with his eyes on the road ahead and is following navigation instructions from the navigation display which are output via the audio output. Driver 310 is confused by an instruction and tilts his head to the side while looking again at the navigation display. The head tilt and confused facial expression are received or detected by interior camera 302 and head movement sensor 303 and processed by controller 306 as a facial gesture command to activate an increase in volume with a repeat of the audio navigation instruction. Thus, operation of the active navigation display comprises modifying the audio output of the navigation display by increasing the volume of the audio output and repeating the audio output. Further, operation of the audio output of the active zone is activated upon the facial gesture command from the driver.”  “[0029] The display(s) may be connected to an audio output, such as a speaker, to allow objects on the display to be accompanied by sound. For example, navigation instructions may be accompanied by audio instructions so that the driver can see as well as hear the turn-by-turn instructions. Where the one or more displays are connected to an audio output, and where an object displayed in the active zone is accompanied by audio, operation of the active zone may comprise at least one of: an increased volume of the audio output, or a repeat of the audio output. A combination may involve repeating the audio output by paraphrasing the instruction at an increased volume to advantageously achieve greater clarity of instruction. The line of sight data may determine such operation of the audio output. For example, upon hearing navigation instructions that the driver is unclear of, the driver may look at the navigation display which becomes the active zone and the audio output may be modified accordingly. Alternatively or additionally, operation of the audio output may be activated upon a command from the user.”  “[0033] In another example, a gesture of the driver demonstrating confusion, e.g. eye rolling, head tilting, and/or frowning, may be programmed as a command to repeat an audio turn-by-turn route instruction of a navigation display in the active zone. In this example, the interior camera may determine the driver's line of sight on the navigation display, resulting in diminishing the operation of all other displays which are in the non-active zone. Thereafter, the head movement sensor and/or the interior camera may detect the facial gesture from the driver and the controller may activate a repeat of the route instruction through the audio output.”]

Regarding Claim 9, Scott teaches: 
9. The voice-interaction device of claim 8, wherein the context controller is configured to detect
 the target user's speech volume based at least in part on an amplitude of the input audio signal, [Scott does not teach sensing the volume/loudness of the speech of the user as a factor.  “Amplitude of the input signal” is the definition of volume/loudness.]
the at least one voice characteristic, [Scott identifies the users/speakers based on biometrics which include “voice recognition.”  Biometric voice recognition identifies characteristics of the input voice:  “[0068] … Alice is identified and authenticated as being associated with a particular user profile. Examples of such biometric techniques include facial recognition, voice recognition, gait recognition, and so forth….”  Scott also teaches that for better speech recognition the system uses speaker’s identity to take into consideration the characteristics of his voice such as accent:  “[0049] …When the identity of a user is known (such as discussed below), it is possible to apply a different speech recognition model that actually fits the user's accent, language, acoustic speech frequencies, and demographic.”]
a distance between the target user and the voice-interaction device, [Scott teaches that the distance between the user and device is determined in order to determine the method of output as display on a GUI or by voice.  See Figure 3.  “[0066] In at least some implementations, a volume of the audio prompt is adjusted based on Alice's distance from the client device 102. For instance, when Alice first enters the proximity zone 306 and is initially detected, an audio prompt may be relatively loud. However, as Alice continues toward the client device 102 an approaches the proximity zone 304, volume of an audio prompt may be reduced.”  “… In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements. ….”  Abstract.  “[0058] … , additional sensors are invoked to detect position, distance, identity, and other characteristics that enable further context-based adaptations of the system behavior….”]
and/or environmental noise. [Scott considered the background noise as a factor to be considered in recognizing the input speech of the user:  “[0049] Also, the system (e.g., the client device 102) can disambiguate between multiple sound sources, such as by filtering out the position of a known noise-producing device (e.g., a television) or background noise. ….”  Scott adjusts the volume of output based on noise in the environment:  “[0089] … Volume adjustments may also be made based on proximity and/or ambient noise levels….”]
Scott modulates the volume of the output speech to suit the particular user and his context.  See rejection of Claim 8.  Scott does not teach that the input volume is sensed or determined as part of context.
Raitio teaches:
9. The voice-interaction device of claim 8, wherein the context controller is configured to detect  [Raitio:  “DIGITAL ASSISTANT PROVIDING WHISPERED SPEECH.”  “[0002] The present disclosure relates generally to a digital assistant and, more specifically, to a digital assistant that is capable of detecting a whispered speech input and providing a whispered speech response.”]
the target user's speech volume based at least in part on an amplitude of the input audio signal,  [Raitio expressly teaches in Figure 8A, “whispered speech determination module 820” and claims 5-6 that a whispered input speech is determined if the volume of input signal is below a threshold volume.  It also expressly teaches that amplitude and volume of the input voice are detected in order to determine whisper.   “[0245] As described and shown in FIG. 8A, in response to receiving a speech input from user 830, whispered speech determination module 820 can determine whether the speech input includes a whispered speech input. In some examples, whispered speech determination module 820 can make such determination based on one or more spectrum characteristics, such as the amplitude, the energy, the volume, the slope, or a combination thereof. …” ]
the at least one voice characteristic, [Raitio, note [0245] to [0247] for all the various types of voice characteristics that are obtained by whisper detection of Raitio.]
…
Scott and Raitio pertain to voice operated personal digital assistants and it would have been obvious to combine the whisper feature of Raitio which needs to detect whether the input speech was whispered and therefore detects a whispered input based on volume of input voice with the context factors of Scott which decide the output volume based on considerations of privacy (who is within the earshot of the device) as an added indicator that privacy and lower output volume is desirable.  (See Raitio, [0003].)  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Amplitude and energy (which depends on the amplitude) are directly related to volume.  However, Raitio does not set forth this relationship.  A third reference is cited for completeness.  Note that the 3rd reference could have been cited as support for a well-known point of physics.
Shurtz (U.S. 2007/0104337) teaches:
the target user's speech volume based at least in part on an amplitude of the input audio signal, [Shurtz teaches that volume is obtained based on amplitude of the audio signal:  “[0007] The foregoing and other features are accomplished, according the present invention, by providing apparatus monitoring the analog signals that are generated by an audio source. The volume of the audio from speakers will be determined by the amplitude of the analog signals which vary with the volume control at the receiver end. The amplitude of the audio analog signals and peak is detected then digitized. The analog signal volume amplitude determines the amplitude of the digitized signals thereby relating the digitized signal audio volume to the analog signal audio volume.”]
Scott and Raitio and Shurtz pertain to voice and audio inputs and it would have been obvious to combine the determination of volume from the amplitude of the audio signal from Shurtz with the system of the combination as one method of obtaining volume.  (See Raitio, [0003].)  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659

Claims:
1. A voice-interaction device comprising: a plurality of input and output components configured to facilitate interaction between the voice-interaction device and a target user, the plurality of input and output components comprising: 
a microphone configured to sense sound and generate an audio input signal; 
a speaker configured to output an audio signal to the target user; and 
an input component configured to sense at least one non-audible interaction from the target user; a context controller configured to monitor the plurality of input and output components and determine a current use context; and 
a virtual assistant module configured to facilitate voice communications between the voice-interaction device and the target user and configure one or more of the input and output components in response to the current use context.

2. The voice-interaction device of claim 1, further comprising: 
audio input circuitry configured to receive the audio input signal and generate an enhanced target signal including audio generated by the target user; and 
a voice processor configured to detect a voice command in the enhanced target signal; 
wherein the virtual assistant module is configured to execute the detected voice command in accordance with the current use context.

3. The voice-interaction device of claim 1, 
wherein the input component comprises an image sensor configured to capture digital images of a field of view; and 
wherein the context controller is further configured to analyze the digital images to detect and/or track a position of the target user in relation to the voice-interaction device and determine a current use context based at least in part on the position of the target user.

4. The voice-interaction device of claim 3, further comprising 
a display configured to present visual display elements to the target user; 
wherein context controller is further configured to analyze the digital images to determine a gaze direction of the target user; and 
wherein the virtual assistant module is further configured to direct interactions to the target user through the visual display elements on the display in response to the gaze direction being directed toward the display, and direct interactions through the audio output signal and the speaker in response to the gaze direction being directed away from the display.

5. The voice-interaction device of claim 4, 
wherein the virtual assistant module adjusts a size of the visual display elements as the target user moves relative to the voice-interaction device, 
wherein the size of the visual display elements is adjusted to facilitate readability at a distance between the target user and the voice-interaction device.

6. The voice-interaction device of claim 3, 
wherein the context controller is configured to: 
estimate a distance between the target user and the voice-interaction device based at least in part on the relative position of the target user in relation to the voice-interaction device; and 
provide attention-aware output rendering in which output fidelity is determined based at least in part on the target user's distance from the voice-interaction device to facilitate interactions from various distances.

7. The voice-interaction device of claim 3, 
wherein the input component comprises a touch control; wherein the context controller is configured to select a proxemic input modality based at least in part on an analysis of the digital images; and 
wherein the virtual assistant module is configured to activate the input component if the target user is determined to be in reach of the input component and activate voice communications when the target user is out of reach of the voice-interaction device.

8. The voice-interaction device of claim 1, wherein the virtual assistant module detects a voice interaction from the target user, determines at least one voice characteristic, and modulates an output volume according to the determined voice characteristic.

9. The voice-interaction device of claim 8, 
wherein the context controller is configured to detect the target user's speech volume based at least in part on an amplitude of the input audio signal, the at least one voice characteristic, a distance between the target user and the voice-interaction device, and/or environmental noise.

10. The voice-interaction device of claim 9, 
wherein the context controller analyzes the characteristics of an input audio signal and modulates the output volume of the voice-interaction device to match a detected use context; 
wherein if the voice is determined to be whisper, the context controller may indicate a lower volume to respond with a corresponding voice output level; and 
wherein if the target user is located distance away, the context controller may indicate a volume adjustment to project the voice output to the target user at a corresponding volume level.

11. A method comprising: 
facilitating communications between a voice-interaction device and a target user using a plurality of input and output components, including sensing sound to generate an audio input signal, outputting an audio signal to the target user, and sensing at least one non-audible interaction from the target user; 
monitoring, by a context controller, a current use context of the plurality of input and output components; and 
adapting one or more of the input and output components to the current use context.

Claim 12 is a method claim with limitations similar to the limitation of Claim 2.
12. The method of claim 11, further comprising: 
generating an enhanced target signal including audio generated by the target user from the audio input signal; 
detecting speech in the enhanced target signal; and 
extracting and executing a voice command from the speech in accordance with the current use context.

Claim 13 is a method claim with limitations similar to the limitation of Claim 3.
13. The method of claim 11, further comprising: acquiring digital images of a field of view; analyzing the acquired images to detect and/or track a relative position of the target user in relation to the voice-interaction device; and determine a current use context based at least in part on the relative position of the target user.

Claim 14 is a method claim with limitations similar to the limitation of Claim 4.
14. The method of claim 13, further comprising a display configured to output information to the target user; and wherein the method further comprises: analyzing a gaze direction of the target user; turning on the display and providing a visual output to the target user in response to the gaze direction being directed toward the display; and turning off the display and providing voice output to the target user through a speaker in response to the gaze direction being directed away from the display.

Claim 15 is a method claim with limitations similar to the limitation of Claim 5.
15. The method of claim 13, further comprising: estimating a distance between the target user and the voice-interaction device based at least in part on the relative position of the target user to the voice-interaction device; and providing attention-aware output rendering in which output fidelity is rendered based at least in part on the target user's distance from the voice-interaction device to facilitate readability from various distances.

Claim 16 is a method claim with limitations similar to the limitation of Claim 6.
16. The method of claim 15, further comprising selecting a proxemic input modality based on an analysis of the acquired images; and wherein the voice-interaction device activates the input component if the target user is determined to be in reach of the input component; and wherein the voice-interaction device utilizes voice input and output when the target user is out of reach of the voice-interaction device.

Claim 17 is a method claim with limitations similar to the limitation of Claim 7.
17. The method of claim 15, further comprising: adjusting a size of displayed elements as the target user moves relative to the voice-interaction device, wherein the size of the displayed elements is adjusted for readability at the distance between the target user and the voice-interaction device; and wherein the voice-interaction device activates a touch screen input when the target user is determined to be in arm's reach of a touch-enabled interface.

Claim 18 is a method claim with limitations similar to the limitation of Claim 8.
18. The method of claim 11, further comprising detecting a user voice interaction, determining at least one voice characteristic, and modulating an output volume according to the determined voice characteristic.

Claim 19 is a method claim with limitations similar to the limitation of Claim 9.
19. The method of claim 18, further comprising detecting the target user's speech volume based at least in part on an amplitude of the input audio signal, the at least one voice characteristic, a distance between the target user and the voice-interaction device, and/or environmental noise.

Claim 20 is a method claim with limitations similar to the limitation of Claim 10.
20. The method of claim 19, further comprising analyzing the characteristics of an input audio signal and modulating the output volume for a detected use context, including determining whether the voice is a whisper, the context controller may lower the volume to respond with a corresponding voice output level; and wherein if the target user is located distance away, the context controller may adjust the volume to project the voice output to the target user at a corresponding volume level.