Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1 and 11 are independent.  All of the Claims have been amended.
This Application was published as U.S. 2019/0311718.
Apparent potential priority 5 April 2018.

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that, if presented, were necessitated by the amendments to the Claims.
This action is Final.

The instant Application is directed to a virtual assistant method or device for receiving voice command/query from a user and responding to the user.  The modality, as speech or displayed, and other characteristics of the response take into account “a current use context” of the device.  The “current user context” is determined according to a variety of sensors such as cameras and microphones that track various parameters such as the location of the user, the direction of his gaze, his distance from the virtual assistant, and the volume of his voice as he utters the command/query and sets the volume of the output voice or adjust the size of the output elements on an output display.  
Scott (U.S. 2017/0289766), filed 13 December 2016 is the closest reference identified to date.  Additionally, Scott is quite close.
Applicant added “frame rate” as one of the aspects of the output that is changed according to context (distance of the user).  Frame rate appears once in the Specification.  Further, Scott teaches that the visual aspects of the displayed information (‘[0057] … other aspects used for visualization ….”) are changed according to distance and “frame rate” is just one type of visual aspect.  Thus, a combination of Scott with any reference that teaches the “frame rate” to be an aspect of visual presentation teaches this amended language. 
Response to Amendments and Arguments
Objections to Claims 9 and 19 are withdrawn in view of the amendments to these claims.
As for 112(f) interpretations, currently only the “virtual assistant module” of Claim 1 is interpreted under 35 U.S.C. 112(f) because it is defined in the Claim only according to its function without reciting any structure. Applicant’s arguments (Response 10) merely repeat the limitation and do not provide any evidence as to which “structure” is included in the “virtual assistant module.”  The “virtual assistant module” impacts or control the operations of a number of hardware elements such as the frame rate of the display.  This does not mean that the display is part of the “module.”  At any rate, the module is interpreted as a combination of processor and memory which is probably what was intended.
As for 102, 103 rejections, Applicant relies on the added limitation of “a display configured to present visual content to the target user; and wherein the virtual assistant module is further configured to adjust a frame rate of video content presented on the display based on the current use context” to Claim 1.  See Response 10-11.
The phrase “frame rate” occurs once at:
[0023] The methods disclosed herein can also be applied to dynamic content such as video output. The device may track whether the user's gaze is directed towards the display. For example, one or more image sensors can be used to track the location of user, including distance from the voice-interaction device, and gaze of the user. The location and distance of the user can be determined through analysis of captured images received from a device camera (e.g., a 3D camera), and the user's gaze can be tracked, for example, by analyzing eye direction and movement. When the user is paying attention from afar, the frame rate of the video stream can be decreased, as well as the resolution of the video stream. When the user approaches the device, both rates can be incremented. This operation is beneficial for devices that operate on a battery, such as portable voice-interaction devices, phones and tablets.

This “frame rate” is not explained in the Specification any further than above.  Accordingly, it is interpreted as playback speed of the video.  For support in the art for such an interpretation, see, for example, Gilson US 20170054822:  “[0069] Block 806 depicts transmitting a request to receive frames that may correspond to points in the content that are temporally separated by an amount of time that may be based at least in part on the playback rate. For example, for normal speed playback, each frame might be temporally separated by approximately 33 milliseconds, assuming a 30 frames-per-second frame rate. At two times normal speed, the request to receive the frames might indicate that the frames should represent points in the content that are approximately 66 milliseconds apart. In some instances, the request may specify that related content, such as audio content associated with video content, is to be excluded. In other instances, the request may specify (or it could be implied) that related content should be included.”

A secondary reference, Kang, is added for teaching of this added feature.
Note that the primary reference Scott already teaches changing the attributes of the image that is being displayed to the user according to context.  “10. A system as described in claim 1, wherein the element of the first digital assistant experience comprises a visual user interface of the digital assistant, and said adapting comprises adapting an aspect of the visual user interface including one or more of changing a font size, a graphic, a color, or a contrast of the visual user interface in dependence upon the change in the contextual factor.”  “[0057] … Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. …”  Scott teaches that the visual user interface may be showing a video to the user: “[0129] As further illustrated in FIG. 9, the example system 900 enables ubiquitous environments for a seamless user experience when running applications on a personal computer (PC), a television device, and/or a mobile device. Services and applications run substantially similar in all three environments for a common user experience when transitioning from one device to the next while utilizing an application, playing a video game, watching a video, and so on.”
Thus, Scott teaches that the display may be showing video to the user.
Scott teaches that the aspects of the image being displayed such as “font, graphic, color, level of detail, contrast” of the image being displayed.
What Scott lacks is changing the “frame rate” of the video according to context.  

For this aspect a secondary reference is added.

The other independent Claim 11 includes:  “adapting a frame rate of visual content being output through the display based on the current use context.”
Patentability of the other independent Claims is argued based on their similarity to Claim 1. Accordingly, the above provides a reply to those arguments as well.
Patentability of the dependent Claims is argued based on their dependence from their base independent Claims. Accordingly, the above provides a reply to those arguments as well.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: “virtual assistant module” in Claims 1. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to a camera, touchscreen and processors or a combination of processor and memory or to a combination of software and hardware.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “camera,” “touchscreen,” and “processor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3, 6-8, 11-13, and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Scott (U.S. 2017/0289766) in view of Kang (U.S. 2018/0227607).
Regarding Claim 1, Scott teaches:
1. A voice-interaction device comprising: [Scott, Figure 1, “client device 102.”  Figure 9, “computing device 902.”  Title:  “Digital Assistant Experience based on Presence Detection.”  The context of the use: including the location of the user and other users present in the vicinity is considered in the type of output and:  “[0004] Techniques for digital assistant experience based on presence sensing are described herein. In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements. Various other contextual factors may additionally or alternatively be considered in adapting a digital assistant experience.”]
a plurality of input and output components configured to facilitate interaction between the voice-interaction device and a target user, [Scott, Figure 1, “client device 102” includes “sensors 132” and “displays 118, 122.”  Figure 9, “computing device 902” includes “sensors 132” and “input/output interfaces 908.”  “[0122] Input/output interface(s) 908 are representative of functionality to allow a user to enter commands and information to computing device 902, and also allow information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice recognition and/or spoken input), a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to detect movement that does not involve touch as gestures), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth. Thus, the computing device 902 may be configured in a variety of ways as further described below to support user interaction.”]
the plurality of input and output components comprising:
a microphone configured to sense sound and generate an audio input signal; [Scott, Figure 1, “audio sensors 132b” include microphones.  “[0017] According to one or more implementations, techniques described herein are able to receive voice commands and react upon presence, identity and context of one or more people. By way of example, the described techniques can be implemented via a computing device equipped with one or multiple microphones, a screen, and sensors to sense the context of a user. Various sensors are contemplated including for example a camera, a depth sensor, a presence sensor, biometric monitoring devices, and so forth.”  See also [0122].]
a speaker configured to output an audio signal to the target user; and [Scott, Figure 9, “input/output interfaces 908” include “speakers”: “[0122] Input/output interface(s) 908 are representative of functionality to allow a user to enter commands and information to computing device 902, and also allow information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice recognition and/or spoken input), a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to detect movement that does not involve touch as gestures), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth. Thus, the computing device 902 may be configured in a variety of ways as further described below to support user interaction.”] 
input component circuitry configured to sense at least one non-audible interaction from the target user; [Scott, Figure 1, “sensors 132” including “light sensors 132a,” “touch sensors 132c,” and “presence sensors 132d” are directed to input of other than voice/sound/audible interaction.  “[0122] … Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice recognition and/or spoken input), a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to detect movement that does not involve touch as gestures), and so forth. …”]
a context controller configured to monitor the plurality of input and output components and determine a current use context;  [Scott, Figures 1, 2 and 9, the “digital assistant 126” of Figures 1-2, or the “processing system 904” of Figure 9, performs the functions of the “context controller” of the Claim by receiving “sensor data 202” and generating “user experience 206” which is a method and modality of “output” based on the context provided by “sensor data 202.”  “[0020] Various types of adaptations scenarios are contemplated. For instance, sensors may be used to obtain data for context sensing beyond a simple presence sensor, such as estimating the number of people present, recognizing the identities of people present, detection of distance/proximity to the people, and/or sensing when people approach or walk away from the device and/or other contextual sensors. For instance, different contextual factors can be sensed and/or inferred, such as age and/or gender based on visual information, a state a person is in (e.g., the user is able to see, talk, and so forth). Such contextual factors may be detected in various ways, such as via analysis of user motion, user viewing angle, eye tracking, and so on.”  To the degree that “output components” provide an output to the environment of the “client device 102,” their outputs becomes part of the context is monitored by the sensors 132 of device 102.]
a virtual assistant module configured to: 
facilitate voice communications between the voice-interaction device and the target user and configure one or more of the input and output components in response to the current use context; and [Scott, Figure 1, “Digital Assistant 126.”  Figure 6, shows that the input components provide more than just voice data and output components modify the output User Interface to generate a user experience that is custom made for the particular user and his context.  For example a GUI is used for output under some circumstances/context and an audible response under different circumstances/context are used:  “[0022] Context sensors as noted above may also enable adaptations to the operation of a voice UI, such as responding differently based on whether multiple people are present or a single person, and responding differently based on proximity to a person. For example, when distance from a reference point to the person is relatively small, a graphical UI is considered appropriate and is therefore presented on a display screen. However, when the person is positioned such that the display screen may not be visible and/or the person is not looking at the display screen, the graphical UI may not be helpful in which case the system may utilize audible alerts, voice interaction, and audio responses.”]
	a display configured to present visual content to the target user; and [Scott’s devices includes displays.  See Figures 1, 3, 5, and 9.  Figures 1, 118, 122, [0028].  Figure 3, 308, [0063].  Figure 1, “display device 118” and “integrated display 122.”  Figure 3, “display 308.”  Figure 9, “Input/output interfaces 908.”  “[0122] … Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth….”]
wherein the virtual assistant module is further configured to adjust a frame rate of video content presented on the display based on the current use context. [Scott teaches that the visualization is changed according to context which includes user: position [0057], presence [0058], identity [0059], emotional state [0060].  Aspects of the image that change are font size, graphics, colors, level of detail, contrast, icons, animations, and “other aspects for visualization” ([0057]).  However, “frame rate” or playback rate of video is not expressly included as an aspect of image that is modified in response to the change in context.]

	Scott does not teach that the playback rate or frame rate of the visual presentation to the use is adjusted based on the use context.
	Kang teaches:
wherein the virtual assistant module is further configured to adjust a frame rate of video content presented on the display based on the current use context. [Kang, Figures 5 and 10-12 showing that the user requests playback and the playback scenario, which includes the “playback speed information”/ “frame rate” of the Claim, depends on context of the user which is a type of “current use context.”  “2. … make a request for the playback scenario information of the video … and receive the one or more pieces of the playback scenario information generated based on the context information ….”  “6. …  wherein the context information includes at least one of an age of the user, a gender of the user, an occupation of the user, a friend of the user, a residence location of the user, a nationality of the user, a hobby of the user, an interest of the user, a location of the electronic device, a size of the display, a resolution of the display, and a frame rate of the display.”  “[0071] According to an embodiment of the present disclosure, the memory 450 may store the playback scenario information of the video. The playback scenario information may include at least one of viewpoint information, zoom information, playback speed information of the video, and audio volume information.”]

Scott and Kang pertain to voice operated personal digital assistants (see Kang [0031]) and it would have been obvious to combine the variable playback speed (frame rate) of Kang which is based on “context information” including factors that constitute “current context” such as the identity of the user and the location of the device with the system of Scott which decides the output volume and output image characteristics, such as font size and contrast, based on current context but does not specifically mention “frame rate” of the video playback as one of features to consider.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, Scott teaches:
2. The voice-interaction device of claim 1, further comprising: 
audio input circuitry configured to receive the audio input signal and generate an enhanced target signal including audio generated by the target user; and [Scott teaches using multiple microphones and conducting beamforming to “enhance” the input signal:  “[0048] Sound: In order to enable interaction with the computer using a speech-based interface, one or multiple microphones representing instances of the sensors 132 can be employed. Using multiple microphones enables the use of sophisticated beamforming techniques to raise the quality of speech recognition and thus the overall interaction experience. Further, when motion information (e.g., angle of arrival information) is available (e.g., from radar information), a beamforming estimate can be used to enhance speech recognition, such as before any speech input is detected.”]
a voice processor configured to detect a voice command in the enhanced target signal; [Scott, Figure 9, “processing system 904.”  The device is a PDA which receives commands primarily through speech interfaces and speech recognition:  “[0074] … Consequently, when the system has identified Bob it may cause the digital assistant 126 to use speech interfaces along with visual information or switch entirely to speech interfaces.”  “[0086] The scenario 500 further depicts an active conversation interface 512, which may be output during an ongoing conversation between a user and the digital assistant. Here, the system provides indications and feedback with respect to the conversation, such as by displaying recognized speech 514, providing suggestions, and/or indicating available voice command options….”  See [0017], [0019] and [0086] for express teaching of “voice commands.”]
and
wherein the virtual assistant module is further configured to execute the detected voice command in accordance with the current use context. [Scott, the general, well-known purpose of a virtual/digital/personal assistant is to execute the command that is input by the user.  Scott uses the context of user, as obtained by the various sensors, to both determine the intent of the input command and the output mode:  “[0031] For example, requests may include spoken or written (e.g., typed text) data that is interpreted through natural language processing capabilities of the digital assistant 126. The digital assistant 126 may interpret various input and contextual clues to infer the user's intent, translate the inferred intent into actionable tasks and parameters, and then execute operations and deploy device services 128 to perform the tasks….”]

Regarding Claim 3, Scott teaches and suggests:
3. The voice-interaction device of claim 1, 
wherein the input component circuitry comprises an image sensor configured to capture digital images of a field of view; and [Scott, Figure 1 teaches “sensors 132” which include a “camera”:  “[0017] … Various sensors are contemplated including for example a camera, a depth sensor, a presence sensor, biometric monitoring devices, and so forth.”  “[0047] Presence sensing: The physical presence of people (i.e. people nearby the system) may be detected using sensors 132 like pyro-electric infrared sensors, passive infrared (PIR) sensors, microwave radar, microphones or cameras, and using techniques such as Doppler radar, radar using time-of-flight sensing, angle-of-arrival sensing inferred from one or more of Doppler radar or time-of-flight sensing, and so forth.”  A camera inherently has a field of view.  The “digital” nature of the “digital images” is suggested by the fact that the system implements a digital assistant module.  See [0125] to [0127] for hardware implementation.]
wherein the context controller is further configured to analyze the digital images to detect and/or track a position of the target user in relation to the voice-interaction device and determine the current use context based at least in part on the position of the target user. [Scott, the “presence sensors 132d” of Figure 1 include cameras which detect and track the position of various users in the environment as they enter and exit a particular region in order to provided context for the command interpretation task.  See Figures 3-4 also:  “[0050] Position: As noted above, radar or camera-based sensors 132 may provide a position for one or multiple users….”  “[0105] In at least some implementations, a reference point (e.g., the display 308) can be occluded at different distances, such as depending on an angle of approach of a user relative to the reference point. In such as case, a particular sensor (e.g., a camera) can resolve this occlusion, even when another sensor (e.g., radar) may not be able to resolve the occlusion.”  “[0020] … For instance, sensors may be used to obtain data for context sensing beyond a simple presence sensor, such as estimating the number of people present, recognizing the identities of people present, detection of distance/proximity to the people, and/or sensing when people approach or walk away from the device and/or other contextual sensors….” ]

Regarding Claim 6, Scott teaches:
6. The voice-interaction device of claim 3, wherein the context controller is configured to: 
estimate a distance between the target user and the voice-interaction device based at least in part on audio originating from the target user and sensed by the microphone of the voice-interaction device; and [Scott, “… In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements….”  Abstract.  See claims 1, 3-6 and 11 which are all related to “distance.”  “[0050] .. Distance and/or proximity can also be detected using ultrasonic detection, time-of-flight, radar, and/or other techniques.”]
decrease the frame rate of the video content based at least in part on the target user's estimated distance from the voice-interaction device.  [Scott, “… In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements….”  Abstract.  “[0022] … For example, when distance from a reference point to the person is relatively small, a graphical UI is considered appropriate and is therefore presented on a display screen….”  “[0057] …Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization….”]
Scott does not teach that playback speed or frame rate of the visual content is modified according to distance.
Kang teaches:
decrease the frame rate of the video content based at least in part on the target user's estimated distance from the voice-interaction device. [Kang as shown with respect to Claim 1 changes the playback scenario of the video image according to context (which includes location) and playback scenario changes include a change to playback speed of the video based on context: “5. … wherein the playback scenario information further includes at least one of playback speed information of the video ….”  “[0036] …  the plurality of first electronic devices 110 and 120, and the second electronic device 130 may change a playback state of the video depending on a user input during the playback of the video. …  may change the playback speed of the video or may change the volume (or level) of the audio, depending on the user input.”  Kang teaches that the playback speed is adapted with context and adapting includes both increase or decrease.]
Rationale as provided for Claim 1.  Playback speed (frame rate) is an aspect of the displayed image and Scott changes font size and contrast to make the image more easily viewable as the user moves away.  It would have been obvious to combine Kang to include the playback speed as another factor that is similar to font size and is adjustable with distance of the user from the display.

Regarding Claim 7, Scott teaches:
7. The voice-interaction device of claim 3, [Scott, Figure 3, “[0062] FIG. 3 depicts an example scenario 300 which represents different proximity based interaction modalities ….”]
wherein the input component circuitry comprises a touch control; [Scott, Figure 1, “touch sensors 132c.”]
wherein the context controller is further configured to select a proxemic input modality based at least in part on an analysis of the digital images; and [Scott teaches that based on proximity of the user to the client device, the device will provide touch based interactions / “proxemics input modality” to the user.  See Figure 3, “[0063] At close proximity in the first proximity zone 302 (e.g., a within a 2 foot arc from the client device 102), touchable interactions are available since a user is close enough to touch a display 308 of the client device 102, use input devices of the client device 102, and so forth. Accordingly, the digital assistant 126 may make adaptations to a user interface displayed on the display 308 to support touch and other close proximity interactions….”  The determination of proximity and location is based on data collected from a number of sensors that include “image” creating sensors:  “[0081] … For instance, a motion sensor (e.g., an infrared sensor) can detect user motion and trigger a camera-based sensor to wake and capture image data, such as to identify a user. As a user moves between different proximity zones, for example, sensors may communicate with one another to wake and/or hibernate each other depending on user proximity and position. …”   See [0047] for use of camera and “[0050] Position: As noted above, radar or camera-based sensors 132 may provide a position for one or multiple users. …”  ]
wherein the virtual assistant module is further configured to activate the input component circuitry in response to the target user being in reach of the input component circuitry and [Scott, Figure 3, teaches that when the user is close to the device, the input modality is adjusted to permit touch input/output. The communication modes go from Speech/Audio (zone 306, far from device) to Speech/Audio and Visual (zones 302 and 304 which are closer to the device) to Visual and Touch (zone 302 which is closest to the device and touching is possible).  “[0063] …. Accordingly, the digital assistant 126 may make adaptations to a user interface displayed on the display 308 to support touch and other close proximity interactions….” ]
activate voice communications when the target user is out of reach of the voice-interaction device. [Scott, Figure 3.  In this limitation “out of reach” is interpreted to mean “out of touching reach” of the user.  When user is in zones 304 and 306 which is farther from the display 308, “speech” is the mode of communication both input of commands and output of results for the device.  “[0064] Farther away within the proximity zone 304 (e.g., between a 2 foot and a 3 foot arc from the client device 102), visual interactions are available since the digital assistant 126 determines that a user is likely close enough to be able to see the display 308 clearly. In this case, the digital assistant 126 may make adaptations to accommodate visual interactions and delivery of information visually. Speech may be used in this range also since the user is determined to be not close enough for touch. Still further away in the proximity zone 306 (e.g., between a 3 foot and a 10 foot arc from the client device 102), speech interactions are available since the digital assistant 126 determines that the user is determined to be too far from the display 308 for other modes like touch and visual interaction. In the proximity zone 306, for instance, the digital assistant 126 determines that a user is likely not be able to see the display clearly. Here, the digital assistant 126 may make adaptations to provide audio-based interactions and commands, and/or modify UI to accommodate the distance by using large elements, increasing text size, and reducing details so the information is easier to digest from a distance.”  Ass the user walks and gets closer or farther, the communication modes available to him/her are adjusted accordingly.  See [0070].]

“[0070] In an example scenario, Alice may ask about a scheduled soccer game while in the proximity zone 306 and receive a voice response because the digital assistant 126 knows Alice's proximity and determines voice is appropriate in the proximity zone 306. As she walks closer and enters the proximity zone 304, the digital assistant 126 recognizes the approach (e.g., change in proximity) and adapts the experience accordingly. For example, when Alice enters the proximity zone 304, the digital assistant 126 may automatically display a map of the soccer game location on the display 308 in response to detection of her approach via the system behavior manager 130.” See also the provision for “touchless gestures” as input:   “[0076] Alternatively or additionally, when Alice is in the proximity zones 304, 306, the digital assistant 126 can present touchless input elements on the display 308 that are capable of receiving user interaction from Alice via touches {sic touchless} gestures recognized by the light sensors 132b.”

Regarding Claim 8, Scott teaches (Note that Scott does not teach the whisper mode which is the intention of this Claim.  But this Claim is broadly stated and Scott teaches the broad language of the Claim.  For more specific references see rejection of Claim 10 which is stated with the appropriate level of particularity.):
8. The voice-interaction device of claim 1, wherein the virtual assistant module is further configured to: [This Claim is mapped to Scott, Figure 6, “Detect User Identity 604.”]
detect a voice interaction from the target user; [Scott, the “client device 102” receives its commands generally by voice:  “[0017] According to one or more implementations, techniques described herein are able to receive voice commands and react upon presence, identity and context of one or more people. By way of example, the described techniques can be implemented via a computing device equipped with one or multiple microphones, a screen, and sensors to sense the context of a user. Various sensors are contemplated including for example a camera, a depth sensor, a presence sensor, biometric monitoring devices, and so forth.”]
determine at least one voice characteristic based on the voice interaction; and [Scott, user identity is one of the parameters detected by the system so that the “experience” i.e. input and output modalities can be adapted for the particular user as shown in Figure 7.  The user identity is detected from various indicators including his phone device or his biometrics which include his particular voice characteristics.  “[0097] Identity of a user is detected (block 604). The user, for example, moves to a proximity to the client device 102 where a user-specific attribute is detected by one or more of the sensors 132, and used to identify and/or authenticate the user. Different ways of detecting user identity are discussed above, and include various biometric features and techniques.”  “[0068] … Examples of such biometric techniques include facial recognition, voice recognition, gait recognition, and so forth. …”] 
modulate an output volume according to the determined voice characteristic. [Scott, if the user is deaf the volume of the output is adjusted accordingly.  The output volume is modulated according to identity of the user; user identity is in turn is determined from his voice characteristics.  Thus the output volume is modulated according to the voice characteristics:  “[0023] Context sensors and techniques discussed herein may also be employed to improve accessibility scenarios. For example, the system may detect or be aware that a particular person is partially deaf. In this case, volume level may be adapted when that particular user is present….”]

Claim 11 is an independent method claim with limitations similar to the limitation of Claim 1 that are rejected under similar rationale.
Regarding Claim 11, Scott teaches:
11. A method comprising: 
monitoring communications between a voice-interaction device and a target user using a plurality of input and output components of the voice-interaction device, [Scott, Figure 1, “client device 102” / “voice interaction device” of the Claim includes “sensors 132,” including “audio sensors 132b” and “presence sensors 132d” and Figure 9, “sensors 132” and “Input/Output Interfaces 908.”  Figures 3-4 and 7 show that distance of the user with the “client device 102” is taken into consideration.  See 702 and 704.  “[0028] The client device 102 can be embodied as any suitable computing system and/or device such as, by way of example and not limitation, a gaming system, a desktop computer, a portable computer, a tablet or slate computer, a handheld computer such as a personal digital assistant (PDA), a cell phone, a set-top box, a wearable device (e.g., watch, band, glasses, etc.), a large-scale interactivity system, and so forth….”]  wherein one of the plurality of output components comprises a display configured to present visual content to the target user; [Scott, Figures 1, 3, 4, 5, and 9 all show devices that include displays.  “[0022] Context sensors as noted above may also enable adaptations to the operation of a voice UI, such as responding differently based on whether multiple people are present or a single person, and responding differently based on proximity to a person. For example, when distance from a reference point to the person is relatively small, a graphical UI is considered appropriate and is therefore presented on a display screen. However, when the person is positioned such that the display screen may not be visible and/or the person is not looking at the display screen, the graphical UI may not be helpful in which case the system may utilize audible alerts, voice interaction, and audio responses.”]
determining, by a context controller of the voice-interaction device, a current use context of the voice-interaction device based on the monitored communications of the plurality of input and output components; and [Scott, Figure 1, “sensors 132” and Figure 2, “sensor data” teach the “context controller” of the Claim that is providing the “current use context” of the device’s input/output interface components.  Figure 9, “input/output interfaces 908.”  The devices shown in Figures 1 and 9 include PDAs with voice interaction.  “[0028] The client device 102 can be embodied as …s a personal digital assistant (PDA), a cell phone ….”  See [0017], [0021], [0047], [0048], e.g., for use of microphones.]
adapting a frame rate of visual content being output through the display based on the current use context. [Scott, Figure 2, “user experience 206” is determined by the “digital assistance 126” including the “system behavior manager 130” with the use of context from the input “sensor data 202.”  The user experience includes the visual content that is output to the user and is modified according to context of the user:  “10. A system as described in claim 1, wherein the element of the first digital assistant experience comprises a visual user interface of the digital assistant, and said adapting comprises adapting an aspect of the visual user interface including one or more of changing a font size, a graphic, a color, or a contrast of the visual user interface in dependence upon the change in the contextual factor.”  See [0057] and [0069], [0108], [0140].   “[0044] … In operation, the system behavior manager 130 obtains sensor data 202 that may be collected via various sensors 132. The sensor data 202 is analyzed and interpreted by the system behavior manager 130 to determine contextual factors such as user presence, identity, proximity, emotional state, and other factors noted above and below….  System behavior adaptations 204 that correspond to the current context are identified and applied to adapt the user experience 206 accordingly. Generally, the user experience 206 includes different attributes of a digital assistant experience such as audible experience, visual experience, touch-based experience, and combinations thereof Various types of adaptations of user experience are contemplated, details of which are described above and below.”  See flowcharts of Figures 6 and 7.]
Scott does not teach that “frame rate” is one of characteristics of the image that is modified according to context.
Kang teaches:
adapting a frame rate of visual content being output through the display based on the current use context. [Kang, Figures 5 and 10-12 showing that the user requests playback and the playback scenario, which includes the “playback speed information”/ “frame rate” of the Claim, depends on context of the user which is a type of “current use context.”  “2. … make a request for the playback scenario information of the video … and receive the one or more pieces of the playback scenario information generated based on the context information ….”  See also claim 6 of Kang and [0071] for “frame rate” or “playback speed” as part of the “playback scenario.”]
	Rationale for combination as provided 

Regarding Claim 12, Scott teaches:
12. The method of claim 11, wherein the plurality of input components further comprises a microphone, and [Scott, microphone is included in the “audio sensors 132b” of Figure 1.  [0017].]
wherein the method further comprises:
sensing sound via the microphone; [Scott, microphone is included in the “audio sensors 132b” of Figure 1.  [0017].]
based on the sensed sound, generate an audio input signal; [Scott, Figure 2, “sensor data 202” including the sound senses by microphone is being input to the “digital assistant 126.”  A microphone is a transducer that converts sound to audio signals.]
generating an enhanced target signal, including audio generated by the target user, from the audio input signal; [Scott, enhancement is a known part of audio signal processing and Scott too expressly includes: “[0048] Sound: In order to enable interaction with the computer using a speech-based interface, one or multiple microphones representing instances of the sensors 132 can be employed. Using multiple microphones enables the use of sophisticated beamforming techniques to raise the quality of speech recognition and thus the overall interaction experience. Further, when motion information (e.g., angle of arrival information) is available (e.g., from radar information), a beamforming estimate can be used to enhance speech recognition, such as before any speech input is detected.”]
detecting speech in the enhanced target signal; and [Scott, see [0048] above.  Scott is directed to PDAs which receive spoken commands and perform speech recognition on the received command.]
extracting a voice command from the detected speech; and [Scott the speech recognition is conducted to extract commands:  “[0017] According to one or more implementations, techniques described herein are able to receive voice commands and react upon presence, identity and context of one or more people….”  [0019, [0064], [0086].]
executing the voice command from the speech in accordance with the current use context. [Scott is directed to PDA and the operation of PDAs is known to include receiving spoken commands and executing the command and providing output response:  “[0019] … When the computing device is in an active state, a digital assistant system operates to process voice commands, and output appropriate graphical user interface (UI) visualizations and/or audible signals to indicate to a user that the digital assistant is ready and able to process voice and/or visual commands and other input. Based on user interaction, the digital assistant can respond to queries, provide appropriate information, offer suggestions, adapt UI visualizations, and takes actions to assist the user depending on the context and sensor data.”]

Claim 13 is a method claim with limitations similar to the limitation of Claim 3.
Regarding Claim 13, Scott teaches:
13. The method of claim 11, further comprising:
acquiring digital images of a field of view captured by an image sensor of the plurality of input components of the voice-interaction device; Scott, Figure 1 teaches “sensors 132” which include a “camera”.  See [0017] and [0047].]
analyzing the acquired digital images; [Scott, the “presence sensors 132d” of Figure 1 include cameras which detect and track the position of various users in the environment as they enter and exit a particular region in order to provided context for the command interpretation task.  See Figures 3-4 also.  [0020], [0050], [0105].]
tracking a relative position of the target user in relation to the voice-interaction device based on an analysis of the acquired digital images; and [Scott, Figures 3-4.  “[0050] Position: As noted above, radar or camera-based sensors 132 may provide a position for one or multiple users….”  “[0020] … For instance, sensors may be used to obtain data for context sensing beyond a simple presence sensor, such as estimating the number of people present, recognizing the identities of people present, detection of distance/proximity to the people, and/or sensing when people approach or walk away from the device and/or other contextual sensors….”]
determining the current use context based at least in part on the relative position of the target user. [Scott, distance and position of the users is one of main aspects of the “context” of Scott according to which volume and image are adjusted or a decision between sound output or image output is made.  “4. … the contextual factor comprises an estimated viewing distance of the user from a display device of the client device.”   “[0004] Techniques for digital assistant experience based on presence sensing are described herein. In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements. ….”  “9. … adapting comprises switching between an audio interaction mode and a visual interaction mode for interaction with the digital assistant.”  “10 . …. adapting comprises adapting an aspect of the visual user interface including one or more of changing a font size, a graphic, a color, or a contrast of the visual user interface in dependence upon the change in the contextual factor.”

Claim 15 is a method claim with limitations similar to the limitation of Claim 5.
Regarding Claim 15, Scott teaches: 
15.     The method of claim 13, further comprising:
estimating a distance between the target user and the voice-interaction device based at least in part on the relative position of the target user to the voice-interaction device; and , [Scott expressly teaches this limitation in the “adapting to the position” category including adapting to “movement of the user” and as shown in Figures 3-7.  See 50, 57, 85 and rejection of Claim 5.]
increasing the frame rate of the visual content  based at least in part on the target user’s estimated distance from the voice-interaction device. [Scott changes/adapts the “digital assistant experience” / output interface and modality in order to accommodate the user: position and movement of the user or his age and disabilities. “[0057] … Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. ...” ]
Scott teaches that, for example, font size and level of detail are changed based on the distance of the user from the display.
Scott does not teach changing the playback speed or frame rate to accommodate the user.
Kang teaches:
increasing the frame rate of the visual content  based at least in part on the target user’s estimated distance from the voice-interaction device. [Kang as applied to Claim 1 teaches that the playback speed of the video is changed according to the scenario of playback that depends on user selection and/or other contextual information.  Kang teaches that the playback speed is adapted based on distance.  Adapting means increasing or decreasing and therefore “increasing the frame rate” is expressly taught by Kang.]
Rationale for combination as provided for Claim 1.

Regarding Claim 16, Scott teaches and suggests:
16. The method of claim 15, further comprising:
based on the analysis of the acquired digital images; [Scott’s sensors shown in Figure 1 include “presence sensors 132d” and “light sensors 132a” and camera and follow the user and detect his presence or absence and his distance from each device. “[0017] … Various sensors are contemplated including for example a camera, a depth sensor, a presence sensor, biometric monitoring devices, and so forth.”   “[0081] … For instance, a motion sensor (e.g., an infrared sensor) can detect user motion and trigger a camera-based sensor to wake and capture image data, such as to identify a user. As a user moves between different proximity zones, for example, sensors may communicate with one another to wake and/or hibernate each other depending on user proximity and position….”]
activating, by the voice-interaction device, a touch-enabled input component if when the target user is in reach of the input component, [Scott, Figure 1, “sensors 132” include “touch sensors 132c.”  When the use is determined to be close, the system makes touch input active and available:  “[0063] At close proximity in the first proximity zone 302 (e.g., a within a 2 foot arc from the client device 102), touchable interactions are available since a user is close enough to touch a display 308 of the client device 102, use input devices of the client device 102, and so forth. Accordingly, the digital assistant 126 may make adaptations to a user interface displayed on the display 308 to support touch and other close proximity interactions….”  wherein the activating the touch-enabled input component comprises rendering a user interface button on the display to accept a touch interaction from the target user; and [Scott teaches that the device presents touch interaction elements which at the least strongly suggests “rendering a user interface button on the display”:  ““[0075] … The proximity zone 302, for instance, represents a distance at which Alice is close enough to touch the display 308. Accordingly, the digital assistant 126 presents touch interaction elements in addition to other visual and/or audio elements. Thus, Alice can interact with the digital assistant 126 via touch input to touch elements displayed on the display 308,….”]  
activating, by  the voice-interaction device, a voice input component and voice output component when the target user is out of reach of the voice-interaction device. [Scott, Figure 1, “sensors 132” include “audio sensors 132b” and microphones.  When the user is too far to use the touch input, the system makes the audio input and output as the available mode: “[0057] …… As the person approaches the system, indications such as icons, animations, and/or audible alerts may be output to signal that different types of interaction are active ….” “[0064] Farther away within the proximity zone 304 (e.g., between a 2 foot and a 3 foot arc from the client device 102), visual interactions are available since the digital assistant 126 determines that a user is likely close enough to be able to see the display 308 clearly. In this case, the digital assistant 126 may make adaptations to accommodate visual interactions and delivery of information visually. Speech may be used in this range also since the user is determined to be not close enough for touch. Still further away in the proximity zone 306 (e.g., between a 3 foot and a 10 foot arc from the client device 102), speech interactions are available since the digital assistant 126 determines that the user is determined to be too far from the display 308 for other modes like touch and visual interaction….”]

Claim 17 is a method claim with limitations similar to the limitation of Claim 7.
Regarding Claim 17, Scott teaches and suggests: 
17. The method of claim 15, further comprising:
adjusting a size of displayed elements on the display as the target user moves relative to the voice-interaction device, wherein the size of the displayed elements is adjusted for readability at the estimated distance between the target user and the voice-interaction device; and [Scott teaches that the input and output modalities both are changed according to context which includes position and distance of the user from the device such that FONT SIZE, e.g., changes according to distance.  “[0057] … Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. ...”]
wherein the voice-interaction device activates a touch screen input when the target user is determined to be in arm’s reach of a touch-enabled input component of the voice-interaction device.  [Scott, Figure 3, teaches that when the user is close to the device, the input modality is adjusted to permit touch input/output. The communication modes go from Speech/Audio (zone 306, far from device) to Speech/Audio and Visual (zones 302 and 304 which are closer to the device) to Visual and Touch (zone 302 which is closest to the device and touching is possible).  “[0063] …. Accordingly, the digital assistant 126 may make adaptations to a user interface displayed on the display 308 to support touch and other close proximity interactions….”  Scott does not include the phrase “arm’s reach” however this distance is suggested by the collective teachings of Scott because the user has to be able to touch the screen in order to use touch interaction. ]

Claim 18 is a method claim with limitations similar to the limitation of Claim 8.
Regarding Claim 18, Scott teaches 
18.  The method of claim 11, further comprising:
detecting a user voice interaction based on the monitored communications; [Scott, the “client device 102” receives its commands generally by voice:  “[0017] According to one or more implementations, techniques described herein are able to receive voice commands and react upon presence, identity and context of one or more people. By way of example, the described techniques can be implemented via a computing device equipped with one or multiple microphones ….”]
determining at least one voice characteristic based on the user voice interaction; and [Scott, user identity is one of the parameters detected by the system so that the “experience” i.e. input and output modalities can be adapted for the particular user as shown in Figure 7.  The user identity is detected from various indicators including his phone device or his biometrics which include his particular voice characteristics.  [0068] and [0097].]
modulating an output volume for one of the plurality of output components based on the determined voice characteristic. [Scott teaches that if the user is deaf the volume of the output is adjusted accordingly.  The output volume is modulated according to identity of the user; user identity is in turn is determined from his voice characteristics.  Thus the output volume is modulated according to the voice characteristics:  “[0023] Context sensors and techniques discussed herein may also be employed to improve accessibility scenarios. For example, the system may detect or be aware that a particular person is partially deaf. In this case, volume level may be adapted when that particular user is present….”]

Claims 4-5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Scott and Kang in view of Singh (U.S. 20180032300).
Regarding Claim 4, Scott teaches and suggests:
4. The voice-interaction device of claim 3, 
wherein the context controller is further configured to analyze the digital images to determine a gaze direction of the target user; and [Scott teaches that the contextual features that it detects include “eye tracking” which teaches “gaze direction” of the Claim.  “[0020] … For instance, different contextual factors can be sensed and/or inferred, such as age and/or gender based on visual information, a state a person is in (e.g., the user is able to see, talk, and so forth). Such contextual factors may be detected in various ways, such as via analysis of user motion, user viewing angle, eye tracking, and so on.”]
wherein the virtual assistant module is further configured to:
direct interactions to the target user through visual display elements on the display in response to the gaze direction being directed toward the display; and [Scott, Figures 6 and 7 provide flowcharts of “presenting a digital assistant experience at the client device” and adapting this presentation/output according to context data.  This includes deciding whether to use the display or audio modality for output.  “[0055] The output, or more general actuation, is mainly based on two interaction modalities: sound and display. However, these two modalities are tightly interwoven with each other based on the contextual data retrieved by sensors 132 and prior knowledge about the user's habits and situation.”  “[0056] Switching system behavior between multiple output modalities can be illustrated by the following example situations. Adaptations of the system behavior are designed to making information easily accessible in various interaction scenarios and contexts.”  The former teachings and [0057] together suggest that when convenient, the output is presented on a display and one of the contextual factors considered in determining this convenience is whether the user can see the screen which may be determined by eye tracking.]
 direct interactions through the audio output signal and the speaker in response to the gaze direction being directed away from the display. [Scott switches to sound output if context indicates that the person cannot see the display.  “[0057] Adapting to the position: When the contextual information indicates that a person who would like to interact with the system is not able to see the screen, the system behavior manager 130 may be configured to switch to sound output in preference over displaying data visually….”  Context according to [0020] above includes “eye tracking.”  The two teachings together suggest that based on eye tracking the output mode may be changed to audio. ]

Scott includes various teachings that together very strongly suggest that in response to eye tracking the output is negotiated between display and audio.  This not express in Scott, however.
Kang does not teach this feature.

Singh teaches:
…
wherein the context controller is further configured to analyze the digital images to determine a gaze direction of the target user; and [Singh, “[0013] The at least one electronic processor can be configured to receive sensor data from at least one eye tracking sensor; and to determine the gaze direction of vehicle driver in dependence on the received sensor data. Alternatively, the at least one electronic processor can be configured to receive the gaze direction information from an eye tracking apparatus.”  “[0034] The eye tracking apparatus 11 comprises first and second image sensors 17-1, 17-2 each comprising a driver-facing camera. At least one of said first and second image sensors 17-1, 17-2 can comprise an infra-red (or near infra-red) capability for eye-tracking purposes. In a variant, the first and second image sensors 17-1, 17-2 could detect light at a visible wavelength to determine head position and/or eye gaze. The first and second image sensors 17-1, 17-2 are connected to an image processing unit 19 configured to process the image data to generate tracking data DAT1.”]
wherein the virtual assistant module is further configured to direct interactions to the target user through visual display elements on the display in response to the gaze direction being directed toward the display; and direct interactions through the audio output signal and the speaker in response to the gaze direction being directed away from the display. [Singh, “[0014] The display control apparatus can be configured to record historical data to determine driver behaviour and/or preferences. Based on historical driver behaviour, the display control apparatus can determine if the driver prefers visual or audio output of information. If the driver prefers audio behaviour and the determined gaze direction is not coincident with the first display, the display control apparatus can be configured to output the first information data set in an audio form rather than audio-visual or visual information.”  “22. The display control apparatus as claimed in claim 4, wherein the at least one electronic processor is configured to control an audio device and/or a haptic device to selectively output in an audio form and/or in a haptic form in dependence on the determined gaze direction and optionally on one or more preferences of the vehicle driver.”]
Scott, Kang, and Singh pertain to voice commands and output of the result of the command or query by display or audio and it would have been obvious to modify the system of Scott which provides for a combination of visual and audio user interfaces that are switched according to context, teaches that the output modality is modified from display to audio when the user cannot see the display, and also provides for eye/gaze tracking, as one of the input parameters to its context detector, with the system of Singh which expressly teaches providing the output to a display or audio output/speaker depending on the location of attention of a user (driver) which is determined from an eye/gaze tracker in order to further accommodate a user who is not looking at the display or cannot look at the display.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 5, Scott teaches:
5. The voice-interaction device of claim 4, 
wherein the virtual assistant module is further configured to adjust a size of the visual display elements based on a movement of the target user relative to the voice-interaction device, [Scott expressly teaches this limitation in the “adapting to the position” category including adapting to “movement of the user” and as shown in Figures 3-7.  “[0057] Adapting to the position: When the contextual information indicates that a person who would like to interact with the system is not able to see the screen, the system behavior manager 130 may be configured to switch to sound output in preference over displaying data visually. The same is true for situations in which a person interacts from farther away. Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. ...”  Figures 3-4 shows keeping track of movements/position of a user:  “1. … and adapting an element of the first digital assistant experience to generate a second digital assistant experience at the client device that is based on a change in a contextual factor that results from the user moving from the first detected distance to the second detected distance from the reference point.”  “[0050] Position: As noted above, radar or camera-based sensors 132 may provide a position for one or multiple users. The position is then used to infer context, e.g. approaching the client device 102, moving away from the client device 102, presence in a different room than the client device 102, and so forth….”  “[0085] A user identified/detail interface 510 represents an expanded visualization that may be provide when the user moves closer to the client device 102 and/or is identified. For instance, the interface 510 can be presented when a user moves from the proximity zone 306 to the proximity zone 304 and is identified and/or authenticated as a particular user. The interface 510 may include various interaction options, customized elements, user-specific information, and so forth. Such details are appropriate when the system detects that the user is more engaged by moving closer, providing input, and so forth….”]
wherein the size of the visual display elements is adjusted to facilitate readability at a distance between the target user and the voice-interaction device. [Scott changes/adapts the “digital assistant experience” / output interface and modality in order to accommodate the user: position and movement of the user or his age and disabilities. “[0057] … Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. ...”  “[0059] …System behavior may also be adapted by selecting a different output modality, such as to support people with limited eyesight, limited hearing, or use age appropriate user interfaces and vocabulary.”  “[0088] In general, the system is able to transition between different UIs and adapt the UIs dynamically during an ongoing interaction based on changing circumstances. For example, different UI and modalities in response to changes in user proximity, number of users present, user characteristics and ages, availability of secondary device/displays, lighting conditions, user activity, and so forth….”]

Claim 14 is a method claim with limitations similar to the limitation of Claim 4.
Regarding Claim 4, Scott teaches:
14.     The method of claim 13, wherein one of the plurality of output components further comprises a speaker and [Scott Figures 1 and 9 the types of devices shown all would include speakers.  Particularly the television.  “[0073] Alternatively or additionally, if the system has access to multiple speakers, different speakers can be chosen for output to Bob and Alice, and respective volume levels at the different speakers can be optimized for Bob and Alice.”]
wherein the method further comprises:
analyzing a gaze direction of the target user; [Scott teaches that the contextual features that it detects include “eye tracking” which teaches “gaze direction” of the Claim.  “[0020] … Such contextual factors may be detected in various ways, such as via analysis of user motion, user viewing angle, eye tracking, and so on.”]
turning on the display and providing a visual output to the target user in response to the gaze direction being directed toward the display; and [Scott teaches turning the display on or off (use of the display) depending on the position of the user and his distance from the display.  ““[0056] Switching system behavior between multiple output modalities can be illustrated by the following example situations….”  “[0057] Depending on the distance, the system may operate to switch to sound, use sound and visual UIs, and/or adapt visual UI for distance by changing font size, graphics, colors, level of detail, contrasts and other aspects used for visualization. … As the person approaches the system, indications such as icons, animations, and/or audible alerts may be output to signal that different types of interaction are active ….”]
turning off the display and providing a voice output to the target user through the speaker in response to the gaze direction being directed away from the display. [Scott.  If context information determines that the user cannot see the display, then audio would be used.  “[0057] Adapting to the position: When the contextual information indicates that a person who would like to interact with the system is not able to see the screen, the system behavior manager 130 may be configured to switch to sound output in preference over displaying data visually….”  Context according to [0020] above includes “eye tracking.”]
Scott teaches “eye tracking” which teaches the “gaze direction” of the Claim as a type of context information and it also teaches that the context information determines the use of display or speakers.  But Scott does not expressly connect the eye tracking and gaze direction with the switching of the output modalities.
Kang does not teach this feature.
Singh teaches:
…
analyzing a gaze direction of the target user; [Singh teaches eye tracking at [0013] and [0034].]  
turning on the display and providing a visual output to the target user in response to the gaze direction being directed toward the display; and turning off the display and providing a voice output to the target user through the speaker in response to the gaze direction being directed away from the display. [Singh, “[0014] The display control apparatus c…  If the driver prefers audio behaviour and the determined gaze direction is not coincident with the first display, the display control apparatus can be configured to output the first information data set in an audio form rather than audio-visual or visual information.”  “22. … control an audio device and/or a haptic device to selectively output in an audio form and/or in a haptic form in dependence on the determined gaze direction and optionally on one or more preferences of the vehicle driver.”]
Rationale as provided for Claim 4.

Claims 9-10 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Scott and Kang and further in view of Raitio (U.S. 2017/0358301) and Shurtz (U.S. 2007/0104337).
Note that the connector was changed from or to and.  Therefore, every limitation of the Claim needs to be accounted for.   Note that Shurtz is added to expressly teach the well-known definition of volume.  Volume/power/energy are all amplitude.
Regarding Claim 9, Scott teaches: 
9. The voice-interaction device of claim 8, wherein the context controller is further configured to detect: [Note the Objection above and note that “and/or” is interpreted as “or” or as “at least one” such that a reference that teaches one of the limitations teaches the Claim.  Refer to the Conclusion for a mapping of the all of the limitations of Claim 9.]
 a speech volume based at least in part on an amplitude of the input audio signal, [Scott does not teach sensing the volume/loudness of the speech of the user as a factor.  “Amplitude of the input signal” is the definition of volume/loudness. Scott modulates the volume of the output speech to suit the particular user and his context.  See rejection of Claim 8.  Scott does not teach that the input volume is sensed or determined as part of context.  However, “speech volume” consideration is suggested by Scott:  “[0021] … In another example, a microphone may be employed to measure loudness of the environment ….”  The “loudness of the environment” includes the voices of the people and also teaches the “voice characteristics” of this claim.  See [0071] and [0072]  where presence of Bob and Alice causes the volume of output to be modified because the “voice characteristics” of the input indicates more people.]
the at least one voice characteristic, [Scott identifies the users/speakers based on biometrics which include “voice recognition.”  Biometric voice recognition identifies characteristics of the input voice:  “[0068] … Alice is identified and authenticated as being associated with a particular user profile. Examples of such biometric techniques include facial recognition, voice recognition, gait recognition, and so forth….”  Scott also teaches that for better speech recognition the system uses speaker’s identity to take into consideration the characteristics of his voice such as accent:  “[0049] …When the identity of a user is known (such as discussed below), it is possible to apply a different speech recognition model that actually fits the user's accent, language, acoustic speech frequencies, and demographic.”]
a distance between the target user and the voice-interaction device, and [Scott teaches that the distance between the user and device is determined in order to determine the method of output as display on a GUI or by voice.  See Figure 3.  “[0066] In at least some implementations, a volume of the audio prompt is adjusted based on Alice's distance from the client device 102. For instance, when Alice first enters the proximity zone 306 and is initially detected, an audio prompt may be relatively loud. However, as Alice continues toward the client device 102 an approaches the proximity zone 304, volume of an audio prompt may be reduced.”  “… In implementations, a system is able to detect user presence and distance from a reference point, and tailor a digital assistant experience based on distance. The distance, for example, represents a distance from a client device that outputs various elements of a digital assistant experience, such as visual and audio elements. ….”  Abstract.  “[0058] … , additional sensors are invoked to detect position, distance, identity, and other characteristics that enable further context-based adaptations of the system behavior….”]
environmental noise. [Scott considered the background noise as a factor to be considered in recognizing the input speech of the user:  “[0049] Also, the system (e.g., the client device 102) can disambiguate between multiple sound sources, such as by filtering out the position of a known noise-producing device (e.g., a television) or background noise. ….”  Scott adjusts the volume of output based on noise in the environment:  “[0089] … Volume adjustments may also be made based on proximity and/or ambient noise levels….”]
Scott modulates the volume of the output speech to suit the particular user and his context.  See rejection of Claim 8.  Scott does not expressly teach that the input volume is sensed or determined as part of context.
Kang does not teach the feature.

Raitio teaches:
9. The voice-interaction device of claim 8, wherein the context controller is further configured to detect: [Raitio:  “DIGITAL ASSISTANT PROVIDING WHISPERED SPEECH.”  “[0002] The present disclosure relates generally to a digital assistant and, more specifically, to a digital assistant that is capable of detecting a whispered speech input and providing a whispered speech response.”]
a speech volume based at least in part on an amplitude of the input audio signal,  [Raitio expressly teaches in Figure 8A, “whispered speech determination module 820” and claims 5-6 that a whispered input speech is determined if the volume of input signal is below a threshold volume.  It also expressly teaches that amplitude and volume of the input voice are detected in order to determine whisper.   “[0245] As described and shown in FIG. 8A, in response to receiving a speech input from user 830, whispered speech determination module 820 can determine whether the speech input includes a whispered speech input. In some examples, whispered speech determination module 820 can make such determination based on one or more spectrum characteristics, such as the amplitude, the energy, the volume, the slope, or a combination thereof. …” ]
the at least one voice characteristic, [Raitio, note [0245] to [0247] for all the various types of voice characteristics that are obtained by whisper detection of Raitio.]
…
Scott, Kang, and Raitio pertain to voice operated personal digital assistants and it would have been obvious to combine the whisper feature of Raitio which needs to detect whether the input speech was whispered and therefore detects a whispered input based on volume of input voice with the context factors of Scott which decide the output volume based on considerations of privacy (who is within the earshot of the device) as an added indicator that privacy and lower output volume is desirable.  (See Raitio, [0003].)  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Volume is directly based on amplitude and energy (which depends on the amplitude).  However, Raitio does not set forth this relationship.  A reference is cited for completeness.  Note that this reference could have been cited as support for a well-known point of physics.
Shurtz teaches:
a speech volume based at least in part on an amplitude of the input audio signal, [Shurtz teaches that volume is obtained based on amplitude of the audio signal:  “[0007] The foregoing and other features are accomplished, according the present invention, by providing apparatus monitoring the analog signals that are generated by an audio source. The volume of the audio from speakers will be determined by the amplitude of the analog signals which vary with the volume control at the receiver end. The amplitude of the audio analog signals and peak is detected then digitized. The analog signal volume amplitude determines the amplitude of the digitized signals thereby relating the digitized signal audio volume to the analog signal audio volume.”]
Scott, Kang, Raitio, and Shurtz pertain to voice and audio inputs and it would have been obvious to combine the determination of volume from the amplitude of the audio signal from Shurtz with the system of the combination as one method of obtaining volume.  (See Raitio, [0003].)  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 10, Scott teaches:
10. The voice-interaction device of claim 9, wherein the context controller is further configured to: 
analyze the characteristics of an input audio signal; and [Scott Figures 1 and 2, makes its determinations based on sensor data which include “audio sensors 132a.”]
modulate the output volume of the voice-interaction device to match a detected use context; [Scott teaches that the output volume depends on and is modified according to the identity of the user and other context such that if the user is deaf or old or far or in presence of other people or presence of noise the volume of the output is adjusted.  “[0023] Context sensors and techniques discussed herein may also be employed to improve accessibility scenarios. For example, the system may detect or be aware that a particular person is partially deaf. In this case, volume level may be adapted when that particular user is present….”  “[0057] … When the person is further away from the system, the system may also adjust the volume of sound output or the clarity of speech synthetization by increasing the overall pitch….”  See also [0057], [0066], [0071]-[0073], [0089], and [0107] for adjustment of output volume due to other contextual factors.]
wherein in response to a determination that the input audio signal corresponds to a whisper, the context controller lowers the output volume in the modulating; and 
wherein in response to the target user being located a distance away, the context controller increases the output volume in the modulating to project a voice output to the target user. [Scott adjust the volume of the output according to distance of the user from the client device as one of the contextual factors.  See Figures 6-7 and “[0057] … When the person is further away from the system, the system may also adjust the volume of sound output or the clarity of speech synthetization by increasing the overall pitch….”]
Scott teaches that the output volume is modified according to identity of the person who is receiving the output and his personal limitations as well as other characteristics of the environment.  However, it does not teach that the device considers the volume of input command which is referred to as a whisper mode meaning that it does not teach that the device whispers back if the user whispers in his command.
Kang does not teach the feature.
Raitio teaches:
wherein in response to a determination that the input audio signal corresponds to a whisper, the context controller lowers the output volume in the modulating; and [Raitio detects whether the input speech has been whispered and if so outputs the result with whispered synthesized speech.  Title: “DIGITAL ASSISTANT PROVIDING WHISPERED SPEECH.”  This feature is used, for example, if the user is in a library or private place and does not want to disturb others in the environment.  See [0003] for examples of use.  The output is in whisper.  “[0002] The present disclosure relates generally to a digital assistant and, more specifically, to a digital assistant that is capable of detecting a whispered speech input and providing a whispered speech response.”  One characteristics of Whisper is lower volume:  “[0245] As described and shown in FIG. 8A, in response to receiving a speech input from user 830, whispered speech determination module 820 can determine whether the speech input includes a whispered speech input. In some examples, whispered speech determination module 820 can make such determination based on one or more spectrum characteristics, such as the amplitude, the energy, the volume, the slope, or a combination thereof….”  Therefore, an output whispered speech will have a lower volume:  “6. wherein the one or more first spectrum characteristics comprise at least one of: a first amplitude, wherein the first amplitude is less than a second amplitude below a threshold frequency, the second amplitude being associated with the non-whispered speech; a first energy, wherein the first energy is less than a second energy below the threshold frequency, the second energy being associated with the non-whispered speech; a first volume, wherein the first volume is less than a second volume by a threshold volume percentage, the second volume being associated with the non-whispered speech .…”]
Scott/Kang and Raitio pertain to voice operated personal digital assistants and it would have been obvious to combine the whisper feature of Raitio which provides a whispered output when the input command/query is detected to be whispered in with the system of Scott which modulates the volume of output response according to the context of the user including user’s distance from the device and presence of other people in the vicinity to provide the system with one more item of context and create a system that considers the volume of input voice as a clue to the intent of the user.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 19 and 20 have limitations similar to the limitations of Claims 9 and 10 but presented in a different order and with different language.  Summarized rejections are provided for these Claims.
Claim 19 is a method claim with limitations somewhat similar to the limitation of Claim 9 with modified language.
19.  The method of claim 18, further comprising!
detecting  a speech volume based at least in part on an amplitude of an input audio signal; [Raitio teaches that in response to detecting the voice of the user as Whispered input, the volume of the output is reduced.  See [0002], [0245] and claim 6.] [Volume being based on amplitude is taught by Shurtz]
determining the at least one voice characteristic based on the speech volume; [Raitio teaches detecting whisper which is based on volume.]
detecting a distance between the target user and the voice-interaction device; [Scott teaches that the distance between the user and device is determined in order to determine the method of output as display on a GUI or by voice.  Figure 3, Abstract, [0058], [0066].]
detecting environmental noise; and [Scott considered the background noise as a factor to be considered in recognizing the input speech of the user:  [0049] and [0089].]
further modulating the output volume based on the distance between the target user and the voice-interaction device and the environmental noise. [Scott teaches that the output volume depends on and is modified according to the distance of the user and noise:  “[0057] … When the person is further away from the system, the system may also adjust the volume of sound output or the clarity of speech synthetization by increasing the overall pitch….”  “[0089] … Volume adjustments may also be made based on proximity and/or ambient noise levels….”]
Rationale for combining Scott/Kang and Raitio as provided for Claim 9.

Claim 20 is a method claim with limitations somewhat similar to the limitation of Claim 10 with adjustments in the language.
20.	The method of claim 19, further comprising:
determining that the voice is a whisper based on the at least one voice characteristic; and [Raitio detects whisper so it can lower the volume accordingly.  See [0002], [00003], [0245] and claim 6.]
lowering, by the context controller, the output volume to a corresponding whisper level; and [Raitio:  “[0002] The present disclosure relates generally to a digital assistant and, more specifically, to a digital assistant that is capable of detecting a whispered speech input and providing a whispered speech response.” ]
in response to detecting if the target user has moved beyond the distance away from the voice-interaction device, increasing, by the context controller, the output volume to project a voice output to the target user. [Scott responds to distance of the user as context and raises the volume when the user moves farther away.  See Figures 6-7 and “[0057] … When the person is further away from the system, the system may also adjust the volume of sound output or the clarity of speech synthetization by increasing the overall pitch….”]
Rationale for combining Scott/Kang and Raitio as provided for Claim 10.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659