DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 2/15/2021.
The Amendment filed on 2/15/2021 has been entered.  
Claims 1, 4, 7, 17-20 have been amended by Applicant.
Claims 2-3 and 14 have been cancelled by Applicant.
Claims 1, 4-13 and 15-20 remain pending in the application of which Claims 1, 19, and 20 are independent.  
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that were necessitated by the amendments to the Claims.   

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers submitted under 35 U.S.C. 119(a)-(d), which papers have been placed of record in the file.

Response to Arguments
Regarding the rejections under 103, Applicant’s arguments, pages 8-13, have been fully considered, but they are not persuasive. 
On pg. 9, Applicant asserts, “However, Engelke does not teach or suggest that the system operating characteristics also include a length of input from a caller”, and on pg. 11 Applicant asserts, “However, Bushey does not teach or suggest also interpreting the length of the input 

However, Examiner respectfully disagrees.  ENGELKE teaches how a duration of the received speech signal impacts the accuracy of the speech recognition (Par 194 – “…accurate for short voice signal durations (e.g., 15-30 seconds) after which accuracy becomes less reliable”), and further teaches controlling the output of the speech recognition results that reflects the duration of the received speech (Par 166 – “an arrow effect 732 represents a long annunciation period …”)  
BUSHEY also teaches the limitations.   BUSHEY teach measuring the utterance length of the user (Par 16 – “Next, as depicted at step 314, the caller input is evaluated for too much speech. If the caller input includes too much speech, such as a speech input exceeding a certain amount of time or number of recognizable phonemes, then processing continues at 340. At processing step 340, two cumulative error counters are incremented. The cumulative error counters represent one way in which confidence values can be assigned to the call.”), and controlling the output (Fig. 3A – “Please use fewer words”).  For at the reasons above, Examiner maintains the rejections.   Please see the rejections below for more details.

Examiner Notes
MATSUBARA (US 2005/0021341 A1) also discloses a similar method/system that analyzes a voice input and measures different characteristics associated with speech recognition accuracy including the length of the speech input  (Pars 41-55 – “(1) the presence or absence of an attached word at the head of the input sentence … (2) Whether the sound level is high or low … (3) Whether the speaking time is long or short … (4)the speaking timing … (5) likelihood of the pattern … (6) the presence or absence of an attached word at the end of the input sentence ”).  MATSUBARA further teaches controlling the output of the speech recognition based on the 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.



Claims 1, 4-6, and 19-20 are under 35 U.S.C. 103 as being unpatentable over ENGELKE (US 2017/0206808 A1), and further in view of BUSHEY (US 2006/0115070 A1).

REGARDING Claim 1, ENGELKE discloses an information processing device, comprising: 
first circuitry (ENGELKE Fig. 24; Par 60 – “In other embodiments device 14 may include a computer, a smart phone, a smart tablet, etc., that can facilitate audio communications with other devices.”) configured to: 
acquire, from second circuitry, information associated with accuracy of a speech recognition process on sound information (ENGELKE Par 179 – “As another instance, an HU's device may independently assess the level of non-HU voice signal noise being picked up by an HU device microphone and, if the determined noise level exceeds some threshold value therefore more accurate and robust captioning options should be available.”), wherein the information associated with the accuracy of the speech recognition process includes an utterance volume of a user in the sound information (ENGELKE Par 166 – “In at least some embodiments of the present disclosure it is contemplated that volume changes, tone, length of annunciation, pitch, etc., of an HU's voice signal may be sensed by automated software and used to change the appearance of or otherwise visually distinguish transcribed text that is presented to an AU via a device display 18 so that the AU can more fully understand and participate in a richer communication session.”; Par 171 – “For instance, if an HU persistently talks in a volume that is much higher than typical for the HYU, a volume indicator 717 may be presented or visually altered in some fashion to indicate the persistent volume. As another example, a volume indicator 715 may be presented above or otherwise spatially proximate any word annunciated with an unusually high volume.”; Par 179 – “As another instance, an HU's device may independently assess the level of non-HU voice signal noise being picked up by an HU device microphone and, if the determined noise level exceeds some threshold value either by itself or in relation to the signal strength of the HU voice signal, may perform some function.”), and utterance clarity of the user in the sound information (ENGELEKE Par 164 – “The accuracy values would be provided via the AU device display 18 (see 728 in FIG. 22) and /or the CA workstation display 50. Where an HU device has a display (e.g., a smart phone, a tablet, etc.), the accuracy value(s) may be presented via that display in at least some cases. To this end, see the smart phone type HU device 800 in FIG. 24 where an accuracy rate is displayed at 802 for a call with an AU. It is expected that seeing a low accuracy value would encourage an HU to try to annunciate words more accurately or slowly to improve the value.”), and an utterance length of the user in the sound information (ENGELKE Par 166 – “In at least some embodiments of the present disclosure it is length of annunciation, pitch, etc., of an HU's voice signal may be sensed by automated software and used to change the appearance of or otherwise visually distinguish transcribed text that is presented to an AU via a device display 18 so that the AU can more fully understand and participate in a richer communication session.”; Par 194 – “It has been recognized that some third party AVR systems available via the internet or the like tend to be extremely accurate for short voice signal durations (e.g., 15-30 seconds) after which accuracy becomes less reliable.”; Par 234 – “In this regard, a system processor receiving an HU voice signal ascertains whether or not the signal includes audio in a range that is typical for human speech during an HU turn and generates a duration of speech value equal to the number of seconds of speech received. Thus, for instance, in a ten second period corresponding to an HU voice signal turn, there may be 3 seconds of silence during which audio is not in the range of typical human speech and therefore the duration of speech value would be 7 seconds. In addition, the processor detects the quantity of captions being generated by an AVR engine. The processor automatically compares the quantity of captions from the AVR with the duration of speech value to ascertain if there is a problem with the AVR engine. Thus, for instance, if the quantity of AVR generated captions is substantially less than would be expected given the duration of speech value, a potential AVR problem may be identified.”); 
control a speech output mode of a result of the speech recognition process (ENGELKE Fig. 22; Par 166 – “In at least some embodiments of the present disclosure it is contemplated that volume changes, tone, length of annunciation, pitch, etc., of an HU's voice signal may be sensed by automated software and used to change the appearance of or otherwise visually distinguish transcribed text that is presented to an AU via a device display 18 so that the AU can more fully understand and participate in a richer communication session. To this end, see, for instance, the two textual effects 732 and 734 in AU device text 730 in FIG. 22 where an arrow effect 732 represents a long annunciation period while a bolded/italicized effect 734 represents an appreciable change in HU voice signal volume. Many other non-textual characteristics of an pitch, speed of speaking, etc., may all be automatically determined and used to provide effect distinct visual queues along with the transcribed text.”), based on the utterance volume of the user (ENGELKE Fig. 22 Unit 734; Par 166 – “To this end, see, for instance, the two textual effects 732 and 734 in AU device text 730 in FIG. 22 where an arrow effect 732 represents a long annunciation period while a bolded/italicized effect 734 represents an appreciable change in HU voice signal volume.”; Par 179 – “As another instance, an HU's device may independently assess the level of non-HU voice signal noise being picked up by an HU device microphone and, if the determined noise level exceeds some threshold value either by itself or in relation to the signal strength of the HU voice signal, may perform some function.”), the utterance clarity of the user (ENGELEKE Par 164 – “The accuracy values would be provided via the AU device display 18 (see 728 in FIG. 22) and /or the CA workstation display 50. Where an HU device has a display (e.g., a smart phone, a tablet, etc.), the accuracy value(s) may be presented via that display in at least some cases. To this end, see the smart phone type HU device 800 in FIG. 24 where an accuracy rate is displayed at 802 for a call with an AU. It is expected that seeing a low accuracy value would encourage an HU to try to annunciate words more accurately or slowly to improve the value.”), and the utterance length of the user (ENGELKE Fig. 22 Unit 734; Par 166 – “To this end, see, for instance, the two textual effects 732 and 734 in AU device text 730 in FIG. 22 where an arrow effect 732 represents a long annunciation period while a bolded/italicized effect 734 represents an appreciable change in HU voice signal volume.”), wherein the type of the result is a [prefix] visual representation (ENGELKE Figs. 22 and 24; Par 173 – “Here, an HU device may also allow an HU to switch back to automated text if an accuracy value 802 exceeds some threshold level. Where HU voice characteristics are detected, those characteristics may be used to visually distinguish text at 804 in at least some embodiments.”; Par 235 – “Here, the AU device that indicates a likely error (e.g., perhaps visually distinguished by a yellow highlight or the like).”);
control an output device to output the result of the speech recognition process (ENGELKE Fig. 24; Par 173 – “Referring again to FIG. 24, in at least some embodiments where an HU device 800 includes a display screen 801, an HU voice text transcription 804 may also be presented via the HU device. Here, an HU viewing the transcribed text could formulate an independent impression of transcription accuracy and whether or not a more robust transcription process (e.g., CA generation of text) is required or would be preferred.”); and
attach the prefix to the output of the result of the speech recognition process (ENGELKE Fig. 22; Par 166 – “In at least some embodiments of the present disclosure it is contemplated that volume changes, tone, length of annunciation, pitch, etc., of an HU's voice signal may be sensed by automated software and used to change the appearance of or otherwise visually distinguish transcribed text that is presented to an AU via a device display 18 so that the AU can more fully understand and participate in a richer communication session. To this end, see, for instance, the two textual effects 732 and 734 in AU device text 730 in FIG. 22 where an arrow effect 732 represents a long annunciation period while a bolded/italicized effect 734 represents an appreciable change in HU voice signal volume. Many other non-textual characteristics of an HU voice signal are contemplated and may be sensed and each may have a different appearance. For instance, pitch, speed of speaking, etc., may all be automatically determined and used to provide effect distinct visual queues along with the transcribed text.”; Fig. 24 – “Current AVR Accuracy: 92%; Line Quality 9/10”; Par 180 – “Here, the HU may present a line quality value as shown at 808 in FIG. 24 for the HU to consider. Similarly, an AU device may present a line quality signal (not illustrated) to the AU to be considered.”; Par 181 – “In some cases system devices may monitor a plurality of different system operating characteristics such as line quality, speaker phone use, non-voice noise level, voice volume level, voice signal pace, etc., and may present one or more “coaching” indications to any one of or a subset of the HU, CA and AU for consideration. Here, the coaching indications should help the parties to a call understand if there is something they can do to increase the level of captioning accuracy.”), wherein the prefix is attached based on a noise volume associated with the sound information (ENGELKE Par 179 – “Here, the idea is that as the noise level increases, the likelihood of accurate AVR captioning will typically decrease and therefore more accurate and robust captioning options should be available.”; Par 181 – “In some cases system devices may monitor a plurality of different system operating characteristics such as line quality, speaker phone use, non-voice noise level, voice volume level, voice signal pace, etc., and may present one or more “coaching” indications to any one of or a subset of the HU, CA and AU for consideration.”), and the noise volume is larger than a threshold value (ENGELKE Par 181 – “Here, the coaching indications should help the parties to a call understand if there is something they can do to increase the level of captioning accuracy. Here, in at least some cases only the most impactful coaching indications may be presented and different entities may receive different coaching indications. For instance, where noise at HU location exceeds a threshold level, a noise indicating signal may only be presented to the HU. Where the system also recognizes that line quality is only average, that indication may be presented to the AU and not to the HU while the HU's noise level remains high. If the HU moves to a quieter location, the noise level indication on the HU device may be replaced with a line quality indication. Thus, the coaching indications should help individual call entities recognize communication conditions that they can effect or that may be the cause of or may lead to poor captioning results for the AU.”).

ENGELKE does not explicitly teach the [square-bracketed] limitations.  In other words, ENGELKE teaches outputting characteristics of the received audio signal along with the recognition results via visual representations (e.g., Fig. 22 an arrow effect to indicate a long annunciation period, a bolded effect to indicate voice volume, a yellow highlight for erroneous captions due to unclarity).  Examiner reviewed the specification for the definition of “Prefix”. The 

BUSHEY discloses an information processing device, comprising: 
first circuitry (BUSHEY Fig. 1) configured to: 
acquire, from second circuitry, information associated with accuracy of a speech recognition process on sound information (BUSHEY Par 11 – “A caller could respond “I need to pay my bill.” Microprocessor 132 can interpret the intent or purpose of the call and assign a confidence value to the call based on a set of rules.”; Par 9 – “The microprocessor can recognize caller input, assign confidence values to the received input, and compare the threshold level to the confidence values.”), wherein the information associated with the accuracy of the speech recognition process includes  (NOTE ENGELEKE already teaches the limitations), and utterance clarity of the user in the sound information (BUSHEY Par 15 – “In other embodiments the call confidence level may be decremented depending on the magnitude of the call response inconsistency. The reduction in a confidence value may vary depending on the type of input received from a caller or the relative position in the call flow where the caller input was requested. Many mathematical formulas could be utilized to gauge caller success or confidence without departing from the scope of the present invention.”; Par 19 – “Where the confidence rating, at decision step 318, is determined to be a medium rating, then as illustrated by step 328 a voice prompt is initiated, such as the illustrated prompt, “I think you said,” followed by a computer filled-in version of what the system thinks the caller said.”), and an utterance length of the user in the sound information (BUSHEY Par 16 – “Next, as depicted at step 314, the  too much speech, such as a speech input exceeding a certain amount of time or number of recognizable phonemes, then processing continues at 340. At processing step 340, two cumulative error counters are incremented. The cumulative error counters represent one way in which confidence values can be assigned to the call.”); 
control a speech output mode of a result of the speech recognition process (BUSHEY Fig. 3B – “<Rec_Results>”) and a type of the result (BUSHEY Fig. 3B – “Confirmation: “I think you said …””), based on (NOTE ENGELEKE already teaches the limitations), the utterance clarity of the user (BUSHEY Par 19 – “Where the confidence rating, at decision step 318, is determined to be a medium rating, then as illustrated by step 328 a voice prompt is initiated, such as the illustrated prompt, “I think you said,” followed by a computer filled-in version of what the system thinks the caller said. The system also provides a follow-up voice prompt such as, “Is that correct?” This voice prompting is illustrative of how the interactive voice response system may solicit further information and boost or modify a confidence level in a caller response or request. At this stage after the confirmation step 328, processing continues with step 338 and the dialogue with the caller continues and user input is again solicited at step 304.”), and the utterance length of the user (BUSHEY Par 16 – “Next, as depicted at step 314, the caller input is evaluated for too much speech. If the caller input includes too much speech, such as a speech input exceeding a certain amount of time or number of recognizable phonemes, then processing continues at 340. At processing step 340, two cumulative error counters are incremented. The cumulative error counters represent one way in which confidence values can be assigned to the call.”), wherein the type of the result is a [prefix] (BUSHEY Fig. 3B –“I think you said…<Rec_results> Is that Correct?”; Par 19 – “Par 19 – “Where the confidence rating, at decision step 318, is determined to be a medium rating, then as illustrated by step 328 a voice prompt is initiated, such as the illustrated prompt, “I think you said,” followed by a computer filled-in version of what the system thinks the caller said.”); and 
control an output device to output the result of the speech recognition process (BUSHEY Fig. 3B – “Determine Confidence Rating 318” -> Low; -> Medium; -> High; Par 21 – “For an initial low confidence event, where the low confidence counter equals 1, as depicted by block 325, the call response system provides an additional prompt, such as, “I'm sorry, I didn't understand.””; Par 19 – “Referring to FIG. 3B, where the confidence rating of the response is determined to be a high confidence value, then processing is forwarded to step 338, and a dialogue with the caller is continued according to normal call support processing. New user input is detected at step 304. Where the confidence rating, at decision step 318, is determined to be a medium rating, then as illustrated by step 328 a voice prompt is initiated, such as the illustrated prompt, “I think you said,” followed by a computer filled-in version of what the system thinks the caller said.”); 
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE to include outputting a confidence indication before outputting a speech recognition result, as taught by BUSHEY.
One of ordinary skill would have been motivated to include outputting a confidence indication before outputting a speech recognition result, in order to solicit further information and boost or modify confidence level in a caller response or request (BUSHEY Par 19).


REGARDING Claim 4, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1.
ENGELKE further discloses wherein the information associated with the accuracy of the speech recognition process (ENGELKE Par 179 – “Here, the idea is that as the noise level increases, the likelihood of accurate AVR captioning will typically decrease and therefore more includes information associated with noise in utterance of the user (ENGELKE Par 181 – “In some cases system devices may monitor a plurality of different system operating characteristics such as line quality, speaker phone use, non-voice noise level, voice volume level, voice signal pace, etc., and may present one or more “coaching” indications to any one of or a subset of the HU, CA and AU for consideration.”), and the sound information includes the noise in the utterance of the user (ENGELKE Par 179 – “As another instance, an HU's device may independently assess the level of non-HU voice signal noise being picked up by an HU device microphone and, if the determined noise level exceeds some threshold value either by itself or in relation to the signal strength of the HU voice signal, may perform some function.”).

REGARDING Claim 5, ENGELKE in view of BUSHEY discloses the information processing device according to claim 4.
ENGELKE further discloses wherein the information associated with the noise includes the noise volume in the sound information (ENGELKE Par 181 – “In some cases system devices may monitor a plurality of different system operating characteristics such as line quality, speaker phone use, non-voice noise level, voice volume level, voice signal pace, etc., and may present one or more “coaching” indications to any one of or a subset of the HU, CA and AU for consideration. Here, the coaching indications should help the parties to a call understand if there is something they can do to increase the level of captioning accuracy. Here, in at least some cases only the most impactful coaching indications may be presented and different entities may receive different coaching indications. For instance, where noise at HU location exceeds a threshold level, a noise indicating signal may only be presented to the HU. Where the system also recognizes that line quality is only average, that indication may be presented to the AU and not to the HU while the HU's noise level remains high. If the HU moves to a quieter location, the noise level indication on the HU device may be replaced with a line quality recognize communication conditions that they can effect or that may be the cause of or may lead to poor captioning results for the AU.”).


REGARDING Claim 6, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1, wherein the information associated with the accuracy of the speech recognition process includes a confidence level of the result of the speech recognition process (ENGELKE – Par 82 -- “For instance, server 30 may assign a confidence factor to each word in the automated text based on how confident the server is that the word has been accurately transcribed. The confidence factors over a most recent number of words (e.g., 100) or a most recent period (e.g., 45 seconds) may be averaged and the average used to assess an overall confidence factor for transcription accuracy.”; BUSHEY also teaches the limitations Par 11 – “A caller could respond “I need to pay my bill.” Microprocessor 132 can interpret the intent or purpose of the call and assign a confidence value to the call based on a set of rules.”; Par 9 – “The microprocessor can recognize caller input, assign confidence values to the received input, and compare the threshold level to the confidence values.”).


REGARDING Claim 19, ENGELKE in view of BUSHEY discloses an information processing method comprising: performing the functions of Claim 1.  Thus, it is rejected under the same rationale.

REGARDING Claim 20, ENGELKE in view of BUSHEY discloses a non-transitory computer-readable medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to execute operations, the operations comprising: performing the functions of Claim 1.  Thus, it is rejected under the same rationale.










Claims 7-8 are rejected under 35 U.S.C. 103 as being unpatentable over ENGELKE in view of BUSHEY, and further in view of BASYE (US 2016/0379638 A1).


REGARDING Claim 7, ENGELKE in view of BUSHEY discloses the information processing device according to claim 6.
ENGELKE in view of BUSHEY does not explicitly teach the rest of the claim limitations.

BASYE discloses a method/system for speech recognition and speech synthesis, wherein the first circuitry is further configured to controls the speech output mode based on information associated the user (BASYE Par 16 – “Offered is a system and method for detecting a speech quality of an utterance using one or more paralinguistic features, for example tone or pitch of voice, whether speech is whining, angry, pleading, etc. The system may then respond to the utterance in a manner that corresponds to the speech quality. For example, when a user whispers a command to a device, the device will not only perform ASR on the command, it will also detect that the command was spoken in whisper. … For example, a spoken command of “play some music” may be interpreted different by a system if spoken in a cause the system to play loud music) than if spoke in a whisper (which may cause the system to play softer music). Other embodiments are also possible.”).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include controlling the speech output mode based on the information related to a user, as taught by BASYE.
One of ordinary skill would have been motivated to include controlling the speech output mode based on the information related to a user, in order to improve a user experience for human-device interactions (BASYE Par 14).

REGARDING Claim 8, ENGELKE in view of BUSHEY and BASYE discloses the information processing device according to claim 7.
BASYE further discloses wherein the information associated with the user (BASYE Par 16 – “Offered is a system and method for detecting a speech quality of an utterance using one or more paralinguistic features, for example tone or pitch of voice, whether speech is whining, angry, pleading, etc. The system may then respond to the utterance in a manner that corresponds to the speech quality. For example, when a user whispers a command to a device, the device will not only perform ASR on the command, it will also detect that the command was spoken in whisper. … For example, a spoken command of “play some music” may be interpreted different by a system if spoken in a scream (which may cause the system to play loud music) than if spoke in a whisper (which may cause the system to play softer music). Other embodiments are also possible.”) includes at least one of behavior information of the user, posture information of the user, setting information of the user, environmental information around the user (BASYE Par 49 – “The present system is actually configured to detect speech quality/qualities and determine a label corresponding to the detected qualities that may be applied to an utterance in the speech and used for later processing. The speech quality may be based on paralinguistic metrics that describe some quality/feature other than the background audio/noises, distance between the user and a device, etc.”), biometric information of the user, or emotion information of the user (BASYE Par 65 – “Specifically, a static output of a spoken reprimand may be output in response to speech of a certain quality. Such as “stop whining” if the speech quality detector 220 determines input speech to be whined, “no need to shout” if the speech quality detector 220 determines input speech to be shouted, “ask nicely” if the speech quality detector 220 detects angry speech, “do you need help” if the speech quality detector 220 determines input speech to be in distress, or other examples. The static output may also be selected based on an indication that ASR or NLU processing failed. For example, if the speech quality detector 220 detects the speech to be whispered and ASR and/or NLU processing failed, the system may output a static response of “please do not whisper, I did not understand.””).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include controlling the speech output mode based on the information related to a user, as taught by BASYE.
One of ordinary skill would have been motivated to include controlling the speech output mode based on the information related to a user, in order to improve a user experience for human-device interactions (BASYE Par 14).



Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over ENGELKE in view of BUSHEY, and further in view of SHIN (US 2017/0083281 A1).

REGARDING Claim 9, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1.
ENGELKE in view of BUSHEY is silent to the rest of the claim limitations.
SHIN discloses a method/system for interaction between a human and a machine using speech recognition, wherein the information associated with the accuracy of the speech recognition process includes an amount of the result of the speech recognition process (SHIN Par 128 – “Furthermore, for example, if a designated fifth keyword is included in the speech of the user, the processor 120 may increase an output amount of information of the content. If a designated sixth keyword is included, the processor 120 may decrease the output amount of information of the content.”; Table 9 – “Fifth keyword: minutely, deeply and the like: Increase output amount of information of content; Sixth keyword: briefly, concisely, and the like: Decrease output amount of information of content.”; Par 129 – “Table 9 is an example in which the first to sixth keywords correspond to various content output schemes. Through the keywords as described above, a user who provides a voice input with respect to the electronic device 101 may be provided with corresponding content as sound with an output scheme according with the intention.”).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include an amount of the result of the speech recognition processing, as taught by SHIN.
One of ordinary skill would have been motivated to include an amount of the result of the speech recognition processing, in order to provide corresponding content in an output scheme that is most appropriate for a condition of a user who performs a voice input (SHIN Par 140).




Claims 10 is rejected under 35 U.S.C. 103 as being unpatentable over ENGELKE in view of BUSHEY, and further in view of BROWN (US 2015/0185996 A1).

REGARDING Claim 10, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1.
ENGELKE in view of BUSHEY is silent to the rest of the claim limitations.
BROWN discloses a method/system for interaction between a user and a virtual assistant using speech recognition, wherein the first circuitry is further configured to controls the speech output mode (BROWN Fig. 10; Pars 99-101 – “An audible manner of output—how a virtual assistant speaks to a user. This may include an accent of the virtual assistant (e.g., English, Australian, etc.), a fluctuation in the virtual assistants speech (e.g., pronouncing a first word of a sentence different than other words of a sentence), how fast words are spoken, and so on. … A language in which a virtual assistant communicates (e.g., Spanish, German, French, English, etc.). This may include a language that is understood by the virtual assistant and/or a language that is spoken or otherwise used to output information by the virtual assistant.
A personality—how a virtual assistant responds to a user. For example, a virtual assistant may act cheerful (e.g., uses predetermined positive language, speaks in a predetermined upbeat tone, etc.), angry (e.g., speaks above a volume threshold, accents particular words, etc.), depressed (e.g., speaks below a word velocity threshold), and so on. In one instance, a virtual assistant may be configured to emulate or mimic how a user interacts with the virtual assistant (e.g., if the user talks fast, the virtual may speak fast; if the user uses text to input, the virtual assistant may output responses in text; etc.).”; Par 138 – “FIG. 10 illustrates an example virtual assistant customization interface 1000 for enabling end-users to configure characteristics of a virtual assistant.  … a drop-down menu 1012 to select a output triggering condition (e.g., present a sports virtual assistant anytime a particular basketball team is playing, present a flight virtual assistant upon arrival at a particular location, etc.), and a drop-down menu 1014 to specify that a particular word corresponds to a word specified in input field 1016 (e.g., specify that “basketball” and “hoops” mean the same thing).”; In other words, each virtual assistant has its own characteristics for communicating with a user.) based on a type of content associated with a usage of the result of the speech recognition process (BROWN Fig. 11; Par 24 – “The different virtual assistants may adapt to different contexts (e.g., conversation context, location of the user, content that is output, calendar events, etc.). The virtual assistants may additionally, or alternatively, interact with each other to carry out tasks for the users, which may be illustrated in conversation user interfaces.”; Par 140 – “FIG. 11 illustrates an example conversation user interface 1100 where a virtual assistant is switched based on user input. In this example, an executive assistant virtual assistant is initiated when the conversation user interface 1100 is opened, as illustrated by a conversation item 1102. Here, the user requests “Do I need to pay any bills?,” as illustrated by a conversation item 1104. Based on this information, the executive assistant virtual assistant may determine that a finance virtual assistant is needed, since the executive assistant virtual assistant may not have access to any bill information. Accordingly, the conversation is turned over to the finance virtual assistant, as illustrated by an icon 1106 that indicates that a change in virtual assistants was made. The finance virtual assistant may then answer the question, as illustrated by an icon 1108.”; In other words, based on a type of content (pay any bills? ->a finance related content), a different virtual assistant is invoked. The invoked assistant has its own unique characteristics for interaction.).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of BUSHEY in view of ENGELKE to include a type of content, as taught by BROWN.
One of ordinary skill would have been motivated to include a type of content, in order to efficiently perform a task (BROWN Par 182).



Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over ENGELKE in view of BUSHEY, and further in view of LYREN (US 2014/0359439 A1).

REGARDING Claim 11, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1.
ENGELKE in view of BUSHEY is silent to the rest of the claim limitations.
LYREN discloses a method/system for interaction between a human and a machine using speech recognition, wherein the first circuitry is further configured to controls the speech output mode based on an execution frequency of the speech recognition process (LYREN Par 62 –“These expressions are modeled as preferences that are compared to preferences of the user agent. The preferences of the user agent are changed or adjusted to match the preferences of the user. Verbal and nonverbal communication preferences with different emotions of the user agent match the verbal and nonverbal communication preferences with different emotions of the user. This adjusting can occur in real-time while the user interacts with the user agent. Over time, a personality of the user agent more closely matches a personality of the user since verbal and nonverbal communication preferences of the user agent are continuously, continually, or periodically changed to match the verbal and nonverbal communication preferences of the user.”).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include a basis of execution frequency, as taught by LYREN.
One of ordinary skill would have been motivated to include a basis of execution frequency, in order to improve performance of the user agent (LYREN Par 157).



Claims 12, 13, and 15 are under 35 U.S.C. 103 as being unpatentable over ENGELKE in view of BUSHEY, and further in view of BEN-DAVID (US 2011/0313762 A1).

REGARDING Claim 12, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1.
BUSHEY in view of ENGELKE is silent to the rest of the claim limitations.

BEN-DAVID discloses a method/system for speech recognition with displaying a confidence level indication, wherein the speech output mode includes a speech output speed of the result of the speech recognition process (BEN-DAVID Par 22 – “In a second embodiment, the marking may be provided by modifying speech synthesized from text by altering one or more parameters of the synthesized speech proportionally to the confidence value. Such marking might be performed by expressive TTS, which would modify the synthesized speech to sound less or more confident. Such effects may be achieved by the TTS system, by modifying parameters like volume, pitch, speech rhythm, speech spectrum etc. or by using a voice dataset recorded with different levels of confidence.”; Par 56 – “In the embodiment shown in FIG. 2A, the confidence indicating component 230 is provided as part of the TTS engine 210. The text to be synthesized may contain in addition to the text itself, mark-up which contains hints to the engine 210 on how to synthesize the speech. Samples of such mark-ups include volume, pitch, and speed or prosody envelope.  … Alternatively, the expressive TTS engine 210 may have preset configurations for different confidence levels, or use different voice data sets for each confidence level. The mark-ups can then just indicate the confidence level of the utterance (e.g. low confidence/high confidence).”).

One of ordinary skill would have been motivated to include a speech output speed, in order to allow a user to distinguish words with a low confidence (BEN-DAVID Par 26).


REGARDING Claim 13, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1.
ENGELKE in view of BUSHEY is silent to the rest of the claim limitations.

 BEN-DAVID discloses a method/system for speech recognition with displaying a confidence level indication, wherein the speech output mode includes magnitude of speech output of the result of the speech recognition process (BEN-DAVID Par 22 – “In a second embodiment, the marking may be provided by modifying speech synthesized from text by altering one or more parameters of the synthesized speech proportionally to the confidence value. Such marking might be performed by expressive TTS, which would modify the synthesized speech to sound less or more confident. Such effects may be achieved by the TTS system, by modifying parameters like volume, pitch, speech rhythm, speech spectrum etc. or by using a voice dataset recorded with different levels of confidence.”; Par 56 – “In the embodiment shown in FIG. 2A, the confidence indicating component 230 is provided as part of the TTS engine 210. The text to be synthesized may contain in addition to the text itself, mark-up which contains hints to the engine 210 on how to synthesize the speech. Samples of such mark-ups include volume, pitch, and speed or prosody envelope.  … Alternatively, the expressive TTS engine 210 may have preset configurations for different confidence levels, or use different voice .
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of BUSHEY in view of ENGELKE to include a speech magnitude, as taught by BEN-DAVID.
One of ordinary skill would have been motivated to include a speech magnitude, in order to allow a user to distinguish words with a low confidence (BEN-DAVID Par 26).

REGARDING Claim 15, ENGELKE in view of BUSHEY discloses the information processing device according to claim 1.
ENGELKE in view of BUSHEY is silent to the rest of the claim limitations.

 BEN-DAVID discloses a method/system for speech recognition with displaying a confidence level indication, wherein the speech output mode includes sound quality of the result of the speech recognition process (BEN-DAVID Par 23 – “In a third embodiment, the speech output may be synthesized speech with post synthesis effects, such as additive noise, added to indicate confidence values in the speech output.”; Pars 47-49 – “The speech output 202 may be modified in one or more of the following audio methods: … Additive noise whose intensity is inversely proportional to the confidence of the speech may be used;”).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include sound quality, as taught by BEN-DAVID.
One of ordinary skill would have been motivated to include sound quality, in order to allow a user to distinguish words with a low confidence (BEN-DAVID Par 26).




Claims 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over ENGELKE in view of BUSHEY, and further in view of YI (US 2014/0303971 A1).

REGARDING Claim 16, BUSHEY in view of ENGELKE discloses the information processing device according to claim 1.
BEN-DAVID in view of BUSHEY does not explicitly teach the rest of the claim limitations.  

YI discloses a method/system for speech recognition, wherein the first circuitry is further configured to prevent the output of the result of the speech recognition process in a form of speech based on a specific condition is satisfied (YI Fig. 5B – “Restrict output of audible data” and “Output Audible and Visible Data”; Par 168 – “FIG. 5B is a flowchart illustrating a method of analyzing an attribute of a voice in accordance with one exemplary embodiment. Hereinafter, description will be given of a method of controlling an output of audible data according to an attribute of a voice with reference to FIGS. 5A and 5B.”; Par 174 – “When the input user's voice is sorted as the whispering by the analyzing unit, the controller 180 may control the audio output module 153 to restrict the output of the audible data (S514′).”; Par 181 – “The controller 180 may restrict the output of the audible data when the volume is below the reference volume (S514′).”; Par 184 – “When the microphone 122 is sensed to be located close to the user, the controller 180 may control the audio output module 153 to restrict the output of the audible data (S514′). Also, when the user is not sensed adjacent to the microphone 122, the controller 180 may control the output unit to output the audible data and the visible data (S514″).”; Par 185 – “That is, the user's intention may be recognized more accurately by way of the relative positions of the user and the mobile terminal as well as the attribute of the voice.”).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include restricting the output of the audible data based on a condition, as taught by YI.
One of ordinary skill would have been motivated to include restricting the output of the audible data based on a condition, in order to allow a user to recognize the operation more conveniently (YI Par 31).


REGARDING Claim 17, ENGELKE in view of BUSHEY and YI discloses the information processing device according to claim 16.
YI further discloses wherein the specific condition includes at least one of a condition associated with a line of sight of the user, a condition associated with a position of the user, a display size of the result of the speech recognition process, or a condition associated with a confidence level of the result of the speech recognition process (YI Par 8 – “Here, even under the condition that the output of the audible data has to be restricted (limited) according to statuses of the user and the mobile terminal, such data may disadvantageously be output in the same manner unless the user separately controls it.”; Fig. 5B – “Restrict output of audible data” and “Output Audible and Visible Data”; Par 184 – “When the microphone 122 is sensed to be located close to the user, the controller 180 may control the audio output module 153 to restrict the output of the audible data (S514′). Also, when the user is not sensed adjacent to the microphone 122, the controller 180 may control the output unit to output the audible data and the visible data (S514″).”; Par 185 – “That is, the user's intention the relative positions of the user and the mobile terminal as well as the attribute of the voice.”).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include restricting the output of the audible data based on a condition, as taught by YI.
One of ordinary skill would have been motivated to include restricting the output of the audible data based on a condition, in order to allow a user to recognize the operation more conveniently (YI Par 31).

REGARDING Claim 18, ENGELKE in view of BUSHEY and YI discloses the information processing device according to claim 16.
YI further discloses wherein the specific condition includes at least one of a condition in which an operation to instruct the speech recognition process to be reactivated by a user is input or a condition in which an operation to instruct the result of the speech recognition process to be transmitted is input (YI Fig. 6A; Par 191 – “Hereinafter, description will be given of a control method of restricting an output of audible data in response to a movement of a mobile terminal sensed by the sensing unit, with reference to FIG. 6A. In accordance with this exemplary embodiment, the controller 180 may activate the sensing unit 140 when a voice recognition mode is activated or a voice is input through the microphone 122.”; Par 195 – “FIG. 6A illustrates the control method employed to a case where the voice is input after the voice recognition mode is activated and the rotation of the mobile terminal is sensed. The controller 180 may control the output unit 150 to output visible data with restricting the output of audible data.”).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the method of ENGELKE in view of BUSHEY to include a condition in which ASR activation is input, as taught by YI.
.


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C. KIM whose telephone number is (571)272-3327.  The examiner can normally be reached on Monday to Friday 9:00 AM thru 5:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.







/JONATHAN C KIM/Primary Examiner, Art Unit 2659