DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
2.	The information disclosure statement (IDS) submitted on 05/06/2020 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Priority Acknowledgment
3.               Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in Application 201811094286.9, filed on 09/19/2018 in the China Intellectual Property Office. 

Claim Objections
4.	Claim 9 is objected to because of the following informalities: typographical error. Claim 9 recites the limitation of “the sound collecting device” in line 2. “the sound collecting device” should be changed to “a sound collecting device”. Appropriate correction is required.
	Claim 9 recites “the processor” in line 16. “the processor” should be changed to “a processor”. Appropriate correction is required.
	Claim 9 recites “the sound playback device” in lines 17, 18. “the sound playback device” should be changed to “a sound playback device”. Appropriate correction is required.

Claim Interpretations
5.	 The following is a quotation of 35 U.S.C. 112(f): 
(f) ELEMENT IN CLAIM FOR A COMBINATION.—An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

6.	The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
 	As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as "configured to" or "so that"; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
 	Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
 	Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 
 	Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder “module” that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: 
 	“an end point detecting module configured to collect a sound in an environment through the sound collecting device in response to a translation task being triggered, and detect whether a user starts speaking based on the collected sound; 
a recognition module configured to enter a voice recognition state in response to detecting the user having started speaking, extract a user voice for the collected sound, determine a source language used by the user based on the extracted user voice, and determining a target language associated with the source language based on a preset language pair; 
 a tail point detecting module configured to detect whether the user has stopped speaking for more than a preset delay duration, and exit the voice recognition state in response to detecting the user having stopped speaking for more than the preset delay duration;
 a translation and voice synthesizing module configured to convert the user voice extracted in the voice recognition state into a target voice of the target language through the processor; and 
 a playback module configured to play the target voice through the sound playback device and trigger the end point detecting module to execute the step of detecting whether the user starts speaking based on the collected sound.” as recited in Claim 9.

	A review of the specification shows that the following appears to be the corresponding structure described in the specification for the 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph limitation: a processor executing each portion of software code to structurally act as each of the “module” in the claim inventions as noted in paragraph [0026] as “the sound collecting device can be, for example, a microphone”, in paragraph [0082] as “extracting the user voice from the collected sound through the processor”, in [0091] as “convert the user voice extracted in the voice recognition state into a target voice of the target language through the processor”, in [00116] The translation apparatus described in this embodiment includes a sound collecting device 601, a sound playback device 602, a storage 603, a processor 604, and a computer program stored in the storage 603 and executable in the processor 604.” Thus, “an end point detecting module” is interpreted as “a microphone”, and each of the other module in claim 9 is interpreted as a computer processor. 
 	If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
For more information, see MPEP § 2173 et seq. and Supplementary Examination Guidelines for Determining Compliance With 35 U.S.C. 112 and for Treatment of Related Issues in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).

Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


8.	Claims 1, 3, 4, 9, 10  are rejected under 35 U.S.C.103 as being unpatentable over Cuthbert et al. (US 2015/0134322 A1) in view of Smus et al. (US 2019/0095430 A1.)

	With respect to Claim 1, Cuthbert et al. disclose 
A speech translation method for a speech translation apparatus, wherein the translation apparatus comprising a processor, a sound collecting device electrically coupled to the processor, and a sound playback device electrically coupled to the processor (Cuthbert et al. [0023] a microphone 30, [0038] the speaker icon 50 has an animated outline to create a visual indication that the language translation application is preparing to output a spoken translation from a speaker of the user device 10, [0092] a programmable processor, a computer, or multiple processors or computers), wherein the method comprises: 
 	collecting a sound in an environment through the sound collecting device in response to a translation task being triggered (Cuthbert et al. [0064] the user may use a gesture (e.g., shaking the user device 10) to initiate the process of turning on the microphone), and detecting whether a user starts speaking based on the collected sound through the processor (Cuthbert et al. [0028] When the primary user begins speaking, the user device 10 receives the primary user’s speech and convert that speech into audio data. The user device 10 encodes the speech into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16 kHz lossless audio and initiates speech recognition, [0072] Upon receiving a voice input signal from the primary user, the user interface transitions to the voice input state 1708 (state C), in which the language translation application is receiving a voice signal and performing speech recognition);
 	entering a voice recognition state in response to detecting the user having started speaking, extracting a user voice from the collected sound through the processor (Cuthbert et al. [0028] When the primary user begins speaking, the user device 10 receives the primary user’s speech and convert that speech into audio data. The user device 10 encodes the speech into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16kHz lossless audio and initiates speech recognition as described below, [0029] Speech recognition involves converting audio data into text representing speech, [0072] Upon receiving a voice input signal from the primary user, the user interface transitions to the voice input state 1708 (state C), in which the language translation application is receiving a voice signal and performing speech recognition), and determining a target language associated with the source language based on a preset language pair (Cuthbert et al. Fig. 2 elements 20 and 40); 
 	exiting the voice recognition state in response to detecting the user having stopped speaking for more than a preset delay duration (Cuthbert et al. [0036] The language translation application may automatically identify endpoints in voice input, [0073] Upon receiving input complete signal (i.e., the language translation application detects a speech endpoint and/or the primary user manually indicates the end of the voice input), the user interface transitions to the prepare translation state 1710 (state D), [0075] if the primary user begins speaking again, the language translation application could return to state 1704 when the delay is less than the predetermined period of time, otherwise the language translation application would go to state 1716), and converting the user voice extracted in the voice recognition state into a target voice of the target language through the processor (Cuthbert et al. [0074] Upon receiving a translation ready signal (e.g., the language translation application receives or generates an audio signal corresponding to a translation of the primary user’s speech into the target language), the user interface transitions to the output translation state 1712 (state E), in which the language translation application is outputting a spoken translation of the primary user’s speech); and
 	 playing the target voice through the sound playback device (Cuthbert et al. Fig. 17 element 1712 Output translation), and returning to the step of detecting whether the user starts speaking based on the collected sound through the processor until the translation task ends (Cuthbert et al. [0017] a primary user (e.g., the owner of a user device) wants to communicates with a participating user who speaks a language different than the language of the primary user... The primary user then begins speaking in a source language (e.g., the primary user’s native language). When the primary user is finished speaking, the language translating application begins to obtain a translation of the primary user’s speech into a target language...This process may repeat for the duration of the exchange between the primary user and the participating user.)
	Cuthbert et al. fail to explicitly teach 
	determining a source language used by the user based on the extracted user voice,
	However, Smus et al teach 
 	determining a source language used by the user based on the extracted user voice (Smus et al. [0060] Each user 105 may have certain particular audio characteristics to his/her speech, such as a particular intonation, frequency, timbre, and/or inflection. Such audio characteristics may be easily detected from an audio signal representation of speech and leveraged to identify the source language of the speech. For example only, a user 105 of the computing device 110 may have a user profile in which her/his preferred language and audio characteristics of speech are stored. Accordingly, when the computing device 110 receives a speech input from the user 105, the audio characteristics can be detected and matched to the user 105 in order to identify the source language of the speech as the preferred language of the user 105),
 	Cuthbert et al. and Smus et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of speech translation as taught by Cuthbert et al., using teaching of detecting the audio characteristic of the user as taught by Smus et al. for the benefit of identifying the source language of the user (Smus et al. [0060] Each user 105 may have certain particular audio characteristics to his/her speech, such as a particular intonation, frequency, timbre, and/or inflection. Such audio characteristics may be easily detected from an audio signal representation of speech and leveraged to identify the source language of the speech. For example only, a user 105 of the computing device 110 may have a user profile in which her/his preferred language and audio characteristics of speech are stored. Accordingly, when the computing device 110 receives a speech input from the user 105, the audio characteristics can be detected and matched to the user 105 in order to identify the source language of the speech as the preferred language of the user 105.)

 	With respect to Claim 3, Cuthbert et al. in view of Smus et al. teach
Fig. 2 elements 20 and 40,  Smus et al. [0066] the machine translation can be obtained from a machine translation model. When translating between languages, when the source language of the input audio signal is determined, the target language(s) into which the audio signal is to be translated can comprise the other languages previously utilized.)

 	With respect to Claim 4, Cuthbert et al. in view of Smus et al. teach
 	wherein the translation apparatus further comprises a display screen electrically coupled to the processor (Cuthbert et al. Figs. 4, 5, 6, after the steps of entering the voice recognition state in response to detecting the user having started speaking and extracting the user voice from the collected sound through the processor (Cuthbert et al. [0028] When the primary user begins speaking, the user device 10 receives the primary user’s speech and convert that speech into audio data. The user device 10 encodes the speech into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16 kHz lossless audio and initiates speech recognition) further comprises: 
 	converting the extracted user voice into a corresponding first text and displaying the first text on the display screen (Cuthbert et al. [0018] capturing speech input, converting speech to text, translating text, displaying the translated text (or partial translation text)); 
 	the steps of exiting the voice recognition state in response to detecting the user having stopped speaking for more than the preset delay duration (Cuthbert et al. [0036] The language translation application may automatically identify endpoints in voice input, [0073] Upon receiving input complete signal (i.e., the language translation application detects a speech endpoint and/or the primary user manually indicates the end of the voice input), the user interface transitions to the prepare translation state 1710 (state D), [0075] if the primary user begins speaking again, the language translation application could return to state 1704 when the delay is less than the predetermined period of time, otherwise the language translation application would go to state 1716) and converting the user voice extracted in the voice recognition state into the target voice of the target language through the processor (Cuthbert et al. [0074] Upon receiving a translation ready signal (e.g., the language translation application receives or generates an audio signal corresponding to a translation of the primary user’s speech into the target language), the user interface transitions to the output translation state 1712 (state E), in which the language translation application is outputting a spoken translation of the primary user’s speech) specifically comprises: 
existing the voice recognition state in response to detecting the user having stopped speaking for more than the preset delay duration (Cuthbert et al. [0036] The language translation application may automatically identify endpoints in voice input, [0073] Upon receiving input complete signal (i.e., the language translation application detects a speech endpoint and/or the primary user manually indicates the end of the voice input), the user interface transitions to the prepare translation state 1710 (state D), [0075] if the primary user begins speaking again, the language translation application could return to state 1704 when the delay is less than the predetermined period of time, otherwise the language translation application would go to state 1716), translating the first text into a second text of the target language through the processor, and displaying the second text on the display screen (Cuthbert et al. Figs 4, 5, 6, [0047] As illustrated in FIG. 8, the user device 10 displays a sample user interface 800 for receiving voice input from the participating user. The user interface 800 may be displayed, for example, when the language translation application has completed initializing a microphone and is ready to receive voice input in the target language from the participating user. The sample user interface includes a prompt “habla ahora” 810 that indicates that the language translation application is waiting for speech input from the primary user. The prompt 810 is displayed in the lower portion of the user interface 800 that includes the textual translation of the primary user's speech into the target language); 
converting the second text into the target voice through a speech synthesis system (Cuthbert et al. [0042] In the example illustrated in FIG. 6, the audio generator would read the text file generated by the language translator, and use the Spanish-language text to generate audio data that can be played to generate Spanish speech corresponding to the text. The audio data may be generated with one or more indicators to synthesize speech having accent or gender characteristics, Smus et al. [0046] The machine translated text can be processed by text-to-speech model, which outputs an audio representation of the machine translation.)

With respect to Claim 9, Cuthbert et al. disclose

 	an end point detecting module con figured to collect a sound in an environment through the sound collecting device in response to a translation task being triggered (Cuthbert et al. [0064] the user may use a gesture (e.g., shaking the user device 10) to initiate the process of turning on the microphone), and detect whether a user starts speaking based on the collected sound (Cuthbert et al. [0072] Upon receiving a voice input signal from the primary user, the user interface transitions to the voice input state 1708 (state C), in which the language translation application is receiving a voice signal and performing speech recognition)); 
a recognition module configured to enter a voice recognition state in response to detecting the user having started speaking, extract a user voice for the collected sound (Cuthbert et al. [0092] processor, [0028] When the primary user begins speaking, the user device 10 receives the primary user’s speech and convert that speech into audio data. The user device 10 encodes the speech into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16kHz lossless audio and initiates speech recognition as described below, [0029] Speech recognition involves converting audio data into text representing speech, [0072] Upon receiving a voice input signal from the primary user, the user interface transitions to the voice input state 1708 (state C), in which the language translation application is receiving a voice signal and performing speech recognition), and determining a target language associated with the source language based on a preset language pair (Cuthbert et al. Fig. 2 elements 20 and 40); 
 a tail point detecting module configured to detect whether the user has stopped speaking  for more than a preset delay duration(Cuthbert et al. [0036] The language translation application may automatically identify endpoints in voice input, [0073] Upon receiving input complete signal (i.e., the language translation application detects a speech endpoint and/or the primary user manually indicates the end of the voice input), the user interface transitions to the prepare translation state 1710 (state D), [0075] if the primary user begins speaking again, the language translation application could return to state 1704 when the delay is less than the predetermined period of time, otherwise the language translation application would go to state 1716), and exit the voice recognition state in response to detecting the user having stopped speaking for more than the preset delay duration Cuthbert et al. [0073] Upon receiving input complete signal (i.e., the language translation application detects a speech endpoint and/or the primary user manually indicates the end of the voice input, [0074] Upon receiving a translation ready signal (e.g., the language translation application receives or generates an audio signal corresponding to a translation of the primary user’s speech into the target language), the user interface transitions to the output translation state 1712 (state E), in which the language translation application is outputting a spoken translation of the primary user’s speech);
 a translation and voice synthesizing module configured to convert the user voice extracted in the voice recognition state into a target voice of the target language through the processor (Cuthbert et al. [0073] Upon receiving input complete signal (i.e., the language translation application detects a speech endpoint and/or the primary user manually indicates the end of the voice input, [0074] Upon receiving a translation ready signal (e.g., the language translation application receives or generates an audio signal corresponding to a translation of the primary user’s speech into the target language), the user interface transitions to the output translation state 1712 (state E), in which the language translation application is outputting a spoken translation of the primary user’s speech); and 
 	a playback module configured to play the target voice through the sound playback device and trigger the end point detecting module to execute the step of detecting whether the user starts speaking based on the collected sound (Cuthbert et al. [0017] a primary user (e.g., the owner of a user device) wants to communicates with a participating user who speaks a language different than the language of the primary user... The primary user then begins speaking in a source language (e.g., the primary user’s native language). When the primary user is finished speaking, the language translating application begins to obtain a translation of the primary user’s speech into a target language...This process may repeat for the duration of the exchange between the primary user and the participating user.)
	Cuthbert et al. fail to explicitly teach 
determine a source language used by the user based on the extracted user voice,
 	However, Smus et al teach 
 	determine a source language used by the user based on the extracted user voice (Smus et al. [0060] Each user 105 may have certain particular audio characteristics to his/her speech, such as a particular intonation, frequency, timbre, and/or inflection. Such audio characteristics may be easily detected from an audio signal representation of speech and leveraged to identify the source language of the speech. For example only, a user 105 of the computing device 110 may have a user profile in which her/his preferred language and audio characteristics of speech are stored. Accordingly, when the computing device 110 receives a speech input from the user 105, the audio characteristics can be detected and matched to the user 105 in order to identify the source language of the speech as the preferred language of the user 105),
 	Cuthbert et al. and Smus et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of speech translation as taught by Cuthbert et al., using teaching of detecting the audio characteristic of the user as taught by Smus et al. for the benefit of identifying the source language of the user (Smus et al. [0060] Each user 105 may have certain particular audio characteristics to his/her speech, such as a particular intonation, frequency, timbre, and/or inflection. Such audio characteristics may be easily detected from an audio signal representation of speech and leveraged to identify the source language of the speech. For example only, a user 105 of the computing device 110 may have a user profile in which her/his preferred language and audio characteristics of speech are stored. Accordingly, when the computing device 110 receives a speech input from the user 105, the audio characteristics can be detected and matched to the user 105 in order to identify the source language of the speech as the preferred language of the user 105.)

With respect to Claim 10, Cuthbert et al. disclose
A translation apparatus, wherein the apparatus comprises a sound collecting device, a sound playback device, a storage, a processor, and a computer program stored in the storage and executable on the processor(Cuthbert et al. [0023] a microphone 30, [0038] the speaker icon 50 has an animated outline to create a visual indication that the language translation application is preparing to output a spoken translation from a speaker of the user device 10, [0092] a programmable processor, a computer, or multiple processors or computers);
 	wherein, the sound collecting device, the sound playback device, and the storage are electrically coupled to the processor (Cuthbert et al. [0092] The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processor or computers. The apparatus may include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them, [0023] microphone 30, [0038] speaker icon 50); 
 	when the processor executes the computer program, the following steps are executed:
collecting a sound in an environment through the sound collecting device in response to a translation task being triggered(Cuthbert et al. [0064] the user may use a gesture (e.g., shaking the user device 10) to initiate the process of turning on the microphone), and detecting whether a user starts speaking based on the collected sound through the processor (Cuthbert et al. [0028] When the primary user begins speaking, the user device 10 receives the primary user’s speech and convert that speech into audio data. The user device 10 encodes the speech into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16 kHz lossless audio and initiates speech recognition, [0072] Upon receiving a voice input signal from the primary user, the user interface transitions to the voice input state 1708 (state C), in which the language translation application is receiving a voice signal and performing speech recognition), entering a voice recognition state in response to detecting the user having started speaking, extracting a user voice from the collected sound(Cuthbert et al. [0028] When the primary user begins speaking, the user device 10 receives the primary user’s speech and convert that speech into audio data. The user device 10 encodes the speech into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16kHz lossless audio and initiates speech recognition as described below, [0029] Speech recognition involves converting audio data into text representing speech, [0072] Upon receiving a voice input signal from the primary user, the user interface transitions to the voice input state 1708 (state C), in which the language translation application is receiving a voice signal and performing speech recognition), and determining a target language associated with the source language based on a preset language pair (Cuthbert et al. Fig. 2 elements 20 and 40); 
 existing the voice recognition state in response to detecting the user having stopped speaking for more than a preset delay duration(Cuthbert et al. [0036] The language translation application may automatically identify endpoints in voice input, [0073] Upon receiving input complete signal (i.e., the language translation application detects a speech endpoint and/or the primary user manually indicates the end of the voice input), the user interface transitions to the prepare translation state 1710 (state D), [0075] if the primary user begins speaking again, the language translation application could return to state 1704 when the delay is less than the predetermined period of time, otherwise the language translation application would go to state 1716), and converting the user voice extracted in the voice recognition state into a target voice of the target language (Cuthbert et al. [0074] Upon receiving a translation ready signal (e.g., the language translation application receives or generates an audio signal corresponding to a translation of the primary user’s speech into the target language), the user interface transitions to the output translation state 1712 (state E), in which the language translation application is outputting a spoken translation of the primary user’s speech); and 
  	playing the target voice through the sound playback device (Cuthbert et al. Fig. 17 element 1712 Output translation), and returning to the step of detecting whether the user starts speaking based on the collected sound until the translation task ends (Cuthbert et al. [0017] a primary user (e.g., the owner of a user device) wants to communicates with a participating user who speaks a language different than the language of the primary user... The primary user then begins speaking in a source language (e.g., the primary user’s native language). When the primary user is finished speaking, the language translating application begins to obtain a translation of the primary user’s speech into a target language...This process may repeat for the duration of the exchange between the primary user and the participating user.)
	Cuthbert et al. fail to explicitly teach 
determining a source language used by the user based on the extracted user voice,
 	However, Smus et al teach 
 	determining a source language used by the user based on the extracted user voice (Smus et al. [0060] Each user 105 may have certain particular audio characteristics to his/her speech, such as a particular intonation, frequency, timbre, and/or inflection. Such audio characteristics may be easily detected from an audio signal representation of speech and leveraged to identify the source language of the speech. For example only, a user 105 of the computing device 110 may have a user profile in which her/his preferred language and audio characteristics of speech are stored. Accordingly, when the computing device 110 receives a speech input from the user 105, the audio characteristics can be detected and matched to the user 105 in order to identify the source language of the speech as the preferred language of the user 105),
 	Cuthbert et al. and Smus et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been Each user 105 may have certain particular audio characteristics to his/her speech, such as a particular intonation, frequency, timbre, and/or inflection. Such audio characteristics may be easily detected from an audio signal representation of speech and leveraged to identify the source language of the speech. For example only, a user 105 of the computing device 110 may have a user profile in which her/his preferred language and audio characteristics of speech are stored. Accordingly, when the computing device 110 receives a speech input from the user 105, the audio characteristics can be detected and matched to the user 105 in order to identify the source language of the speech as the preferred language of the user 105.)

9.	Claim 2 is rejected under 35 U.S.C.103 as being unpatentable over Cuthbert et al. (US 2015/0134322 A1) in view of Smus et al. (US 2019/0095430 A1) and Baker et al. (US 2012/0041759 A1.)

	With respect to Claim 2, Cuthbert et al. in view of Smus et al. teach 
 	wherein before the step of entering the voice recognition state in response to detecting the user having started speaking further comprises: 
	Cuthbert et al. in view of Smus et al. fail to explicitly teach
 	detecting whether a noise in the environment is greater than a preset noise based on the collected sound through the processor, and outputting prompt information for prompting the user the environment being unsuitable for translations if the noise is greater than the preset noise.  
	However, Barker et al. teach 
 	detecting whether a noise in the environment is greater than a preset noise based on the collected sound through the processor, and outputting prompt information for prompting the user the environment being unsuitable for translations if the noise is greater than the preset noise (Barker et al. [0042] Background noise present in the mobile user's environment can affect the quality of recordings. The mobile replacement-dialogue recording device 20 can detect and provide useful information concerning such background noise. For example, the device can use its microphone in order to determine whether the ambient sound level is unacceptably high, and suggest that the mobile user move to a quieter location if too much noise is detected.)
 	Cuthbert et al., Smus et al. and Barker et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of speech translation as taught by Cuthbert et al., using teaching of detecting the audio characteristic of the user as taught by Smus et al. for the benefit of identifying the source language of the user, using teaching of informing the user as taught by Barker et al. for the benefit of informing the user of the high level of noise and suggesting the user move to a quieter location (Barker et al. [0042] Background noise present in the mobile user's environment can affect the quality of recordings. The mobile replacement-dialogue recording device 20 can detect and provide useful information concerning such background noise. For example, the device can use its microphone in order to determine whether the ambient sound level is unacceptably high, and suggest that the mobile user move to a quieter location if too much noise is detected.)

Allowable Subject Matter
10.	Claim 5 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
	Claims 6-8 are objected to as being dependent upon an objected claim(s) by virtue of their dependency. 
The prior art(s) taken alone or in combination fail(s) to teach the following element(s) in combination with the other recited elements in the claim(s).
	“adjusting the preset delay duration based on a time difference between a time of having detected the user having stopped speaking and a time of the translation instruction being triggered.”
The closest prior found as following. 
a.	Doi (US 2007/0061152 A1). In this reference, Doi discloses a method for detecting the endpoint of the user’s utterance by comparing the silence period with the predetermined time (Doi [0140] Also, in place of executing the translation by detecting the change in the face image information, in the case where the silence period during which the user does not speak exceeds a predetermined time, the recognition result stored in the source language storage unit 121 before start of the silence period can be translated as one unit. As a result, the translation and the speech synthesis can be carried out by appropriately determining the end of the speech, while at the same time minimizing the silence period, thereby further promoting the smooth dialogue.) However, Doi does not disclose a method for adjust the predetermined time based on a time difference between a time of having detected the user having stopped speaking and a time of the translation instruction being triggered. 
b.	VanBlon et al. (US 2015/0154983 A1.) In this reference, VanBlon et al. disclose a method for ending processing of the audible input sequence by utilizing the expiration of the threshold time (VanBlon et al.[0053] In addition to the foregoing, the UI 300 includes instructions 324 indicating that, should the user wish to close the audible input application and/or end the particular audible input sequence that was being input by the user prior to the pause detected by the device, a command to do so (e.g. automatically) may be input to the device by e.g. removing the device from the user's facial proximity (e.g. a threshold distance away from at least a portion of the user's face). However, note that the instructions 324 may indicate that the application may be closed by still other ways such as e.g. inputting an audible command to close the application and/or end processing of the audible input sequence, engage another application and/or operation of the device for a threshold time to close the application and/or end processing of the audible input sequence (e.g. after expiration of the threshold time), not providing audible input (e.g. providing an audible pause and/or not speaking) within a threshold time to close the application and/or end processing of the audible input sequence (e.g. after expiration of the threshold time), not providing touch input to the display presenting the UI 300 for a threshold time to close the application and/or end processing of the audible input sequence, etc. (e.g. after expiration of the threshold time.) However, VanBlon et al. does not teach a method of adjusting the threshold time in ending processing of the audible input sequence.
c.	Kim et al. (US 2021/0232776 A1.) In this reference, Kim et al. disclose a method for translate the recognized user’s voice into a preset language. Kim et al. applied end point detection to detect end points of the speech and further detect the actual speech section (Kim et al. [0047] At this time, the signal input to the processor 130 may be converted into a more useful form for speech recognition. The processor 130 may convert the input signal from an analog form into a digital form, and may detect start and end points of the speech and further detect the actual speech section/data included in voice data. This is called end point detection (EPD).) However, Kim et al does not disclose a method for adjusting the preset delay duration based on a time difference between a time of having detected the user having stopped speaking and a time of the translation instruction being triggered in detecting the end points of the speech. 
Conclusion
11.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. See PTO-892.
a.	Furihata et al. (US 2008/0091407 A1.) In this reference, Furihata et al. disclose a method for performing translation from inputted speech. 
b. 	Sakamoto et al. (US 2013/0211818 A1.) In this reference, Sakamoto et al. disclose a method for speech translation. 
c. 	Nagao (US 2008/0077390 A1.) In this reference, Nagao disclose a method for translating the user’s speech. 

12.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429.  The examiner can normally be reached on Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on 571-272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR 




/THUYKHANH LE/Primary Examiner, Art Unit 2658