DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 7-11, 13 and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Liberman et al. (EP 2 373 016 A2) in view of Huang CN (110166729).

Claim 1,
Liberman teaches an automatic interpretation method performed by a server communicating with a plurality of terminal devices having a microphone function, a speaker function, a communication function, the automatic interpretation method comprising ([0012-0013] [0018] [0020] MLTV-MCU for translating audio stream into different languages (text and vocal translation); a multipoint control unit (MCU) is used to manage a video communication session (a videoconference); an MCU is a conference controlling entity that is located in a node of a network, in a terminal, or elsewhere; a terminal (endpoint) is an entity on the network, capable of providing real-time, two-way audio and/or audiovisual communication with other terminals or with the MCU): 
receiving a plurality of speech signals uttered by a plurality of users from a plurality of terminal devices; acquiring a plurality of speech energies from the plurality of received speech signals; determining main speech signal uttered in a current utterance turn among the plurality of speech signals by comparing the plurality of acquired speech energies ([0018] [0024] receiving plurality of audio stream from each endpoint; measuring signal energy for each audio stream; determining the main speaker based on the highest signal energy from plurality of energies; the main speaker may be the conferee whose audio energy level was above the audio energy of the other conferees for a certain percentage of a certain period); and 
transmitting an automatic interpretation result acquired by performing automatic interpretation on the determined main speech signal to the plurality of terminal devices ([0018] [0024] translating the main speaker audio stream and displaying on plurality of endpoints).
The difference between the prior art and the claimed invention is that Liberman does not explicitly teach terminal devices having wearable function.
Huang teaches terminal devices having wearable function ([pg. 12] mobile terminal is a wearable device).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Liberman with teachings of Huang by modifying the method and system for adding translation in a videoconference as taught by Liberman to include terminal devices having wearable function as taught by Huang for the benefit of using different forms for voice communication (Huang [pg. 12]).

Claim 11 contains subject matter similar to claim 1, and thus is rejected under similar rationale.

Claim 3,
Liberman further teaches the automatic interpretation method of claim 1, wherein the determining of a main speech signal comprises: determining, as the main speech signal, a speech signal having a largest speech energy among the plurality of speech signals ([0024] the main speaker is the conferee whose audio energy level was above the audio energy of the other conferees for a certain percentage of a certain period).

Claim 13 contains subject matter similar to claim 3, and thus is rejected under similar rationale.

Claim 7,
Liberman further teaches the automatic interpretation method of claim 1, wherein the transmitting of the automatic interpretation result to the plurality of terminal devices comprises: acquiring a first text data of a first language from the main speech signal using a speech recognizer ([0024-0025] MLTV-MCU identifies an audio stream from the main speaker that its needs to translate, identifies the language of the audio stream, and identifies the language to which the audio stream should be translated, the MLTV-MCU converts the audio stream into a written text; a speech to text engine (STTE) that may convert an audio stream into text); 
acquiring second text data automatically translated to a second language from the first text data using an automatic translator ([0028] after an audio stream has been converted to text by STTE, one embodiment of the MLTV-MCU translates the text by a translation engine (TE) to another language); 
acquiring a synthesized speech of the first language from the first text data and acquiring a synthesized speech of a second language from the second text data, using a speech synthesizer ([0032] text to speech and speech to text engines); and 
transmitting the automatic interpretation result including the first text data, the second text data, the synthesized speech of the first language, and the synthesized speech of the second language to the plurality of terminal devices ([0018] the MLTV-MCU is informed of which audio streams from the one or more received audio streams in a multipoint videoconference need to be translated and the languages into which the different audio streams need to be translated; the MLTV-MCU translates each needed audio stream to one or more desired languages using text to speech and speech to text engines; the MLTV-MCU displays the one or more translations of the one or more audio streams, as written translation or vocal translation, on one or more endpoint screens).

Claim 16 contains subject matter similar to claim 7, and thus is rejected under similar rationale.

Claim 8,
Liberman further teaches the automatic interpretation method of claim 7, wherein the acquiring of the first text data including acquiring the first text data using an end- to-end speech recognizer capable of performing language identification ([0022] identifying the language of the received audio stream).

Claim 17 contains subject matter similar to claim 8, and thus is rejected under similar rationale.

Claim 9,
Liberman further teaches the automatic interpretation method of claim 7, wherein the transmitting of the automatic translation result including the first text data, the second text data, and the synthesized speech to the plurality of terminal devices comprises: transmitting at least one of the first text data and the synthesized speech of the first language to terminal devices of a user who uses the first language; and transmitting at least one of the second text data and the synthesized speech of the second language to terminal devices of a user who uses the second language ([0018] [0020] outputting the user’s desired language to translated transmitting the desired language of the user determining the desired language of the user; converting the main speaker audio stream to text and converting the text to another language; the conferee's language; the languages into which to translate the conferee's speech; the endpoints whose audio is to be translated to the conferee's language; the languages into which the conferee desires translation; a written translation, using subtitles, or vocal translation).

Claim 10,
Liberman further teaches the automatic interpretation method of claim 1, wherein, in the receiving of the plurality of speech signals, each speech signal corresponds to a speech section detected according to a speech detection process performed by the plurality of terminal devices ([0081] AD 310 detects and distinguishes between voice and non-voice audio signals).


Claims 2 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Liberman et al. (EP 2 373 016 A2) in view of Huang CN (110166729) and further in view of Kang et al. (KR 20110061781).

Claim 2,
Liberman further teaches detecting a speech section from each speech signal. calculating the plurality of speech energies by calculating a power spectrum density corresponding to each speech section ([0026] distinguishing between voice and non-voice for each audio stream).
The difference between the prior art and the claimed invention is that Liberman nor Huang explicitly teach calculating the plurality of speech energies by calculating a power spectrum density corresponding to each speech section.
Kang teaches calculating the plurality of speech energies by calculating a power spectrum density corresponding to each speech section ([pg. 3] estimating the power spectral density of the noise in real time in the non-voice section and the speech section).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Liberman with teachings of Kang by modifying the method and system for adding translation as taught by Kang to include calculating the plurality of speech energies by calculating a power spectrum density corresponding to each speech section as taught by Kang for the benefit of improving performance in speech processing by removing dynamic noise included in an input speech in a noisy environment (Kang [pg. 1]).

Claim 12 contains subject matter similar to claim 2, and thus is rejected under similar rationale.

Claims 4-6 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Liberman et al. (EP 2 373 016 A2) in view of Huang CN (110166729) and further in view of Ayrapetian (US 11,277,685).

Claim 4,
Liberman further teaches the automatic interpretation method of claim 1, wherein the determining of a main speech signal comprises: determining, as the main speech signal, a speech signal having a largest speech energy among the plurality of speech signals ([0024] the main speaker is the conferee whose audio energy level was above the audio energy of the other conferees for a certain percentage of a certain period).
The difference between the prior art and the claimed invention is that Liberman nor Huang explicitly teach determining a reference speech signal among the remaining speech signals; and canceling noise of the main speech signal using the reference speech signal.
Ayrapetian teaches determining a reference speech signal among the remaining speech signals; and canceling noise of the main speech signal using the reference speech signal ([claim 3] determining that the second beamformed audio signal has a lowest SNR value of the plurality of beamformed audio signals, wherein the second beamformed audio signal is reference audio data; and generating, using the adaptive filter, the first portion of the output audio data by subtracting the reference audio data from the target audio data).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Liberman and Huang with teachings of Ayrapetian by modifying the method and system for adding translation in a videoconference as taught by Liberman to include determining a reference speech signal among the remaining speech signals; and canceling noise of the main speech signal using the reference speech signal as taught by Ayrapetian for the benefit of improving noise cancellation in a communication session (Ayrapetian [col. 2 line 38]).

Claim 5,
Ayrapetian further teaches the automatic interpretation method of claim 4, wherein the determining of the reference speech signal comprises determining, as the reference speech signal, a speech signal having a lowest speech energy or medium speech energy among the plurality of speech signals ([claim 3] determining that the second beamformed audio signal has a lowest SNR value of the plurality of beamformed audio signals, wherein the second beamformed audio signal is reference audio data).

Claim 6,
Liberman further teaches the automatic interpretation method of claim 5, wherein the removing of noise of the main speech signal is selectively performed according to operation commands from the plurality of terminal devices ([0023] a pre-defined number of are translated from endpoints based on user request to translate plurality of audio streams; the plurality of received audio streams to be translated may be in one embodiment a pre-defined number of audio streams with audio energy higher than a certain threshold-value; the pre-defined number may be in the range 3 to 5, for example; the audio streams to be translated may be audio streams from endpoints a user requested the MLTV-MCU to translate).

Claim 6 contains subject matter similar to claim 15, and thus is rejected under similar rationale.

Claim 14,
Liberman and Huang teach all the limitations in claim 11. The difference between the prior art and the claimed invention is that Liberman nor Huang explicitly teach wherein the noise canceling processing unit cancels noise of the speech signal of the speaker using a technique of processing signals of two or more channels.
Ayrapetian teaches wherein the noise canceling processing unit cancels noise of the speech signal of the speaker using a technique of processing signals of two or more channels ([Fig. 4] the adaptive noise canceller may include a number of nullformer blocks 418a through 418p. The device 110 may include P number of nullformer blocks 418 where P corresponds to the number of channels, where each channel corresponds to a direction in which the device may focus the nullformers 418 to isolate detected noise).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Liberman and Huang with teachings of Ayrapetian by modifying the method and system for adding translation in a videoconference as taught by Liberman to include wherein the noise canceling processing unit cancels noise of the speech signal of the speaker using a technique of processing signals of two or more channels as taught by Ayrapetian for the benefit of improving noise cancellation in a communication session (Ayrapetian [col. 2 line 38]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Oh et al. (KR 20210097392) teaches to automatically create meeting minutes of meeting contents together with a real-time interpretation function of a meeting conducted in multiple languages. Accordingly, a conference interpreting device can: bidirectionally interpret speeches in multilingual meetings; display content of a speaker in text with both a spoken language and a translated language; generate meeting minutes including audio with content of a meeting recorded, and text synchronized to the audio; play the audio of the generated meeting minutes, and display the text synchronized with the audio being played; rapidly browse only corresponding content when playing the meeting minutes by setting a bookmark for spoken content; and create and send a QR code for downloading the saved meeting minutes.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656