DETAILED ACTION

Introduction
1.         This office action is in response to Applicant’s submission filed on 04/04/2019.  Claims 1-20 are pending in the application. As such, Claims 1-20 have been examined. 

Notice of Pre-AIA  or AIA  Status
2. 	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
3.	The drawings filed on 04/04/2019 have been accepted and considered by the Examiner.

Claim Objections
4.	Claim 15 is objected to because of the following informalities:  
In Claim 15 please replace: 
“--a voice detection module, … when the new voice data is detected,;--” with
“--a voice detection module, … when the new voice data is detected;--” in order to replace the extra comma at the end of this limitation. Appropriate correction is required.



Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

5.	The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 

Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “a voice collection unit…to collect…”, “a feature extraction unit….to extract…”, “a scene determination unit…to determine…”, “an orientation control unit to acquire…, and to control…” in claim 10; “a feature extraction module…to extract…”, “a first scene determination module…to input…and determine…” in claim 12; “a sample set establishing module…to acquire…”, “a feature vector set establishing module…to extract…, and establish…”, “a training module…to train…” in claim 13; “a sample acquiring module to acquire…”, “a feature determining module…to determine…”, “a decision tree constructing module…to construct…”, “ a second scene determining module…to determine…” in claim 14; “a first voice acquiring module…to acquire…”, “a region dividing module…to divide…”, “a voice detection module…to acquire…”, “an angle matching module…to determine…”, “a first turning control module…to control…” in claim 15; and “a second voice acquiring module…to acquire…”, “a second turning module…to control…”, “a first prediction module…to predetermine…”, “a second prediction module…to predetermine…”, “a third turning module…to control…” in claim 18.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


6.	Claim 20 is rejected under 35 U.S.C. 101 because said claim recites an embodiment of the applicants’ invention directed towards “a computer storage medium…” It is noted, however, that the recitation of the medium in the specification is not exclusory with respect to non-statutory medium types as no specific and limiting definition of “a storage medium” is provided (Specification, p. 27 uses open-ended language like “may be” and “for example”, “any” and “not limited to” in  “…computer readable medium can include: any entity or device that can carry the computer program codes, recording medium, USB flash disk, mobile hard disk, hard disk, optical disk, computer storage device, ROM (Read-Only Memory), RAM (Random Access Memory), electrical carrier signal, telecommunication signal…” See e.g., Specification p. 27).
  Additionally, variations of the term “storage” are not necessarily considered to limit a media claim to non-transitory embodiments because content may be considered to be stored on a signal during propagation and because many disclosures conflate storage media and signals.  For example, U.S. Patent 6,286,104 discloses: “the methods described herein may be implemented by a series of computer-executable instructions residing on a storage medium such as a carrier wave”.
 Thus, under the broadest reasonable interpretation, the claim(s) as a whole would include non-statutory mediums such as carrier waves.
 “The United States Patent and Trademark Office (USPTO) is obliged to give claims their broadest reasonable interpretation consistent with the specification during proceedings before the USPTO. See In re Zletz, 893 F.2d 319(Fed. Cir. 1989) (during patent examination the pending 
The claims as a whole therefore include(s) signal-based mediums.  A signal does not fall within one of the four statutory categories of invention (i.e., process, machine, manufacture, or composition of matter) because it is an ephemeral, transient signal and thus is non-statutory.  Since the claims as a whole include these non-statutory instances, Claim 20 is directed to non-statutory subject matter.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

7.	Claim(s) 1, 2, 4, 10, 11, 12, 19, 20 is/are rejected under 35 U.S.C. 102(a)(1) and/or 102(a)2) as being anticipated by Bernardin et al., (Bernardin, Keni, and Rainer Stiefelhagen. "Audio-visual multi-person tracking and identification for smart environments." Proceedings of the 15th ACM international conference on Multimedia. 2007), hereinafter referred to as BERNARDIN.

With respect to Claim 1, BERNARDIN discloses:
1. A method for controlling camera shooting comprising steps of: 
collecting voice data of a sound source object (See e.g., “…speech detection and speaker identification, coupled with a source localizer using the input from several microphone arrays, deliver precisely localized ID cues whenever a speaker becomes active…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
extracting a voice feature based on the voice data of the sound source object (See e.g., “…speech detection and speaker identification, coupled with a source localizer using the input from several microphone arrays, deliver precisely localized ID cues whenever a speaker becomes active…,” “…Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 

    PNG
    media_image1.png
    494
    631
    media_image1.png
    Greyscale
determining a current voice scene according to the extracted voice feature and a voice feature corresponding to a preset voice scene (See e.g., “…output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure… history of speech source location estimates is kept for the duration of a speech segment. Similarly, for the same time window, a record is kept in the fusion module of the positions of all visually tracked persons. The visual and acoustic tracks are then compared to associate the recognized speaker ID to the best matching person track…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); and 

    PNG
    media_image2.png
    544
    591
    media_image2.png
    Greyscale
acquiring a shooting mode corresponding to the current voice scene, and controlling the movement of the camera according to the shooting mode corresponding to the current voice scene (See e.g., “…A set of steerable fuzzy-controlled pan-tilt-zoom cameras serves to smoothly track persons of interest and opportunistically capture facial close-ups for face identification. In parallel, speech segmentation, sound source localization and speaker identification are performed using several far-field microphones and arrays…”, “…the fusion module uses the person track information from the multiple camera tracker as the basis upon which the scene model is updated and association of ID cues is performed. The scene model is composed of a number of active person models, and some optional information such as the position of the entrance door and of the whiteboard. A person model comprises the person’s 3D location, and a histogram of identification cues that were assigned to it over time. This “ID histogram” has as many bins as audio-visually trained in subjects, and the values accumulated in the respective bins are the confidences given by the face or speaker ID modules…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5).

With respect to Claim 2, BERNARDIN discloses:
2. The method of claim 1, wherein the voice feature comprises one or more selected from a group consisting of a voice duration, a voice interval duration, a sound source angle, a sound intensity of a voice, or a sound frequency of a voice (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5). 

With respect to Claim 4, BERNARDIN discloses:
4. The method of claim 1, wherein the step of extracting a voice feature based on the voice data of the sound source object comprises: extracting voice features of a specified amount of the voice data (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); determining the current voice scene by inputting the specified amount of the voice data into a trained machine learning model (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5).

With respect to Claim 10, BERNARDIN discloses:
10. A device for controlling camera shooting comprising: 
a voice collection unit, configured to collect voice data of a sound source object (See e.g., “…speech detection and speaker identification, coupled with a source localizer using the input from several microphone arrays, deliver precisely localized ID cues whenever a speaker becomes active…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
a feature extraction unit, configured to extract a voice feature based on the voice data of the sound source object (See e.g., “…speech detection and speaker identification, coupled with a source localizer using the input from several microphone arrays, deliver precisely localized ID cues whenever a speaker becomes active…,” “…Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
a scene determination unit, configured to determine a current voice scene according to the extracted voice feature and a voice feature corresponding to a preset voice scene (See e.g., “…output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence 
    PNG
    media_image1.png
    494
    631
    media_image1.png
    Greyscale
measure… history of speech source location estimates is kept for the duration of a speech segment. Similarly, for the same time window, a record is kept in the fusion module of the positions of all visually tracked persons. The visual and acoustic tracks are then compared to associate the recognized speaker ID to the best matching person track…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); and 

    PNG
    media_image2.png
    544
    591
    media_image2.png
    Greyscale
an orientation control unit, configured to acquire a shooting mode corresponding to the current voice scene, and to control movement of the camera according to the shooting mode corresponding to the current voice scene (See e.g., “…A set of steerable fuzzy-controlled pan-tilt-zoom cameras serves to smoothly track persons of interest and opportunistically capture facial close-ups for face identification. In parallel, speech segmentation, sound source localization and speaker identification are performed using several far-field microphones and arrays…”,  “…the fusion module uses the person track information from the multiple camera tracker as the basis upon which the scene model is updated and association of ID cues is performed. The scene model is composed of a number of active person models, and some optional information such as the position of the entrance door and of the whiteboard. A person model comprises the person’s 3D location, and a histogram of identification cues that were assigned to it over time. This “ID histogram” has as many bins as audio-visually trained in subjects, and the values accumulated in the respective bins are the confidences given by the face or speaker ID modules…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5).

With respect to Claim 11, BERNARDIN discloses:
11. The device of claim 10, wherein the voice features comprises one or more of a voice duration, a voice interval duration, a sound source angle, a sound intensity of a voice, or a sound frequency of a voice (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5). 

With respect to Claim 12, BERNARDIN discloses:
12. The device of claim 10, wherein the scene determination unit comprises: a feature extraction module, configured to extract voice features of a specified amount of the voice data (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
a first scene determining module, configured to input the voice features of the specified amount of the voice data into the trained machine learning model and determine a current voice scene (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5).

With respect to Claim 19, BERNARDIN discloses:
19. A smart device, comprising: a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein when the processor executes the computer program the steps claimed according to claim 1 are implemented (See e.g., “…acquisition and the processing of information are distributed over a network of computers. A total of eight Pentium IV, 3GHz machines is used…,” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5).

With respect to Claim 20, BERNARDIN discloses:
20. A computer storage medium, the computer storage medium is stored with a computer program, wherein when the computer program is executed by a processor, the steps claimed according to claim 1 are implemented (See e.g., “…acquisition and the processing of information are distributed over a network of computers. A total of eight Pentium IV, 3GHz machines is used…,” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5).


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective 

8.	Claims 5, 6, 13, 14, is/are rejected under 35 U.S.C. 103 as being unpatentable over Bernardin et al., (Bernardin, Keni, and Rainer Stiefelhagen. "Audio-visual multi-person tracking and identification for smart environments." Proceedings of the 15th ACM international conference on Multimedia. 2007), in view of Sethi (I. K. Sethi, "Neural implementation of tree classifiers," in IEEE Transactions on Systems, Man, and Cybernetics, vol. 25, no. 8, pp. 1243-1249, Aug. 1995), hereinafter referred to as BERNARDIN,  and SETHI.

With respect to Claim 5, BERNARDIN discloses:
5. The method of claim 4, wherein steps of training the machine learning model comprises: acquiring a specified amount of sample voice data, and establishing a sample voice data set based on the sample voice data, wherein the sample voice data is marked with a voice scene , and the number of the sample voice data of each voice scene is no less than an average of the number of the sample voice data of each voice scene (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); extracting voice features according to the sample voice data, and establishing a feature vector set based on the voice features extracted (See e.g., “…by automatically capturing sample images for each subject at different points in the room using the active cameras, and applying the same alignment and decomposition techniques…” and See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
[training a decision tree] of the sample voice data set (See e.g., See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5) according to [the feature vector set until an actual output value of the decision tree is the same as an ideal output value, and the training is completed.]
BERNARDIN does not explicitly, but SETHI discloses [training a decision tree] and [the feature vector set until an actual output value of the decision tree is the same as an ideal output ]
    PNG
    media_image3.png
    472
    485
    media_image3.png
    Greyscale
 (See e.g., “…three training schemes to incorporate soft decision making in a feedforward network representing a tree classifier…” “…branch adaptive implementation of decision trees because the inner links of the network correspond to tree branches… the partitioning layer is forced to adjust its  output during training by varying its gain… node adaptive implementation of decision trees because the neurons in the partitioning layer of the network represent the internal nodes of the decision tree… third scheme is a combination of above two methods in which the outputs of the partitioning layer neurons and the link weights for the AND layer are both adjusted during training… this method as the combined branch and node adaptive implementation. Similar to soft decision trees, the classification decision in all three schemes is made by the AND layer neuron producing the highest output, i.e. the class label associated with the neuron producing the highest output is taken as the tree classifier decision…” See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).
BERNARDIN and SETHI can be considered analogous art because they are from a similar field of endeavor in natural language processing techniques and applications having pattern recognition and learning tasks.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of BERNARDIN with SETHI’s techniques comprising see e.g., “…training schemes…representing a tree classifier…” in order to advantageously further compensate for see e.g., “…sensitive[ness] (See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).

With respect to Claim 6, BERNARDIN discloses:
6. The method of claim 1, wherein the step of determining the current voice scene according to the extracted voice feature and the voice feature corresponding to the preset voice scene comprises: acquiring a specified amount of a sample voice data (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); determining a distribution of the sound source angle, a voice duration distribution, and a voice interval time of the sample voice data (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); [constructing a decision tree according to the distribution] of the sound source angle, the voice duration distribution, and the voice interval time of the sample voice data acquired (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); determining a current scene according to [the decision tree constructed] and the voice features acquired (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5).
BERNARDIN does not explicitly, but SETHI discloses [constructing a decision tree according to the distribution] and [the decision tree constructed]
    PNG
    media_image3.png
    472
    485
    media_image3.png
    Greyscale
 (See e.g., “…three training schemes to incorporate soft decision making in a feedforward network representing a tree classifier…” “…branch adaptive implementation of decision trees because the inner links of the network correspond to tree branches… the partitioning layer is forced to adjust its  output during training by varying its gain… node adaptive implementation of decision trees because the neurons in the partitioning layer of the network represent the internal nodes of the decision tree… third scheme is a combination of above two methods in which the outputs of the partitioning layer neurons and the link weights for the AND layer are both adjusted during training… this method as the combined branch and node adaptive implementation. Similar to soft decision trees, the classification decision in all three schemes is made by the AND layer neuron producing the highest output, i.e. the class label associated with the neuron producing the highest output is taken as the tree classifier decision…” See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).
BERNARDIN and SETHI can be considered analogous art because they are from a similar field of endeavor in natural language processing techniques and applications having pattern recognition and learning tasks.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of BERNARDIN with SETHI’s techniques comprising see e.g., “…training schemes…representing a tree classifier…” in order to advantageously further compensate for see e.g., “…sensitive[ness] to noise and minor variations in the data…led to the use of soft thresholding in decision trees …three neural implementation schemes for tree classifiers, that allow soft thresholding…,” (See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).

With respect to Claim 13, BERNARDIN discloses:
13. The device of claim 10, wherein the scene determination unit comprises: a sample set establishing module, configured to acquire a specified amount of sample voice data, and establish a sample voice data set based on the sample voice data, wherein the sample voice data is marked with voice scenes, and the number of sample voice data of each voice scene is no less than the (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
a feature vector set establishing module, configured to extract a voice feature according to the sample voice data, and establish a feature vector set based on the extracted voice feature (See e.g., “…by automatically capturing sample images for each subject at different points in the room using the active cameras, and applying the same alignment and decomposition techniques…” and See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); and 
[a training module, configured to train a decision tree] of the sample voice data set (See e.g., See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5) according to [the feature vector set, until the actual output value of the decision tree is the same as the ideal output value, and the training is completed.]
BERNARDIN does not explicitly, but SETHI discloses [a training module, configured to train a decision tree] and [the feature vector set, until the actual output value of the decision tree is the same as the ideal output value, and the training is completed] 
    PNG
    media_image3.png
    472
    485
    media_image3.png
    Greyscale
 (See e.g., “…three training schemes to incorporate soft decision making in a feedforward network representing a tree classifier…” “…branch adaptive implementation of decision trees because the inner links of the network correspond to tree branches… the partitioning layer is forced to adjust its  output during training by varying its gain… node adaptive implementation of decision trees because the neurons in the partitioning layer of the network represent the internal nodes of the decision tree… third scheme is a combination of above two methods in which the outputs of the partitioning layer neurons and the link weights for the AND layer are both adjusted during training… this method as the combined branch and node adaptive implementation. Similar to soft decision trees, the classification decision in all three schemes is made by the AND layer neuron producing the highest output, i.e. the class label associated with the neuron producing the highest output is taken as the tree classifier decision…” See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).
BERNARDIN and SETHI can be considered analogous art because they are from a similar field of endeavor in natural language processing techniques and applications having pattern recognition and learning tasks.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of BERNARDIN with SETHI’s techniques comprising see e.g., “…training schemes…representing a tree classifier…” in order to advantageously further compensate for see e.g., “…sensitive[ness] to noise and minor variations in the data…led to the use of soft thresholding in decision trees …three neural implementation schemes for tree classifiers, that allow soft thresholding…,” (See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).

With respect to Claim 14, BERNARDIN discloses:
14. The device of claim 10, wherein the scene determination unit comprises: 
a sample acquiring module, configured to acquire a specified amount of sample voice data (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
a feature determining module, configured to determine a distribution of the sound source angle, a voice duration distribution, and a voice interval time of the sample voice data (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
[a decision tree constructing module, configured to construct a decision tree according to the distribution] of the sound source angle, the voice duration distribution, and the voice interval time (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5); 
a second scene determining module, configured to determine the current voice scene according to [the decision tree constructed] and the voice features extracted (See e.g., “…Speech detection and segmentation…by thresholding in the power spectrum. Speech segments of more than 1 second length are extracted…Speaker localization and tracking…by estimating time delays of arrival between microphone pairs using the Phase Transform variant of the Generalized Cross Correlation function… Speaker Identification…component for speaker ID is based on the approach presented in [21]. Speakers are modeled using a 32-component Gaussian Mixture Model (GMM). The inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel…inputs to the GMMs are the MFCC coefficients computed on the segmented speech from one audio channel. For each speaker, one set of GMMs is trained offline on a 30 second speech segment. The recognition itself is made on segments of 1 to 5 seconds, with longer segments being broken down into smaller ones, to allow for intermediate identification results. Cepstral mean subtraction and feature warping are performed on the audio signal to reduce channel, noise and reverberation effects. The output of the speaker ID module is the identity of the speaker as well as the corresponding GMM’s a-posteriori probability for the analyzed segment, which is used as confidence measure…” See e.g., BERNARDIN, Abstract, §§ 2, 2.1, 2.2, 2.3, 3, Figs. 1, 4, 5). .
BERNARDIN does not explicitly, but SETHI discloses  [a decision tree constructing module, configured to construct a decision tree according to the distribution] and [the decision tree constructed] (See e.g., “…three training schemes to incorporate soft decision making in a feedforward network representing a tree classifier…” “…branch adaptive implementation of decision trees because the inner links of the network correspond to tree branches… the partitioning layer is forced to adjust its  output during training by varying its gain… node adaptive implementation of decision trees because the neurons in the partitioning layer of the network represent the internal nodes of the decision tree… third scheme is a combination of above two methods in which the outputs of the partitioning layer neurons and the link 
    PNG
    media_image3.png
    472
    485
    media_image3.png
    Greyscale
weights for the AND layer are both adjusted during training… this method as the combined branch and node adaptive implementation. Similar to soft decision trees, the classification decision in all three schemes is made by the AND layer neuron producing the highest output, i.e. the class label associated with the neuron producing the highest output is taken as the tree classifier decision…” See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).
BERNARDIN and SETHI can be considered analogous art because they are from a similar field of endeavor in natural language processing techniques and applications having pattern (See e.g., SETHI, Abstract, §§I, II, III, Figs. 3, 5).

9.	Claims 7, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bernardin et al., (Bernardin, Keni, and Rainer Stiefelhagen. "Audio-visual multi-person tracking and identification for smart environments." Proceedings of the 15th ACM international conference on Multimedia. 2007), in view of Sethi (I. K. Sethi, "Neural implementation of tree classifiers," in IEEE Transactions on Systems, Man, and Cybernetics, vol. 25, no. 8, pp. 1243-1249, Aug. 1995), and further in view of Mizumoto et al., (Mizumoto, T., Nakadai, K., Yoshida, T., Takeda, R., Otsuka, T., Takahashi, T., & Okuno, H. G. (2011, May). Design and implementation of selectable sound separation on the Texai telepresence system using HARK. In 2011 IEEE International Conference on Robotics and Automation (pp. 2130-2137). IEEE.), hereinafter referred to as BERNARDIN,  SETHI, and MIZUMOTO.
With respect to Claim 7, BERNARDIN in view of SETHI does not, but MIZUMOTO discloses:
7. The method of claim 1, wherein the step of acquiring the shooting mode corresponding to the current voice scene, and controlling the movement of the camera according to the shooting mode corresponding to the current voice scene comprises: acquiring voice data from a beginning (See 
    PNG
    media_image4.png
    358
    541
    media_image4.png
    Greyscale
 e.g., “…a video camera and microphones, the operator looks at and listens to the remote situation around Texai. When a person talks to Texai, the 
    PNG
    media_image5.png
    420
    509
    media_image5.png
    Greyscale
localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…,” (See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); dividing speaking regions according to the voice data acquired, and determining region angles of the speaking regions divided (See e.g. “…When a person talks to Texai, the 
    PNG
    media_image5.png
    420
    509
    media_image5.png
    Greyscale
localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); acquiring a sound  source angle of the new voice data when the new voice data are detected (See e.g., “…localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); determining a speaking region to which the sound source angle of the new voice data belongs (See e.g., “…localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); controlling a turning angle of the camera according to the region angle of the speaking region (See e.g., “…Through a video camera and microphones, the operator looks at and listens to the remote situation around Texai. When a person talks to Texai, the localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power. Then, the video conference subscribes the topic and overlays (superimposes) on the video as shown in Figure 6. The direction and the length of line in the center of Figure 6 denotes the direction and the volume of talker, respectively. Next, using two slide bars as shown in the right bottom of Figure3, the operator specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6).
BERNARDIN and SETHI can be considered analogous art because they are from a similar field of endeavor in pattern recognition and processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of BERNARDIN in view of SETHI with MIZUMOTO’s techniques comprising see e.g., a video-conference implementation with selectable sound separation functions in order to advantageously compensate for “…difficult[ies] recognizing the auditory scene…” by see e.g., “…mode visualizes[ing] the direction-of-arrival of surrounding sounds, while the filter mode provides sounds that originate from the range of directions …,” (See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6).

With respect to Claim 15, BERNARDIN in view of SETHI does not, but MIZUMOTO discloses:
15. The device of claim 10, wherein the orientation control unit comprises: 

    PNG
    media_image4.png
    358
    541
    media_image4.png
    Greyscale
a first voice acquiring module, configured to acquire voice data from a beginning of a video conference to a current moment when a voice scene is the video conference scene (See  e.g., “…a video camera and microphones, the operator looks at and listens to the remote situation around Texai. When a person talks to Texai, the localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…,” (See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); 
a region dividing module, configured to divide a speaking region according to the sound source angle acquired of the voice data, and determine an region angle of the speaking region divided (See e.g. “…When a person talks to Texai, the localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); 

    PNG
    media_image5.png
    420
    509
    media_image5.png
    Greyscale
a voice detection module, configured to acquire a sound source angle of a new voice data when the new voice data is detected, (See e.g., “…localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); 
(See e.g., “…localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power…specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6); a first turning control module, configured to control a turning angle of the camera according to the region angle of the speaking region determined (See e.g., “…Through a video camera and microphones, the operator looks at and listens to the remote situation around Texai. When a person talks to Texai, the localization module detects the direction of the sound, and the /talker node publishes a topic /hark, which consists of time stamp, id, direction-of-arrival, and its power. Then, the video conference subscribes the topic and overlays (superimposes) on the video as shown in Figure 6. The direction and the length of line in the center of Figure 6 denotes the direction and the volume of talker, respectively. Next, using two slide bars as shown in the right bottom of Figure3, the operator specifies two parameters: (1) the center direction of the range to listen to, and (2) the angular width of the range, as shown in the center of Figure 6. From the parameters, the user interface publishes a topic /hark direction which consists of the beginning and the ending angles of user’s interest. Then, a remote user listens to only the sounds from the specified range…,” See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6).
(See e.g., MIZUMOTO, Abstract, §§ I, II, III, Figs. 3, 6).

Allowable Subject Matter
10.	Claim(s) 3, 8, 9, 16-18 is/are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
11.       The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.  Nishiguchi et al., (Nishiguchi, S., Higashi, K., Kameda, Y., & Minoh, M. (2003, July). A sensor-fusion method for detecting a speaking student. In 2003 International Conference on Multimedia and Expo. ICME'03. Proceedings (Cat. No. 03TH8698) (Vol. 1, pp. I-129). IEEE.) discloses, see e.g., “…detecting the location of the speaker that is a target of automatic video filming in distance learning and lecture archive…a face of a speaking student is filmed in a lecture video…to detect the location of a speaker. An acoustic sensor such as a microphone array is used widely to detect the location of a sound source. However, it is difficult to detect the location of a (See e.g., Nishiguchi et al., Abstract). 
Please, see additional references in form PTO-892 for more details.
12.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Edgar Guerra-Erazo whose telephone number is (571) 270-3708.  The examiner can normally be reached on M-F 7:30a.m.-5:00p.m. EST. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Bhavesh Mehta can be reached on (571) 272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
/EDGAR X GUERRA-ERAZO/Primary Examiner, Art Unit 2656