DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 07/12/2022 have been fully considered but they are not persuasive, with respect to rejection under 35 U.S.C.102.
Applicant argues similar to arguments filed on 12/06/21 “However, there is nothing in Solomon that would fairly teach or suggest to identify each respective user of a plurality of users existing in a respective predetermined angular direction as a speaker whose utterance is to be received in the manner particularly recited, let alone such identification of the respective user existing in the respective predetermined anqular direction being based on image and voice obtained in an environment where the plurality of users exist at a time of the utterance. As such, it is simply not possible for Solomon to satisfy the elements particularly recited by amended independent claim 1.” Examiner respectfully disagrees. 
First Solomon teaches identify each respective user of a plurality of users existing in a respective predetermined angular direction as a speaker (Solomon teaches in [0179], In some examples, the system may track multiple conversations that are occurring simultaneously or otherwise overlapping, and may interact with participants in each conversation as appropriate for each conversation. [0198] FIG. 7 schematically illustrates an example entity tracker 100 that may comprise a component of the intelligent assistant system 20. Entity tracker 100 may be used to determine an identity, position, and/or current status of one or more entities within range of one or more sensors. [0214], By combining the data from the camera with the data from the microphone, the entity tracker 100 may identify the person with a higher confidence value than would be possible using the data from either sensor alone. e.g. in addition to spoken utterances, additional user input data is utilized including image data and context information including data related to an identity, position and status based on received sensory data; Regarding the “angular direction”, Solomon teaches in [0208] The reported entity position 114 for a detected entity may correspond to the entity's geometric center, a particular part of the entity that is classified as being important (e.g., the head of a human), a series of boundaries defining the borders of the entity in three-dimensional space, etc. The position identifier 106 may further calculate one or more additional parameters describing the position and/or orientation of a detected entity, such as a pitch, roll, and/or yaw parameter. In other words, the reported position of a detected entity may have any number of degrees-of-freedom, and may include any number of coordinates defining the position of the entity in an environment.)  Examiner notes at least these parameters are indicative of the claimed “angular direction”.
In response to the amendments, Examiner has applied an additional reference.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being incomplete for omitting essential steps, such omission amounting to a gap between the steps.  See MPEP § 2172.01.  The omitted steps in claim 1 are at least with regards to S52-S54 of Fig. 9 and S71-S74 of Fig. 10. Without the specific steps, the current amendment language in the paragraph “wherein the speaker identification unit identifies…” is unclear since “the upper limit of the plurality of users” could be interpreted as a variable, dependent on the number of the plurality of users.  For example, in one session, if there are 5 users, then “the upper limit of the plurality of users” would be 5, according to the claim language. And in another session, if there are 10 users, then “the upper limit of the plurality of users” would become 10.  Also the language could be interpreted such that the identification of users is performed in order of probability of occurrence of an utterance by each respective user for the 5 people in the first session, and for the 10 users in the second session, but the instant claim language would not consider disconnecting or removing users with lowest probability of occurrence, in order to make room for other users. Therefore Examiner submits the current claim language does not distinctly claim the invention, and clarification is required.
Independent claims 18 and 19 are rejected for the same reasons.
Dependent claims are similarly rejected since they are dependent on the above claims 1, 18, and 19.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Solomon et al. (hereinafter Sol. U.S. Patent Application Publication 2018/0232662) in view of Yin et al. (US 2018/0373992)
Regarding Claim 1, Sol discloses:
An information processing apparatus (e.g. system of Figs. 1 and 2) comprising:
a voice acquisition unit configured to obtain the voice in the environment where a plurality of users exist, wherein the voice acquisition unit is implemented via at least one microphone (e.g. note example sensors such as microphones; par 210; Using data received from sensors, the intelligent assistant system may track and/or communicate with one or more users or other entities; par 41; In some examples, the context information 110 may include entity identity data 112, entity position data 114, and entity status data 116 for one or more users or other entities detected simultaneously; par 252)
a speaker identification unit (e.g. person identifier 105, entity identifier 104 of entity tracker 100 of Fig. 7) configured to identify each respective user of the plurality of users existing in a (e.g. note part of entity tracker used to determine an identity, position and/or current status of one or more entities within range of one or more sensors; para 198; note various types of entities in para 204, including “user”) respective predetermined angular direction (e.g. entity tracker determining position and/or orientation of a detected entity, such as pitch, roll and/or yaw parameter; para 208; note these parameters indicative of the claimed “angular direction,”) as a speaker whose utterance is to be received based on an image and the voice obtained in the environment where the respective user exists at a time of the utterance ([0200] Entity tracker 100 receives sensor data from one or more sensors 102, such as sensor A 102A, sensor B 102B, and sensor C 102C, though it will be understood that an entity tracker may be used with any number and variety of suitable sensors. As examples, sensors usable with an entity tracker may include cameras (e.g., visible light cameras, UV cameras, IR cameras, depth cameras, thermal cameras), microphones, … and/or any other sensors or devices that collect and/or store information pertaining to the identity, position, and/or current status of one or more people or other entities.);[0214], By combining the data from the camera with the data from the microphone, the entity tracker 100 may identify the person with a higher confidence value than would be possible using the data from either sensor alone. e.g. in addition to spoken utterances, additional user input data is utilized including image data and context information including data related to an identity, position and status based on received sensory data; para 103; contextual data including date, time of day etc; para 119;  In some examples, the system may track multiple conversations that are occurring simultaneously or otherwise overlapping, and may interact with participants in each conversation as appropriate for each conversation; par 179; see also various sensors used by entity tracker in para 200; note user pointing and speaking; para 151; and see example in para 216, camera suggesting person in a kitchen, microphone suggesting hallway; para 216 and see para 221 as well); and
a semantic analysis unit configured to perform semantic analysis of the utterance of the respective identified speaker existing in the respective predetermined angular direction to output a response to a request of the respective identified speaker ([0230] Accordingly, the entity tracker 100 may use a variety of audio processing techniques to more confidently identify a particular active participant who is engaged in a conversation…with the intelligent assistant system 20. e.g. parser utilizing a plurality of intent templates that may be filled with words or terms received from the voice listener by examining a semantic meaning; para 66, 81 – 83; note also the examples of the system broadcasting a response to the user as well; para 99, entire doc),
wherein the speaker identification unit and the semantic analysis unit are each implemented via at least one processor (e.g. note various components embodied and executed by one or more processors of a computing device; para 48, entire doc).
Sol discloses in [0036] In some examples, data from one or more sensors also may be utilized to process the natural language inputs and/or user intentions. Such data may be processed to generate identity, location/position, status/activity, and/or other information related to one or more entities within range of a sensor. Statistical probabilities based on current and past data may be utilized to generate confidence values associated with entity information, and in [0270], The threshold data 820 may include an entity identification threshold 822, an entity position/location threshold 824, and an entity status threshold 826. Each of these thresholds may be defined as a probability. When an entity identity, location, or status is determined to have a detection probability that exceeds the threshold probability for that entity identity, location, or status, a detection of that entity identity, location, or status may be indicated and/or recorded; and in [0276] At 902 the method 900 may include receiving a set of threshold data. The threshold data may include one or more probability thresholds above which a detection of a user, user location, or user activity may be registered. Sol’s detection probability can correspond to the claimed “probably of occurrence of utterance”, especially since Applicant’s disclosure also states [0126] At this time, the tracking unit 74 estimates a user5 having the lowest probability of uttering on the basis of at least any of the image from the imaging unit 71 and the sensing information from the sensing unit 73. For example, on the basis of the image from the imaging unit 71, the tracking unit 74 estimates the user existing at a10 most distant position from the home agent 20 as the user with the lowest probability of uttering, and terminates the tracking of the face of the user. Therefore Examiner notes Sol teaches identification of users based on a respective probability of occurrence of an utterance by each user.
  Sol does not explicitly detail wherein the speaker identification unit identifies the plurality of users according to an upper limit of the plurality of users.
Yin discloses an analogous sensor-based decision making system comprises: accessing, by one or more processors, sensor data that includes information regarding an area; disregarding, by the one or more processors, a portion of the sensor data that corresponds to objects outside of a region of interest; identifying, by the one or more processors, a plurality of objects from the sensor data; assigning, by the one or more processors, a priority to each of the plurality of objects; based on the priorities of the objects, selecting, by the one or more processors, a subset of the plurality of objects (abstract).
Yin teaches wherein the speaker identification unit identifies the plurality of users according to an upper limit of the plurality of users. ([0077] If the number of objects within the region of interest exceeds the size of the fixed-length list, a predetermined number of objects may be selected for inclusion in this list based on their proximity to the autonomous system, their speed, their size, their type (e.g., pedestrians may have a higher priority for collision avoidance than vehicles), or any suitable combination thereof. The predetermined number may correspond to the fixed length of the list of data structures. Filtering objects by priority is termed “object-aware filtering,” because the filtering takes into account attributes of the object beyond just the position of the object. [0078] In example embodiments in which a predetermined number of objects are used as a uniform representation, the predetermined number of objects having the highest priority may be selected for inclusion in the uniform representation. In example embodiments in which a fixed-size image is used as a uniform representation, a predetermined number of objects having the highest priority may be represented in the fixed size image or objects having a priority above a predetermined threshold may be represented in the fixed size image. [0080] In some example embodiments, the threshold priority at which objects will be represented is dynamic An algorithm to compute the threshold may be rule-based, machine learning-based, or any suitable combination thereof. Input to the algorithm may include one or more factors (e.g., attributes of detected objects, attributes of the autonomous system, attributes of the environment, or any suitable combination thereof). Output from the algorithm may be in the form of a threshold value.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate selection and detection of predetermined number of objects based on priority as taught by Yin with the intelligence system of Sol, because doing so would have provided a system with reduced processing power requirement, resulting in improved efficiency ([0108] of Yin)

Regarding Claim 2, in addition to the elements stated above regarding claim 1, Sol of the combination further discloses:
wherein, in a case where a face of the user-detected in the image is being tracked in the angular direction (e.g. entity tracker 100, as detailed above; note use of yaw/pitch/roll; para 208 of Sol) in which a voice session for performing a dialogue with the user is generated (e.g. see example conversation in para 68, 230 of Sol), the speaker identification unit identifies the user as the speaker (e.g. note the entity identifier 104 may record images of a person's face, and associate these images with recorded audio of the person's voice; para 205, see also example in para 216, 221, 227, 228, note losing track of a person if he covers his face; see also entity tracker using techniques to more confidently identify a person who is engaged in a conversation with the intelligent assistant system 20 in para 230 as well of Sol).

Regarding Claim 3, in addition to the elements stated above regarding claim 2, Sol of the combination further discloses:
a tracking unit configured to track the face of the user detected in the image (e.g. using camera data, the entity tracker 100 may identify a particular person; para 232 of Sol); and
a voice session generation unit configured to generate the voice session in the angular direction in which a trigger for starting the dialogue with the user has been detected (e.g. one or more functions activated upon detection of one or more keywords that are spoken by a user; para 296 of Sol; note also context information 110 may be utilized by voice listener 30 when interpreting human speech or activating functions in response to a keyword trigger; para 211 of Sol),
wherein the tracking unit and the voice session generation unit are each implemented via at least one processor (e.g. note various components embodied and executed by one or more processors of a computing device; para 48, entire doc of Sol).

Regarding Claim 4, in addition to the elements stated above regarding claim 3, Sol of the combination further discloses:
wherein the speaker identification unit identifies the speaker based on the image, the voice, and sensing information obtained by sensing in the environment (e.g. note the entity identifier 104 may record images of a person's face, and associate these images with recorded audio of the person's voice; para 205, see also example in para 216, 221, 227, 228 of Sol, note losing track of a person if he covers his face; see also entity tracker using techniques to more confidently identify a person who is engaged in a conversation with the intelligent assistant system 20 in para 220 as well of Sol).

Regarding Claim 5, in addition to the elements stated above regarding claim 4, Sol of the combination further discloses:
wherein the trigger is detected based on at least any of the image, the voice, and the sensing information (e.g. one or more functions activated upon detection of one or more keywords that are spoken by a user; para 296 of Sol; note also context information 110 may be utilized by voice listener 30 when interpreting human speech or activating functions in response to a keyword trigger; para 211 of Sol; note further examples of including signals gestures captured by a cameras, face direction etc; para 321, 324 of Sol).

Regarding Claim 6, in addition to the elements stated above regarding claim 5, Sol of the combination further discloses:
wherein the trigger is an utterance of a predetermined word detected from the voice (e.g. activating functions in response to a keyword trigger; para 211 of Sol; further note functions activated upon detection of one or more keywords; para 296 and spoken keywords; para 324, Fig. 21 of Sol)

Regarding Claim 7, in addition to the elements stated above regarding claim 5, Sol of the combination further discloses:
wherein the trigger is predetermined operation detected from the image (e.g. using camera data, the entity tracker 100 may identify a particular person and determine that the person's lips are moving; para 232 of Sol; note also captured video may indicate lip movement of a user that may be used to associate a spoken keyword with the user; para 324 of Sol).

Regarding Claim 8, in addition to the elements stated above regarding claim 3, Sol of the combination further discloses:
wherein, in a case where the trigger has been detected in the angular direction different from the angular direction in which N voice sessions are being generated in a state where the N voice sessions are being generated (e.g. the system may track multiple conversations that are occurring simultaneously or otherwise overlapping, and may interact with participants in each conversation as appropriate for each conversation; para 179 of Sol; and context information, including entity position and status,  is used to determine whether a particular commitment should be executed, note further utilization when activating functions in response to a keyword trigger; para 211 Sol; in other words, tracking multiple conversation of multiple users, responding accordingly in terms of their positional information for the triggers, i.e. different positions, or “angular directions”) the voice session generation unit terminates the voice session estimated to have a lowest probability of occurrence of the utterance out of the N voice (note that Each user, location, and activity included in the entity identity data 112, entity position data 114, and entity status data 116 may have an associated estimate of a probability that that user, location, or activity was correctly identified; para 252; and In an environment with multiple users, such indicators also may identify the particular user who is addressing a device; para 324 of Sol; and note filtering sensor data when confidence values are below a threshold; para 227; consistently identify speech form particular people and ignore background noise; para 233 and finally see the examples of aggregating metrics based on speaker ID and keyword confidence in order to rank and select messages; see Fig. 22 and its corresponding description, paras 308+ of Sol).

Regarding Claim 9, in addition to the elements stated above regarding claim 8, Sol of the combination further discloses:
wherein the voice session generation unit estimates the voice session having the lowest probability of occurrence of the utterance based on at least any of the image, the voice, and the sensing information (e.g.  in addition to spoken utterances, additional user input data is utilized including image data and context information including data related to an identity, position and status based on received sensory data; para 113; see also various sensors used by entity tracker in para 200; Each user, location, and activity included in the entity identity data 112, entity position data 114, and entity status data 116 may have an associated estimate of a probability that that user, location, or activity was correctly identified; para 252)

Regarding Claim 10, in addition to the elements stated above regarding claim 9, Sol of the combination further discloses:
wherein the voice session generation unit terminates the voice session having an earliest utterance detection time, based on the voice (e.g. note confidence decay functions applied to sensor data.. as time has passed… eventually reaching 0%; paras 222 – 225; note requirement of a confidence value exceeding a predetermined threshold; para 233; and see Fig. 16 as well regarding threshold processing, in particular probability thresholds above which a detection of a user/location/activity may be registered; para 276).

Regarding Claim 11, in addition to the elements stated above regarding claim 8, Sol of the combination further discloses:
wherein, in a case where the face has been detected in the angular direction different from the angular direction in which M faces are being tracked in a state where the M faces are being tracked, the tracking unit terminates the tracking of the face of the user estimated to have the lowest probability of occurrence of the utterance out of the M faces being tracked  (e.g. Each user, location, and activity included in the entity identity data 112, entity position data 114, and entity status data 116 may have an associated estimate of a probability that that user, location, or activity was correctly identified; para 252; note the entity identifier 104 may record images of a person's face, and associate these images with recorded audio of the person's voice; para 205; and In an environment with multiple users, such indicators also may identify the particular user who is addressing a device; para 324; and note filtering sensor data when confidence values are below a threshold; para 227; consistently identify speech form particular people and ignore background noise; para 233 and finally see the examples of aggregating metrics based on speaker ID and keyword confidence in order to rank and select messages; see Fig. 22 and its corresponding description, paras 308+; note confidence decay functions applied to sensor data.. as time has passed… eventually reaching 0% ; paras 222 – 225; note requirement of a confidence value exceeding a predetermined threshold; para 233; and see Fig. 16 as well regarding threshold processing, in particular probability thresholds above which a detection of a user/location/activity may be registered; para 276).

Regarding Claim 12, in addition to the elements stated above regarding claim 11, Sol of the combination further discloses:
wherein the tracking unit estimates the user having the lowest probability of occurrence of the utterance based on at least any of the image, and the sensing information (e.g.  in addition to spoken utterances, additional user input data is utilized including image data and context information including data related to an identity, position and status based on received sensory data; para 113; see also various sensors used by entity tracker in para 200; Each user, location, and activity included in the entity identity data 112, entity position data 114, and entity status data 116 may have an associated estimate of a probability that that user, location, or activity was correctly identified; para 252).

Regarding Claim 13, in addition to the elements stated above regarding claim 12, Sol of the combination further discloses:
wherein the tracking unit terminates tracking of the face of the user existing at a most distant position based the image (e.g.  in addition to spoken utterances, additional user input data is utilized including image data and context information including data related to an identity, position and status based on received sensory data; para 113; see also various sensors used by entity tracker in para 200; and note the system tracking as the user changes her location and moves farther away from the first device and corresponding device changes; para 322; note the entity identifier 104 may record images of a person's face, and associate these images with recorded audio of the person's voice; para 205)

Regarding Claim 14, in addition to the elements stated above regarding claim 11, Sol of the combination further discloses:
wherein a number M of the faces tracked by the tracking unit and a number N of the voice sessions generated by the voice session generation unit are same (e.g. the system may track multiple conversations that are occurring simultaneously or otherwise overlapping, and may interact with participants in each conversation as appropriate for each conversation; para 179; note the entity identifier 104 may record images of a person's face, and associate these images with recorded audio of the person's voice; para 205)

Regarding Claim 15, in addition to the elements stated above regarding claim 1, Sol of the combination further discloses:
a voice recognition unit configured to perform voice recognition of the utterance of the identified speaker (e.g. voice listener 30 receives audio data and utilizes speech recognition functionality to translate spoken utterances into text; para 46);
wherein the semantic analysis unit uses a result of the voice recognition on the utterance and performs the semantic analysis  (e.g. parser utilizing a plurality of intent templates that may be filled with words or terms received from the voice listener by examining a semantic meaning; para 66, 81 - 83), and
wherein the voice recognition unit is implemented via at least one processor (e.g. note various components embodied and executed by one or more processors of a computing device; para 48, entire doc).

Regarding Claim 16, in addition to the elements stated above regarding claim 1, Sol of the combination further discloses:
a response generation unit configured to generate a response to the request of the speaker, wherein the response generation unit is implemented via at least one processor (e.g. note example system response; para 154; and note message generated by the system in response to the speech; para 298, 303)

Regarding Claim 17, in addition to the elements stated above regarding claim 1, Sol of the combination further discloses:
an imaging unit configured to obtain the image in the environment, wherein the imaging unit is implemented via at least one camera (e.g. note example sensors such as cameras; par 200).
Claim 18 is rejected under the same grounds as claims 1, 5, 16 and 17 above

Claim 19 is rejected under the same grounds as claims 1 and 3 above.

Claim 20 is rejected under the same grounds as claims 1 and 3 above.

Regarding Claim 21, in addition to the elements stated above regarding claim 1, Sol of the combination further discloses:
wherein the semantic analysis unit performs the semantic analysis of each utterance of voice obtained from the plurality of users to output respective responses to requests of the plurality of users in the respective predetermined angular direction of each respective identified speaker ([0179], In some examples, the system may track multiple conversations that are occurring simultaneously or otherwise overlapping, and may interact with participants in each conversation as appropriate for each conversation. e.g. parser utilizing a plurality of intent templates that may be filled with words or terms received from the voice listener by examining a semantic meaning; para 66, 81 – 83;).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS H MAUNG whose telephone number is (571)270-5690.  The examiner can normally be reached on Monday-Friday, 9am-6pm, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached on 1-(571) 272-7848.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THOMAS H MAUNG/           Primary Examiner, Art Unit 2654