DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 3/21/2022 has been entered.
Response to Amendment
In response to the office action from 12/21/2021, the applicant has submitted an RCE, filed 3/21/2022, amending claims 5-6, 14-15, while arguing to traverse the prior art and 112(b) rejections. Applicant’s arguments have been fully considered and have been determined persuasive with respect to the dependent claims 4 and 13. Therefore claims 1-3, 5-12, 14-20, with the examiner’s amendment below, are allowable over the prior art for the below provided reasons for allowance.
EXAMINER’S AMENDMENT
The examiner has changed the title of the invention to “METHODS, APPARATUSES, SYSTEMS, DEVICES, AND COMPUTER-READABLE STORAGE MEDIA FOR PROCESSING SPEECH SIGNALS BASED ON HORIZONTAL AND PITCH ANGLES AND DISTANCE OF A SOUND SOURCE RELATVE TO A MICROPHONE ARRAY”, so as to be more descriptive of the invention.
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with the attorney on file, Mr. George Zalepa on 5/27/2022.
Amend claims 1, 5, 10, 14, 19, cancel claims 4 and 13.

As Per Claim 1:

1. A method comprising: 
performing a facial recognition analysis on an image, the image including a user; 
detecting, based on the facial recognition analysis, a period during which the user makes speech sounds, the period including a start point indicative of a time when the user starts to make speech sounds; 
in response to detecting the start point, locating a sound source in an audio signal received by a microphone array; 
determining orientation data of the sound source; and 
based on the period and the orientation data, performing a speech sound start and end point analysis to determine a start point and end point of the speech sounds in the audio signal;  
wherein the orientation data comprises a horizontal angle, a pitch angle, and a distance of the sound source relative to the microphone array.

As Per Claim 4:

Cancel.

As Per Claim 5:

5.  The method of claim 1, the performing the speech sound start and end point analysis comprising: 
based on the orientation data, determining a speech sound receiving range of the microphone array and acquiring the audio signal in the speech sound receiving range; 
calculating a speech sound existence probability of the audio signal in the speech sound receiving range;
 and performing the speech sound start and end point analysis according to the speech sound existence probability being greater than a preset probability threshold.

As Per Claim 10:

10.  A non-transitory computer readable storage medium for tangibly storing computer program instructions capable of being executed by a computer processor, the computer program instructions defining the steps of: 
performing a facial recognition analysis on an image, the image including a user; 
detecting, based on the facial recognition analysis, a period during which the user makes speech sounds, the period including a start point indicative of a time when the user starts to make speech sounds; 
in response to detecting the start point, locating a sound source in an audio signal received by a microphone array; 
determining orientation data of the sound source; and 
based on the period and the orientation data, performing a speech sound start and end point analysis to determine a start point and end point of the speech sounds in the audio signal;
wherein the orientation data comprises a horizontal angle, a pitch angle, and a distance of the sound source relative to the microphone array.

As Per Claim 13: Cancel.

As Per Claim 14:

14. (Currently Amended) The computer readable storage medium of claim 10, the performing the speech sound start and end point analysis comprising: 
based on the orientation data, determining a speech sound receiving range of the microphone array and acquiring the audio signal in the speech sound receiving range; 
calculating a speech sound existence probability of the audio signal in the speech sound receiving range; 
and performing the speech sound start and end point analysis according to the speech sound existence probability being 

As Per Claim 19:

19. (Previously Presented) An apparatus comprising: 
a processor; 
and a storage medium for tangibly storing thereon program logic for execution by the processor, the stored program logic comprising: 
logic, executed by the processor, for performing a facial recognition analysis on an image, the image including a user, 
logic, executed by the processor, for detecting, based on the facial recognition analysis, a period during which the user makes speech sounds, the period including a start point indicative of a time when the user starts to make speech sounds, 
logic, executed by the processor and in response to detecting the start point, for locating a sound source in an audio signal received by a microphone array, 
logic, executed by the processor, for determining orientation data of the sound source, and 
logic, executed by the processor, for, based on the period and the orientation data, performing a speech sound start and end point analysis to determine a start point and end point of the speech sounds in the audio signal;
wherein the orientation data comprises a horizontal angle, a pitch angle, and a distance of the sound source relative to the microphone array.
Allowable Subject Matter
The following is an examiner’s statement of reasons for allowance: The independent claims 1, 10 and 19 teach method, computer readable storage medium, as well as an apparatus associated with a user machine dialog system, which utilizes image and speech data of the user synchronously by the machine to determine appropriate responses to the user. It begins by performing a facial analysis of the user’s image, based on which it can determine a period associated with a start point during which the user begins making speech sounds. In response to the detection of the start point, it locates a sound source in the audio signal associated with speech using a microphone array. This is followed by determining an orientation data comprising of a horizontal angle, a pitch angle and a distance of the sound source to the microphone array, where as Fig. 3 of the disclosure shows the distance, horizontal and pitch angles correspond to the spherical coordinates of the user’s radial distance as well as azimuthal and polar angles in a spherical coordinate centered at the microphone array. Finally based on this period and the orientation data, it determines also an end point of the speech sounds associated with the period generated by the user.
The prior art of record Nakadai et al. (US 2004/0104702) does teach an interactive human robot dialog system using both audio as well as visual data in synch and defined as “visuoauditory servo of the robot”. Specifically, in one example according to ¶ 0217: “At a time instant t1, Mr. A” (a user) “enters a visual field of the robot 10. The vision module 30 detects the face” (facial image is recognized) “of MR. A to form a visual event for him”. Then according to ¶ 0218 sentence 1: “At a time instant t2, Mr. A begins to talk” (a period beginning with time “t2” (a start point when the user “begins to talk” or starts to make speech sound) and ending eventually in time “t4” (¶ 0220) is detected during which the user makes the speech sounds. Nakadai also uses a “microphone” array using two microphones inserted into the ears of its robot; i.e., according to ¶ 0037 lines 2+: “the audition module receives sounds collected by microphones” (audio signal received by a microphone array) “from external objects as sound sources and extracts pitches from the collected sounds utilizing their harmonic structures to find the directions” (locating the sound source) “in which the sound sources exist” (e.g. of “Mr. A” as he “begins to” “talk” (in response to detecting the start point).
Finally, according to ¶ 0220 lines 1-3: “At a time instant t4” (based on the period which began at “t2” (start point)) “Mr. A upon moving hides himself into the shade. This causes the vision module 30 to cease forming the visual event” “causes the visual stream for him to break off” (i.e., “t4” becomes designated as the end of the period). The visual stream called “visual event” although depends on “position (distance r, horizontal angle θ and vertical angle ɸ” (¶ 0150), however these parameters are NOT taught to correspond to the user’s spherical coordinates from “microphones” (i.e. microphone array) of the robot. As such they are not directly associated with the location of the user as he was making a specific sound, and therefore the period above in Nakadai between “t2” and “t4” does not correspond to a specific period associated with speech or sound made by the user and will not be exclusively identified with a speech or sound period, and thus will not be an appropriate period to “perform audio enhancement” voice activity detection for the purpose of “speech recognition” and thus will not lead to “reduc[ing] the computation load and processing time of said systems” as disclosed in specification ¶ 0003. 
Further search did not produce any prior art teaching this phenomenon, and therefore these claims became allowable. Claims 2-3, 5-9 (dependent on claim 1), 11-12, 14-18 (dependent on claim 10), and claim 20 (dependent on claim 19) further narrow the scope of their allowed parent claims and are thus allowable under similar rationale.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARZAD KAZEMINEZHAD whose telephone number is (571)270-5860. The examiner can normally be reached 10:30 am to 11:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL C WASHBURN can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Farzad Kazeminezhad/
Art Unit 2657
May 28th 2022.