DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the America Invents Act (“AIA ”).
Non-Final Office Action
This Office Action responds to the filing of the application on May 6, 2020. 
                Status of Claims
Claims 1-20 are pending and have been examined. The claim rejections and objections to the Specification are set forth below.
Specification
The disclosure is objected to because of the following informalities: 
On page 1, paragraph [003], second line down, “identifying that human can becomes increasingly complex” should be “identifying that human can become increasingly complex”.
On page 11, paragraph [035], “a current weather” (tenth line down) should be “a current weather condition” and “current or predicted weather” (thirteenth to fifteenth line down) should be “current or predicted weather condition” (note: condition can be another noun that describes the weather e.g., status, reading, etc. but because condition was used later on, that seems most appropriate). 
On page 18, paragraph [050], fourth line down, did Applicant mean instead of “as seen in table 4”, “as seen in the table in FIG. 4”? If so that correction should be made.
Appropriate correction is required.


Examiner Suggestions
The Examiner suggests for the below noted claim limitations to be amended for improvement to the claim’s form and provide better consistency.
As to claim 3, the last claim paragraph beginning “causing the ATM…” should be “cause the ATM…” to be consistent with the other claim paragraphs.
As to claims 3-4, the preambles should recite “when executed by the one or more processors, further cause the computer system to:” or “when executed by the one or more processors, cause the computer system to further:” because these are additional method steps or actions performed by the computer system or one or more processors not recited in claim 1.
As to claims 5 & 13, although implied, it would be clearer if there was antecedent basis for “similarity metrics”, e.g., “a similarity metric for each of the prior responses to form similarity metrics, wherein each similarity metric indicates…” because “the similarity metrics” is recited later on in each claim.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1: The claims are either directed to a system (independent claim 1), a non-transitory computer readable medium (independent claim 5), and a method (independent claim 13), all of which constitute at least one statutory category of invention (e.g., process or machine).
Step 2A, Prong One:  The Examiner has identified independent “method” claim 13 as the claim that best represents the claimed invention for analysis and is similar to independent “system” claim 1 & “non-transitory computer readable medium” claim 20.  The claim recites a method for using passive multifactor authentication to provide access to secure services, which is considered a judicial exception because it falls under the category of certain methods of organizing human activity, such as: “causing a first message to be output by…an interactive [entity/agent] in response to detecting a first user's presence in an environment of the interactive [entity/agent]; capturing, via…the interactive [entity/agent], first data representing a first response to the first message; determining, based on the first data and second data related to prior responses provided by one or more users, a similarity metric for each of the prior responses, wherein the similarity metric indicates a degree of similarity between the first response and each of the prior responses; determining, based on a first similarity metric of the similarity metrics satisfying a predefined authentication condition, a first account associated with first response; and providing, via the interactive [entity/agent], access to one or more services associated with the first account”, which are also commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; and/or business relations). As a result, the claims are directed to the abstract idea of a human being outputting a first message, capturing data representing a first response to the first message, determining a similarity metric for the prior messages and a first account and providing access to one or more services associated with the first account. If the claim limitations, under the broadest reasonable interpretation, cover methods of organizing human activity but for the recitation of generic computer components, then they fall within the “certain methods of organizing human activity” grouping of abstract ideas. Thus, claim 1 recites an abstract idea. This judicial exception directed to certain methods of organizing human activity is also not integrated into a practical application, and the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as analyzed below.
Step 2A, Prong Two: This judicial exception is not integrated into a practical application. Limitations that are not indicative of integration into a practical application include adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely using a computer as a tool to perform an abstract idea (MPEP 2106.05.f). In particular, the claim only recites the additional elements of one or more processors executing computer program instructions, interacting with a speaker and at least one sensor of an interactive kiosk, to perform all the steps. A plain reading of FIG. 1 as well as its associated descriptions in paragraphs [016]-[026] & [087]-[093] of Applicants’ Specification reveals that the above listed components can be general-purpose, generic or commercially available computing elements or devices programmed to perform the claimed steps. See, e.g., Apps.’ Spec., para. [0025] (“In some embodiments, interactive kiosk 106 may be communicatively coupled to a general purpose computing device, a computer system (e.g., computer system 102), one or more client devices (e.g., client devices 104), one or more databases (e.g., databases 132), or other components.”). Hence, the additional elements of the one or more processors and the interactive kiosk having a speaker and at least one sensor  function as generic processors, kiosks, speakers and sensors such that they amount to no more than mere instructions to apply the exception using generic computer components or to implement an abstract idea by merely adding the words “apply it” (or an equivalent) with the judicial exception. Thus, in the claim, the judicial exception is not integrated into a practical application because the limitations are recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using generic computer components. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. 
In addition to the one or more processors and the interactive kiosk having a speaker and at least one sensor of independent claim 13, independent claims 1 & 5 also contain the generic computing components of: a system (claim 1), a computer system (claim 1), an automated teller machine (ATM) with a speaker, a camera and microphone (claim 1), and a non-transitory computer readable medium (claim 5).
Step 2B: Thus, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of the one or more processors (claims 1, 5 & 13), the interactive kiosk having a speaker and at least one sensor (claims 5 & 13), the system (claim 1), the computer system (claim 1), the automated teller machine (ATM) with the speaker, the camera and the microphone (claim 1), and the non-transitory computer readable medium (claim 5) recited in the claims or used to perform the steps listed in the claims amount to no more than mere instructions to implement an abstract idea by adding the words “apply it” (or an equivalent) with the judicial exception. Thus, the additional elements of the instant underlying process, when taken in combination, together do not offer substantially more than the sum of the functions of the elements when each is taken alone. Furthermore, mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Hence, independent claim 13 is not patent eligible, nor are independent claims 1 & 5 based on similar reasoning and rationale.
Dependent claims 2-4, 6-12 & 14-20, when analyzed as a whole, are also held to be patent ineligible under 35 U.S.C. 101 because the additional recited limitations only refine the abstract idea further. For instance:
As to claims 2-4, the limitations of “The system of claim 1, wherein prior to the audio message being output, the computer program instructions, when executed by the one or more processors, cause the computer system to: detect that the user is within the predefined distance of the ATM; cause a first audio message to be output by the speaker, the first audio message comprising a first greeting for the user; capture, via the camera and the microphone, a first video of the environment in connection with the outputting of the first audio message; cause the user to provide authentication information indicating the account; cause a second audio message to be output by the speaker in response to detecting that the user has ceased interacting with the ATM, wherein the second audio message comprises a farewell message for the user; capture, via the camera and the microphone, a second video of the environment in connection with the outputting of the second audio message; store, in association with the account, (i) first data related to a first facial expression of the user and first sounds in the environment responsive to the first audio message based on the first video, and (ii) second data related to a second facial expression of the user and second sounds in the environment responsive to the second audio message based on the second video; and generate training data for training a neural network to identify the user based on the first data and the second data, wherein the training data comprises the first data and the second data, and wherein the prior response data is generated based on the training data” (claim 2), “The system of claim 1, wherein the computer program instructions, when executed by the one or more processors, cause the computer system to: cause, in response to detecting that the user is within the predefined distance of the ATM, a first audio message to be output by the speaker, wherein the first audio message comprises a first greeting for the user; capture, via the camera and microphone, a video of the environment of the ATM in connection with the outputting of the first audio message; detect, based on the captured video, a first response to the first audio message from the user; determine, based on the first response and the prior responses, first similarity scores indicating how similar the first response is to each of the prior responses; determine that the first similarity scores do not satisfy the similarity score threshold; and causing the ATM to request additional authentication information to authenticate the user prior to providing access to the one or more services” (claim 3) and “The system of claim 1, wherein the computer program instructions, when executed by the one or more processors, cause the computer system to: generate training data for training a neural network to recognize the user based on at least one of the user's facial expression or the user's spoken reply to a new audio message, wherein the training data is generated based on the prior response data, the detected response, and a detected additional response to an additional audio message output by the speaker after the user ceases interacting with the ATM; cause the neural network to be trained based on the training data to obtain a trained neural network; provide, to the trained neural network, a subsequently detected response to a first audio message output by the speaker in response to detecting that a first user is within the predefined distance; and obtain, from the trained neural network, an output indicating whether the first user is the user, wherein: for the output indicating that the trained neural network classified the first user as being the user, the ATM is caused to provide the user access to the one or more services, and for the output indicating that the trained neural network is unable to classify the first user as being the user, the ATM is caused to request additional authentication information to authenticate the first user” (claim 4), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations describe further steps performed (e.g., detect, cause, capture, store, generate, detect, determine, causing, provide, and obtain steps) in a method for using passive multifactor authentication to provide access to secure services. 
As to claims 6 & 14, the limitations of “The non-transitory computer readable medium of claim 5, wherein the first similarity metric satisfying the predefined authentication condition comprises: determining that the first response comprises one of a plurality of responses previously provided by the first user in response to a message output by the interactive kiosk” (claim 6) and “The method of claim 13, wherein the first similarity metric satisfying the predefined authentication condition comprises: determining that the first response comprises one of a plurality of responses previously provided by the first user in response to a message output by the interactive kiosk” (claim 14), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations further describe the first similarity metric satisfying the predefined authentication condition in a method for using passive multifactor authentication to provide access to secure services. 
As to claims 7 & 15, the limitations of “The non-transitory computer readable medium of claim 6, wherein the first data comprises at least one of (i) first image data representing one or more images of the first user in the environment in connection with the first message being output, or (ii) first audio data representing sounds detected in the environment in connection with the first message being output, the first similarity metric being determined to satisfy the predefined authentication condition comprises at least one of: determining, based on the second data, that at least one of the one or more images depicts the first user; or determining, based on the second data, that the sounds comprise an audio fingerprint of the first user” (claim 7) and “The method of claim 14, wherein the first data comprises at least one of (i) first image data representing one or more images of the first user in the environment in connection with the first message being output, or (ii) first audio data representing sounds detected in the environment in connection with the first message being output, the first similarity metric being determined to satisfy the predefined authentication condition comprises at least one of: determining, based on the second data, that at least one of the one or more images depicts the first user; or determining, based on the second data, that the sounds comprise an audio fingerprint of the first user” (claim 15), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations further describe the first data and the first similarity metric being determined to satisfy the predefined authentication condition in a method for using passive multifactor authentication to provide access to secure services. 
As to claims 8 & 16, the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: determining information related to a location of the interactive kiosk; and generating the first message based on the information related to the location” (claim 8) and “The method of claim 13, further comprising: determining information related to a location of the interactive kiosk; and generating the first message based on the information related to the location” (claim 16), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations describe further steps (e.g., determining and generating steps) performed in a method for using passive multifactor authentication to provide access to secure services. 
As to claims 9 & 17, the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and generating training data for training a prediction model to recognize the first user based on the first data and the third data, wherein the training data comprises the first data and the third data” (claim 9) and “The method of claim 13, further comprising: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and generating training data for training a prediction model to recognize the first user based on the first data and the third data, wherein the training data comprises the first data and the third data” (claim 17), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations describe further steps (e.g., causing and generating steps) performed in a method for using passive multifactor authentication to provide access to secure services. 
As to claims 10 & 18, the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: causing a second message to be output by the speaker in response detecting a second user's presence in the environment of the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and in response to determining that the second user is unable to be authenticated based on the second data and the third data, causing the interactive kiosk to request additional authentication information for authenticating the second user” (claim 10) and “The method of claim 13, further comprising: causing a second message to be output by the speaker in response detecting a second user's presence in the environment of the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and in response to determining that the second user is unable to be authenticated based on the second data and the third data, causing the interactive kiosk to request additional authentication information for authenticating the second user” (claim 18), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations describe further steps (e.g., causing and capturing steps) performed in a method for using passive multifactor authentication to provide access to secure services. 
As to claims 11 & 19, the limitations of “The non-transitory computer readable medium of claim 5, wherein determining the similarity metric comprises: determining a distance metric indicating the degree of similarity between the first data and the second data, wherein the similarity metric is determined to satisfy the predefined authentication condition based on the distance metric being less than or equal to a distance threshold” (claim 11) and “The method of claim 13, wherein determining the similarity metric comprises: determining a distance metric indicating the degree of similarity between the first data and the second data, wherein the similarity metric is determined to satisfy the predefined authentication condition based on the distance metric being less than or equal to a distance threshold” (claim 19), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations further describe determining the similarity metric in a method for using passive multifactor authentication to provide access to secure services. 
As to claims 12 & 20, the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; detecting, based on the second data and the third data, a difference between the second response and prior responses of the first user; and generating a flag indicating that suspicious behavior has been detected by the interactive kiosk, wherein the flag is stored in association with the first account” (claim 12) and “The method of claim 13, further comprising: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; detecting, based on the second data and the third data, a difference between the second response and prior responses of the first user; and generating a flag indicating that suspicious behavior has been detected by the interactive kiosk, wherein the flag is stored in association with the first account” (claim 20), under the broadest reasonable interpretation, are further refinements of methods of organizing human activity such as commercial interactions, activities and/or data because these limitations describe further steps (e.g., causing, capturing, detecting and generating steps) performed in a method for using passive multifactor authentication to provide access to secure services. 
Therefore, the dependent claims further define the abstract idea that is present in their respective independent claims and hence are abstract for at least the reasons presented above.  In addition, the dependent claims do not include additional elements that integrate into a practical application or are sufficient to amount to significantly more than the judicial exception. The additional elements of the instant underlying process, when taken in combination, together do not offer substantially more than the sum of the functions of the elements when each element is taken alone. Thus, the claims as a whole do not amount to significantly more than the abstract idea itself. For these reasons, the dependent claims are also not patent eligible, and as a result, claims 1-20 are not eligible subject matter under 35 U.S.C. 101.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
This application currently names joint inventors. In considering patentability of the claims the Examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Ortiz et al., U.S. Pat. Pub. 2021/0173916 A1 (“Ortiz”)1 in view of Dai et al., CN 206144199 U (“Dai”).
As to claim 1, Ortiz discloses “A system for using passive multifactor authentication to provide access to one or more secure services, comprising:” (Ortiz, Abstract (“Systems, devices, methods, and computer readable media are provided in various embodiments relating to generating a dynamic challenge passphrase data object”); para. [0005] (“The present disclosure generally relates to the field of secure authentication tokens, and more specifically, secure authentication or validation using dynamically generated passphrases.”).
“cause, in response to detecting that a user is within a predefined distance of the ATM, [an authentication challenge from an ATM]”. See, e.g., Ortiz, paras. [0142] (“Alice presents the secured token to an access control device (an automated teller machine (ATM)). The access control device, in response to verifying that the secured token was signed by the trusted entity, generates or selects a passphrase, and uses the passphrase in an authentication challenge (e.g., “Please say the word: Kingfisher).”); [0145]-[0146], [0183] (authentication challenge on an ATM window or at ATM).
“capture, via a camera and microphone of the ATM, video of an environment of the ATM in connection with the outputting of the audio message”. See, e.g., Ortiz, paras. [0029] (“In the embodiments described herein, the dynamically generated passphrase(s), when spoken, require an individual to adjust their features (e.g., facial or auditory) to speak a first set of words (i.e., dynamically generated passphrase(s)) including a plurality of phonemes that are captured in audio and/or video.”); [0031] (“The requesting individual captures a video of themselves speaking the passphrase (e.g., via a mobile phone), and transmits the captured video to the system.”); [0124], [0138]-[0139], [0152] (same, captured video response from user); [0160] (video processing unit which records image and audio date from 2D camera 130 or 3D camera 140); [0167] (same); [0220], [0223], [0226]-[0230], [0234]-[0240] (discussion of camera, which can be 2D/3D); claims 6 & 16 (reciting capturing video data). 
“detect, based on the captured video, a response from the user to the audio message, wherein the response comprises a facial expression of the user and a spoken reply from the user”. See, e.g., Ortiz, paras. [0029] (“In the embodiments described herein, the dynamically generated passphrase(s), when spoken, require an individual to adjust their features (e.g., facial or auditory) to speak a first set of words (i.e., dynamically generated passphrase(s)) including a plurality of phonemes that are captured in audio and/or video.”); [0033] (“The system receives the video (e.g., timestamped audio and video track) and extracts features (facial, lips, eyes or otherwise) of the requesting individual saying the plurality of phonemes, and compares the extracted features to reference features of the authenticated individual (e.g., Tom) saying the same plurality of phonemes.”); [0036] (“The features can include facial expressions or characteristics (e.g., eye shape), micro-movements (i.e., movements difficult to see with the human eye), auditory features, and combinations thereof. These features can be extracted from images within the video data, depth image data (e.g., 3-D image data), and facial dot projection mapping data, among others. The features may include facial characteristics including at least one of: lateral and medial position coordinates of both eyes; lateral-position coordinates of lips, a forehead curvature, distances between an ear and the eyes, or a height of nose. For example, a pixel mask can be applied to track these features over multiple frames.”); [0045], [0050], [0052],  (facial data); [0051] (“In an example embodiment where extracted features include depth data associated with an individuals face, a facial recognition scanner can be provided in the context of a bike sharing or a smart door lock, which takes a picture or a 3D representation of a face of the individual. This picture or the 3D representation is converted into a feature representation. The individual then utilizes the mobile device to adduce the digitally signed token as a “deposit token”, which is then received in relation to a challenge request mapping the picture or a 3D representation of a face of the individual against the available characteristics of the digitally signed token. If the device is satisfied that the captured picture or a 3D representation of a face of the individual is corroborated by the available characteristics of the digitally signed token, the device may then provision access (e.g., unlocks a bicycle or unlock a door).”); FIGS. 2-4, 7, 10-13, 16-24, 27, 29A, 30-33, 59-60 and their corresponding paragraphs (describing features of a facial recognition system).
“obtain prior response data related to prior responses provided by one or more users to previous audio messages, wherein the prior responses comprising facial expressions of the one or more users and spoken replies from the one or more users”. See, e.g., Ortiz, paras. [0018] (“challenge response data structure”); [0044] (“challenge response data set”); [0045] (“The third party computing device may process the digitally signed token upon receiving a challenge response data set representative of response images asserted as the individual speaking the passphrase. The third party computing device validates the challenge response data set by validating against the facial representation extracted by the model data architecture to establish that the challenged individual speaking the passphrase satisfies an output of the model data architecture at an acceptable confidence threshold value (e.g., a pre-defined value).”); claim 1 (“challenge response data structure”); claim 5 (“response data object”). 
“determine, based on the prior response data and the detected response, a similarity score indicating how similar the detected response is to the prior responses”. See, e.g., Ortiz, claims 6 & 16 (“using the one or more baseline machine learning data model architectures corresponding to the at least one overlapping phoneme or phoneme transition to determine an overall classification similarity score; wherein the provisioning of access to the electronic resource only occurs if the correct response string has been validated against the correct response string and the overall classification similarity score is greater than a pre-defined threshold similarity score.”); [0192], [0399] (similarity matching); [0427] (“In example embodiments, the compared feature vectors may be compared to determine whether the feature vectors are sufficiently similar (e.g., satisfying a threshold indicative of feature similarity). For example, similar to the determination of distances between entry vectors in regards to cluster analysis as described herein, the threshold indicative of feature similarity may be based on a distance or orientation between the two feature vectors. For example, the cosine similarity between the two vectors may be determined, and where the value of the cosine similarity is zero, the two vectors may be orthogonal, indicating that they are not very similar. In example embodiments, the distance may be measured by similarities measures including a Euclidian, or Jaccard distance between the two vectors.”); [0424] (sufficient similarity); claims 2 & 12 (semantic similarity).
“determine that the similarity score between the detected response and one of the prior responses satisfies a similarity score threshold; determine an account associated with the one of the prior responses, wherein the account comprises one or more services accessible via the ATM; and”. See, e.g., Ortiz, claims 6 & 16 (“using the one or more baseline machine learning data model architectures corresponding to the at least one overlapping phoneme or phoneme transition to determine an overall classification similarity score; wherein the provisioning of access to the electronic resource only occurs if the correct response string has been validated against the correct response string and the overall classification similarity score is greater than a pre-defined threshold similarity score.”); [0427] (“In example embodiments, the compared feature vectors may be compared to determine whether the feature vectors are sufficiently similar (e.g., satisfying a threshold indicative of feature similarity). For example, similar to the determination of distances between entry vectors in regards to cluster analysis as described herein, the threshold indicative of feature similarity may be based on a distance or orientation between the two feature vectors. For example, the cosine similarity between the two vectors may be determined, and where the value of the cosine similarity is zero, the two vectors may be orthogonal, indicating that they are not very similar. In example embodiments, the distance may be measured by similarities measures including a Euclidian, or Jaccard distance between the two vectors.”).
“provide, via the ATM, access to the one or more services”. See, e.g., Ortiz, paras. [0142]-[0144] (“Alice presents the secured token to an access control device (an automated teller machine (ATM)). The access control device, in response to verifying that the secured token was signed by the trusted entity, generates or selects a passphrase, and uses the passphrase in an authentication challenge (e.g., “Please say the word: Kingfisher)…Alice then provides a video of her saying: “Kingfisher”. Each of the tokenized parts of the “Kingfisher” are compared against the neural network parameters stored in the secured token and the system determines that it is 99.8% confident that the video is of Alice saying the “Kingfisher” based on her facial features… Alice is given access to her bank account.”); [0145]-[0146], [0183] (authentication challenges given before users are allowed to access ATMs).
However, Ortiz does not specifically or expressly disclose “cause the computer system to: cause, in response to detecting that a user is within a predefined distance of the ATM, an audio message to be output by a speaker of the ATM, wherein the audio message comprises a greeting message for the user” as recited by claim 1.
Dai cures this deficiency. See, e.g., Dai (English translation) (under “Contents of the utility model”) (“Compared with the existing technology, the multifunctional single pavilion, when people is outdoor part position of the first sensor, a first sensor is activated and sends a signal to the electronic door and opening the electronic door, people entering into the withdrawal chamber direction teller machine, ATM direction will trigger greeting sensing horn, greeting sensing horn is activated and speaking such as "welcome, you good" words, makes people feel warm when people transacting business in cash position, this third sensor is activated. This third sensor sends signal to the fourth inductor, the working mode of the fourth inductor is when payee behind someone close contact, a fourth sensor sends signal to the alarm loudspeaker; the alarm horn sends alarm such as "Please keep the safe distance from the payee" statement, it can effectively prompt the payee to prevent occurrence of crime increase the alert and warning the back the personnel and non-payee keep a safe distance when people transact the deposit transaction…the withdrawal chamber keep constant temperature, withdrawal chamber ground warning to remind the person back to keep the safety distance [hence, distance from users and from users to the ATM can be determined with sensors]”); (under “Preferred Embodiment”) (“the multifunctional single pavilion, when people is in the withdrawal chamber 1 outside the first sensor 2 position, the first sensor 2 is activated and sends a signal to the electronic door 3, the electronic door 3 opened, the people enter into the withdrawal chamber 1 ATM 8. trend ATM 8 will trigger the greeting induction speaker 7, greeting induction speaker 7 is activated and speaking such as "welcome, you good" words, makes people feel warm when people transacting business in ATM 8 position, the third sensor 5 is activated, the third sensor 5 sends the signal to the fourth inductor 6 and the fourth inductor 6 of the working mode is when the payee behind someone close contact, the fourth sensor 6 sends the signal to the alarm speaker 12. the alarm speaker 12 issues a warning such as "please keep the safe distance from the payee" statement, it can effectively prompt the payee to prevent occurrence of crime increase the alert and warning the back the personnel and non-payee keep a safe distance when people transact the deposit transaction [hence, distance from users and from users to the ATM can be determined].”)
Therefore, it would have been obvious to one of ordinary skill in the art to combine Ortiz’s and Dai’s above disclosures to teach, suggest and disclose all of the limitations recited by claim 1. The motivation to combine Ortiz and Dai would also support a conclusion of obviousness because it would be obvious to apply some teaching, suggestion or motivation (e.g., causing in response to detecting that a user is within a predefined distance of the ATM, an audio message to be output by a speaker of the ATM, wherein the audio message comprises a greeting message for the user) in order to yield predictable results and/or a reasonable expectation of success. See MPEP 2143. Examiner further submits that the combination of Ortiz and Dai would be particularly advantageous in integrating systems and methods to “present[] [a] secured token to an access control device (an automated teller machine (ATM)” (Ortiz, para. [0142]) with systems and methods for an “ATM…[to] trigger [a] greeting sensing horn…speaking such as "welcome, you good" words, [to make] people feel warm when [they are] transacting business” (Dai, under “Contents of utility model”) in order to ultimately teach, suggest and disclose all of the limitations of claim 1.
As to claim 5, and for the same reasons as above, Ortiz in view of Dai also discloses a “non-transitory computer readable medium storing computer program instructions that, when executed by one or more processors of a computing device, effectuate operations comprising: causing a first message to be output by a speaker of an interactive kiosk in response to detecting a first user's presence in an environment of the interactive kiosk; capturing, via at least one sensor of the interactive kiosk, first data representing a first response to the first message; determining, based on the first data and second data related to prior responses provided by one or more users, a similarity metric for each of the prior responses, wherein the similarity metric indicates a degree of similarity between the first response and each of the prior responses; determining, based on a first similarity metric of the similarity metrics satisfying a predefined authentication condition, a first account associated with first response; and providing, via the interactive kiosk, access to one or more services associated with the first account”. See, e.g., Ortiz, paras. [0219], [04354], claim 20 (describing “non-transitory computer-readable medium”); also the recited “similarity metric” is identical to the above-recited “similarity score” from claim 1, and interactive kiosk is identical to ATM, and first data can be video/audio data captured).
As to claim 13, and for the same reasons as above, Ortiz in view of Dai also discloses a “method implemented on one or more processors executing computer program instructions that, when executed, perform the method, the method comprising: causing a first message to be output by a speaker of an interactive kiosk in response to detecting a first user's presence in an environment of the interactive kiosk; capturing, via at least one sensor of the interactive kiosk, first data representing a first response to the first message; determining, based on the first data and second data related to prior responses provided by one or more users, a similarity metric for each of the prior responses, wherein the similarity metric indicates a degree of similarity between the first response and each of the prior responses; determining, based on a first similarity metric of the similarity metrics satisfying a predefined authentication condition, a first account associated with first response; and providing, via the interactive kiosk, access to one or more services associated with the first account”. See notes above for claim 5.
As to claims 2-4, Ortiz in view of Dai also discloses the limitations of “The system of claim 1, wherein prior to the audio message being output, the computer program instructions, when executed by the one or more processors, cause the computer system to: detect that the user is within the predefined distance of the ATM; cause a first audio message to be output by the speaker, the first audio message comprising a first greeting for the user; capture, via the camera and the microphone, a first video of the environment in connection with the outputting of the first audio message; cause the user to provide authentication information indicating the account; cause a second audio message to be output by the speaker in response to detecting that the user has ceased interacting with the ATM, wherein the second audio message comprises a farewell message for the user; capture, via the camera and the microphone, a second video of the environment in connection with the outputting of the second audio message; store, in association with the account, (i) first data related to a first facial expression of the user and first sounds in the environment responsive to the first audio message based on the first video, and (ii) second data related to a second facial expression of the user and second sounds in the environment responsive to the second audio message based on the second video; and generate training data for training a neural network to identify the user based on the first data and the second data, wherein the training data comprises the first data and the second data, and wherein the prior response data is generated based on the training data” (claim 2), “The system of claim 1, wherein the computer program instructions, when executed by the one or more processors, cause the computer system to: cause, in response to detecting that the user is within the predefined distance of the ATM, a first audio message to be output by the speaker, wherein the first audio message comprises a first greeting for the user; capture, via the camera and microphone, a video of the environment of the ATM in connection with the outputting of the first audio message; detect, based on the captured video, a first response to the first audio message from the user; determine, based on the first response and the prior responses, first similarity scores indicating how similar the first response is to each of the prior responses; determine that the first similarity scores do not satisfy the similarity score threshold; and causing the ATM to request additional authentication information to authenticate the user prior to providing access to the one or more services” (claim 3) and “The system of claim 1, wherein the computer program instructions, when executed by the one or more processors, cause the computer system to: generate training data for training a neural network to recognize the user based on at least one of the user's facial expression or the user's spoken reply to a new audio message, wherein the training data is generated based on the prior response data, the detected response, and a detected additional response to an additional audio message output by the speaker after the user ceases interacting with the ATM; cause the neural network to be trained based on the training data to obtain a trained neural network; provide, to the trained neural network, a subsequently detected response to a first audio message output by the speaker in response to detecting that a first user is within the predefined distance; and obtain, from the trained neural network, an output indicating whether the first user is the user, wherein: for the output indicating that the trained neural network classified the first user as being the user, the ATM is caused to provide the user access to the one or more services, and for the output indicating that the trained neural network is unable to classify the first user as being the user, the ATM is caused to request additional authentication information to authenticate the first user” (claim 4). See, e.g., Dai (English translation) (under “Contents of the utility model”) (“Compared with the existing technology, the multifunctional single pavilion, when people is outdoor part position of the first sensor, a first sensor is activated and sends a signal to the electronic door and opening the electronic door, people entering into the withdrawal chamber direction teller machine, ATM direction will trigger greeting sensing horn, greeting sensing horn is activated and speaking such as "welcome, you good" words, makes people feel warm when people transacting business in cash position, this third sensor is activated. This third sensor sends signal to the fourth inductor, the working mode of the fourth inductor is when payee behind someone close contact, a fourth sensor sends signal to the alarm loudspeaker; the alarm horn sends alarm such as "Please keep the safe distance from the payee" statement, it can effectively prompt the payee to prevent occurrence of crime increase the alert and warning the back the personnel and non-payee keep a safe distance when people transact the deposit transaction…the withdrawal chamber keep constant temperature, withdrawal chamber ground warning to remind the person back to keep the safety distance”); (under “Preferred Embodiment”) (“the multifunctional single pavilion, when people is in the withdrawal chamber 1 outside the first sensor 2 position, the first sensor 2 is activated and sends a signal to the electronic door 3, the electronic door 3 opened, the people enter into the withdrawal chamber 1 ATM 8. trend ATM 8 will trigger the greeting induction speaker 7, greeting induction speaker 7 is activated and speaking such as "welcome, you good" words, makes people feel warm when people transacting business in ATM 8 position, the third sensor 5 is activated, the third sensor 5 sends the signal to the fourth inductor 6 and the fourth inductor 6 of the working mode is when the payee behind someone close contact, the fourth sensor 6 sends the signal to the alarm speaker 12. the alarm speaker 12 issues a warning such as "please keep the safe distance from the payee" statement, it can effectively prompt the payee to prevent occurrence of crime increase the alert and warning the back the personnel and non-payee keep a safe distance when people transact the deposit transaction.”)
As to claims 6 & 14, Ortiz in view of Dai also discloses the limitations of “The non-transitory computer readable medium of claim 5, wherein the first similarity metric satisfying the predefined authentication condition comprises: determining that the first response comprises one of a plurality of responses previously provided by the first user in response to a message output by the interactive kiosk” (claim 6) and “The method of claim 13, wherein the first similarity metric satisfying the predefined authentication condition comprises: determining that the first response comprises one of a plurality of responses previously provided by the first user in response to a message output by the interactive kiosk” (claim 14). See, e.g., Ortiz, claims 6 & 16 (“using the one or more baseline machine learning data model architectures corresponding to the at least one overlapping phoneme or phoneme transition to determine an overall classification similarity score; wherein the provisioning of access to the electronic resource only occurs if the correct response string has been validated against the correct response string and the overall classification similarity score is greater than a pre-defined threshold similarity score.”); [0192], [0399] (similarity matching); [0427] (“In example embodiments, the compared feature vectors may be compared to determine whether the feature vectors are sufficiently similar (e.g., satisfying a threshold indicative of feature similarity). For example, similar to the determination of distances between entry vectors in regards to cluster analysis as described herein, the threshold indicative of feature similarity may be based on a distance or orientation between the two feature vectors. For example, the cosine similarity between the two vectors may be determined, and where the value of the cosine similarity is zero, the two vectors may be orthogonal, indicating that they are not very similar. In example embodiments, the distance may be measured by similarities measures including a Euclidian, or Jaccard distance between the two vectors.”); [0424] (sufficient similarity); claims 2 & 12 (semantic similarity).
As to claims 7 & 15, Ortiz in view of Dai also discloses the limitations of “The non-transitory computer readable medium of claim 6, wherein the first data comprises at least one of (i) first image data representing one or more images of the first user in the environment in connection with the first message being output, or (ii) first audio data representing sounds detected in the environment in connection with the first message being output, the first similarity metric being determined to satisfy the predefined authentication condition comprises at least one of: determining, based on the second data, that at least one of the one or more images depicts the first user; or determining, based on the second data, that the sounds comprise an audio fingerprint of the first user” (claim 7) and “The method of claim 14, wherein the first data comprises at least one of (i) first image data representing one or more images of the first user in the environment in connection with the first message being output, or (ii) first audio data representing sounds detected in the environment in connection with the first message being output, the first similarity metric being determined to satisfy the predefined authentication condition comprises at least one of: determining, based on the second data, that at least one of the one or more images depicts the first user; or determining, based on the second data, that the sounds comprise an audio fingerprint of the first user” (claim 15). See, e.g., Ortiz, paras. [0036], [0182] (video data, depth image data, 3-D image data); [0047] (raw image data); [0139] (image, video or audio data); [0160] (“The video processing unit 111 is configured to record raw image and audio data captured by 2D camera 130 or 3D camera 140. In example embodiments, as described herein, the video processing unit 111 may validate (e.g., validation as described in step two of FIG. 3) the captured raw image and audio data.”).
As to claims 8 & 16, Ortiz in view of Dai also discloses the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: determining information related to a location of the interactive kiosk; and generating the first message based on the information related to the location” (claim 8) and “The method of claim 13, further comprising: determining information related to a location of the interactive kiosk; and generating the first message based on the information related to the location” (claim 16). See, e.g., Ortiz, paras. [0037] (“In some embodiments, to avoid deepfake vulnerabilities, the system limits the amount of time available for the requesting individual to provide the requesting data (e.g., the video or the audio recording), requires that the media data is timestamped, or includes embedded location information, etc.”); [0158] (branch location data); [0317] (location where picture is taken or higher security location); [0323] (bike rental center location); [0325] (present at location of terminal, and location tracking that can be corroborated against GPS coordinates, QR codes provided on a door, etc.); [0344] (registration requests may only be permitted in certain locations (e.g., within a branch)); [0406] (location of purchase); [0412]-[0413] (location of transaction and location of purchase).
As to claims 9 & 17, Ortiz in view of Dai also discloses the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and generating training data for training a prediction model to recognize the first user based on the first data and the third data, wherein the training data comprises the first data and the third data” (claim 9) and “The method of claim 13, further comprising: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and generating training data for training a prediction model to recognize the first user based on the first data and the third data, wherein the training data comprises the first data and the third data” (claim 17). See sections of Dai cited above for second message and third data from sensor; for training data and prediction model, see, e.g., Ortiz, paras. [0025], [0039]-[0040], [0044], [0164], [0176], [0187], [0197], [0358], [0361], [0368], [0370], [0387]-[0388], [0416]-[0417], [0419], [0421], [0452], claim 16 (describing training data, training examples for machine learning models or deep learning models or neural networks); [0039]-[0040] (model data to predict features – prediction model); [0139] (predict whether image, video or audio data contains Alice (as opposed to another individual) saying the particular phoneme); [0264] (utterances predicted); [0277] (AI algorithm may predict a word spoken by the authenticated individual in the video and predicted word is compared to actual word for matching to generate a match confidence score); [0365] (“In example embodiments where, the model data architecture shown in FIG. 30 is trained to identify the phoneme being spoke in the image. For example, each image of the video may be processed by the segmentation portion 3004 and the classification portion 3006 and the model may predict, at the output of the classification portion 3006, the phoneme present in the processed image. The prediction, as described above, can be in the form of a vector, where each dimension of the vector represents a phoneme. In this way, the model data architecture learns to classify each image as including a phoneme or phoneme transition.”); [0368] (correct prediction of vectors for model data architecture trained to classify images); [0387], [0417] (NLP model data and predicting subsequent words in novels); [0429] (predicted phoneme in model data architecture for each image in a video); [0437] (model data architecture trained to predict one or more features of the authenticating individual saying a passphrase).
As to claims 10 & 18, Ortiz in view of Dai also discloses the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: causing a second message to be output by the speaker in response detecting a second user's presence in the environment of the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and in response to determining that the second user is unable to be authenticated based on the second data and the third data, causing the interactive kiosk to request additional authentication information for authenticating the second user” (claim 10) and “The method of claim 13, further comprising: causing a second message to be output by the speaker in response detecting a second user's presence in the environment of the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; and in response to determining that the second user is unable to be authenticated based on the second data and the third data, causing the interactive kiosk to request additional authentication information for authenticating the second user” (claim 18). See, e.g., Ortiz, paras. [0148] (“In another scenario, the authentication via dynamic passphrase is not a substitute for, but rather, an additional layer of security. For example, authentication via a dynamic passphrase may be used in conjunction with username/password authentication, and other types of authentication.”); [0138]-[0144] (Alice example where Alice performs all the steps of the method in claims 10 & 18, which can be used as additional authentication information).
As to claims 11 & 19, Ortiz in view of Dai also discloses the limitations of “The non-transitory computer readable medium of claim 5, wherein determining the similarity metric comprises: determining a distance metric indicating the degree of similarity between the first data and the second data, wherein the similarity metric is determined to satisfy the predefined authentication condition based on the distance metric being less than or equal to a distance threshold” (claim 11) and “The method of claim 13, wherein determining the similarity metric comprises: determining a distance metric indicating the degree of similarity between the first data and the second data, wherein the similarity metric is determined to satisfy the predefined authentication condition based on the distance metric being less than or equal to a distance threshold” (claim 19). See, e.g., Ortiz, paras. [0427] (“In example embodiments, the compared feature vectors may be compared to determine whether the feature vectors are sufficiently similar (e.g., satisfying a threshold indicative of feature similarity). For example, similar to the determination of distances between entry vectors in regards to cluster analysis as described herein, the threshold indicative of feature similarity may be based on a distance or orientation between the two feature vectors. For example, the cosine similarity between the two vectors may be determined, and where the value of the cosine similarity is zero, the two vectors may be orthogonal, indicating that they are not very similar. In example embodiments, the distance may be measured by similarities measures including a Euclidian, or Jaccard distance between the two vectors.”); [0278] (“If the score is above a certain threshold, at step 1770, then the person in the video may be determined to be a real person matching the provided identity.”); [0022], [0041], [0045], [0139] (threshold of confidence for response data); [0034] (features passing a threshold); [0132]-[0134], [0175], [0350] (readability threshold); [0209] (age threshold); [0435] (not satisfying a threshold); claims 6 & 16 (“using the one or more baseline machine learning data model architectures corresponding to the at least one overlapping phoneme or phoneme transition to determine an overall classification similarity score; wherein the provisioning of access to the electronic resource only occurs if the correct response string has been validated against the correct response string and the overall classification similarity score is greater than a pre-defined threshold similarity score.”).
As to claims 12 & 20, Ortiz in view of Dai also discloses the limitations of “The non-transitory computer readable medium of claim 5, wherein the operations further comprise: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; detecting, based on the second data and the third data, a difference between the second response and prior responses of the first user; and generating a flag indicating that suspicious behavior has been detected by the interactive kiosk, wherein the flag is stored in association with the first account” (claim 12) and “The method of claim 13, further comprising: causing a second message to be output by the speaker in response to determining that the first user ceased interacting with the interactive kiosk; capturing, via the at least one sensor, third data representing a second response to the second message; detecting, based on the second data and the third data, a difference between the second response and prior responses of the first user; and generating a flag indicating that suspicious behavior has been detected by the interactive kiosk, wherein the flag is stored in association with the first account” (claim 20). See, e.g., Ortiz, paras. [0453] (“Where method 7300 may result in the provisioning of access to an electronic resource (e.g., online banking account) where authentication is successful, in instances where authentication is not successful (e.g., where the correct response string is not selected or spoken), the system 100 may send an alert to a fraud monitoring squad.”); [0449] (“method 7300 may be implemented in the context of an authentication process to: access an advice center banking resource, change login credentials associated with the banking resource (e.g., authentication may be required to change a password), generally where it is accessed that there is a likelihood of fraud or where there are indicators of exceptional behavior, accessing automated self-service for accounts, for high risk transactions, for account origination and enrollment, and for authentication of infrequent users.”).
Prior Art Made of Record
The following prior art made of record and not relied upon is considered pertinent:
Galitsky, U.S. Pat. Pub. 2021/0174030 A1 – for discussing similar subject matter to the claims e.g. utterances used to engage in a communicative discourse (Abstract).
Conclusion
Any inquiry concerning this communication or earlier communications from the Examiner should be directed to TIMOTHY T HSIEH whose telephone number is 571-270-3381.  The Examiner can normally be reached on M-F 8am-6pm EST. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. 
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, RYAN DONLON can be reached on 571-270-3602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 



/T.T.H./Examiner, Art Unit 3695
September 30, 2022

/CHRISTOPHER BRIDGES/Primary Examiner, Art Unit 3695                                                                                                                                                                                                        10/3/2022




    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Ortiz’s effective filing date is at least as early as December 20, 2019, because an enabling provisional patent application disclosing the same material as this patent publication was filed then, which predates the filing date of the present application of May 6, 2020.