DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 8 and 20 are objected to because of the following informalities:
In claim 8, “to train the policy network represent” should be “to train the policy network to represent”
In claims 8 and 20, “structued” should be “ structured”

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. The claimed invention is directed to the concept of mapping textual descriptions to driving data; training a policy based on a correspondence between an observed context and driving behaviors based on the mapping; and determining control behaviors based on observations of a surrounding environment, the mapping, and the policy. This judicial exception is not integrated into a practical application. The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception and do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
The Examiner will further explain in view of the 2019 Revised Patent Subject Matter Eligibility Guidance:
Regarding claim 1, applicant recites A vehicle behavior system for determining driving behaviors for controlling a vehicle, comprising: 
one or more processors; 
a memory communicably coupled to the one or more processors and storing: 
a training module including instructions that when executed by the one or more processors cause the one or more processors to generate, using textual descriptions in combination with driving log snippets, a joint feature space that represents a coordinated mapping between the textual descriptions and the driving log snippets, 
wherein the training module includes instructions to train a policy network to generate identified behaviors from the driving behaviors according to a correspondence between an observed context that is mapped onto the joint feature space and the driving behaviors defined in the joint feature space resulting from at least the textual descriptions; and 
a network module including instructions that when executed by the one or more processors cause the one or more processors to provide a behavior cloning model including at least an encoder, the joint feature space, and the policy network to generate control behaviors from the driving behaviors defined in the joint feature space according to acquired observations of a surrounding environment of the vehicle.
The claim recites a system configured to perform a series of steps and therefore is directed to an apparatus, which satisfies step 1 of the Section 101 analysis. Under the new two-prong inquiry, the claim is eligible at revised step 2A unless: Prong One: the claim recites a judicial exception; and Prong Two: the exception is not integrated into a practical application of the exception. 
The above claim steps are directed to the concept of mapping textual descriptions to driving data; training a policy based on a correspondence between an observed context and driving behaviors based on the mapping; and determining control behaviors based on observations of a surrounding environment, the mapping, and the policy. This is an abstract idea that can be performed by a user mentally or manually and falls within the Mental Processes grouping, since a user can observe and process context and driving data, create correlations between the context and the driving data, and determine what behaviors should be taken in certain contexts accordingly. (Prong one: YES, recites an abstract idea).
Other than reciting the use of one or more processors, nothing in the claim elements precludes the steps from being performed entirely by a human. The use of one or more computing devices is insufficient to amount to significantly more than the judicial exception and does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. (Prong Two: NO, does not recite additional elements that integrate the abstract idea into a practical application similar to that shown in MPEP 2106.05).
Under step 2B, the claimed invention does not recite additional elements that are indicative of an inventive concept. The additional elements when considered both individually and as an ordered combination do not amount to significantly more than the abstract idea. The one or more processors are described in at least paragraph [0032] of applicant’s specification as merely a general purpose computers. Therefore these additional limitations are no more than mere instructions to apply the exception using generic computer components. The recitation of generic processors/computers does not take the above limitations out of the mental processes grouping. 
Moreover, the implementation of the abstract idea on generic computers and/or generic computer components does not add significantly more, similar to how the recitation of the computer in Alice amounted to mere instructions to apply the abstract idea on a generic computer. The claims merely invoke the additional elements as tools that are being used in their ordinary capacity. Further, the courts have found that simply limiting the use of the abstract idea to a particular environment does not add significantly more. Thus, taken alone, the additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide generic computer implementation.
Examiner’s note to help applicant overcome the 101 rejection: a common way for applicants to overcome 101 rejections in vehicle cases to amend the claims to affirmatively recite control of vehicle actuators or autonomous driving, as such control of vehicle actuators or autonomous driving cannot be performed by a user mentally or manually. For example, applicant could add the limitation, “wherein the one or more processors control autonomous driving of the vehicle based on the generated control behaviors” to the present claim. This would overcome the present 101 rejection.
Of course, this is merely examiner’s suggestion in order to move forward prosecution and it is ultimately up to applicant how applicant wishes to amend the claims.

Regarding claim 2, applicant recites The vehicle behavior system of claim 1, wherein the training module includes instructions to generate the joint feature space including instructions to receive the textual descriptions that describe rules for controlling the vehicle, and receive the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior, and wherein the joint feature space defines a constrained space in which to search the driving behaviors.
However, merely specifying the kind of data processed as part of the mapping and learning process does not change that a user could mentally or manually process the information recited. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 3, applicant recites The vehicle behavior system of claim 1, wherein the training module includes instructions to generate the joint feature space including instructions to train a log encoder to map the driving log snippets into the joint feature space as log feature vectors and to train a textual encoder to map the textual descriptions into the joint feature space as textual feature vectors that correspond to associated snippets of the driving log snippets.
However, merely specifying that the processed information is stored in the form of vectors does not change the fact that a user could mentally conceptualize such storage of the processed information and perform said processing mentally or manually. The additional limitations therefore do not serve to integrate the judicial exception into a practical application.

Regarding claim 4, applicant recites The vehicle behavior system of claim 3, wherein the training module includes instructions to generate the joint feature space including instructions to enforce correspondence between the driving log snippets and the textual descriptions in the joint feature space by training the log encoder and the textual encoder using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors, and wherein the joint feature space regularizes the driving behaviors as lawful and interpretable actions for controlling the vehicle.
However, a user could mentally conceptualize enforcing a mapping between two different sets of observed data, and could mentally conceptualize a cost associated with certain mappings (i.e., a “loss function”) to determine which mappings are most appropriate and indicate the greatest or least amounts of similarity between the sets of data. A user could further use such a mental mapping to mentally decide which driving behaviors are lawful and interpretable. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 5, applicant recites The vehicle behavior system of claim 3, wherein the log feature vectors and the textual feature vectors provide an encoded representation of the driving log snippets and the textual description according to at least a context and a temporal sequence of actions that define separate behaviors in the joint feature space and provide for associating the textual descriptions with the driving log snippets.
However, a user can mentally encode and map driving observations and textual descriptions with each other based on context and a temporal sequence of actions. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 6, applicant recites The vehicle behavior system of claim 1, wherein the joint feature space is a vector space identifying the driving behaviors mapped against driving rules defined by the textual descriptions and observed behaviors sampled in the driving log snippets that have been projected into the joint feature space to provide the driving log snippets as interpretable representations of the textual descriptions.
However, a user can mentally conceptualize a mapping of driving rules defined by textual descriptions and observes behaviors of a vehicle to determine which descriptions and which observed behaviors should be matched with each other. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 7, applicant recites The vehicle behavior system of claim 1, wherein the network module includes instructions to provide the behavior cloning model including instructions to process sensor data from the vehicle using the behavior cloning model to identify behaviors from the joint feature space for controlling the vehicle.
However, a user can mentally analyze sensor data from a vehicle to identify behaviors from a data mapping which could be used for controlling a vehicle. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 8, applicant recites The vehicle behavior system of claim 1, wherein the training module includes instructions to generate the joint feature space and to train the policy network represent a two-stage training process for the behavior cloning model that is unsupervised, wherein the textual descriptions are structued driving rules from a driver handbook that are provided in a standardized format, wherein the driving log snippets include driver control inputs for controlling the vehicle and sensor data, and wherein the behavior cloning model forms a deep metric learning network.
However, an unsupervised user can mentally associate written driving rules with driver control inputs for a vehicle and sensor data to learn an association between the three data sources. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 9, applicant recites A non-transitory computer-readable medium for determining driving behaviors for controlling a vehicle and including instructions that when executed by one or more processors cause the one or more processors to: 
generate, using textual descriptions in combination with driving log snippets, a joint feature space that represents a coordinated mapping between the textual descriptions and the driving log snippets; 
train a policy network to generate identified behaviors from the driving behaviors according to a correspondence between an observed context that is mapped onto the joint feature space and the driving behaviors defined in the joint feature space resulting from at least the textual descriptions; and 
provide a behavior cloning model including at least an encoder, the joint feature space, and the policy network to generate control behaviors from the driving behaviors defined in the joint feature space according to acquired observations of a surrounding environment of the vehicle.
The claim recites a non-transitory computer-readable configured to be executed to perform a series of steps and therefore is directed to an apparatus, which satisfies step 1 of the Section 101 analysis. Under the new two-prong inquiry, the claim is eligible at revised step 2A unless: Prong One: the claim recites a judicial exception; and Prong Two: the exception is not integrated into a practical application of the exception. 
The above claim steps are directed to the concept of mapping textual descriptions to driving data; training a policy based on a correspondence between an observed context and driving behaviors based on the mapping; and determining control behaviors based on observations of a surrounding environment, the mapping, and the policy. This is an abstract idea that can be performed by a user mentally or manually and falls within the Mental Processes grouping, since a user can observe and process context and driving data, create correlations between the context and the driving data, and determine what behaviors should be taken in certain contexts accordingly. (Prong one: YES, recites an abstract idea).
Other than reciting the use of one or more processors configured to execute a non-transitory computer-readable medium, nothing in the claim elements precludes the steps from being performed entirely by a human. The use of one or more computing devices is insufficient to amount to significantly more than the judicial exception and does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. (Prong Two: NO, does not recite additional elements that integrate the abstract idea into a practical application similar to that shown in MPEP 2106.05).
Under step 2B, the claimed invention does not recite additional elements that are indicative of an inventive concept. The additional elements when considered both individually and as an ordered combination do not amount to significantly more than the abstract idea. The one or more processors are described in at least paragraph [0032] of applicant’s specification as merely a general purpose computers. Therefore these additional limitations are no more than mere instructions to apply the exception using generic computer components. The recitation of generic processors/computers does not take the above limitations out of the mental processes grouping. 
Moreover, the implementation of the abstract idea on generic computers and/or generic computer components does not add significantly more, similar to how the recitation of the computer in Alice amounted to mere instructions to apply the abstract idea on a generic computer. The claims merely invoke the additional elements as tools that are being used in their ordinary capacity. Further, the courts have found that simply limiting the use of the abstract idea to a particular environment does not add significantly more. Thus, taken alone, the additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide generic computer implementation.
Examiner’s note to help applicant overcome the 101 rejection: a common way for applicants to overcome 101 rejections in vehicle cases to amend the claims to affirmatively recite control of vehicle actuators or autonomous driving, as such control of vehicle actuators or autonomous driving cannot be performed by a user mentally or manually. For example, applicant could add the limitation, “wherein the one or more processors control autonomous driving of the vehicle based on the generated control behaviors” to the present claim. This would overcome the present 101 rejection.
Of course, this is merely examiner’s suggestion in order to move forward prosecution and it is ultimately up to applicant how applicant wishes to amend the claims.

Regarding claim 10, applicant recites The non-transitory computer-readable medium of claim 9, wherein the instructions to generate the joint feature space include instructions to receive the textual descriptions that describe rules for controlling the vehicle, and receive the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior, and wherein the joint feature space defines a constrained space in which to search the driving behaviors.
However, merely specifying the kind of data processed as part of the mapping and learning process does not change that a user could mentally or manually process the information recited. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 11, applicant recites The non-transitory computer-readable medium of claim 9, wherein the instructions to generate the joint feature space include instructions to train a log encoder to map the driving log snippets into the joint feature space as log feature vectors and to train a textual encoder to map the textual descriptions into the joint feature space as textual feature vectors that correspond to associated snippets of the driving log snippets.
However, merely specifying that the processed information is stored in the form of vectors does not change the fact that a user could mentally conceptualize such storage of the processed information and perform said processing mentally or manually. The additional limitations therefore do not serve to integrate the judicial exception into a practical application.

Regarding claim 12, applicant recites The non-transitory computer-readable medium of claim 11, wherein the instructions to generate the joint feature space include instructions to enforce correspondence between the driving log snippets and the textual descriptions in the joint feature space by training the log encoder and the textual encoder using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors, wherein the joint feature space regularizes the driving behaviors as lawful and interpretable actions for controlling the vehicle, and wherein the log feature vectors and the textual feature vectors provide an encoded representation of the driving log snippets and the textual description according to at least a context and a temporal sequence of actions that define separate behaviors in the joint feature space and provide for associating the textual descriptions with the driving log snippets.
However, a user could mentally conceptualize enforcing a mapping between two different sets of observed data, and could mentally conceptualize a cost associated with certain mappings (i.e., a “loss function”) to determine which mappings are most appropriate and indicate the greatest or least amounts of similarity between the sets of data. A user could further use such a mental mapping to mentally decide which driving behaviors are lawful and interpretable. Furthermore, a user can mentally encode and map driving observations and textual descriptions with each other based on context and a temporal sequence of actions. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 13, applicant recites A method of determining driving behaviors for controlling a vehicle, comprising: 
generating, using textual descriptions in combination with driving log snippets, a joint feature space that represents a coordinated mapping between the textual descriptions and the driving log snippets; 
training a policy network to generate identified behaviors from the driving behaviors according to a correspondence between an observed context that is mapped onto the joint feature space and the driving behaviors defined in the joint feature space resulting from at least the textual descriptions; and 
providing a behavior cloning model including at least an encoder, the joint feature space, and the policy network to generate control behaviors from the driving behaviors defined in the joint feature space according to acquired observations of a surrounding environment of the vehicle.
The claim recites a series of steps and therefore is directed to a process, which satisfies step 1 of the Section 101 analysis. Under the new two-prong inquiry, the claim is eligible at revised step 2A unless: Prong One: the claim recites a judicial exception; and Prong Two: the exception is not integrated into a practical application of the exception. 
The above claim steps are directed to the concept of mapping textual descriptions to driving data; training a policy based on a correspondence between an observed context and driving behaviors based on the mapping; and determining control behaviors based on observations of a surrounding environment, the mapping, and the policy. This is an abstract idea that can be performed by a user mentally or manually and falls within the Mental Processes grouping, since a user can observe and process context and driving data, create correlations between the context and the driving data, and determine what behaviors should be taken in certain contexts accordingly. (Prong one: YES, recites an abstract idea).
Nothing in the claim elements precludes the steps from being performed entirely by a human. (Prong Two: NO, does not recite additional elements that integrate the abstract idea into a practical application similar to that shown in MPEP 2106.05).
Under step 2B, the claimed invention does not recite additional elements that are indicative of an inventive concept. 
Examiner’s note to help applicant overcome the 101 rejection: a common way for applicants to overcome 101 rejections in vehicle cases to amend the claims to affirmatively recite control of vehicle actuators or autonomous driving, as such control of vehicle actuators or autonomous driving cannot be performed by a user mentally or manually. For example, applicant could add the limitation, “wherein one or more processors control autonomous driving of the vehicle based on the generated control behaviors” to the present claim. This would overcome the present 101 rejection.
Of course, this is merely examiner’s suggestion in order to move forward prosecution and it is ultimately up to applicant how applicant wishes to amend the claims.

Regarding claim 14, applicant recites The method of claim 13, wherein generating the joint feature space includes receiving the textual descriptions that describe rules for controlling the vehicle, and receiving the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior, and wherein the joint feature space defines a constrained space in which to search the driving behaviors.
However, merely specifying the kind of data processed as part of the mapping and learning process does not change that a user could mentally or manually process the information recited. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 15, applicant recites The method of claim 13, wherein generating the joint feature space includes training a log encoder to map the driving log snippets into the joint feature space as log feature vectors and training a textual encoder to map the textual descriptions into the joint feature space as textual feature vectors that correspond to associated snippets of the driving log snippets.
However, merely specifying that the processed information is stored in the form of vectors does not change the fact that a user could mentally conceptualize such storage of the processed information and perform said processing mentally or manually. The additional limitations therefore do not serve to integrate the judicial exception into a practical application.

Regarding claim 16, applicant recites The method of claim 15, wherein generating the joint feature space includes enforcing correspondence between the driving log snippets and the textual descriptions in the joint feature space by training the log encoder and the textual encoder using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors, and wherein the joint feature space regularizes the driving behaviors as lawful and interpretable actions for controlling the vehicle.
However, a user could mentally conceptualize enforcing a mapping between two different sets of observed data, and could mentally conceptualize a cost associated with certain mappings (i.e., a “loss function”) to determine which mappings are most appropriate and indicate the greatest or least amounts of similarity between the sets of data. A user could further use such a mental mapping to mentally decide which driving behaviors are lawful and interpretable. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 17, applicant recites The method of claim 15, wherein the log feature vectors and the textual feature vectors provide an encoded representation of the driving log snippets and the textual description according to at least a context and a temporal sequence of actions that define separate behaviors in the joint feature space and provide for associating the textual descriptions with the driving log snippets.
However, a user can mentally encode and map driving observations and textual descriptions with each other based on context and a temporal sequence of actions. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 18, applicant recites The method of claim 13, wherein the joint feature space is a vector space identifying the driving behaviors mapped against driving rules defined by the textual descriptions and observed behaviors sampled in the driving log snippets that have been projected into the joint feature space to provide the driving log snippets as interpretable representations of the textual descriptions.
However, a user can mentally conceptualize a mapping of driving rules defined by textual descriptions and observes behaviors of a vehicle to determine which descriptions and which observed behaviors should be matched with each other. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 19, applicant recites The method of claim 13, wherein providing the behavior cloning model includes processing sensor data from the vehicle using the behavior cloning model to identify behaviors from the joint feature space for controlling the vehicle.
However, a user can mentally analyze sensor data from a vehicle to identify behaviors from a data mapping which could be used for controlling a vehicle. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Regarding claim 20, applicant recites The method of claim 13, wherein generating the joint feature space and training the policy network is a two-stage training process that is unsupervised, wherein the textual descriptions are structued driving rules from a driver handbook that are provided in a standardized format, wherein the driving log snippets include driver control inputs for controlling the vehicle and sensor data, and wherein the behavior cloning model forms a deep metric learning network.
However, an unsupervised user can mentally associate written driving rules with driver control inputs for a vehicle and sensor data to learn an association between the three data sources. Therefore, the additional limitations do not serve to integrate the judicial exception into a practical application.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 13, 15 and 17-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Dean et al. (US 20190392231 A1), hereinafter referred to as Dean.
Regarding claim 13, Dean discloses A method of determining driving behaviors for controlling a vehicle (See at least Fig. 10 in Dean: Dean discloses that a vehicle is controlled in an autonomous driving mode based on identified semantic meaning at block 1040 [See at least Dean, 0092]), comprising: 
generating, using textual descriptions in combination with driving log snippets, a joint feature space that represents a coordinated mapping between the textual descriptions and the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. Since these training images contain sensed image data, they may be regarded as part of the driving log snippets. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space); 
training a policy network to generate identified behaviors from the driving behaviors (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. The setting of the generated response based on the semantic meaning may be regarded as training of a policy network) according to a correspondence between an observed context that is mapped onto the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Also see at least Fig. 5 in Dean: Dean further discloses that, at block 540 of a training process, a phrase recognition model is trained such that, in response to receiving an input image, the model outputs output data indicating whether a phrase of the plurality of phrases is included in the input image [See at least Dean, 0070]. The designation of the words and phrases of [Dean, 0052] as those identifiable in an image by a vehicle may therefore be regarded as an observed context) and the driving behaviors defined in the joint feature space resulting from at least the textual descriptions (Dean discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]); and 
providing a behavior cloning model including at least an encoder (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020 [See at least Dean, 0092]), the joint feature space (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020, to extract semantic meaning for the text in block 1030, and to execute a vehicle response in block 1040 [See at least Dean, 0092]. It will be appreciated that the joint feature space used to perform this action based on the text phrases is the same image-phrase-response joint feature space as described in [Dean, 0052-0053 and 0070] and earlier in this rejection), and the policy network (See at least Fig. 10 in Dean: Dean discloses that a semantic meaning for the identified text is identified at block 1030, and that the vehicle is then controlled in the autonomous driving mode based on the identified semantic meaning at block 1040 [See at least Dean, 0092]. It will be appreciated that the policy network used to perform this action based on semantic meaning is the same semantic meaning-response policy network as described in [Dean, 0052-0053] and earlier in this rejection) to generate control behaviors from the driving behaviors defined in the joint feature space according to acquired observations of a surrounding environment of the vehicle (See at least Fig. 10 in Dean: Dean discloses that a vehicle is then controlled in an autonomous driving mode based on identified semantic meaning at block 1040 [See at least Dean, 0092]).

Regarding claim 15, Dean discloses The method of claim 13, wherein generating the joint feature space includes training a log encoder to map the driving log snippets into the joint feature space as log feature vectors and training a textual encoder to map the textual descriptions into the joint feature space as textual feature vectors that correspond to associated snippets of the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The phrase list may be regarded as a textual feature vector and the corresponding lists of vehicle responses and images containing phrases may be regarded as log feature vectors).

Regarding claim 17, Dean discloses The method of claim 15, wherein the log feature vectors and the textual feature vectors provide an encoded representation of the driving log snippets and the textual description according to at least a context and a temporal sequence of actions that define separate behaviors in the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. It will therefore be appreciated that each pair of a phrase and vehicle response may be regarded as a distinct behavior in the joint feature space, and that further, each response indicates an action that will occur temporally after the act of detecting a phrase) and provide for associating the textual descriptions with the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. It will therefore be appreciated that each pair of a phrase and vehicle response may be regarded as a textual description associated with a driving log snippet).

Regarding claim 18, Dean discloses The method of claim 13, wherein the joint feature space is a vector space identifying the driving behaviors mapped against driving rules defined by the textual descriptions and observed behaviors sampled in the driving log snippets that have been projected into the joint feature space to provide the driving log snippets as interpretable representations of the textual descriptions (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]).

Regarding claim 19, applicant recites The method of claim 13, wherein providing the behavior cloning model includes processing sensor data from the vehicle using the behavior cloning model to identify behaviors from the joint feature space for controlling the vehicle (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020, to extract semantic meaning for the text in block 1030, and to execute a vehicle response in block 1040 [See at least Dean, 0092]. It will be appreciated that the joint feature space used to perform this action based on the text phrases is the same phrase-response joint feature space as described in [Dean, 0052-0053]).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-7, 9-11 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Dean et al. (US 20190392231 A1) in view of Balakrishnan et al. (US 20190299978 A1), hereinafter referred to as Dean and Balakrishnan, respectively.
Regarding claim 1, Dean discloses A vehicle behavior system for determining driving behaviors for controlling a vehicle (See at least Fig. 10 in Dean: Dean discloses that a vehicle is controlled in an autonomous driving mode based on identified semantic meaning at block 1040 [See at least Dean, 0092]), comprising: 
one or more processors (See at least Fig. 5 and Fig. 10 in Dean: Dean discloses that methods 500 and 1000 may be performed by one or more processors of one or more computing devices [See at least Dean, 0070 and 0092]); 
one or more memories communicably coupled to the one or more processors (See at least Fig. 5 in Dean: Dean discloses that the example flow diagram 500 may be performed by one or more processors of one or more computing devices, such as the processors of server computing devices 210, which means that the flow diagram is stored onboard the one or more processors of the server computing device 210 [See at least Dean, 0070]. Also see at least Fig. 10 in Dean: Dean discloses that the example flow diagram 1000 may be performed by one or more processors of one or more computing devices, such as processors 120 of computing devices 110, in order to control the vehicle in the autonomous driving mode [See at least Dean, 0092]) and storing: 
a training module including instructions that when executed by the one or more processors cause the one or more processors to generate, using textual descriptions in combination with driving log snippets, a joint feature space that represents a coordinated mapping between the textual descriptions and the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. Since these training images contain sensed image data, they may be regarded as part of the driving log snippets. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space), 
wherein the training module includes instructions to train a policy network to generate identified behaviors from the driving behaviors (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. The setting of the generated response based on the semantic meaning may be regarded as training of a policy network) according to a correspondence between an observed context that is mapped onto the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Also see at least Fig. 5 in Dean: Dean further discloses that, at block 540 of a training process, a phrase recognition model is trained such that, in response to receiving an input image, the model outputs output data indicating whether a phrase of the plurality of phrases is included in the input image [See at least Dean, 0070]. The designation of the words and phrases of [Dean, 0052] as those identifiable in an image by a vehicle may therefore be regarded as an observed context) and the driving behaviors defined in the joint feature space resulting from at least the textual descriptions (Dean discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]); and 
a network module including instructions that when executed by the one or more processors cause the one or more processors to provide a behavior cloning model including at least an encoder (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020 [See at least Dean, 0092]), the joint feature space (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020, to extract semantic meaning for the text in block 1030, and to execute a vehicle response in block 1040 [See at least Dean, 0092]. It will be appreciated that the joint feature space used to perform this action based on the text phrases is the same image-phrase-response joint feature space as described in [Dean, 0052-0053 and 0070] and earlier in this rejection), and the policy network (See at least Fig. 10 in Dean: Dean discloses that a semantic meaning for the identified text is identified at block 1030, and that the vehicle is then controlled in the autonomous driving mode based on the identified semantic meaning at block 1040 [See at least Dean, 0092]. It will be appreciated that the policy network used to perform this action based on semantic meaning is the same semantic meaning-response policy network as described in [Dean, 0052-0053] and earlier in this rejection) to generate control behaviors from the driving behaviors defined in the joint feature space according to acquired observations of a surrounding environment of the vehicle (See at least Fig. 10 in Dean: Dean discloses that a vehicle is then controlled in an autonomous driving mode based on identified semantic meaning at block 1040 [See at least Dean, 0092]).
	However, Dean does not explicitly teach where the one or more memories are a single memory.
	However, Balakrishnan does teach where the one or more memories used for training a vehicle are a single memory located onboard the vehicle (Balakrishnan teaches that subsystems may interface with a neural network onboard the autonomous vehicle to train the onboard neural network [See at least Balakrishnan, 0019-0020]). Both Balakrishnan and Dean teach methods for training machine-learning modules of autonomous vehicles. However, only Balakrishnan teaches where the training may occur onboard the vehicle using computational resources of the vehicle.
	It would have been obvious to anyone of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the training step of the method of Dean to also allow training to occur locally, using local memory and computational resources of the vehicle, as in Balakrishnan. Anyone of ordinary skill in the art will appreciate that it is an obvious substitution for a machine-learning controller for a vehicle to be trained on-board a vehicle rather than being trained offline and subsequently transmitted to the vehicle.

Regarding claim 2, Dean in view of Balakrishnan teaches The vehicle behavior system of claim 1, wherein the training module includes instructions to generate the joint feature space including instructions to receive the textual descriptions that describe rules for controlling the vehicle (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space), wherein the joint feature space defines a constrained space in which to search the driving behaviors (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. It will be appreciated, from [Dean, 0052-0053 and 0070], that the finite phrase list may be regarded as a constraint on the joint feature space). 
However, Dean does not explicitly teach the system wherein the training module includes instructions to receive the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior.
However, Balakrishnan does teach a training method for an autonomous driving system of a vehicle wherein a training module includes instructions to receive the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior (See at least Fig. 7 in Balakrishnan: Balakrishnan teaches that, as part of process 700, the vehicle state and actions may be continually monitored and evaluated by the vehicle in order to train the neural networks of the vehicle [See at least Balakrishnan, 0053-0054]. Balakrishnan further teaches that the vehicle state may be determined by referencing sensor data, as well as referencing data from external sources such as predetermined maps of a surrounding area [See at least Balakrishnan, 0052]). Both Dean and Balakrishnan teach methods for training autonomous vehicle maneuvers. However, only Balakrishnan explicitly teaches where the training of the maneuvers is based on sensor data of the state of the vehicle collected during driving.
It would have been obvious to anyone of ordinary skill in the art to modify the training method of Dean to also utilize vehicle state data gathered by the own vehicle during driving to determine the vehicle maneuvers. Doing so improves safety by allowing the vehicle to evaluate different maneuvers in real-time in order to determine the best maneuvers for a given scenario (With regard to this reasoning, see at least [Balakrishnan, 0052-0054]).

Regarding claim 3, Dean in view of Balakrishnan teaches The vehicle behavior system of claim 1, wherein the training module includes instructions to generate the joint feature space including instructions to train a log encoder to map the driving log snippets into the joint feature space as log feature vectors and to train a textual encoder to map the textual descriptions into the joint feature space as textual feature vectors that correspond to associated snippets of the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The phrase list may be regarded as a textual feature vector and the corresponding lists of vehicle responses and images containing phrases may be regarded as log feature vectors).

Regarding claim 5, Dean in view of Balakrishnan teaches The vehicle behavior system of claim 3, wherein the log feature vectors and the textual feature vectors provide an encoded representation of the driving log snippets and the textual description according to at least a context and a temporal sequence of actions that define separate behaviors in the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. It will therefore be appreciated that each pair of a phrase and vehicle response may be regarded as a distinct behavior in the joint feature space, and that further, each response indicates an action that will occur temporally after the act of detecting a phrase) and provide for associating the textual descriptions with the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. It will therefore be appreciated that each pair of a phrase and vehicle response may be regarded as a textual description associated with a driving log snippet).

Regarding claim 6, Dean in view of Balakrishnan teaches The vehicle behavior system of claim 1, wherein the joint feature space is a vector space identifying the driving behaviors mapped against driving rules defined by the textual descriptions and observed behaviors sampled in the driving log snippets that have been projected into the joint feature space to provide the driving log snippets as interpretable representations of the textual descriptions (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]).

Regarding claim 7, Dean in view of Balakrishnan teaches The vehicle behavior system of claim 1, wherein the network module includes instructions to provide the behavior cloning model including instructions to process sensor data from the vehicle using the behavior cloning model to identify behaviors from the joint feature space for controlling the vehicle (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020, to extract semantic meaning for the text in block 1030, and to execute a vehicle response in block 1040 [See at least Dean, 0092]. It will be appreciated that the joint feature space used to perform this action based on the text phrases is the same phrase-response joint feature space as described in [Dean, 0052-0053]).

Regarding claim 9, Dean discloses One or more non-transitory computer-readable media (Dean discloses that information is stored in the memory 130 of the computing devices 110 of a vehicle in order to allow the computing devices to use the phrase recognition model to make driving decisions for the vehicle 100 [See at least Dean, 0071]. Dean further teaches that the phrase recognition model may be trained using processors of a server computing device 210 [See at least Dean, 0070]) for determining driving behaviors for controlling a vehicle (See at least Fig. 10 in Dean: Dean discloses that a vehicle is controlled in an autonomous driving mode based on identified semantic meaning at block 1040 [See at least Dean, 0092]) and including instructions that when executed by one or more processors cause the one or more processors (See at least Fig. 5 and Fig. 10 in Dean: Dean discloses that methods 500 and 1000 may be performed by one or more processors of one or more computing devices [See at least Dean, 0070 and 0092]) to: 
generate, using textual descriptions in combination with driving log snippets, a joint feature space that represents a coordinated mapping between the textual descriptions and the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. Since these training images contain sensed image data, they may be regarded as part of the driving log snippets. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space); 
train a policy network to generate identified behaviors from the driving behaviors (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. The setting of the generated response based on the semantic meaning may be regarded as training of a policy network) according to a correspondence between an observed context that is mapped onto the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Also see at least Fig. 5 in Dean: Dean further discloses that, at block 540 of a training process, a phrase recognition model is trained such that, in response to receiving an input image, the model outputs output data indicating whether a phrase of the plurality of phrases is included in the input image [See at least Dean, 0070]. The designation of the words and phrases of [Dean, 0052] as those identifiable in an image by a vehicle may therefore be regarded as an observed context) and the driving behaviors defined in the joint feature space resulting from at least the textual descriptions (Dean discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]); and 
provide a behavior cloning model including at least an encoder (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020 [See at least Dean, 0092]), the joint feature space (See at least Fig. 10 in Dean: Dean discloses that an image of the vehicle’s surrounding environment is processed using a phrase recognition model in order to identify text in the image that is included in a selected phrase list at block 1020, to extract semantic meaning for the text in block 1030, and to execute a vehicle response in block 1040 [See at least Dean, 0092]. It will be appreciated that the joint feature space used to perform this action based on the text phrases is the same image-phrase-response joint feature space as described in [Dean, 0052-0053 and 0070] and earlier in this rejection), and the policy network (See at least Fig. 10 in Dean: Dean discloses that a semantic meaning for the identified text is identified at block 1030, and that the vehicle is then controlled in the autonomous driving mode based on the identified semantic meaning at block 1040 [See at least Dean, 0092]. It will be appreciated that the policy network used to perform this action based on semantic meaning is the same semantic meaning-response policy network as described in [Dean, 0052-0053] and earlier in this rejection) to generate control behaviors from the driving behaviors defined in the joint feature space according to acquired observations of a surrounding environment of the vehicle (See at least Fig. 10 in Dean: Dean discloses that a vehicle is then controlled in an autonomous driving mode based on identified semantic meaning at block 1040 [See at least Dean, 0092]).
However, Dean does not explicitly teach where the one or more non-transitory computer-readable media are a single non-transitory computer-readable medium.
	However, Balakrishnan does teach where the one or more non-transitory computer-readable media used for training a vehicle are a single non-transitory computer-readable medium located onboard the vehicle (Balakrishnan teaches that subsystems may interface with a neural network onboard the autonomous vehicle to train the onboard neural network [See at least Balakrishnan, 0019-0020 and Claim 19]). Both Balakrishnan and Dean teach methods for training machine-learning modules of autonomous vehicles. However, only Balakrishnan teaches where the training may occur onboard the vehicle using computational resources of the vehicle.
	It would have been obvious to anyone of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the training step of the method of Dean to also allow training to occur locally, using local memory and computational resources of the vehicle, as in Balakrishnan. Anyone of ordinary skill in the art will appreciate that it is an obvious substitution for a machine-learning controller for a vehicle to be trained on-board a vehicle rather than being trained offline and subsequently transmitted to the vehicle.

Regarding claim 10, Dean in view of Balakrishnan teaches The non-transitory computer-readable medium of claim 9, wherein the instructions to generate the joint feature space include instructions to receive the textual descriptions that describe rules for controlling the vehicle (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space), and 
wherein the joint feature space defines a constrained space in which to search the driving behaviors (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. It will be appreciated, from [Dean, 0052-0053 and 0070], that the finite phrase list may be regarded as a constraint on the joint feature space).
However, Dean does not explicitly teach the non-transitory computer-readable medium wherein the instructions further include instructions to receive the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior.
However, Balakrishnan does teach a training method for an autonomous driving system of a vehicle further includes instructions to receive the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior (See at least Fig. 7 in Balakrishnan: Balakrishnan teaches that, as part of process 700, the vehicle state and actions may be continually monitored and evaluated by the vehicle in order to train the neural networks of the vehicle [See at least Balakrishnan, 0053-0054]. Balakrishnan further teaches that the vehicle state may be determined by referencing sensor data, as well as referencing data from external sources such as predetermined maps of a surrounding area [See at least Balakrishnan, 0052]). Both Dean and Balakrishnan teach methods for training autonomous vehicle maneuvers. However, only Balakrishnan explicitly teaches where the training of the maneuvers is based on sensor data of the state of the vehicle collected during driving.
It would have been obvious to anyone of ordinary skill in the art to modify the training method of Dean to also utilize vehicle state data gathered by the own vehicle during driving to determine the vehicle maneuvers. Doing so improves safety by allowing the vehicle to evaluate different maneuvers in real-time in order to determine the best maneuvers for a given scenario (With regard to this reasoning, see at least [Balakrishnan, 0052-0054]).

Regarding claim 11, Dean in view of Balakrishnan teaches The non-transitory computer-readable medium of claim 9, wherein the instructions to generate the joint feature space include instructions to train a log encoder to map the driving log snippets into the joint feature space as log feature vectors and to train a textual encoder to map the textual descriptions into the joint feature space as textual feature vectors that correspond to associated snippets of the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The phrase list may be regarded as a textual feature vector and the corresponding lists of vehicle responses and images containing phrases may be regarded as log feature vectors).

Regarding claim 14, Dean discloses The method of claim 13, wherein generating the joint feature space includes receiving the textual descriptions that describe rules for controlling the vehicle (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space), and 
wherein the joint feature space defines a constrained space in which to search the driving behaviors (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. It will be appreciated, from [Dean, 0052-0053 and 0070], that the finite phrase list may be regarded as a constraint on the joint feature space).
However, Dean does not explicitly teach the method further comprising receiving the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior.
However, Balakrishnan does teach a training method further comprising receiving the driving log snippets that indicate one or more control inputs associated with an observed behavior of the vehicle as observed via associated sensor data of the vehicle depicting a surrounding environment at a time of the observed behavior (See at least Fig. 7 in Balakrishnan: Balakrishnan teaches that, as part of process 700, the vehicle state and actions may be continually monitored and evaluated by the vehicle in order to train the neural networks of the vehicle [See at least Balakrishnan, 0053-0054]. Balakrishnan further teaches that the vehicle state may be determined by referencing sensor data, as well as referencing data from external sources such as predetermined maps of a surrounding area [See at least Balakrishnan, 0052]). Both Dean and Balakrishnan teach methods for training autonomous vehicle maneuvers. However, only Balakrishnan explicitly teaches where the training of the maneuvers is based on sensor data of the state of the vehicle collected during driving.
It would have been obvious to anyone of ordinary skill in the art to modify the training method of Dean to also utilize vehicle state data gathered by the own vehicle during driving to determine the vehicle maneuvers. Doing so improves safety by allowing the vehicle to evaluate different maneuvers in real-time in order to determine the best maneuvers for a given scenario (With regard to this reasoning, see at least [Balakrishnan, 0052-0054]).

Claims 4 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Dean et al. (US 20190392231 A1) in view of Balakrishnan et al. (US 20190299978 A1) in further view of Ye et al. (US 20210295114 A1), hereinafter referred to as Ye.
Regarding claim 4, Dean in view of Balakrishnan teaches The vehicle behavior system of claim 3, wherein the training module includes instructions to generate the joint feature space including instructions to enforce correspondence between the driving log snippets and the textual descriptions in the joint feature space by training the log encoder and the textual encoder (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The correspondence between the images and vehicle behaviors from the driving log snippets and the text phrases is therefore enforced), wherein the joint feature space regularizes the driving behaviors as lawful and interpretable actions for controlling the vehicle (Dean discloses that, for each phrase of the phrase list, the operator may identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. This may be regarded as regularizing the driving behaviors as lawful and interpretable actions for controlling the vehicle).
However, Dean does not explicitly teach the system wherein correspondence between driving log snippets and textual descriptions is enforced using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors.
However, Ye does teach a system wherein correspondence between logged image snippets and textual descriptions is enforced using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors (See at least Fig. 6 in Ye: Ye teaches that at S107, the system updates the parameters in the image text extraction model based on the text attribute loss function value and the text location loss function value, where the parameters in the image text extraction model include the parameter of each convolutional layer in the backbone network, the parameter of each layer in the feature fusion subnetwork, the parameter of each layer in the classification subnetwork, the parameter of each layer in the bounding box regression subnetwork, and the like, which quantify how close the identified text characteristics are to the actual text to be detected [See at least Ye, 0066]). Both Ye and Dean teach methods for training a text extraction model for images. However, only Ye explicitly teaches where consistency between the extracted features of the text and the actual features of the text is enforced using loss functions.
It would have been obvious to anyone of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the text extraction method of Dean to also utilize loss functions to improve the accuracy of text extraction from images, as in Ye. Doing so optimizes the accuracy of text extraction, as will be appreciated by anyone of ordinary skill in the art.

Regarding claim 12, Dean in view of Balakrishnan teaches The non-transitory computer-readable medium of claim 11, wherein the instructions to generate the joint feature space include instructions to enforce correspondence between the driving log snippets and the textual descriptions in the joint feature space by training the log encoder and the textual encoder (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The correspondence between the images and vehicle behaviors from the driving log snippets and the text phrases is therefore enforced), 
wherein the joint feature space regularizes the driving behaviors as lawful and interpretable actions for controlling the vehicle (Dean discloses that, for each phrase of the phrase list, the operator may identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. This may be regarded as regularizing the driving behaviors as lawful and interpretable actions for controlling the vehicle), and 
wherein the log feature vectors and the textual feature vectors provide an encoded representation of the driving log snippets and the textual description according to at least a context and a temporal sequence of actions that define separate behaviors in the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. It will therefore be appreciated that each pair of a phrase and vehicle response may be regarded as a distinct behavior in the joint feature space, and that further, each response indicates an action that will occur temporally after the act of detecting a phrase) and provide for associating the textual descriptions with the driving log snippets (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. It will therefore be appreciated that each pair of a phrase and vehicle response may be regarded as a textual description associated with a driving log snippet).
However, Dean does not explicitly teach the non-transitory computer-readable medium wherein correspondence between driving log snippets and textual descriptions is enforced using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors.
However, Ye does teach a system wherein correspondence between logged image snippets and textual descriptions is enforced using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors (See at least Fig. 6 in Ye: Ye teaches that at S107, the system updates the parameters in the image text extraction model based on the text attribute loss function value and the text location loss function value, where the parameters in the image text extraction model include the parameter of each convolutional layer in the backbone network, the parameter of each layer in the feature fusion subnetwork, the parameter of each layer in the classification subnetwork, the parameter of each layer in the bounding box regression subnetwork, and the like, which quantify how close the identified text characteristics are to the actual text to be detected [See at least Ye, 0066]). Both Ye and Dean teach methods for training a text extraction model for images. However, only Ye explicitly teaches where consistency between the extracted features of the text and the actual features of the text is enforced using loss functions.
It would have been obvious to anyone of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the text extraction method of Dean to also utilize loss functions to improve the accuracy of text extraction from images, as in Ye. Doing so optimizes the accuracy of text extraction, as will be appreciated by anyone of ordinary skill in the art.

Claims 8 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Dean et al. (US 20190392231 A1) in view of Balakrishnan et al. (US 20190299978 A1) in further view of Silver (US 20120267853 A1), hereinafter referred to as Silver.
Regarding claim 8, Dean in view of Balakrishnan teaches The vehicle behavior system of claim 1, wherein the training module includes instructions to generate the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. Since these training images contain sensed image data, they may be regarded as part of the driving log snippets. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space) and to train the policy network (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. The setting of the generated response based on the semantic meaning may be regarded as training of a policy network) represent a two-stage training process for the behavior cloning model that is unsupervised (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The training process may therefore be regarded as a two-stage process, where one stage involves setting phrases which the system learns to recognize based on image data and a second stage involves matching those phrases to semantic meanings which correspond to vehicle responses. Furthermore, it will be appreciated that since the vehicle is able to execute responses immediately based on semantic meaning without further analysis of context, that the process may be regarded as creation of an unsupervised model), wherein the textual descriptions are structued driving rules provided in a standardized format (Dean discloses that the phrases represent rules and are provided in a standardized format, in the form of text strings stored in a phrase list [See at least Dean, 0052]), and wherein the driving log snippets include driver control inputs for controlling the vehicle (Dean discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Since these phrases and responses are provided by the operator, they may be regarded as driver control inputs).
However, Dean does not explicitly teach the system wherein the log snippets further include sensor data, and wherein the behavior cloning model forms a deep metric learning network.
However, Balakrishnan does teach an autonomous vehicle training system wherein the log snippets further include sensor data (See at least Fig. 7 in Balakrishnan: Balakrishnan teaches that, as part of process 700, the vehicle state and actions may be continually monitored and evaluated by the vehicle in order to train the neural networks of the vehicle [See at least Balakrishnan, 0053-0054]), and wherein the behavior cloning model forms a deep metric learning network (See at least Fig. 7 in Balakrishnan: Balakrishnan teaches that the process 700 for automatic vehicle navigation using deep reinforcement learning for the training [See at Balakrishnan, 0052]). Both Dean and Balakrishnan teach methods for training autonomous vehicle maneuvers. However, only Balakrishnan explicitly teaches where the training of the maneuvers is based on sensor data of the state of the vehicle collected during driving and where the training employs deep learning.
It would have been obvious to anyone of ordinary skill in the art to modify the training method of Dean to also utilize vehicle state data gathered by the own vehicle during driving to determine the vehicle maneuvers and to employ deep learning, as in Balakrishnan. Doing so improves safety by allowing the vehicle to evaluate different maneuvers in real-time with more reliable machine-learning algorithms in order to determine the best maneuvers for a given scenario (With regard to this reasoning, see at least [Balakrishnan, 0052-0054]).
However, Dean does not explicitly teach the system wherein the textual descriptions are from a driver handbook.
However, Silver does teach a scenario in textual descriptions of road signs are from a driver handbook (Silver teaches that all "DMV" (Department of Motor Vehicle) road signs are described and explained in the player's/driver's handbook [See at least Silver, claim 4]). Both Silver and Dean teach methods for recognizing signs. However, only Silver teaches where the signs may be the same signs described in a handbook.
It would have been obvious to anyone of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the system of Dean so that the road signs stored as textual descriptions are also described in a handbook. Anyone of ordinary skill in the art will appreciate that handbooks are common catalogues of signs available to drivers to help drivers understand road signs, so it would be logical to have the textual descriptions of the signs from at least [Dean, 0052] present in a handbook for this purpose.

Regarding claim 20, Dean discloses The method of claim 13, wherein generating the joint feature space (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. Since these training images contain sensed image data, they may be regarded as part of the driving log snippets. The setting of generated driving responses based on textual descriptions/phrases which are in turn correlated with certain images features may be regarded as generation of a joint feature space) and training the policy network (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. The setting of the generated response based on the semantic meaning may be regarded as training of a policy network) is a two-stage training process that is unsupervised (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The training process may therefore be regarded as a two-stage process, where one stage involves setting phrases which the system learns to recognize based on image data and a second stage involves matching those phrases to semantic meanings which correspond to vehicle responses. Furthermore, it will be appreciated that since the vehicle is able to execute responses immediately based on semantic meaning without further analysis of context, that the process may be regarded as creation of an unsupervised model), wherein the textual descriptions are structued driving rules provided in a standardized format (Dean discloses that the phrases represent rules and are provided in a standardized format, in the form of text strings stored in a phrase list [See at least Dean, 0052]), wherein the driving log snippets include driver control inputs for controlling the vehicle (Dean discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Since these phrases and responses are provided by the operator, they may be regarded as driver control inputs).
However, Dean does not explicitly teach the method wherein the log snippets further include sensor data, and wherein the behavior cloning model forms a deep metric learning network.
However, Balakrishnan does teach an autonomous vehicle training system wherein the log snippets further include sensor data (See at least Fig. 7 in Balakrishnan: Balakrishnan teaches that, as part of process 700, the vehicle state and actions may be continually monitored and evaluated by the vehicle in order to train the neural networks of the vehicle [See at least Balakrishnan, 0053-0054]), and wherein the behavior cloning model forms a deep metric learning network (See at least Fig. 7 in Balakrishnan: Balakrishnan teaches that the process 700 for automatic vehicle navigation using deep reinforcement learning for the training [See at Balakrishnan, 0052]). Both Dean and Balakrishnan teach methods for training autonomous vehicle maneuvers. However, only Balakrishnan explicitly teaches where the training of the maneuvers is based on sensor data of the state of the vehicle collected during driving and where the training employs deep learning.
It would have been obvious to anyone of ordinary skill in the art to modify the training method of Dean to also utilize vehicle state data gathered by the own vehicle during driving to determine the vehicle maneuvers and to employ deep learning, as in Balakrishnan. Doing so improves safety by allowing the vehicle to evaluate different maneuvers in real-time with more reliable machine-learning algorithms in order to determine the best maneuvers for a given scenario (With regard to this reasoning, see at least [Balakrishnan, 0052-0054]).
However, Dean does not explicitly teach the method wherein the textual descriptions are from a driver handbook.
However, Silver does teach a scenario in textual descriptions of road signs are from a driver handbook (Silver teaches that all "DMV" (Department of Motor Vehicle) road signs are described and explained in the player's/driver's handbook [See at least Silver, claim 4]). Both Silver and Dean teach methods for recognizing signs. However, only Silver teaches where the signs may be the same signs described in a handbook.
It would have been obvious to anyone of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the system of Dean so that the road signs stored as textual descriptions are also described in a handbook. Anyone of ordinary skill in the art will appreciate that handbooks are common catalogues of signs available to drivers to help drivers understand road signs, so it would be logical to have the textual descriptions of the signs from at least [Dean, 0052] present in a handbook for this purpose.

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Dean et al. (US 20190392231 A1) in view of Ye et al. (US 20210295114 A1), hereinafter referred to as Ye.
Regarding claim 16, applicant recites The method of claim 15, wherein generating the joint feature space includes enforcing correspondence between the driving log snippets and the textual descriptions in the joint feature space by training the log encoder and the textual encoder (Dean discloses that a selected phrase list may be identified, at least initially, manually by an operator in order to focus on text which is most important for an autonomous vehicle to be able to make intelligent and safe driving decisions, and as such, the selected phrase list may include words and phrases related to rules for controlling a vehicle [See at least Dean, 0052]. Dean further discloses that, for each phrase of the phrase list, the operator may also identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. Dean further discloses that training images containing phrases are used to train the phrase recognition model [See at least Dean, 0070]. The correspondence between the images and vehicle behaviors from the driving log snippets and the text phrases is therefore enforced), and 
wherein the joint feature space regularizes the driving behaviors as lawful and interpretable actions for controlling the vehicle (Dean discloses that, for each phrase of the phrase list, the operator may identify a semantic meaning, or an indication of what to do for or how to respond to, that phrase which may be critical to allowing the vehicle to not only identify an item of the phrase list but also to respond to that item without requiring further analysis of the context in every situation [See at least Dean, 0053]. This may be regarded as regularizing the driving behaviors as lawful and interpretable actions for controlling the vehicle).
However, Dean does not explicitly teach the method wherein correspondence between driving log snippets and textual descriptions is enforced using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors.
However, Ye does teach a system wherein correspondence between logged image snippets and textual descriptions is enforced using a loss function that is based, at least in part, on a similarity metric for comparing the log feature vectors and the textual feature vectors (See at least Fig. 6 in Ye: Ye teaches that at S107, the system updates the parameters in the image text extraction model based on the text attribute loss function value and the text location loss function value, where the parameters in the image text extraction model include the parameter of each convolutional layer in the backbone network, the parameter of each layer in the feature fusion subnetwork, the parameter of each layer in the classification subnetwork, the parameter of each layer in the bounding box regression subnetwork, and the like, which quantify how close the identified text characteristics are to the actual text to be detected [See at least Ye, 0066]). Both Ye and Dean teach methods for training a text extraction model for images. However, only Ye explicitly teaches where consistency between the extracted features of the text and the actual features of the text is enforced using loss functions.
It would have been obvious to anyone of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the text extraction method of Dean to also utilize loss functions to improve the accuracy of text extraction from images, as in Ye. Doing so optimizes the accuracy of text extraction, as will be appreciated by anyone of ordinary skill in the art.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NAEEM T ALAM whose telephone number is (571)272-5901. The examiner can normally be reached M-F 9:00 am-5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, FADEY JABR can be reached on (571) 272-1516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/N.T.A./Examiner, Art Unit 3668      
                                                                                                                                                                                                  /YAZAN A SOOFI/Primary Examiner, Art Unit 3668