DETAILED ACTION

	Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 10, and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Choe et al. (US Pub. No. 2014/0324864 A1).
Regarding claim 1, Choe discloses, a video event recognition method, comprising: constructing a video event graph, each event in the video event graph including: M argument roles of the event and respective arguments of the argument roles, M being a positive integer greater than one; (See Choe ¶73, “FIG. 2 shows an exemplary graphical representation of a scene (e.g., including a loading event). The graphical representation of the scene serves as a framework for analysis, extraction, and representation of the visual elements and structure of the scene, such as the ground plane, sky, buildings, moving vehicles, humans, and interactions between those entities. … As illustrated in FIG. 2, using a traffic scene as an example, bottom-up detection includes classification of image patches (such as road, land, and vegetation), detection of moving objects, and representation of events, which generate data-driven candidates for scene content. Top-down hypotheses, on the other hand, are driven by scene models and contextual relations represented by the attribute grammar, such as the traffic scene model and human-vehicle interaction model.”)
acquiring, for a to-be-recognized video, respective arguments of the M argument roles of a to-be-recognized event corresponding to the video; and selecting, according to the arguments acquired, an event from the video event graph as a recognized event corresponding to the video.  (See Choe ¶134, “FIG. 6B illustrates an exemplary search method starting with a relational graph that results from a queried event of a video clip. The event may be modeled as a relational graph (611), the graph may be broken down into subgraphs which may be structured into groups (612), the subgraph groups (e.g., topics) may be indexed (613). These indices, modeled as vectors, may be compared with other indices and stored subgraph groups to determine if two relational graphs match (614).”)

Regarding claim 10, Choe discloses, an electronic device, comprising: at least one processor; and a memory in communication connection with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to carry out a video event recognition method, which comprises: (Se Choe ¶66, “For example, computer devices 130 may include stored programs that implement the algorithms described herein in combination with the one or more processors 136 and information stored in storage 134.”)
constructing a video event graph, each event in the video event graph including: M argument roles of the event and respective arguments of the argument roles, M being a positive integer greater than one; acquiring, for a to-be-recognized video, respective arguments of the M argument roles of a to-be-recognized event corresponding to the video; and selecting, according to the arguments acquired, an event from the video event graph as a recognized event corresponding to the video.  (See the rejection of claim 1 as it is equally applicable for claim 10 as well.)

Regarding claim 19, Choe discloses, a non-transitory computer-readable storage medium comprising instructions, which, when executed by a computer, cause the computer to carry out a video event recognition method, which comprises: (See Choe ¶58,  “Each computer may include and/or access a computer-readable medium embodying software to operate the computer.”)
constructing a video event graph, each event in the video event graph including: M argument roles of the event and respective arguments of the argument roles, M being a positive integer greater than one; acquiring, for a to-be-recognized video, respective arguments of the M argument roles of a to-be-recognized event corresponding to the video; and selecting, according to the arguments acquired, an event from the video event graph as a recognized event corresponding to the video.  (See the rejection of claim 1 as it is equally applicable for claim 19 as well.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 2-5, 11-14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Choe et al. (US Pub. No. 2014/0324864 A1) in view of Lecue et al. (US Pat. No. 10,339,420 B1).
Regarding claim 2, Choe discloses, the method according to claim 1, wherein the M argument roles comprise: a spatial scene argument role, an action argument role, (See Choe ¶73, “FIG. 2, using a traffic scene as an example, bottom-up detection includes classification of image patches (such as road, land, and vegetation), detection of moving objects.”)
Choe discloses the above limitations but he fails to disclose the following limitations.
However Lecue discloses, a person argument role, an object argument role and a related term argument role. (See Lecue 16:45—47, “In some implementations, the information regarding the plurality of entities indicates corresponding identities of one or more of the plurality of event.”
Lecue 7:17-24, “techniques for detecting the type of entity and/or the type characteristic of the entity, such as one or more object recognition techniques, facial recognition techniques, speech recognition techniques, character recognition techniques, and/or the like. The data stream analyzer may train the model using historical data associated with detecting types of entities and/or types of characteristics of the entities (e.g., using past analyses of the plurality of data streams or other data streams) and annotating the common knowledge graph (or other common knowledge graphs) with corresponding identified information associated with entities and/or characteristics of entities.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the event entities such as a person, object and text and their recognized attributes as part of a video knowledge graph  as suggested by Lecue to Choe’s video event graph using known engineering techniques, with a reasonable expectation of success. The motivation for doing so is in order to accurately describe an event using additional relevant entities that more fully describe the video events.

Regarding claim 3, Choe and Lecue disclose, the method according to claim 2, wherein the acquiring respective arguments of the M argument roles of the to-be-recognized event corresponding to the video comprises:
performing vision understanding on the video to obtain an argument of the spatial scene argument role, an argument of the action argument role, (See Choe ¶72, “Videos may be analyzed to determine scene elements, to recognize actions, and to extract contextual information, such as time and location, in order to detect events. The various elements, actions, and events can be modeled using a relational graph.”)
an argument of the person argument role and an argument of the object argument role of the to-be-recognized event; and performing text understanding on a text corresponding to the video to obtain an argument of the related term argument role of the to-be-recognized event.  (See Lecue 7:17-21, “techniques for detecting the type of entity and/or the type characteristic of the entity, such as one or more object recognition techniques, facial recognition techniques, speech recognition techniques, character recognition techniques, and/or the like.”)

Regarding claim 4, Choe and Lecue disclose, the method according to claim 3, wherein the performing vision understanding on the video to obtain the argument of the spatial scene argument role, the argument of the action argument role, the argument of the person argument role and the argument of the object argument role of the to-be-recognized event comprises: 
performing spatial scene recognition on the video to obtain the argument of the spatial scene argument role of the to-be-recognized event; (See Choe ¶74, “An example of scene element extraction is now described. In particular, analysis of urban scenes benefits greatly from knowledge of the locations of buildings, roads, sidewalks, vegetation, and land areas. Maritime scenes similarly benefit from knowledge of the locations of water regions, berthing areas, and sky/cloud regions. From video feeds, a background image is periodically learned and it is processed to extract scene elements.”)
performing action recognition on the video to obtain the argument of the action argument role of the to-be-recognized event; (See Choe ¶75, “For action recognition, to describe one example, video from a calibrated sensor may be processed and metadata of target information may be generated by detection, tracking, and classification of targets.”)
performing face recognition on the video to obtain the argument of the person argument role of the to-be-recognized event; and performing generic object recognition on the video to obtain the argument of the object argument role of the to-be-recognized event.  (See Lecue 7:17-21, “techniques for detecting the type of entity and/or the type characteristic of the entity, such as one or more object recognition techniques, facial recognition techniques, speech recognition techniques, character recognition techniques, and/or the like.”)

Regarding claim 5, Choe and Lecue disclose, the method according to claim 3, wherein the performing text understanding on text corresponding to the video to obtain the argument of the related term argument role of the to- be-recognized event comprises: 	
performing entity recognition and keyword extraction on the text to obtain the argument of the related term argument role of the to-be-recognized event.  (See Lecue 7:17-21, “techniques for detecting the type of entity and/or the type characteristic of the entity, such as one or more object recognition techniques, facial recognition techniques, speech recognition techniques, character recognition techniques, and/or the like.”)

Regarding claim 11, Choe and Lecue disclose, the electronic device according to claim 10, wherein the M argument roles comprise: a spatial scene argument role, an action argument role, a person argument role, an object argument role and a related term argument role.  (See the rejection of claim 2 as it is equally applicable for claim 11 as well.)

Regarding claim 12, Choe and Lecue disclose, the electronic device according to claim 11, wherein the acquiring respective arguments of the M argument roles of the to-be-recognized event corresponding to the video comprises: performing vision understanding on the video to obtain an argument of the spatial scene argument role, an argument of the action argument role, an argument of the person argument role and an argument of the object argument role of the to-be-recognized event; and performing text understanding on a text corresponding to the video to obtain an argument of the related term argument role of the to-be-recognized event.  (See the rejection of claim 3 as it is equally applicable for claim 12 as well.)

Regarding claim 13, Choe and Lecue disclose, the electronic device according to claim 12, wherein the performing vision understanding on the video to obtain the argument of the spatial scene argument role, the argument of the action argument role, the argument of the person argument role and the argument of the object argument role of the to-be-recognized event comprises: performing spatial scene recognition on the video to obtain the argument of the spatial scene argument role of the to-be-recognized event; performing action recognition on the video to obtain the argument of the action argument role of the to-be-recognized event; performing face recognition on the video to obtain the argument of the person argument role of the to-be-recognized event; and performing generic object recognition on the video to obtain the argument of the object argument role of the to-be-recognized event.  (See the rejection of claim 4 as it is equally applicable for claim 13 as well.)

Regarding claim 14, Choe and Lecue disclose, the electronic device according to claim 12, wherein the performing text understanding on text corresponding to the video to obtain the argument of the related term argument role of the to-be-recognized event comprises: performing entity recognition and keyword extraction on the text to obtain the argument of the related term argument role of the to-be-recognized event.  (See the rejection of claim 5 as it is equally applicable for claim 14 as well.)

Regarding claim 20, Choe and Lecue disclose, the non-transitory computer-readable storage medium according to claim 19, wherein the M argument roles comprise: a spatial scene argument role, an action argument role, a person argument role, an object argument role and a related term argument role.  (See the rejection of claim 2 as it is equally applicable for claim 20 as well.)

	Claims 6 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Choe et al. (US Pub. No. 2014/0324864 A1) in view of Lecue et al. (US Pat. No. 10,339,420 B1) and in further view of Jin et al. (US Pub. No. 2020/0233864 A1).
Regarding claim 6, Choe an Lecue disclose, the method according to claim 2,  
wherein the selecting, according to the arguments acquired, the event from the video event graph comprises: 
constructing an event graph according to the arguments acquired, the event graph comprising three layers of nodes, wherein the first layer includes one node corresponding to the to- be-recognized event, (See Choe Fig. 2 where there are first layer nodes such as “Event.”)
the second layer includes M nodes corresponding respectively to the argument roles, the number of nodes of the third layer is equal to a sum of the number of the arguments of the M augment roles, (See Choe Fig. 2 where the second layer is for example Vehicle, Human, and Other.)
and the nodes of the third layer corresponds respectively to the arguments, (See Choe Fig. 2, where the third layer is for example actions such as Stop, Approach, Open Trunk, load, and Close Trunk.)
the nodes of the second layer are connected to the node of the first layer, and the nodes of the third layer are respectively connected to the nodes of the second layer corresponding to respective argument roles to which the nodes of the third layer correspond; (See Fig. 2, where the nodes of each layer are connected to each other.)
and acquiring a graph 
Choe and Lecue disclose comparing vector graph representations but he fails to disclose the graph could instead be represented as graph embeddings.
However Costabello discloses, and acquiring a graph embedding representation corresponding to the event graph, (See Jin ¶26, “FIG. 1 illustrates a comparison of the two techniques. For any given input graph G with N nodes, conventional node embedding (e.g., node embedding component 120 in FIG. 1) derives N node embedding vectors of K dimensions each that capture the structural properties of the nodes.”
Jin ¶79,  “Node embeddings for consecutive graphs (e.g., days) can be compared to identify abrupt changes of graph structures.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the graph embedding representation as suggested by Jin for Choe and Lecue’s graph vector representation using known engineering techniques, with a reasonable expectation of success. The motivation as disclosed by Jin ¶23 is in order to provide a compact representation of a graph through dimensionality reduction by using node embedding. 

Regarding claim 15, Choe, Lecue, and Costabello disclose, the electronic device according to claim 11, wherein the selecting, according to the arguments acquired, the event from the video event graph comprises: constructing an event graph according to the arguments acquired, the event graph comprising three layers of nodes, wherein the first layer includes one node corresponding to the to- be-recognized event, the second layer includes M nodes corresponding respectively to the argument roles, the number of nodes of the third layer is equal to a sum of the number of the arguments of the M augment roles, and the nodes of the third layer corresponds respectively to the arguments, the nodes of the second layer are connected to the node of the first layer, and the nodes of the third layer are respectively connected to the nodes of the second layer corresponding to respective argument roles to which the nodes of the third layer correspond; and acquiring a graph embedding representation corresponding to the event graph, calculating respectively similarities between the graph embedding representation corresponding to the event graph and graph embedding representations corresponding to respective events of the video event graph, and taking the event with the maximum similarity as the recognized event.  (See the rejection of claim 6 as it is equally applicable for claim 15 as well.)


	Allowable Subject Matter
Claim 7-9 and 16-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Regarding claim 7, the method according to claim 6, wherein the acquiring the graph embedding representation corresponding to the event graph comprises: selecting N center nodes from the nodes in the event graph, N being a positive integer greater than one and less than the number of the nodes comprised in the event graph; performing the following processing on each center node: acquiring neighborhood nodes of the center node, the neighborhood nodes being nodes connected to the center node, and determining a vector representation corresponding to a sub-graph composed of the center node and the neighborhood nodes; and inputting the obtained vector representations into a convolutional neural network to obtain the graph embedding representation corresponding to the event graph.  (The disclosed prior art of record fails to disclose the limitation of this claim.)

	Regarding claims 8 and 9, these claims are objected to since they depend from objected to claim 7.
	
	Regarding claims 16-18, these claims are objected to since they contain limitations similar to objected to claims 7-9 respectively.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID PERLMAN whose telephone number is        (571) 270-1417. The examiner can normally be reached on Monday - Friday; 10:00am - 6:30pm. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached on (571) 272-3638.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/DAVID PERLMAN/Primary Examiner, Art Unit 2662