DETAILED ACTION

Response to Arguments
Applicant's arguments filed 01/21/2021 have been fully considered but they are not persuasive. 

Applicant has argued as follows: The rejections should be withdrawn because Eledath in view of Venetianer fails to disclose or suggest at least "receiving a video feed comprising video from multiple sources, including a moving aerial source ... [and] processing the video feed in real-time as the video feed is received", as recited in independent claims 1, 8, and 15, as amended. Eledath teaches an augmented reality system that processes a field-of-view of a user wearing a camera. Specifically, Eledath teaches processing the camera's video data for visual elements within the scene and augmenting the user's field-of-view with a virtual element based on accessible knowledge. Although Eledath mentions augmenting a user's field-of-view with an aerial view of the user's location, the aerial view is merely a stored image from the accessible knowledge and not a real-time video from a moving aerial source, as recited in the amended claims. 
Examiner’s Response: This limitations is disclosed by a new reference of Werling et al. (US Pub. No. 2011/0043627 A1), who discloses using an UAV to obtain streaming surveillance videos.
	
Applicant has argued as follows: Further, even though the Office Action points to the person-supported camera of Eledath as disclosing a moving source, Eledath is silent on processing any real-time video from an aerial moving source. 
Examiner’s Response: Eledath disclose doing real time processing as disclosed in ¶41, “For example, the disclosed implementations of AR technology can enable real-time analyze-while-collect modes in which humans are assisted to sift through the chaos of geospatial and semantic contexts of real world locations.” The aerial source of the video can again be disclosed by combining with Werling et al. (US Pub. No. 2011/0043627 A1), who discloses using an UAV to obtain streaming surveillance videos.

Applicant has argued as follows: Eledath's head-mounted display and sensor system is disclosed as a solution to interpreting user intent based on the user's head motion. Thus, it would not be obvious to use an aerial camera to modify the system of Eledath because an aerial camera would be completely disjointed from the user's field-of-view and subtle eye and/or head motions as they move through their surroundings. 
Finally, as admitted by the Office Action, Eledath does not disclose receiving and processing a video feed from multiple sources, and it does not make sense to modify Eledath's personalized view system to do so, since the Eledath's system is designed to interpret the specific field-of-view and movements unique for the user to which the camera is mounted. 
Examiner Response: However as disclosed in the Eledath ¶97, the camera does not necessarily need to be a wearable AR head mounted camera, it could alternatively be a “fixed location camera, such as “stand-off” cameras that are installed in walls or ceilings, and/or mobile cameras (such as cameras that are integrated with consumer electronic devices, such as desktop computers, laptop computers, smart phones, tablet computers.” 
Similarly the display is also not necessarily an AR head mounted display as disclosed by Eledath in ¶103, ‘the augmented view 140 can be selectively presented on one or more different display devices depending, for example, on the user's current context, e.g., where the user has multiple computing devices (e.g., smart phone, tablet, smart watch, AR glasses, etc.), the augmented view 140 including the virtual element(s) 142 may be presented on a display device 138 that the user is currently using or which is relevant to the user's current activity.”
Therefore it is certainly possible that multiple video sources could be used as part of the surveillance.

Applicant has argued as follows: To cure the defects of Eledath, the Office Action points to Venetianer as teaching a video feed from multiple sources. However, Venetianer describes a video surveillance system design, in which video sensors are placed at specific locations and then calibrated to track the static regions in their respective fields-of-view. Venetianer is silent on any moving source supplying video to the computing system, but rather specifically states that if the video sensor does have motion (e.g., sweeping, zooming, and/or translation), an additional step would be required to obtain video stabilization. Therefore, Venetianer fails to cure the deficiencies of Eledath. 
Examiner’s Response:  A moving source supplying video is disclosed by a new reference of Werling et al. (US Pub. No. 2011/0043627 A1), who discloses using an UAV to obtain streaming surveillance videos.

Applicant has argued as follows: Moreover, Eledath teaches away from both multiple cameras and static cameras as described by Venetianer, since Eledath's body-mounted augmented reality system could not work if combined with the surveillance system of Venetianer. In particular, the stationary video sensors 14 of Venetianer located at particular orientations within an environment could not provide the of11 Appl. No.: 16/357,378 hAmendment of January 21, 2021 Reply to Office Action of October 22, 2020ead motion and movement data required by Eledath's system in order to discern the user's intent because Venetianer's video sensors 14 cannot be worn by the user. 
Examiner Response: However as disclosed in the Eledath ¶97, the camera does not necessarily need to be a wearable AR head mounted camera, it could alternatively be a “fixed location camera, such as “stand-off” cameras that are installed in walls or ceilings, and/or mobile cameras (such as cameras that are integrated with consumer electronic devices, such as desktop computers, laptop computers, smart phones, tablet computers.” 

Applicant has argued as follows: Similarly, Venetianer's system could not use the single head-mounted camera of Eledath because the surveillance system of Venetianer requires each video sensor 14 to be calibrated to recognize a static field-of-view in order to be able to "determine an approximate absolute size and speed of a particular object . .. at various places in the video image provided by the video sensor", as described in [0098] of Venetianer. Thus, Venetianer also teaches away from moving cameras, and there is no suggestion or motivation to combine Eledath with Venetianer. 
Examiner Response: However as disclosed in the Eledath ¶97, the camera does not necessarily need to be a wearable AR head mounted camera, it could alternatively be a “fixed location camera, such as “stand-off” cameras that are installed in walls or ceilings, and/or mobile cameras (such as cameras that are integrated with consumer electronic devices, such as desktop computers, laptop computers, smart phones, tablet computers.” 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 3-5, 7, 8, 10-12, 14, 15, 17-19, 21-27 are rejected under 35 U.S.C. 103 as being unpatentable over Eledath et al. (US Pub. No. 2016/0378861 A1 and in further view of Venetianer et al. (US Pub. No. 2008/0100704 A1) and in further view of Werling et al. (US Pub. No. 2011/0043627 A1).
Regarding claim 1, Eledath discloses, an apparatus comprising a processor and a memory storing executable instructions that, in response to execution by the processor, cause the apparatus to: (See Eledath ¶152, “includes a plurality of instructions embodied in memory accessible by a processor of at least one of the computing devices.”)
receive a video feed; process the video feed in real-time as the video feed is received, including the apparatus being caused to: (See Eledath ¶36, “Among other things, embodiments of the disclosed technologies can utilize computer vision technologies to generate a semantic understanding of a live view of a real-world environment as depicted in a set of images or video produced by, e.g., a camera.”)
perform object detection and recognition on the video feed to detect and classify objects therein, perform activity recognition to detect and classify activities of at least some of the objects, and output classified objects and classified activities in the video feed; (See Eledath ¶49, “A scene-understanding server (e.g., scene understanding services 220) provides interfaces to modules that recognize classes and specific instances of objects (vehicles, people etc.), locales and activities being performed (e.g., FIGS. 33-34).”)
generate natural language text that describes the video feed from the classified objects and activities; (See Eledath ¶105, “Based on the system 110's semantic 
produce a semantic network including a graph with vertices that represent the classified objects, and edges that connect the vertices and represent semantic relationships between the classified objects, at least some of the semantic relationships corresponding to respective ones of the classified activities; (See Eledath ¶54, “Data collected in the system 110 can be stored and organized for situational awareness, analysis and reasoning by automated algorithms and human users … In a graph representation, nodes represent the objects of interest along with their attributes, and edges between the nodes represent inter-object relationships.” Further see Eledath ¶139, Other examples of multi-entity relational cures, which may identify two or more entities and a relationship between them, include "person comes out of this vehicle" (where "this" vehicle is identified by a pointing gesture), "vehicle that was parked next to this one last evening") (relationship includes a temporal component and a spatial component).”)
and store the video feed, classified objects and classified activities, natural language text, and semantic network in a knowledge base; (See Eledath ¶97, “The video 122 may be stored in computer memory as a video file and analyzed by the system 110 as disclosed herein.” Further see Eledath ¶105, “the system 110 creates a 
and generate a graphical user interface (GUI) configured to enable queries of the knowledge base, and presentation of selections of the video feed, classified objects and classified activities, natural language text, and semantic network. (See Eledath ¶128, “For example, the reasoning module 1804 may infer based on the user intent and processing performed by the inference module 1812 that there is a need to perform a query on a certain database to find the information the user is looking for. … converts the result of task flow/workflow execution and/or other processing initiated by the reasoning module 1804 into suitable output, e.g., graphical/textual overlays, system-generated natural language, etc., and sends the output to the appropriate output device (e.g., display, speaker), as illustrated by augmented image 1808.”)
Eledath discloses the above limitations but he fails to disclose the following limitations. 
However Venetianer discloses, receive a video feed comprising video from multiple sources; (See Venetianer ¶92, “The video sensors 14 provide source video to the computer system 11.  Each video sensor 14 can be coupled to the computer system 11 using, for example, a direct connection (e.g., a firewire digital camera interface) or a 
network.”
Further see Venetianer ¶97, “In block 21, the video surveillance system is set up as discussed for FIG. 1.  Each video sensor 14 is orientated to a location for video surveillance.  The computer system 11 is connected to the video feeds from the video equipment 14 and 15.”

wherein the classified activities comprise an interaction between one or more of the classified objet and a geographic area in the video feed; (See Venetianer ¶128, “Activity detectors correspond to a behavior related to an area of the video scene.  They describe how an object might interact with a location in the scene.  FIG. 18 illustrates three exemplary activity detectors.  FIG. 18a represents the behavior of crossing a perimeter in a particular direction using a virtual video tripwire … FIG. 18b represents the behavior of loitering for a period of time on a railway track.  FIG. 18c represents the behavior of taking something away from a section of wall …Other exemplary activity detectors may include detecting a person falling, detecting a person changing direction or speed, detecting a person entering an area, or detecting a person going in the wrong direction.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the multiple video sources and classifying activities that include an interaction between objects and geographic location as suggested by Venetianer to Eledath’s surveillance apparatus using known engineering techniques, with a reasonable expectation of success. The motivation for doing is in order to obtain more information by classifying objects from multiples viewpoints as well as more accurately determining many types of activities based on their interactions and locations. 

However Werling discloses, receiving video feed comprising video from multiple sources, including a moving aerial source; (See Werling ¶32, “The locative video software 140 combines the video frames 132 and the geospatial snapshots 112 to generate the combined video stream 150, as shown in block 144.  The combined video stream 150 provides a viewer, i.e., operator or user, with a combination of both the video frames 132, which may be a real time video stream, and supplemental information, such as the referential geospatial data 122 available from the geospatial data repository 120, … Similar, in a military domain, an operator or analyst viewing video coming from a unmanned aerial vehicle (UAV) can see recent reports regarding significant activities in the area displayed in the video.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the surveillance video from a UAV as suggested by Werling to Eledath and Venetianer’s surveillance video from multiple sources using known engineering techniques, with a reasonable expectation of success. The motivation for doing so is because a UAV can provide an aerial view and can be repositioned to multiple locations of interest.

Regarding claim 2, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein at least some of the multiple sources are moving sources. (See 

Regarding claim 3, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein the apparatus being caused to process the video feed further includes being caused to: geo-register the classified objects with respective geographic locations, (See Eledath ¶49,  “A geo-localization services module 214 processes sensor data to accurately geo-locate the user both indoors and outdoors even in GPS (Global Positioning System) challenged areas.” Further see Eledath ¶98, “One or more location/orientation sensors 118 acquire location/orientation data 126 in order to spatially align or "register" the video 122 with the real world scene 100 so that object detection and/or object recognition algorithms and other computer vision techniques can determine an understanding of the real world scene 100.”)
wherein the GUI is further configured to present an aerial image or map of a scene in the video feed, identifying thereon the classified objects at the respective geographic locations (See Eledath ¶113, “FIG. 12 illustrates a map/scene correlation implementation in which the system 110 correlates a view 1202 of the real world scene with an overhead view of a real or virtual map 1204 of the corresponding geographic area. … For example, the vehicle graphical overlay 1206 on the real world scene 1202 identifies a vehicle in the scene (from which the user can view certain characteristics of the vehicle, such as color or make/model) as well as it's spatial location within the scene 1202, including surrounding people and objects.  The graphical overlay 1208 on the map 1204 identifies the geographic location of the same vehicle; 

However Venetianer discloses, and including respective trajectories of any moving ones of the classified objects, and with the respective trajectories of the moving ones of the classified objects. (See Venetianer ¶169, “In block 62, an activity record is generated for each event occurrence that occurred.  The activity record includes, for example: details of a trajectory of an object;” Further see Venetianer ¶170, “In block 63, output is generated.  The output is based on the event occurrences extracted in block 44 and a direct feed of the source video from block 41.”)

Regarding claim 4, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein the apparatus being caused to perform object detection and recognition includes being caused to assign respective unique identifiers to the classified objects, and the presentation of selections of the video feed in the GUI includes identifying the classified objects on the video feed and including the respective unique identifiers. (See Fig. 6 which shows identifiers of “Gray Van” and “Jim Jones” within the video feed in the user’s GUI.)

Regarding claim 5, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein at least some of the objects are moving objects, and the apparatus being caused to perform object detection and recognition includes being caused to detect and classify the moving objects using convolutional neural networks. (See 
However Venetianer discloses, classify the moving objects using motion compensation, (See Venetianer ¶157, “The motion detection technique of block 51 and the change detection technique of block 52 are complimentary techniques, where each technique advantageously addresses deficiencies in the other technique.”
Further see Venetianer ¶158, “As an option, if the video sensor 14 has motion (e.g., a video camera that sweeps, zooms, and/or translates), an additional block can be inserted before blocks between blocks 51 and 52 to provide input to blocks 51 and 52 for video stabilization.  Video stabilization can be achieved by affine or projective global motion compensation.”)
background subtraction (See Venetianer ¶156, “In block 52, objects are detected via change.  Any change detection algorithm for detecting changes from a background model can be used for this block. … As an example, a stochastic background modeling technique, such as dynamically adaptive background subtraction, can be used.”)

Regarding claim 6, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein the apparatus being caused to perform activity recognition includes being caused to detect and classify at least some of the activities as involving only a single one of the classified objects, multiple ones of the classified objects, or interaction between one or more of the classified objects and a geographic area in the video feed. (See the rejection of claim 1, as it is equally applicable for claim 6 as well.)

Regarding claim 7, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein the apparatus being caused to generate the GUI includes being caused to generate the GUI configured to enable queries of the knowledge base based on similarity between a user-specified object and one or more of the classified objects in the video feed. (See Eledath ¶131, “For example, in response to a user asking "who is that?" the reasoner 1600 may need to analyze gesture and/or gaze data to determine the person in the scene to whom the user is referring as "that", and then initiate a face recognition algorithm to identify such person, and then initiate a search query to determine additional details about the person (e.g., residence, employment status, etc.).  The dialog boxes 1618, 1620, 1622 illustrate examples of output intents that may be produced by the reasoner 1600.” Also see Figs. 5-7 which show queries by the a user and the displayed output in a GUI.)

Regarding claim 8, Eledath, Venetianer, and Werling disclose, a method of intelligent video analysis, the method comprising: receiving a video feed comprising video from multiple sources, including a moving aerial source; processing the video feed in real-time as the video feed is received, including: performing object detection and recognition on the video feed to detect and classify objects therein, performing activity recognition to detect and classify activities of at least some of the objects, and outputting classified objects and classified activities in the video feed, wherein the classified activities comprise an interaction between one or more of the classified objects and a geographic area in the video feed; generating natural language text that describes the video feed from the classified objects and activities; producing a semantic 

Regarding claim 10, Eledath and Hurd disclose, the method of claim 8, wherein processing the video feed further includes geo-registering the classified objects with respective geographic locations, and including respective trajectories of any moving ones of the classified objects, and wherein the GUI is further configured to present an aerial image or map of a scene in the video feed, identifying thereon the classified objects at the respective geographic locations and with the respective trajectories of the moving ones of the classified objects. (See the rejection of claim 3 as it is equally applicable for claim 10 as well.)

Regarding claim 11, Eledath, Venetianer, and Werling disclose, the method of claim 8, wherein performing object detection and recognition includes assigning respective unique identifiers to the classified objects, and the presentation of selections of the video feed in the GUI includes identifying the classified objects on the video feed 

Regarding claim 12, Eledath, Venetianer, and Werling disclose, the method of claim 8, wherein at least some of the objects are moving objects, and performing object detection and recognition includes detecting and classifying the moving objects using motion compensation, background subtraction and convolutional neural networks. (See the rejection of claim 5 as it is equally applicable for claim 12 as well.)

Regarding claim 14, Eledath, Venetianer, and Werling disclose, the method of claim 8, wherein generating the GUI includes generating the GUI configured to enable queries of the knowledge base based on similarity between a user-specified object and one or more of the classified objects in the video feed. (See the rejection of claim 7 as it is equally applicable for claim 14 as well.)

Regarding claim 15, Eledath, Venetianer, and Werling disclose, a non-transitory computer-readable storage medium having computer- readable program code stored therein that in response to execution by a processor, causes an apparatus to: (See Eledath ¶161, “Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors.”)
receive a video feed comprising video from multiple sources, including a moving aerial source; process the video feed in real-time as the video feed is received, 

Regarding claim 17, Eledath and Hurd disclose, the computer-readable storage medium of claim 15, wherein the apparatus being caused to process the video feed further includes being caused to: geo-register the classified objects with respective geographic locations, and including respective trajectories of any moving ones of the classified objects, wherein the GUI is further configured to present an aerial image or map of a scene in the video feed, identifying thereon the classified objects at the 

Regarding claim 18, Eledath, Venetianer, and Werling disclose, the computer-readable storage medium of claim 15, wherein the apparatus being caused to perform object detection and recognition includes being caused to assign respective unique identifiers to the classified objects, and the presentation of selections of the video feed in the GUI includes identifying the classified objects on the video feed and including the respective unique identifiers. (See the rejection of claim 4 as it is equally applicable for claim 18 as well.)

Regarding claim 19, Eledath, Venetianer, and Werling disclose, the computer-readable storage medium of claim 15, wherein at least some of the objects are moving objects, and the apparatus being caused to perform object detection and recognition includes being caused to detect and classify the moving objects using motion compensation, background subtraction and convolutional neural networks. (See the rejection of claim 5 as it is equally applicable for claim 19 as well.)

Regarding claim 21, Eledath, Venetianer, and Werling disclose, the computer-readable storage medium of claim 15, wherein the apparatus being caused to generate the GUI includes being caused to generate the GUI configured to enable queries of the knowledge base based on similarity between a user- specified object and one or more 

Regarding claim 22, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein the moving aerial source is an unmanned aerial vehicle.  (See Werling ¶32, “Similar, in a military domain, an operator or analyst viewing video coming from a unmanned aerial vehicle (UAV) can see recent reports regarding significant activities in the area displayed in the video.”)

Regarding claim 23, Eledath, Venetianer, and Werling disclose, the apparatus of claim 1, wherein the GUI is further configured to present an aerial image or map identifying a geographic location of at least one of the multiple sources.  (See Werling ¶32, “The locative video software 140 combines the video frames 132 and the geospatial snapshots 112 to generate the combined video stream 150, as shown in block 144.  The combined video stream 150 provides a viewer, i.e., operator or user, with a combination of both the video frames 132, which may be a real time video stream, and supplemental information, such as the referential geospatial data 122 available from the geospatial data repository 120.”)

Regarding claim 24, Eledath, Venetianer, and Werling disclose, the method of claim 8, wherein the moving aerial source is an unmanned aerial vehicle. (See the rejection of claim 22 as it is equally applicable for claim 24 as well.)

Regarding claim 25, Eledath, Venetianer, and Werling disclose, the method of claim 8, wherein the GUI is further configured to present an aerial image or map identifying a geographic location of at least one of the multiple sources. (See the rejection of claim 23 as it is equally applicable for claim 25 as well.)

Regarding claim 26, Eledath, Venetianer, and Werling disclose, the computer-readable storage medium of claim 15, wherein the moving aerial source is an unmanned aerial vehicle. (See the rejection of claim 22 as it is equally applicable for claim 26 as well.)

Regarding claim 27, Eledath, Venetianer, and Werling disclose, the computer-readable storage medium of claim 15, wherein the GUI is further configured to present an aerial image or map identifying a geographic location of at least one of the multiple sources. (See the rejection of claim 23 as it is equally applicable for claim 27 as well.)

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID PERLMAN whose telephone number is (571)270-1417.  The examiner can normally be reached on Monday - Friday; 10:00am - 6:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached on (571) 272-3638.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


/DAVID PERLMAN/Primary Examiner, Art Unit 2662