DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 7, 12 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by “Amazon Rekognition FAQs” (Amazon).

Concerning claim 1, Amazon teaches a computer-implemented method (p. 1: image recognition device using deep neural network models (computer-implemented)) comprising:
generating an input video stream comprising at least one image frame that coincides with detection of activity within a threshold distance of a property (p. 1: extracting context from a live stream video of frames comprising detecting activities such as when someone delivers a package to your door (threshold of distance property)); 
generating timing information for the input video stream (p. 14: returning time stamps in milliseconds on the video timeline of the video stream), the timing information comprising a respective time stamp for each image frame of the input video stream (p. 16: frame accurate timecodes provide the exact frame number for a relevant segment of video); 
based on the input video stream and the timing information, obtaining image frames comprising a pre-event image frame that precedes detection of the activity and a post-event image frame that coincides with detection of the activity (pgs. 1 & 5: activities are detected in live video streams and detects persons and the service tracks them through the video as the person might go in (pre-event image frame) and out of (post-event image frame) the scene); 
computing an image score with respect to placement of a candidate item at the property in response to processing the pre-event image frame and the post-event image frame (pgs. 1 & 5: returning a confidence score for each instance of an object found (placement of a candidate item) by analyzing the video to assign labels to detected activities as a person might go in and out of the scene); and 
based on the image score, determining that a first item was delivered to the property or that a second item was removed after being delivered to the property (pgs. 1 & 5: based on the confidence score, determining someone delivers a package to your door).

Concerning claim 7, Amazon further teaches the method of claim 1, wherein the first item and the second item are the same item (pgs. 1 & 5: based on the confidence score, determining someone delivers a package to your door).

Concerning claim 12, Amazon teaches a system comprising:
a processing device (p. 15: video stream processor);
a non-transitory machine-readable storage device storing instructions that are executable by the processing device to cause performance of operations (pgs. 1 & 15: deep neural networks are used for a video stream processor to manage analysis of stream video) comprising:
generating an input video stream comprising at least one image frame that coincides with detection of activity within a threshold distance of a property (p. 1: extracting context from a live stream video of frames comprising detecting activities such as when someone delivers a package to your door (threshold of distance property)); 
generating timing information for the input video stream (p. 14: returning time stamps in milliseconds on the video timeline of the video stream), the timing information comprising a respective time stamp for each image frame of the input video stream (p. 16: frame accurate timecodes provide the exact frame number for a relevant segment of video); 
based on the input video stream and the timing information, obtaining image frames comprising a pre-event image frame that precedes detection of the activity and a post-event image frame that coincides with detection of the activity (pgs. 1 & 5: activities are detected in live video streams and detects persons and the service tracks them through the video as the person might go in (pre-event image frame) and out of (post-event image frame) the scene); 
computing an image score with respect to placement of a candidate item at the property in response to processing the pre-event image frame and the post-event image frame (pgs. 1 & 5: returning a confidence score for each instance of an object found (placement of a candidate item) by analyzing the video to assign labels to detected activities as a person might go in and out of the scene); and 
based on the image score, determining that a first item was delivered to the property or that a second item was removed after being delivered to the property (pgs. 1 & 5: based on the confidence score, determining someone delivers a package to your door).

Concerning claim 20, Amazon teaches one or more non-transitory machine-readable storage devices storing instructions that are executable by one or more processing devices to cause performance of operations (pgs. 1 & 15: deep neural networks are used for a video stream processor to manage analysis of stream video) comprising:
generating an input video stream comprising at least one image frame that coincides with detection of activity within a threshold distance of a property (p. 1: extracting context from a live stream video of frames comprising detecting activities such as when someone delivers a package to your door (threshold of distance property)); 
generating timing information for the input video stream (p. 14: returning time stamps in milliseconds on the video timeline of the video stream), the timing information comprising a respective time stamp for each image frame of the input video stream (p. 16: frame accurate timecodes provide the exact frame number for a relevant segment of video); 
based on the input video stream and the timing information, obtaining image frames comprising a pre-event image frame that precedes detection of the activity and a post-event image frame that coincides with detection of the activity (pgs. 1 & 5: activities are detected in live video streams and detects persons and the service tracks them through the video as the person might go in (pre-event image frame) and out of (post-event image frame) the scene); 
computing an image score with respect to placement of a candidate item at the property in response to processing the pre-event image frame and the post-event image frame (pgs. 1 & 5: returning a confidence score for each instance of an object found (placement of a candidate item) by analyzing the video to assign labels to detected activities as a person might go in and out of the scene); and 
based on the image score, determining that a first item was delivered to the property or that a second item was removed after being delivered to the property (pgs. 1 & 5: based on the confidence score, determining someone delivers a package to your door).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 2-6, 8-10, and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over “Amazon Rekognition FAQs” (Amazon) in view of Chen et al. (US 2019/0130580 A1).

Concerning claim 2, Amazon teaches the method of claim 1. Amazon further discloses wherein obtaining image frames comprises:
a first time stamp (p. 16), obtaining a pre-event image frame of an area of interest (AOI) (pgs. 1 & 5: analyzing video to detect activities as a person might go into a scene and determining an object of interest), the property (p. 1);
a second time stamp (p. 16), and a post-event image frame (p. 5). Amazon fails to explicitly disclose obtaining a first event image of an AOI having a boundary that overlaps with the property within a threshold distance from an imaging device at the property; and obtaining a second event image frame of the AOI with respect to the boundary that overlaps with the property.
Chen et al. (hereinafter Chen) teaches a method for applying complex object detection in a video analytics system, comprising:
obtaining a first event image of an AOI having a boundary that overlaps with the property within a threshold distance from an imaging device at the property (¶0264, ¶0269, ¶0413 & fig. 18: determining a bounding region for an object of interest in a video frame in step 1802 overlapping in the scene (property)); and 
obtaining a second event image frame of the AOI with respect to the boundary that overlaps with the property (¶0264, ¶0269, ¶0413 & fig. 18: determining a bounding region for an object of interest in a sequence (second) of video frames overlapping the scene (property)). Given this teaching, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the Amazon invention to included obtaining a first event image frame of an AOI having a boundary that overlaps with the property; and obtaining a second event image frame of the AOI with respect to the boundary that overlaps with the property, as taught by Chen, for the benefit of detecting an tracking objects in a sequence of video frames in a scene (Chen, ¶¶0005-0006).
It should be noted that Amazon and Chen do not specifically disclose an area of interest (AOI) having a boundary that overlaps with the property within a threshold distance from an imaging device at the property. However, because Amazon teaches an AOI that overlaps the property (pgs. 1 & 5), and Chen teaches determining a confidence for detecting an object of interest based on a threshold size of objects of interest considering the distance between the camera and the objects (¶0228), it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to determine a distance between the AOI and the camera, and to apply a threshold to that distance, similar to the threshold size of the object disclosed by Chen, in order to determine if the object is a package left at a door of the property based on the distance from the camera.

Concerning claim 3, Amazon in view of Chen teaches the method of claim 2. Amazon in view of Chen further teaches the pre-event image (Amazon, pgs. 1 & 5); and the AOI coinciding with a field of view of an imaging device used to generate the input video stream (pgs. 1, 4 & 15: detecting objects by object bounding boxes in a field of view of a camera of the live video stream). Amazon fails to explicitly teach wherein, the AOI includes a pre-event AOI that overlaps a portion of an area depicted in the pre-event image.
Chen further discloses the AOI including a previous image AOI (pre-event AOI) that overlaps a portion of an area depicted in the previous image (pre-event image) (¶0264, ¶0269, ¶0413 & fig. 18: determining a bounding region for an object of interest in a sequence of video frames overlapping the scene (property)). Based on this teaching, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the Amazon invention to include the AOI comprising a pre-event AOI that overlaps a portion of an area depicted in the pre-event image, as taught by Chen, for the benefit of detecting and tracking objects in a sequence of video frames in a scene.

Concerning claim 4, Amazon in view of Chen teaches the method of claim 2. Amazon further teaches the method, wherein obtaining image frames comprises: 
obtaining a post-event image frame that includes an image bounding box (pgs. 1, 5, & 9: the service detects persons and tracks them through the video as the person might go out of the scene (post-event image), and returns a bounding box), wherein the image bounding box: 
is configured as an overlay in the post-event image frame; and outlines the first item, the second item, or both the first item and the second item (pgs. 5 & 9: a bounding box (outlines) is returned for each instance of an object found (first item) detected in an image).

Concerning claim 5, Amazon in view of Chen teaches the method of claim 4. Amazon further teaches the method, wherein processing each of the pre-event image frame and the post-event image frame comprises: 
processing each of the pre-event and post-event image frames using a machine-learning (ML) model that implements a deep-learning algorithm used to train the ML model for package detection based on a plurality of color images (pgs. 1 & 3-5: analyzing video as the person might go in (pre-event image frame) and out of (post-event image frame) the scene using deep learning model trained with labeled ground truth data for detecting when someone delivers a package based on a live video stream of video frames in normal color conditions).

Concerning claim 6, Amazon in view of Chen teaches the method of claim 5. Amazon further teaches the method, comprising:
in response to processing the post-event image frame using the ML model, detecting, from the post-event image frame, that the candidate item was placed at the property (pgs. 1 & 5: the service tracks when a person might go out of a scene and detects when someone delivers a package to your door); and 
in response to detecting that the candidate item was placed at the property (pgs. 1 & 5), generating, using the ML model, the image bounding box in the post-event image frame to outline the candidate item (pgs. 1, 5 & 9: using deep learning model, returning a bounding box for each instance of an object found (candidate item) detected in the image of analyzed video as the person might go out of the scene). Amazon does not disclose generating the image bounding box as an overlay.
Chen further teaches generating the image bounding box as an overlay (¶0167: bounding boxes are overlapped geometrically). Based on this teaching, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the Amazon invention to include generating the image bounding box as an overlay, as taught by Chen, for the benefit of detecting and tracking objects in a sequence of video frames in a scene.

Concerning claim 8, Amazon in view of Chen teaches the method of claim 1. Amazon further teaches the method, wherein processing the pre-event image frame and the post-event image frame comprises:
the pre-event image and the post-event image (pgs. 1 & 5), and extracting, using local feature extraction, a set of features from the foreground region (pgs. 1 & 9: extracting motion-based context and extracting relevant face attributes from the sequence of video frames (foreground region)). Amazon fails to explicitly disclose computing a foreground region of the post-event image based on background modeling applied to the second image pre-event image.
Chen teaches computing a foreground region of a first image based on background modeling applied to a second image (¶¶0145-0146: detecting foreground pixels from the first image and performing subtraction between the current frame and a background model including the background part of a scene in a sequence of video frames). Based on this teaching, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the Amazon invention to include computing a foreground region of the first image (post-event image) based on background modeling applied to the second image (pre-event image), as taught by Chen, for the benefit of detecting and tracking objects in a sequence of video frames in a scene.

Concerning claim 9, Amazon in view of Chen teaches the method of claim 8. Amazon further teaches the method, wherein computing the image score comprises: 
computing the image score based on the set of features extracted from the foreground region (p. 6: returning a confidence score for each label relying on motion context).

Concerning claim 10, Amazon in view of Chen teaches the method of claim 9. Amazon further teaches the method, wherein computing the image score comprises: 
computing a region-based similarity score that characterizes similarity between respective regions of the pre-event image frame and the post-event image frame (pgs. 1, 5, 10: returning a confidence score for each comparison (similarity) of a likelihood that a face detected in two images are of the same person, where two images are of a person going in an out a scene).

Claim 13 is the corresponding system to the method of claim 2 and is rejected under the same rationale.

Claim 14 is the corresponding system to the method of claim 3 and is rejected under the same rationale.

Claim 15 is the corresponding system to the method of claim 4 and is rejected under the same rationale.

Claim 16 is the corresponding system to the method of claim 5 and is rejected under the same rationale.

Claim 17 is the corresponding system to the method of claim 6 and is rejected under the same rationale.

Claim 18 is the corresponding system to the method of claim 8 and is rejected under the same rationale.

Concerning claim 19, Amazon in view of Chen teaches the system of claim 18. Amazon further teaches the system, wherein: 
computing the image score comprises: computing the image score based on the set of features extracted from the foreground region (pgs. 1, 5, & 10: returning the confidence score for each attribute based on extracting motion-based context and extracting relevant face attributes); and 
the image score is a region-based similarity score that characterizes similarity between respective regions of the pre-event image frame and the post-event image frame (pgs. 1, 5 & 10).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over “Amazon Rekognition FAQs” (Amazon) in view of Siminoff (US 2020/0267354 A1).

Concerning claim 11, Amazon teaches the method of claim 1. Amazon fails to explicitly teach the method, wherein: the input video stream is obtained using a doorbell camera and a local frame buffer that is local to the doorbell camera.
Siminoff teaches an input functionality for audio/video recording and communication doorbells, wherein: 
the input video stream is obtained using a doorbell camera (¶0192: an A/V recording and communication doorbell includes a camera) and a local frame buffer that is local to the doorbell camera (¶0192: the camera is always recording and the recorded footage is continuously stored in a rolling buffer of the camera (local)). Given this teaching, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the Amazon invention to include a doorbell camera and a local frame buffer that is local to the doorbell camera, as taught by Siminoff, for the benefit of leveraging the functionality of a video doorbell to interact with home security systems (Siminoff, ¶0006).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAMES M ANDERSON II whose telephone number is (571)270-1444. The examiner can normally be reached Monday - Friday 10AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BRIAN PENDLETON can be reached on 571-272-7527. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/James M Anderson II/Primary Examiner, Art Unit 2425