Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

DETAILED ACTION
Claims 1 – 20 are pending in this application. Claims 1, 8 and 15 are independent.

CLAIM INTERPRETATION
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.


As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 


Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Said placeholder(s) is/are: "…module…" in at least claims 1 – 7.

Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.


Placeholder
Corresponding Structure
Functional Language
module
(ASIC)/ hardware component on a chip (SoC)
(¶ [0087])
"…determine a depth estimation…"



If Inventor(s) (or (pre-AlA) Applicant(s)) does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, Inventor(s) (or (pre-AlA) Applicant(s)) may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

For more information, see MPEP § 2173 et seq. and Supplementary Examination Guidelines for Determining Compliance With 35 U.S.C. 112 and for Treatment of Related Issues in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 – 3, 5 – 10, 12 – 17, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Daniilidis, Konstantinos (US-20200265590-A1, hereinafter simply referred to as Daniilidis) in view of Chakravarty, Punarjay (US-20200041276-A1, hereinafter simply referred to as Chakravarty).

Regarding independent claims 1, 8 and 15, Daniilidis teaches:
A method for estimating ego-motion based on a plurality of input images in a self-supervised system (See at least Daniilidis, ¶ [0012, 0105], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras…"), comprising: receiving a source image (e.g., (left image) – captured by an event camera in ¶ [0017] and FIG. 6 of Daniilidis) and a target image (e.g., (right image) illustrated depth predicted by the convolutional neural network for the left image in ¶ [0017] and FIG. 6 of Daniilidis) (See at least Daniilidis, ¶ [0012, 0017, 0105], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…"); determining a depth estimation D.sub.t based on the target image (See at least Daniilidis, ¶ [0012, 0017, 0027, 0105], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…grayscale images taken with the event-based camera that are synchronized with the event images are used to train the convolutional neural network…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…"); determining a depth estimation D.sub.s based on a source image (See at least Daniilidis, ¶ [0012, 0017, 0027, 0080, 0105], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…grayscale images taken with the event-based camera that are synchronized with the event images are used to train the convolutional neural network…", "…we compare our ego-motion results against SFMLearner by Zhou et al. [22], which learns egomotion and depth from monocular grayscale images, while acknowledging that our loss has access to an additional stereo image at training time…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…").
Daniilidis teaches all the subject matters of the claimed inventive concept as expressed in the rejections above, and further teaches that counting of events is a common method for visualizing the event stream and has been shown to be informative in a learning based framework to regress 6DoF pose (¶ [0126]).
But, Daniilidis does not expressly disclose the concept of determining an ego-motion estimation in a form of a six degrees-of-freedom (6 DOF) transformation between the target image and the source image by inputting the depth estimations (D.sub.t, D.sub.s), the target image, and the source image into a two-stream network architecture trained to output the 6 DOF transformation based at least in part on the depth estimations (D.sub.t, D.sub.s), the target image, and the source image.
Nevertheless, Chakravarty teaches the concept of determining an ego-motion estimation in a form of a six degrees-of-freedom (6 DOF) transformation between the target image (e.g., reconstructed image 208 in ¶ [0038] and FIG. 2 of Chakravarty) and the source image (e.g., RGB image coming from a monocular camera in ¶ [0022] and FIGS. 3 & 4 of Chakravarty) (See at least Chakravarty, ¶ [0022], FIGS. 1 – 7, "…devices of the present disclosure utilize the VAE-GAN as the central machinery in the SLAM algorithm…The system is trained using a regular stereo visual SLAM pipeline, where stereo visual simultaneous localization and mapping (vSLAM) receives a sequence of stereoscopic images, generates the depth maps and corresponding six Degree of Freedom poses as the stereo camera moves through space. Stereo vSLAM trains the VAE-GAN-SLAM algorithm using a sequence of RGB images, the corresponding depth maps for the images, and the corresponding pose vector data for the images. The VAE-GAN is trained to reconstruct the RGB image, the pose vector data for the image, and the depth map for the image while creating a shared latent space representation of the same…") by inputting the depth estimations (D.sub.t, D.sub.s) (e.g., reconstructed depth map 226 in ¶ [0038] & FIGS. 2 & 4 (embedded within #404) of Chakravarty), the target image (e.g., reconstructed image 208 in ¶ [0038] and FIGS. 2 & 4 (embedded within #404) of Chakravarty), and the source image (e.g., RGB image coming from a monocular camera in ¶ [0022] and FIGS. 3 & 4 of Chakravarty) into a two-stream network architecture (e.g., FIG. 4 of Chakravarty) trained to output the 6 DOF transformation based at least in part on the depth estimations (D.sub.t, D.sub.s), the target image, and the source image (See at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046], FIGS. 1 – 7, "…devices of the present disclosure utilize the VAE-GAN as the central machinery in the SLAM algorithm…The system is trained using a regular stereo visual SLAM pipeline, where stereo visual simultaneous localization and mapping (vSLAM) receives a sequence of stereoscopic images, generates the depth maps and corresponding six Degree of Freedom poses as the stereo camera moves through space. Stereo vSLAM trains the VAE-GAN-SLAM algorithm using a sequence of RGB images, the corresponding depth maps for the images, and the corresponding pose vector data for the images. The VAE-GAN is trained to reconstruct the RGB image, the pose vector data for the image, and the depth map for the image while creating a shared latent space representation of the same…", "…Each of the image decoder 206, the pose decoder 214, and the depth decoder 224 includes a generative adversarial network (GAN) that comprises a GAN generator (see e.g. 404) and a GAN discriminator (see e.g. 408)…", "…the VAE-GAN 201 is trained to generate each of the reconstructed image 208, the reconstructed pose vector data 216, and the reconstructed depth map 226 in tandem…the latent space 230 includes an encoded latent space vector applicable to each of an image, pose vector data of an image, and a depth map of an image…because the VAE-GAN 201 is trained in tandem, the trained VAE-GAN 201 may receive an input image and generate any outer output such as pose vector data based on the input image or a depth map based on the input image…", "…as illustrated in FIG. 3, the pose encoder 312 and the pose decoder 314 have been trained (see FIG. 2)…The VAE-GAN 301 receives an RGB image 302 at the image encoder 304…", "…the RGB image 302 is a red-green-blue image captured by a monocular camera and provided to the VAE-GAN 301 after the VAE-GAN 301 has been trained…the vehicle controller may implement the result of the VAE-GAN 301 into a SLAM algorithm for computing simultaneous localization and mapping of the vehicle in real-time. The vehicle controller may further provide a notification to a driver, determine a driving maneuver, or execute a driving maneuver based on the results of the SLAM algorithm…", "…the VAE-GAN 301 includes a latent space 330 that is shared by each of an image encoder/decoder, a pose encoder/decoder, and a depth encoder/decoder. The shared latent space 330 enables the VAE-GAN 301 to generate any trained output based on an RGB image 302 (or non-RGB image) as illustrated. The reconstructed pose vector data 316 includes six Degree of Freedom pose data for a monocular camera…").


Regarding dependent claims 2, 9 and 16, Daniilidis modified by Chakravarty above teaches:
wherein the two-stream network architecture (e.g., FIGS. 2 – 4 of Chakravarty) comprises: an appearance stream convolution neural network (CNN) (i.e., merely a "…CNN 310 for processing image data…" in para. [0054] of Applicant’s PG PUB – which is seen to correspond to image decoder 304 (FIG. 3) of Chakravarty) that convolves the source image and the target image; and a structure stream CNN (i.e., merely a "…CNN 310 for processing depth estimate data…" in para. [0054] of Applicant’s PG PUB – which is seen to correspond to depth decoder 324 (FIG. 3) of Chakravarty) that convolves the depth estimations (D.sub.t, D.sub.s) (See at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046, 0052], FIGS. 1 – 7, "…devices of the present disclosure utilize the VAE-GAN as the central machinery in the SLAM algorithm…The system is trained using a regular stereo visual SLAM pipeline, where stereo visual simultaneous localization and mapping (vSLAM) receives a sequence of stereoscopic images, generates the depth maps and corresponding six Degree of Freedom poses as the stereo camera moves through space. Stereo vSLAM trains the VAE-GAN-SLAM algorithm using a sequence of RGB images, the corresponding depth maps for the images, and the corresponding pose vector data for the images. The VAE-GAN is trained to reconstruct the RGB image, the pose vector data for the image, and the depth map for the image while creating a shared latent space representation of the same…", "…Each of the image decoder 206, the pose decoder 214, and the depth decoder 224 includes a generative adversarial network (GAN) that comprises a GAN generator (see e.g. 404) and a GAN discriminator (see e.g. 408)…", "…the VAE-GAN 201 is trained to generate each of the reconstructed image 208, the reconstructed pose vector data 216, and the reconstructed depth map 226 in tandem…the latent space 230 includes an encoded latent space vector applicable to each of an image, pose vector data of an image, and a depth map of an image…because the VAE-GAN 201 is trained in tandem, the trained VAE-GAN 201 may receive an input image and generate any outer output such as pose vector data based on the input image or a depth map based on the input image…", "…as illustrated in FIG. 3, the pose encoder 312 and the pose decoder 314 have been trained (see FIG. 2)…The VAE-GAN 301 receives an RGB image 302 at the image encoder 304…", "…the RGB image 302 is a red-green-blue image captured by a monocular camera and provided to the VAE-GAN 301 after the VAE-GAN 301 has been trained…the vehicle controller may implement the result of the VAE-GAN 301 into a SLAM algorithm for computing simultaneous localization and mapping of the vehicle in real-time. The vehicle controller may further provide a notification to a driver, determine a driving maneuver, or execute a driving maneuver based on the results of the SLAM algorithm…", "…the VAE-GAN 301 includes a latent space 330 that is shared by each of an image encoder/decoder, a pose encoder/decoder, and a depth encoder/decoder. The shared latent space 330 enables the VAE-GAN 301 to generate any trained output based on an RGB image 302 (or non-RGB image) as illustrated. The reconstructed pose vector data 316 includes six Degree of Freedom pose data for a monocular camera…", "…the GAN discriminator 408 determines a prediction of authenticity for each image, i.e. whether the image is a camera image from the actual dataset or a depth map 406 generated by the GAN generator 404…the GAN discriminator 408 is a convolutional neural network configured to categorize images fed to it and the GAN generator 404 is an inverse convolutional neural network…The losses of the GAN generator 404 and the GAN discriminator 408 push against each other to improve the outputs of the GAN…"), wherein the pose module further includes instructions to fuse outputs of the appearance stream CNN and the structure stream CNN into a unified output to produce the 6 DOF transformation (See at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046, 0052], FIGS. 1 – 7, "…devices of the present disclosure utilize the VAE-GAN as the central machinery in the SLAM algorithm…The system is trained using a regular stereo visual SLAM pipeline, where stereo visual simultaneous localization and mapping (vSLAM) receives a sequence of stereoscopic images, generates the depth maps and corresponding six Degree of Freedom poses as the stereo camera moves through space. Stereo vSLAM trains the VAE-GAN-SLAM algorithm using a sequence of RGB images, the corresponding depth maps for the images, and the corresponding pose vector data for the images. The VAE-GAN is trained to reconstruct the RGB image, the pose vector data for the image, and the depth map for the image while creating a shared latent space representation of the same…", "…Each of the image decoder 206, the pose decoder 214, and the depth decoder 224 includes a generative adversarial network (GAN) that comprises a GAN generator (see e.g. 404) and a GAN discriminator (see e.g. 408)…", "…the VAE-GAN 201 is trained to generate each of the reconstructed image 208, the reconstructed pose vector data 216, and the reconstructed depth map 226 in tandem…the latent space 230 includes an encoded latent space vector applicable to each of an image, pose vector data of an image, and a depth map of an image…because the VAE-GAN 201 is trained in tandem, the trained VAE-GAN 201 may receive an input image and generate any outer output such as pose vector data based on the input image or a depth map based on the input image…", "…as illustrated in FIG. 3, the pose encoder 312 and the pose decoder 314 have been trained (see FIG. 2)…The VAE-GAN 301 receives an RGB image 302 at the image encoder 304…", "…the RGB image 302 is a red-green-blue image captured by a monocular camera and provided to the VAE-GAN 301 after the VAE-GAN 301 has been trained…the vehicle controller may implement the result of the VAE-GAN 301 into a SLAM algorithm for computing simultaneous localization and mapping of the vehicle in real-time. The vehicle controller may further provide a notification to a driver, determine a driving maneuver, or execute a driving maneuver based on the results of the SLAM algorithm…", "…the VAE-GAN 301 includes a latent space 330 that is shared by each of an image encoder/decoder, a pose encoder/decoder, and a depth encoder/decoder. The shared latent space 330 enables the VAE-GAN 301 to generate any trained output based on an RGB image 302 (or non-RGB image) as illustrated. The reconstructed pose vector data 316 includes six Degree of Freedom pose data for a monocular camera…", "…the GAN discriminator 408 determines a prediction of authenticity for each image, i.e. whether the image is a camera image from the actual dataset or a depth map 406 generated by the GAN generator 404…the GAN discriminator 408 is a convolutional neural network configured to categorize images fed to it and the GAN generator 404 is an inverse convolutional neural network…The losses of the GAN generator 404 and the GAN discriminator 408 push against each other to improve the outputs of the GAN…").

Regarding dependent claims 3, 10 and 17, Daniilidis modified by Chakravarty above teaches:
a synthesizer module including instructions that when executed by the one or more processors cause the one or more processors to synthesize a predicted image based at least in part on the ego-motion estimation, the depth estimation D.sub.t and the source image (See at least Daniilidis, ¶ [0012, 0017, 0027, 0033, 0076, 0080, 0105], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…grayscale images taken with the event-based camera that are synchronized with the event images are used to train the convolutional neural network…", "…deblurred event images are comparable to edge maps, and so we apply a photometric stereo loss on the census transform of these images to allow our network to learn metric poses and depths…", "…For comparison against the ground truth, we convert the output of the network, (;), from units of pixels/bin into units of pixel displacement…", "…we compare our ego-motion results against SFMLearner by Zhou et al. [22], which learns egomotion and depth from monocular grayscale images, while acknowledging that our loss has access to an additional stereo image at training time…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…" Also, see at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046, 0052], FIGS. 1 – 7), wherein the memory further stores instructions to compare the predicted image against the target image to determine photometric loss (e.g., photometric stereo loss in ¶ [0033] of Daniilidis) for the self-supervised system and adjust parameters of the self-supervised system to reduce the photometric loss by optimizing an associated loss function (See at least Daniilidis, ¶ [0012, 0017, 0027, 0033, 0076, 0080, 0105], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…grayscale images taken with the event-based camera that are synchronized with the event images are used to train the convolutional neural network…", "…deblurred event images are comparable to edge maps, and so we apply a photometric stereo loss on the census transform of these images to allow our network to learn metric poses and depths…", "…For comparison against the ground truth, we convert the output of the network, (;), from units of pixels/bin into units of pixel displacement…", "…we compare our ego-motion results against SFMLearner by Zhou et al. [22], which learns egomotion and depth from monocular grayscale images, while acknowledging that our loss has access to an additional stereo image at training time…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…" Also, see at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046, 0052, 0056], FIGS. 1 – 7).

Regarding dependent claims 5, 12 and 19, Daniilidis modified by Chakravarty above teaches:
wherein the self-supervised system is trained using training data that is augmented with noise (See at least Daniilidis, ¶ [0012, 0017, 0027, 0033, 0070, 0076, 0080, 0105, 0131, 0139, 0155], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…grayscale images taken with the event-based camera that are synchronized with the event images are used to train the convolutional neural network…", "…deblurred event images are comparable to edge maps, and so we apply a photometric stereo loss on the census transform of these images to allow our network to learn metric poses and depths…", "…we apply a Charbonnier loss (10) on the difference between the two images, and vice versa for the right…", "…For comparison against the ground truth, we convert the output of the network, (;), from units of pixels/bin into units of pixel displacement…", "…we compare our ego-motion results against SFMLearner by Zhou et al. [22], which learns egomotion and depth from monocular grayscale images, while acknowledging that our loss has access to an additional stereo image at training time…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…", "…where log m is the matrix logarithm, and .omega.[circumflex over ( )] converts the vector w into the corresponding skew symmetric matrix…", "…A central moving average filter is applied to the estimated velocities to reduce noise…", "…we follow the KITTI flow 2015 benchmark and report the percentage of points with EE greater than 3 pixels and 5% of the magnitude of the flow vector. Similar to KITTI, 3 pixels is roughly the maximum error observed when warping the grayscale images according to the ground truth flow and comparing against the next image…" Also, see at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046, 0052, 0056], FIGS. 1 – 7).

Regarding dependent claims 6, 13 and 20, Daniilidis modified by Chakravarty above teaches:
wherein the noise comprises random noise patches sized 81.times.81 to 101.times.101 (e.g., Charbonnier loss function in ¶ [0070] of Daniilidis – which is well-known to be applied in patch sizes) at a noise augmentation level of 20%-40% coverage (See at least Daniilidis, ¶ [0012, 0017, 0027, 0033, 0070, 0076, 0080, 0105, 0131, 0139, 0155], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…grayscale images taken with the event-based camera that are synchronized with the event images are used to train the convolutional neural network…", "…deblurred event images are comparable to edge maps, and so we apply a photometric stereo loss on the census transform of these images to allow our network to learn metric poses and depths…", "…we apply a Charbonnier loss (10) on the difference between the two images, and vice versa for the right…", "…For comparison against the ground truth, we convert the output of the network, (;), from units of pixels/bin into units of pixel displacement…", "…we compare our ego-motion results against SFMLearner by Zhou et al. [22], which learns egomotion and depth from monocular grayscale images, while acknowledging that our loss has access to an additional stereo image at training time…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…", "…where log m is the matrix logarithm, and .omega.[circumflex over ( )] converts the vector w into the corresponding skew symmetric matrix…", "…A central moving average filter is applied to the estimated velocities to reduce noise…", "…we follow the KITTI flow 2015 benchmark and report the percentage of points with EE greater than 3 pixels and 5% of the magnitude of the flow vector. Similar to KITTI, 3 pixels is roughly the maximum error observed when warping the grayscale images according to the ground truth flow and comparing against the next image…" Also, see at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046, 0052, 0056], FIGS. 1 – 7).

Regarding dependent claims 7 and 14, Daniilidis modified by Chakravarty above teaches:
wherein the source image and the target image are both monocular images (See at least Daniilidis, ¶ [0012, 0017, 0027, 0080, 0105], FIGS. 1, 2, 6 – 8 and 12 – 15, "…FIG. 1 is a diagram illustrating the use of a convolutional neural network to predict optical flow or egomotion and depth from motion blur in event camera images and to use the predicted values to remove the motion blur from the event camera images…", "…FIG. 6 (left image) is a scene at night with a flashing streetlight captured by an event camera; FIG. 6 (right image) illustrated depth predicted by the convolutional neural network for the left image…", "…grayscale images taken with the event-based camera that are synchronized with the event images are used to train the convolutional neural network…", "…we compare our ego-motion results against SFMLearner by Zhou et al. [22], which learns egomotion and depth from monocular grayscale images, while acknowledging that our loss has access to an additional stereo image at training time…", "…existing networks have been developed with frame-based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event-based cameras. In particular, we introduce an image-based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image-based networks…" Also, see at least Chakravarty, ¶ [0022, 0038, 0043, 0044, 0045, 0046], FIGS. 1 – 7).



Allowable Subject Matter
Dependent claims 4, 11 and 18 are objected to as being allowable – including all of the limitations of their base claim(s) and any intervening and/or dependent claims, if re-written in independent form.
Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure: See the Notice of References Cited (PTO–892)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IDOWU O OSIFADE whose telephone number is (571)272-0864. The Examiner can normally be reached on Monday-Friday 8:00am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the Examiner’s Supervisor, Kim Vu can be reached on (571) 272 -3859. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. 
Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/IDOWU O OSIFADE/Primary Examiner, Art Unit 2666