DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant's submission filed on 27 April 2021 has been entered.  Claims 1, 4-6, 9 and 10 have been amended.  Claims 2, 7 and 8 have been canceled.  Claims 1, 2, 4-6 and 9-11 are currently pending and have been considered below.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 2, 4-6 and 9-11 have been carefully considered but are moot in view of the new grounds of rejection necessitated by Applicant’s amendments.
The 35 U.S.C. §112(b) rejection of claims 1, 2, 4-6 and 9-11 regarding the term “feature maps” has been withdrawn in view of Applicant’s amendments.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 2, 4-6 and 9-11 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter 
Claims 1, 9 and 10 recite the limitation “wherein the hardware processor generates the time observation map based on a result of an inner product of the feature quantities defined for each element of the plurality of elements along the time direction, a position direction in the plurality of first feature maps, and a relationship direction among the plurality of first feature maps, and wherein the inner product of the feature quantity for each element of the plurality of first feature maps is defined as the first weighting value for each element of the plurality of first feature maps belonging to the first group and the plurality of first feature maps belonging to the second group.”  It is unclear what is meant by the limitations “time observation map based on … an inner product of the feature quantities … along the time direction, a position direction and a relationship direction … and the inner product is defined as the first weighting value … belonging to the first group and … belonging to the second group.”  It is unclear if the time, position and relationship are combined or fused and then a weight is calculated or determined for each frame or time in a sequence of video frames or if the term “inner product” corresponds to “a weight” or “weights” at different points in time.
Claims 4 and 5 recite the limitations “each feature quantity of the element included in the element group is linearly embedded.”  It is unclear what is meant by “quantity of the element included in the element group is linearly embedded.”  It is unclear if the features are embedded in the combined maps or embedded elsewhere.
Claims 2, 6 and 11 are rejected for being dependent on a rejected base claim. 




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 2, 4-6, 9, 10 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over A. Piergiovanni, C. Fan, and M. S. Ryoo. Learning latent sub-events in activity videos using temporal attention filters. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAl-17), 1 January 2017, hereinafter, “Piergiovanni”, in view of Baradel F, Wolf C, Mille J. Pose-conditioned spatio-temporal attention for human action recognition. arXiv preprint arXiv:1703.10106. 2017 Mar 29, hereinafter, “Baradel”, and further in view of Palanisamy et al., U.S. Publication No. 2020/0139973, hereinafter, “Palanisamy”.

As per claim 1, Piergiovanni discloses an object detection apparatus (Piergiovanni, page 4247, Introduction, activity recognition approaches taking advantage of convolutional neural networks (CNNs) ... image-based object recognition using CNNs … image-based CNN architectures) comprising:
a hardware processor configured to: 

generate, based at least in part on a first group of the plurality of first feature maps calculated at a first time and a second group of the plurality of first feature maps calculated prior to the first time, a time observation map (Piergiovanni, page 4248, Figure 1, Temporal attention filters; Piergiovanni, page 4249, Recognition approach, We design our model as a set of temporal filters, each corresponding to a particular sub-event, placed on top of per-frame (or per-segment) CNN architectures (Figure 1) ... jointly learning latent sub-events composing each activity, underlying CNN parameters, and the activity classifier);
generate a plurality of second feature maps by weighting each of the plurality of first feature maps included in the first group or the second group in accordance with first weighting values indicated in the time observation map (Piergiovanni, page 4248, Figure 1, Fully connected layer; Piergiovanni, page 4250, Recurrent neural networks with temporal filters, learns weights that models how previous LSTM iteration outputs ... our attention filter parameters become the function of previous iteration LSTM outputs); and 
detect an object captured m the input image by using the plurality of second feature maps (Piergiovanni, page 4247, Introduction, binary (or multiclass) classification problem of outputting activity class labels … activity recognition approaches taking advantage of convolutional neural networks (CNNs) ... image-based object recognition using CNNs; Piergiovanni, page 4250, Recurrent neural networks with temporal filters, the recognition system must learn how to dynamically and adaptively adjust locations 
Piergiovanni does not explicitly disclose the following limitations as further recited however Baradel discloses 
in which feature quantities of at least some elements included in the plurality of first feature maps differ (Baradel, page 6, Section 4. Network architectures and Training, Architectures — The pose network fsk consists of 3 convolutional layers of respective sizes 8 x 3, 8 x 3, 5 x 75. Inputs are of size 20 x 300 x 3 and feature maps are, respectively, 10 x 150, 5 x 75 and 1 x 1 x 1024. Max pooling is employed after each convolutional layer);
in which a first weighting value is defined for each element (Baradel, page 2, Introduction, features learned on pose serve as an input to the soft-attention mechanism, which weights each glimpse output according to an estimated importance; Baradel, page 6, Section 3.3. Temporal Attention, (combined with pose) the spatial attention distributions pt over time t are a good indicator for temporal attention, and stack them into a single vector P, input into the network predicting temporal attention ... This attention is used as weight for adaptive temporal pooling of the features).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Piergiovanni to include the different feature map input scales and temporal weightings as taught by Baradel in order to enable detection of features at different locations and times (Baradel, Abstract).
Piergiovanni and Baradel do not explicitly disclose the following limitations as further recited however Palanisamy discloses 
a time observation map comprising a plurality of elements, in which a first weighting value is defined for each element, wherein the closer the relationship in a time direction between the first group 
wherein the hardware processor generates the time observation map based on a result of an inner product of the feature quantities defined for each element of the plurality of elements along the time direction, a position direction in the plurality of first feature maps, and a relationship direction among the plurality of first feature maps (Palanisamy, ¶0014, The LSTM network then learns a temporal attention weight for each LSTM output at each time step. The learned temporal attention weight is an inner product of the region vector at that time step and the hidden vector at that time step, and reflects a relative importance of that LSTM output at a given frame so that frames that matter the most for learning the correct actions are considered to have higher importance for computing an action output. A softmax function at the LSTM network normalizes the sum of all of the learned temporal attention weights to one, and the LSTM network then generates, at each time step, a weighted output for that time step that is equal to the product of a learned temporal attention weight at that time step and a hidden state vector at that time step), and 
wherein the inner product of the feature quantity for each element of the plurality of first feature maps is defined as the first weighting value for each element of the plurality of first feature maps belonging to the first group and the plurality of first feature maps belonging to the second group 
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to include the spatial and temporal weights for each time step as taught by Palanisamy in the system of Piergiovanni and Baradel in order to provide a means for controlling an autonomous vehicle (Palanisay, Abstract).

As per claim 2, Piergiovanni, Baradel and Palanisamy disclose the apparatus according to claim 1, wherein the hardware processor calculates the plurality of first feature maps in which at least either resolutions or scales differ (Baradel, page 6, Section 4. Network architectures and Training, Architectures — The pose network fsk consists of 3 convolutional layers of respective sizes 8 x 3, 8 x 3, 5 x 75. Inputs are of size 20 x 300 x 3 and feature maps are, respectively, 10 x 150, 5 x 75 and 1 x 1 x 1024. Max pooling is employed after each convolutional layer).  The motivation would be the same as above in claim 1.

As per claim 4, Piergiovanni, Baradel and Palanisamy disclose the apparatus according to claim 1, wherein the hardware processor generates the time observation map in which the results of the inner product of the feature quantities defined for each element of the plurality of first feature maps along the time direction are defined as the first weighting value for an element (Palanisamy, ¶0014) that corresponds between a first combined map and a second combined map,

the second combined map being a map in which, for each element group of corresponding elements in the plurality of second feature maps included in the second group, each feature quantity of the element included in the element group is linearly embedded (Palanisamy, ¶0088, the CNN 130 processes the image data 129 to generate an output a feature map; Palanisamy, ¶0089, The feature map is first applied to the spatial attention module 140 of the attention module to generate an output that is then applied to the temporal attention module 160 of the attention module … Temporal attention module 160 can then generate a combined context vector as its output that can then be used by other layers to generate the hierarchical actions 172; Palanisamy, ¶0100; Palanisamy, ¶0101,  Each one of the dotted-line boxes represents the actor-critic network architecture 102 being continuously applied to updated information at different steps within a time series ... each dotted-line box represents processing by the actor-critic network architecture 102 at different instances in time; Palanisamy, ¶0108, The set of region vectors 132-3 are applied to the attention network 134-3 along with a previous hidden state vector ... from the previous stage (at past time step t+1)).  The motivation would be the same as above in claim 1.


generates a third combined map comprising a plurality of elements, in which a weight value during a linear embedding is different than in the first combined map, the third combined map being a map in which a feature quantity of each element thereof included in the element group is linearly embedded for each element group of corresponding elements of the plurality of first feature maps included in the second group (Palanisamy, ¶0088, the CNN 130 processes the image data 129 to generate an output a feature map; Palanisamy, ¶0089, The feature map is first applied to the spatial attention module 140 of the attention module to generate an output that is then applied to the temporal attention module 160 of the attention module … Temporal attention module 160 can then generate a combined context vector as its output that can then be used by other layers), and 
generates a plurality of the second feature maps comprising a plurality of elements by weighting a feature quantity of each element thereof included in the third combined map in accordance with the first weighting values indicated in the time observation map (Palanisamy, ¶0101,  Each one of the dotted-line boxes represents the actor-critic network architecture 102 being continuously applied to updated information at different steps within a time series ... each dotted-line box represents processing by the actor-critic network architecture 102 at different instances in time; Palanisamy, ¶0108, The set of region vectors 132-3 are applied to the attention network 134-3 along with a previous hidden state vector ... from the previous stage (at past time step t+1)).

As per claim 6, Piergiovanni, Baradel and Palanisamy disclose the apparatus according to claim 1, wherein the hardware processor
generates, based on the plurality of first feature maps, a space observation map in which elements having a closer relationship in a space, which is defined by position directions in the plurality 
generates a plurality of third feature maps by weighting each of the plurality of first feature maps in accordance with the second weighting values indicated in the space observation map (Palanisamy, ¶0107, The spatial attention module 140 is applied after the feature extraction layers 130 and before the recurrent layers 150. The spatial attention module 140 can apply spatial attention to learn weights for different areas in an image ... This allows the spatial attention module 140 to add importance to different locations or regions within the image data 129-3), 
generates the time observation map by using the plurality of third feature maps as the plurality of first feature maps (Palanisamy, ¶0115,  can be decided which frames in past observations matter the most. The temporal attention module 160 learns scalar weights ... in different time steps. The weight of each LSTM output wi is defined as an inner product of the feature vector vi 132 and LSTM hidden vector h.i 133 followed by a softmax function), and 
generates the plurality of second feature maps by using the plurality of third feature maps as the plurality of first feature maps (Palanisamy, ¶0116, each learned weight is dependent on the previous time step's information and current state information. The learned weights can be interpreted as the importance of the LSTM output at a given frame. Therefore, the optimizing process can be seen as learning to choose which observations are relatively more important for learning the correct actions. 
wherein the relationship direction among the first feature maps indicates increase or decrease direction of resolution or scale (Palanisamy, ¶0105, Each convolutional kernel generates a first layer output channel that comprises an image having a first resolution. A first max-pooling layer 226 is configured to process each first output channel by applying a maximum value operation to that first output channel to down-scale the corresponding image and generate a down-scaled map having the first resolution. The first max-pooling layer 226 outputs a plurality of second output channels that each comprise an image having a second resolution that is less than the first resolution. A second convolutional layer 228 configured to apply a second bank of convolutional kernels to each of the plurality of second output channels. Each convolutional kernel of the second bank generates a third output channel that comprises an image having a third resolution that is less than the second resolution).  The motivation would be the same as above in claim 1.

Regarding claim(s) 9 and 10: 
A corresponding reasoning as given earlier (see rejection of claim(s) 1) applies, mutatis mutandis, to the subject-matter of claim(s) 9 and 10, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 1.

As per claim 11, Piergiovanni, Baradel and Palanisamy disclose a moving object comprising:
the object detection apparatus according to claim 1; and a hardware processor configured to control a driving device based on information indicating a detection result of an object (Palanisamy, .

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRACY MANGIALASCHI whose telephone number is (571)270-5189.  The examiner can normally be reached on M-F, 9:30AM TO 6:00PM.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on (571) 272-7332.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/TRACY MANGIALASCHI/Examiner, Art Unit 2668                                            
/VU LE/Supervisory Patent Examiner, Art Unit 2668