Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Reasons for Allowance
1. The following is an examiner’s statement of reasons for allowance: the prior-art, the prior-art, Ji (US PGPub 20110182469), in view of Jin (US PGPub 20180089562), in view of Medioni (US Patent 9836853), in view of Huang (US PGPub 20180075336), in view of Wang (US PGPub 20190164290), and further in view of Tan (US PGPub 20170243058) failed to disclose: a video action detection method based on a convolutional neural network (CNN), wherein the convolutional neural network comprises a convolutional layer, a common pooling layer, a temporal-spatial pyramid pooling layer and a full connection layer, wherein the outputs of the convolutional neural network comprise a category classification output layer and a time localization calculation result output layer, the video motion detection method comprising: Step 1: in a training phase, performing the following steps: Step 11) inputting a training video in a CNN model to obtain a feature map; Step 12) acquiring segments of different lengths in the training video, and selecting positive samples and negative samples from the actual video action segments (ground truth) as training samples; Step 13) inputting the corresponding feature region of the training samples in the feature map into the temporal-spatial pyramid pooling layer to obtain a feature expression of uniform size; Step 14) inputting the features of the uniform size into the full connection layer, defining a Loss Function, obtaining a loss value; performing backpropagation, adjusting the parameters in the model, and performing training; and Step 15) gradually reducing the learning rate of training; obtaining the trained model when the training loss is no longer falling; and Step 2: in a detection phase, performing the following steps: Step 21) inputting an entire video to be detected into the trained model obtained in Step 15); Step 22) extracting segments of different lengths in the video to-be-detected, acquiring the feature regions of the corresponding segments in the feature layer of the network, and inputting into the temporal-spatial pyramid pooling layer to obtain a feature expression of uniform size; and Step 23) discriminating the features of uniform size, and obtaining a classification confidence based on the category classification output layer; selecting the classification with the highest confidence, and obtaining the category of the action occurring in the video; obtaining a start time and an end time of the action according to time location output from the output layer, thereby fulfilling video action detection, as recited by the independent claim 1.

Regarding Claim 1, the closest prior-art found, Ji, Jin, Medioni, Huang, Wang and Tan discloses of a video action detection method based on a convolutional neural network (CNN), wherein the convolutional neural network comprises a convolutional layer, a common pooling layer, a temporal-spatial pyramid pooling layer and a full connection layer, wherein the outputs of the convolutional neural network comprise a category classification output layer and a time localization calculation result output layer, the video motion detection method comprising: Step 1: in a training phase, performing the following steps: Step 11) inputting a training video in a CNN model to obtain a feature map; Step 12) acquiring segments of different lengths in the training video, and selecting positive samples and negative samples from the actual video action segments (ground truth) as training samples; Step 13) inputting the corresponding feature region of the training samples in the feature map into the temporal-spatial pyramid pooling layer to obtain a feature expression of uniform size; Step 14) inputting the features of the uniform size into the full connection layer, defining a Loss Function, obtaining a loss value; performing backpropagation, adjusting the parameters in the model, and performing training; and Step 15) gradually reducing the learning rate of training; obtaining the trained model when the training loss is no longer falling; and Step 2: in a detection phase, performing the following steps: Step 21) inputting an entire video to be detected into the trained model obtained in Step 15); Step 22) extracting segments of different lengths in the video to-be-detected, acquiring the feature regions of the corresponding segments in the feature layer of the network, and inputting into the temporal-spatial pyramid pooling layer to obtain a feature expression of uniform size.
Individually, Ji teaches 2D CNNs, 2D convolution is performed at the convolutional layers to extract features from local neighborhood on feature maps in the previous layer and A CNN architecture can be constructed by stacking multiple layers of convolution and subsampling in an alternating fashion. recognizing human action from one or more video frames by performing 3D convolutions to capture motion information encoded in multiple adjacent frames and extracting features from spatial and temporal dimensions therefrom; generating multiple channels of information from the video frames, combining information from all channels to obtain a feature representation for a 3D CNN model; and applying the 3D CNN model to recognize human actions.
Jin teaches the CNN hardware architecture can perform a convolution computation operation by prefetching weight data in a deep learning algorithm and using the prefetched weight data when performing a convolution computation operation. That is, the CNN hardware architecture can reduce a convolution operation time (reduce multiplication delay attributable to memory load latency upon convolution operation) by prefetching weight data to be used from the start of a layer to the end of the layer in a block on which a convolution operation is performed. The CNN may include a feature extraction part 110 and a classification part 120. The feature extraction part 110 may include a configuration in which a pair of convolution layer and a pooling layer is repeated multiple times, for example, N times. The classification part 120 may include at least one fully connected layer. The feature extraction part 110 may include an architecture in which output data processed by a pair of a convolution layer and a pooling layer becomes the input data of a next pair of a convolution layer and a pooling layer. a convolution neural network (CNN) operation apparatus and a method which are capable of reducing a convolution operation time by passing a convolution operation if an operand is 0. when each of the PEs of the convolution block computes input data, the arbiter of the input unit may provide the Z_F of the corresponding input data, thereby being capable of reducing operation latency.
Medioni teaches a three-dimensional convolutional neural network may include filters that are self-optimized through learning for classification of faces within images. Different three-dimensional convolutional neural network may be trained for video highlight detection using video segments of different numbers of video frames. For example, a first three-dimensional convolutional neural network may be trained for video highlight detection using video segments of sixteen video frames. A second three-dimensional convolutional neural network may be trained for video highlight detection using video segments of twenty-four video frames. Training of three-dimensional convolutional neural network for video highlight detection using video segments of other numbers of video frames are contemplated. Input component 106 may be configured to input one or more sets of video segments (e.g., the first set of video segments, the second set of video segments) into one or more/different three-dimensional convolutional neural networks. The three-dimensional convolutional neural network may output one or more sets of spatiotemporal feature vectors corresponding to one or more sets of video segments (e.g., a first set of spatiotemporal feature vectors corresponding to the first set of video segments, a second set of spatiotemporal feature vectors corresponding to the second set of video segments). For example, input component 106 may input set A 610 into the first three dimensional convolutional neural network. The first three-dimensional convolutional neural network may output a set of spatiotemporal feature vectors corresponding to input set A 610. For example, the three-dimensional convolutional neural network system may include a first three-dimensional convolutional neural network trained for video highlight detection using video segments of a certain number of video frames (e.g., sixteen video frames) and a second three-dimensional convolutional neural network trained for video highlight detection using video segments of different number of video frames (e.g., twenty-four video frames).
Huang teaches a convolutional neural network for classifying time series data uses a dynamic context selection. In one example a method includes receiving a plurality of inputs of different sizes at a convolutional neural network, applying convolution and pooling to each of the inputs to provide a plurality of outputs of different sizes, changing the size of each of the outputs to a selected uniform size, reshaping each of the outputs to a vector, and fully connecting the vectors. This then provides a uniform size for both CNN models to the respective fully connected layers 222, 224. The fully connected layers receive the input and then generate metadata 226, 228 to describe the inputs based on the prior training of the model. receiving a plurality of inputs of different sizes at a convolutional neural network, applying convolution and pooling to each of the inputs to provide a plurality of outputs of different sizes, changing the size of each of the outputs to a selected uniform size, reshaping each of the outputs to a vector, and fully connecting the vectors. different durations for a convolutional neural network, a processor to apply convolution and pooling to each of the inputs to provide a plurality of outputs of different sizes, to change the size of each of the outputs to a selected uniform size, to reshape each of the outputs to a vector.
Wang teaches during training, the loss function for the semantic image segmentation labeling and the objectness classification labeling may be fused such that the two tasks share a fully convolutional neural network and the two tasks supplement one another in training. Such training improves the performance of the trained fully convolutional network system significantly while reducing complexity. The loss functions may be fused using any suitable technique or techniques such as applying a first weighting to the semantic image segmentation labeling loss function and a second weighting to the objectness classification labeling loss function and summing the weighted loss functions.
The results during training may be compared to ground truth labels (again for both semantic and objectness labels) and the fused loss functions may minimized over the training. In an embodiment, the two loss functions may be weighted and summed by adder 165 over the training images. The resultant loss values or parameters are provided as training feedback 171 and the fused loss functions may be minimized over the training images and sub-regions from cropping of the training images (as discussed further below) to train and generate semantic image segmentation system 100.
Tan teaches selecting a positive sample or a negative sample to be sent to the feature extracting module of the matching model based on the convolutional neural network, and extracting a pair of features corresponding to the pair of gait energy images included in said sample; selecting positive samples and negative samples. Pairs of gait energy images having the same identity are selected as positive samples, and pairs of gait energy images having different identities are selected as negative samples. The selection of the gait energy images should be a selection from gait energy images of different views based on the same probability. First, gait energy images of different views in the gait energy image sequence of the training gait video sequence should have the same probability of being selected, and the matching model based on the convolutional neural network is trained according to the fairly selected various cross-view circumstances. Second, the positive and negative samples are used based on a preset ratio. Since the number of pairs of gait energy images having the same identity is far less than the number of pairs of gait energy images having different identities, if the ratio of the positive samples to negative samples is not limited and the selection is performed according to the natural probability, there would be very few positive samples, which will result in over-fitting of the matching model based on the convolutional neural network in the training process. Preferably, the positive and negative samples may be made to have the same probability of appearance.
However, the prior art, Ji, Jin, Medioni, Huang, Wang and Tan failed to disclose the following subject matter such as “discriminating the features of uniform size, and obtaining a classification confidence based on the category classification output layer; selecting the classification with the highest confidence, and obtaining the category of the action occurring in the video; obtaining a start time and an end time of the action according to time location output from the output layer, thereby fulfilling video action detection”
Therefore, claims 1-6 are allowed.

2. Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAE UK JEON whose telephone number is (571)270-3649.  The examiner can normally be reached on 9am-6pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chat Do can be reached on 571-272-3721.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/JAE U JEON/Primary Examiner, Art Unit 2193