DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/21/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5, 9-13 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Feichtenhofer et al. ("Convolutional Two-Stream Network Fusion for Video Action Recognition"as cited by applicant in IDS filed 04/21/2020) in view of Zhang et al. ("PolyNet: A Pursuit of Structural Diversity in Very Deep Networks", as cited by applicant in IDS filed 04/21/2020).

    PNG
    media_image1.png
    351
    495
    media_image1.png
    Greyscale
Regarding Claim 1, Feichtenhofer teaches a computer-implemented method, comprising: receiving an input at a first layer of a learning network, the input comprising a plurality of images(Page 1, Right Column, Paragraph 2,  The two-stream architecture [22] incorporates motion information by training separate ConvNets for both appearance in still images and stacks of optical flow. Indeed, this work showed that optical flow information alone was sufficient to discriminate most of the actions in UCF101.); 
extracting, using a first unit of the first layer, first features of the plurality of images from the input in a spatial dimension, the first features characterizing a spatial presentation of the plurality of images (Page 2, Figure 1. Example outputs of the first three convolutional layers from a two-stream ConvNet model [22]. The two networks separately capture spatial (appearance) and temporal information at a fine temporal scale. ); extracting, in a temporal dimension, the second features at least characterizing temporal changes across the plurality of images (As seen in Figure 1, the top network shows how the neural network is extracting the temporal features from the plurality of images. The examiner interprets that the images are in a temporal dimension if the temporal features are being extracted from the images. Also shown in figure 1, it shows the extraction of the temporal features are passing multiple layers of the neural network and in each different layer, temporal information is being extracted showing the temporal changes in the plurality of images.); and generating a spatial-temporal feature representation of the plurality of images based at least in part on the second features (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification).
Feichtenhofer does not explicitly teach extracting based on a type of a connection between the first unit and a second unit of the first layer and using the second unit, second features of the plurality of images from at least one of the first features and the input. 

    PNG
    media_image2.png
    308
    290
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    27
    487
    media_image3.png
    Greyscale

Zhang teaches extracting based on a type of a connection between the first unit and a second unit of the first layer and using the second unit, second features of the plurality of images from at least one of the first features and the input (As seen in figure 4c, the examiner interprets that the input can be a plurality of images and Inception F is the first unit and Inception G is the second unit. Figure 4c shows that Inception G(second unit) is getting the second features from the first features of Inception F(first unit) and in Figure 4(d), Inception G(second unit) is getting the second features from the input).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 2, the combination of Feichtenhofer and Zhang teaches the method of claim 1, where Feichtenhofer further teaches  wherein the plurality of images are images processed by a second layer of the learning network. (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. As seen in Figure 1, the figure shows the first three convolutional layers where each network is learning either the spatial feature or the temporal feature from the plurality of images.)
Regarding Claim 3, the combination of Feichtenhofer and Zhang teaches the method of claim 1, while Feichtenhofer teaches a parallel connection in which the second unit extracts the second features from the input (See Figure 1 of Feichenhofer, the top network extracts the temporal features from the input images.)
They don’t explicitly teach wherein the type of the connection between the first and second units is selected from a group consisting of: a first series connection in which the second unit at least extracts the second features from the first features, a second series connection in which the second unit at least extracts the second features from the input,

    PNG
    media_image4.png
    401
    620
    media_image4.png
    Greyscale

Zhang teaches the type of the connection between the first and second units is selected from a group consisting of (Page 4, Right Column, Paragraph 4, In this study, we constructed a number of variant versions of IR 3-6-3. In each version, we choose one stage (from A, B, and C) and replace all the Inception residual units therein with one of the six PolyInception modules introduced above (2-way, 3-way, poly-2, poly-3, mpoly-2, and mpoly-3). The examiner interprets that the study is selecting from one of the PolyInception structures shown in figure 4.)
a first series connection in which the second unit at least extracts the second features from the first features (As seen in Figure 4C, the examiner interprets that Inception G to be the second unit and Inception F to be the first unit. The figure shows a series connection of Inception G and Inception F and shows the feature from Inception F is being passed on to Inception G so Inception G can extract second 
    PNG
    media_image5.png
    324
    183
    media_image5.png
    Greyscale
features from the first feature), 
a second series connection in which the second unit at least extracts the second features from the input (As seen in Figure 4a, The annotated circle shows a series connection in which one can interpret inception F being a second unit which is in series with the input and extracts the features from the input.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)



Regarding Claim 4, the combination of Feichtenhofer and Zhang teaches the method of claim 1, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation comprises: in response to the type of the connection being a second series connection or a parallel connection(As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images), generating the spatial-temporal feature representation by combining the first features and second features (Page 2, Right Col, Paragraph 2, The method first decomposes video into spatial and temporal components by using RGB and optical flow frames. These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification, softmax scores are combined by late fusion. The examiner interprets that the spatial and temporal features extracted from the neural networks are then fused to such that the spatial-temporal feature generated can be outputted to perform video recognition for final classification).
Regarding Claim 5, the combination of Feichtenhofer and Zhang teaches the method of claim 1, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation further comprises: generating the spatial-temporal feature representation further based on the input (Page 2, Right Column Paragraph 2 These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification. The examiner interprets that the input frames into the neural network work as shown in figure 1 are used to generate the spatial-temporal feature such that it can be used to perform video recognition from the input frames.).


    PNG
    media_image1.png
    351
    495
    media_image1.png
    Greyscale
Regarding Claim 9, Feichtenhofer teaches a device, comprising: a processing unit; and a memory coupled to the processing unit and having instructions stored thereon which, when executed by the processing unit, causing the device to perform acts comprising (See Acknowledgment section, Feichtenhofer discloses the use of GPU’s and it is inherent that memory is coupled to the GPU.): receiving an input at a first layer of a learning network, the input comprising a plurality of images (Page 1, Right Column, Paragraph 2,  The two-stream architecture [22] incorporates motion information by training separate ConvNets for both appearance in still images and stacks of optical flow. Indeed, this work showed that optical flow information alone was sufficient to discriminate most of the actions in UCF101.); 
extracting, using a first unit of the first layer, first features of the plurality of images from the input in a spatial dimension, the first features characterizing a spatial presentation of the plurality of images (Page 2, Figure 1. Example outputs of the first three convolutional layers from a two-stream ConvNet model [22]. The two networks separately capture spatial (appearance) and temporal information at a fine temporal scale. ); extracting, in a temporal dimension, the second features at least characterizing temporal changes across the plurality of images (As seen in Figure 1, the top network shows how the neural network is extracting the temporal features from the plurality of images.); and generating a spatial-temporal feature representation of the plurality of images based at least in part on the second features (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification).
They don’t explicitly teach based on a type of a connection between the first unit and a second unit of the first layer and using the second unit, second features of the plurality of images from at least one of the first features and the input. 

    PNG
    media_image2.png
    308
    290
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    27
    487
    media_image3.png
    Greyscale

Zhang teaches based on a type of a connection between the first unit and a second unit of the first layer and using the second unit, second features of the plurality of images from at least one of the first features and the input (As seen in figure 4c, the examiner interprets that the input can be a plurality of images and Inception F is the first unit and Inception G is the second unit. Figure 4c shows that Inception G(second unit) is getting the second features from the first features of Inception F(first unit) and in Figure 4(d), Inception G(second unit) is getting the second features from the input).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 10, the combination of Feichtenhofer and Zhang teaches the device of claim 9, where Feichtenhofer further teaches wherein the plurality of images are images processed by a second layer of the learning network. (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. As seen in Figure 1, the figure shows the first three convolutional layers where each network is learning either the spatial feature or the temporal feature from the plurality of images.)
Regarding Claim 11, the combination of Feichtenhofer and Zhang teaches the method of claim 9, while Feichtenhofer teaches a parallel connection in which the second unit extracts the second features from the input (See Figure 1 of Feichenhofer, the top network extracts the temporal features from the input images.)
They don’t explicitly teach wherein the type of the connection between the first and second units is selected from a group consisting of: a first series connection in which the second unit at least extracts the second features from the first features, a second series connection in which the second unit at least extracts the second features from the input,

    PNG
    media_image4.png
    401
    620
    media_image4.png
    Greyscale

Zhang teaches the type of the connection between the first and second units is selected from a group consisting of (Page 4, Right Column, Paragraph 4, In this study, we constructed a number of variant versions of IR 3-6-3. In each version, we choose one stage (from A, B, and C) and replace all the Inception residual units therein with one of the six PolyInception modules introduced above (2-way, 3-way, poly-2, poly-3, mpoly-2, and mpoly-3). The examiner interprets that the study is selecting from one of the PolyInception structures shown in figure 4.)
a first series connection in which the second unit at least extracts the second features from the first features (As seen in Figure 4C, the examiner interprets that Inception G to be the second unit and Inception F to be the first unit. The figure shows a series connection of Inception G and Inception F and shows the feature from Inception F is being passed on to Inception G so Inception G can extract second 
    PNG
    media_image5.png
    324
    183
    media_image5.png
    Greyscale
features from the first feature), 
a second series connection in which the second unit at least extracts the second features from the input (As seen in Figure 4a, The annotated circle shows a series connection in which one can interpret inception F being a second unit which is in series with the input and extracts the features from the input.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)



Regarding Claim 12, the combination of Feichtenhofer and Zhang teaches the device of claim 9, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation comprises: in response to the type of the connection being a second series connection or a parallel connection(As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images), generating the spatial-temporal feature representation by combining the first features and second features (Page 2, Right Col, Paragraph 2, The method first decomposes video into spatial and temporal components by using RGB and optical flow frames. These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification, softmax scores are combined by late fusion. The examiner interprets that the spatial and temporal features extracted from the neural networks are then fused to such that the spatial-temporal feature generated can be outputted to perform video recognition for final classification).
Regarding Claim 13, the combination of Feichtenhofer and Zhang teaches the method of claim 9, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation further comprises: generating the spatial-temporal feature representation further based on the input (Page 2, Right Column Paragraph 2 These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification. The examiner interprets that the input frames into the neural network work as shown in figure 1 are used to generate the spatial-temporal feature such that it can be used to perform video recognition from the input frames.).
Regarding Claim 15, Feichtenhofer teaches a computer program product being stored on a computer-readable medium and comprising machine-executable instructions which, when executed by a device, cause the device to: (See Acknowledgment section, Feichtenhofer discloses the use of GPU’s and it is inherent that memory is coupled to the GPU.): receive an input at a first layer of a learning network, the input comprising a plurality of images; (Page 1, Right Column, Paragraph 2,  The two-stream architecture [22] incorporates motion information by training separate ConvNets for both appearance in still images and stacks of optical flow. Indeed, this work showed that optical flow information alone was sufficient to discriminate most of the actions in UCF101.); extracting, using a first unit of the first layer, first features of the plurality of images from the input in a spatial dimension, the first features characterizing a spatial presentation of the plurality of images (Page 2, Figure 1. Example outputs of the first three convolutional layers from a two-stream ConvNet model [22]. The two networks separately capture spatial (appearance) and temporal information at a fine temporal scale. ); extracting, in a temporal dimension, the second features at least characterizing temporal changes across the plurality of images (As seen in Figure 1, the top network shows how the neural network is extracting the temporal features from the plurality of images.); and generating a spatial-temporal feature representation of the plurality of images based at least in part on the second features (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification).
They don’t explicitly teach based on a type of a connection between the first unit and a second unit of the first layer and using the second unit, second features of the plurality of images from at least one of the first features and the input. 

    PNG
    media_image2.png
    308
    290
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    27
    487
    media_image3.png
    Greyscale

Zhang teaches based on a type of a connection between the first unit and a second unit of the first layer and using the second unit, second features of the plurality of images from at least one of the first features and the input (As seen in figure 4c, the examiner interprets that the input can be a plurality of images and Inception F is the first unit and Inception G is the second unit. Figure 4c shows that Inception G(second unit) is getting the second features from the first features of Inception F(first unit) and in Figure 4(d), Inception G(second unit) is getting the second features from the input).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Claims 6 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Feichtenhofer et al. ("Convolutional Two-Stream Network Fusion for Video Action Recognition"as cited by applicant in IDS filed 04/21/2020) in view of Zhang et al. ("PolyNet: A Pursuit of Structural Diversity in Very Deep Networks", as cited by applicant in IDS filed 04/21/2020) in further view of He et al. ("Deep Residual Learning for Image Recognition", as cited by applicant in IDS filed on 04/21/2020).
Regarding Claim 6, the combination of Feichtenhofer and Zhang teaches the method of claim 1, while Feichtenhofer teaches extracting first features from the input (See Figure 1, the examiner interprets the first feature to be spatial features in an image and the spatial stream is used to extract those spatial features from the input) 
However they don’t explicitly teach wherein the input has a first number of dimensions and reducing the dimensions of the input from the first number to a second number; 
He teaches wherein the input has a first number of dimensions and reducing the dimensions of the input from the first number to a second number; (Page 6, Right Column, Paragraph 3, The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. The examiner interprets that the input coming in will have a first number of dimensions and the 1x1 convolution layer is responsible to reduce the input to a second number)
	It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of He to Feichtenhofer and Zhang in order to reduce the dimension size of the input. One skilled in the art would have been motivated to modify Feichenhofer and Zhang in this manner in order to train deeper neural networks by presenting a residual learning framework. (He, Abstract)
Regarding Claim 7, the combination of Feichtenhofer, Zhang and He teaches the method of claim 6, where Feichtenhofer further teaches generating the spatial-temporal feature representation (Page 2, Right Col, Paragraph 2, The method first decomposes video into spatial and temporal components by using RGB and optical flow frames. These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification, softmax scores are combined by late fusion. The examiner interprets that the spatial and temporal features extracted from the neural networks are then fused to such that the spatial-temporal feature generated can be outputted to perform video recognition for final classification). wherein the second features have a third number of dimensions, and further increasing the dimensions of the second features from the third number to a fourth number (Page 6, Right Column, Paragraph 3, The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. The examiner interprets that the second feature will have a certain number of dimensions and by using the bottleneck architecture described in the prior art, the 1x1 convolution layer can be used to increase the dimensions of the second feature to another number.); 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of He to Feichtenhofer and Zhang in order to increase the dimension size of the second feature. One skilled in the art would have been motivated to modify Feichenhofer and Zhang in this manner in order to train deeper neural networks by presenting a residual learning framework. (He, Abstract)
Claims 8 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Feichtenhofer et al. ("Convolutional Two-Stream Network Fusion for Video Action Recognition"as cited by applicant in IDS filed 04/21/2020) in view of Zhang et al. ("PolyNet: A Pursuit of Structural Diversity in Very Deep Networks", as cited by applicant in IDS filed 04/21/2020) in further view of Li et al. ("Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation").

Regarding Claim 8, the combination of Feichtenhofer and Zhang teaches the method of claim 1,  while Feichtenhofer teaches extracting spatial and temporal features using a first and second unit(As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images) they don’t explicitly teach: extracting third features of the plurality of images from the first intermediate feature representation in the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images, extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, 
and generating the spatial-temporal feature representation based at least in part on the fourth features.

    PNG
    media_image6.png
    813
    1268
    media_image6.png
    Greyscale

Li teaches extracting third features of the plurality of images from the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, (As seen in Figure 5 of the prior art, the examiner interprets that the prior art has multiple layers so any layer can be interpreted as a 3rd or 4th layer and in the 3d or 2d CNN’s they are extracting the spatial features from the plurality of images and in the LSTM layer they are extracting the temporal features from the plurality of images. ), 
and generating the spatial-temporal feature representation based at least in part on the fourth features(As seen in Figure 5 of the prior art, the extracted features are inputted into a softmax layer and the output of the softmax layer is used for sorting in order to determine S* which is the final action recognition result from the input video.).
	It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Li to Feichtenhofer and Zhang in order to have a third and fourth unit to extract the spatial temporal features from the plurality of images and generate a spatial-temporal feature representation. One skilled in the art would have been motivated to modify Feichtenhofer and Zhang in this manner in order to accurately recognize action in videos using a convolutional neural network. (Li, Abstract) 

    PNG
    media_image7.png
    506
    661
    media_image7.png
    Greyscale
However they don’t teach generating a first intermediate feature representation for the first layer based at least in part on the second features and based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units; where Zhang teaches generating a first intermediate feature representation for the first layer based at least in part on the second features (As seen in Figure 5, the annotated arrow pointing to the annotated circle the examiner interprets the first layer to be stage A and second layer to be stage B and from the figure stage B is learning the features from stage and then outputting the learned features of stage A and B to stage C and the examiner believes that this process is generating the intermediate feature of the first layer based on the second features of stage B.)
    PNG
    media_image4.png
    401
    620
    media_image4.png
    Greyscale
based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units (As seen in Figure 4, the examiner interprets the prior art of Zhang shows various different connections between units and it would be obvious that by adding the teaching of Zhang the units can all have different types of connections);
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer and Li in order to generate an intermediate feature of the first and second unit and provide different type of connections between the third and fourth units. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 14, the combination of Feichtenhofer and Zhang teaches the device of claim 9,  while Feichtenhofer teaches extracting spatial and temporal features using a first and second unit(As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images) they don’t explicitly teach: extracting third features of the plurality of images from the first intermediate feature representation in the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images, extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, 
and generating the spatial-temporal feature representation based at least in part on the fourth features.

    PNG
    media_image6.png
    813
    1268
    media_image6.png
    Greyscale

Li teaches extracting third features of the plurality of images from the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, (As seen in Figure 5 of the prior art, the examiner interprets that the prior art has multiple layers so any layer can be interpreted as a 3rd or 4th layer and in the 3d or 2d CNN’s they are extracting the spatial features from the plurality of images and in the LSTM layer they are extracting the temporal features from the plurality of images. ), 
and generating the spatial-temporal feature representation based at least in part on the fourth features(As seen in Figure 5 of the prior art, the extracted features are inputted into a softmax layer and the output of the softmax layer is used for sorting in order to determine S* which is the final action recognition result from the input video.).
	It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Li to Feichtenhofer and Zhang in order to have a third and fourth unit to extract the spatial temporal features from the plurality of images and generate a spatial-temporal feature representation. One skilled in the art would have been motivated to modify Feichtenhofer and Zhang in this manner in order to accurately recognize action in videos using a convolutional neural network. (Li, Abstract) 
However they don’t teach generating a first intermediate feature representation for the first layer based at least in part on the second features and based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units; 

    PNG
    media_image7.png
    506
    661
    media_image7.png
    Greyscale
where Zhang teaches generating a first intermediate feature representation for the first layer based at least in part on the second features (As seen in Figure 5, the annotated arrow pointing to the annotated circle the examiner interprets the first layer to be stage A and second layer to be stage B and from the figure stage B is learning the features from stage and then outputting the learned features of stage A and B to stage C and the examiner believes that this process is generating the intermediate feature of the first layer based on the second features of stage B.)
    PNG
    media_image4.png
    401
    620
    media_image4.png
    Greyscale
based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units (As seen in Figure 4, the examiner interprets the prior art of Zhang shows various different connections between units and it would be obvious that by adding the teaching of Zhang the units can all have different types of connections);
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer and Li in order to generate an intermediate feature of the first and second unit and provide different type of connections between the third and fourth units. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAN D HOANG whose telephone number is (571)272-4344.  The examiner can normally be reached on Monday-Friday 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire X. Wang can be reached on (571) 270-1051.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/HAN HOANG/Examiner, Art Unit 2663            

/CLAIRE X WANG/Supervisory Patent Examiner, Art Unit 2663