DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claims 1-15 have been considered but are moot because the new ground of rejection set forth below in this office action. The amendments in regards to independent claim 1, 9 and 15 have changed the scope of the claim which invoked a new 35 U.S.C 112 rejection due to the claims being indefinite. The claim limitation of “determining a type of a connection between the first unit and a second unit of the first layer is a first series connection type, a second series connection type, or a parallel connection type” It is unclear to the examiner whether this step is necessary at all considering in the subsequent steps of “extracting at least the second features from the first features for the first series connection type; extracting at least the second features from the input for the second series connection type; and extracting the second features from the input for the parallel connection type” this step implies that the invention is extracting the second features for all types of connections and it doesn’t appear that the step of determining the type of connection is relevant since the claim language indicates that the extraction is happening for all types of connections.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-15 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding independent claim 1, 9 and 15, each of the independent claims recite in some variation “ determining a type of a connection between the first unit and a second unit of the first layer is a first series connection type, a second series connection type, or a parallel connection type”. The claim limitation can be interpreted as selecting or choosing the type of connection however in the later steps of the claim recites “extracting at least the second features from the first features for the first series connection type; extracting at least the second features from the input for the second series connection type; and extracting the second features from the input for the parallel connection type” by using the word “and” after all the extracting steps it is unclear whether the invention as claimed even needs to have a determining the type of connection step because it seems like the invention is extracting the second features for all the types of connections regardless of what the connection type was in the determining step. For the purposed of advancing prosecution, examiner interprets that determining the type of connection between the first and second layer to be a parallel connection as taught by the prior art of Feichtenhofer.
The remaining dependent claim 2-8 and 10-14 are also rejected by virtue of their dependency of the independent claims. 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5, 9-13 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Feichtenhofer et al. ("Convolutional Two-Stream Network Fusion for Video Action Recognition" as cited by applicant in IDS filed 04/21/2020) in view of Zhang et al. ("PolyNet: A Pursuit of Structural Diversity in Very Deep Networks", as cited by applicant in IDS filed 04/21/2020).

    PNG
    media_image1.png
    351
    495
    media_image1.png
    Greyscale
Regarding Claim 1, Feichtenhofer teaches a computer-implemented method, comprising: receiving an input at a first layer of a learning network, the input comprising a plurality of images; 
extracting, using a first unit of the first layer, first features of the plurality of images from the input in a spatial dimension, the first features characterizing a spatial presentation of the plurality of images (Page 2, Figure 1. Example outputs of the first three convolutional layers from a two-stream ConvNet model [22]. The two networks separately capture spatial (appearance) and temporal information at a fine temporal scale. );extracting, second features of the plurality of images from at least one of the first features and the input in a temporal dimension, the second features at least characterizing temporal changes across the plurality of images (As seen in Figure 1, the top network shows how the neural network is extracting the temporal features from the plurality of images. The examiner interprets that the images are in a temporal dimension if the temporal features are being extracted from the images. Also shown in figure 1, it shows the extraction of the temporal features are passing multiple layers of the neural network and in each different layer, temporal information is being extracted showing the temporal changes in the plurality of images.);
determining a type of a connection between the first unit and a second unit of the first layer is a first series connection type, a second series connection type, or a parallel connection type (The examiner interprets as seen in figure 1 of Feichenhofer shows a parallel connection type between the first and second unit of the first layer of the network.)and extracting the second features from the input for the parallel connection type(See Figure 1 of Feichenhofer, the top network extracts the temporal features from the input images.);and generating a spatial-temporal feature representation of the plurality of images based at least in part on the second features. (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification).
Feichtenhofer does not explicitly teach wherein the second unit extracts the second features, and extracting the second feature comprises: extracting at least the second features from the first features for a first series connection type; extracting at least the second features from the input for a second series connection type; 

    PNG
    media_image2.png
    401
    620
    media_image2.png
    Greyscale
Zhang teaches wherein the second unit extracts the second features, and extracting the second feature(As seen in Figure 4, Shows examples of PolyInception structures in which Inception blocks are extracting features from a first Inception block.) comprises: 
extracting at least the second features from the first features for the first series connection type (As seen in Figure 4C, the examiner interprets that Inception G to be the second unit and Inception F to be the first unit. The figure shows a series connection of Inception G and Inception F and shows the feature from Inception F is being passed on to Inception G so Inception G can extract second features from the first feature);
 
    PNG
    media_image3.png
    324
    183
    media_image3.png
    Greyscale
extracting at least the second features from the input for the second series connection type (As seen in Figure 4a, The annotated circle shows a series connection in which one can interpret inception F being a second unit which is in series with the input and extracts the features from the input.); 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 2, the combination of Feichtenhofer and Zhang teaches the method of claim 1, where Feichtenhofer further teaches wherein the plurality of images are images processed by a second layer of the learning network. (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. As seen in Figure 1, the figure shows the first three convolutional layers where each network is learning either the spatial feature or the temporal feature from the plurality of images.)
Regarding Claim 3, the combination of Feichtenhofer and Zhang teaches the method of claim 1, while Feichtenhofer teaches a parallel connection of the parallel connection type in which the second unit extracts the second features from the input (See Figure 1 of Feichenhofer, the top network extracts the temporal features from the input images.)
They don’t explicitly teach wherein the type of the connection between the first and second units is selected from a group consisting of: a first series connection of the first series connection type in which the second unit at least extracts the second features from the first features, a second series connection of the second series connection type in which the second unit at least extracts the second features from the input,
Zhang teaches the type of the connection between the first and second units is selected from a group consisting of (Page 4, Right Column, Paragraph 4, In this study, we constructed a number of variant versions of IR 3-6-3. In each version, we choose one stage (from A, B, and C) and replace all the Inception residual units therein with one of the six PolyInception modules introduced above (2-way, 3-way, poly-2, poly-3, mpoly-2, and mpoly-3). The examiner interprets that the study is selecting from one of the PolyInception structures shown in figure 4.)
a first series connection of the first series connection type in which the second unit at least extracts the second features from the first features (As seen in Figure 4C, the examiner interprets that Inception G to be the second unit and Inception F to be the first unit. The figure shows a series connection of Inception G and Inception F and shows the feature from Inception F is being passed on to Inception G so Inception G can extract second 
    PNG
    media_image3.png
    324
    183
    media_image3.png
    Greyscale
features from the first feature), a second series connection of the second series connection type in which the second unit at least extracts the second features from the input,
 (As seen in Figure 4a, The annotated circle shows a series connection in which one can interpret inception F being a second unit which is in series with the input and extracts the features from the input.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 4, the combination of Feichtenhofer and Zhang  the method of claim 1, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation comprises: in response to the type of the connection being a second series connection of the second series connection type or a parallel connection of the parallel connection type (As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images), generating the spatial-temporal feature representation by combining the first features and second features (Page 2, Right Col, Paragraph 2, The method first decomposes video into spatial and temporal components by using RGB and optical flow frames. These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification, softmax scores are combined by late fusion. The examiner interprets that the spatial and temporal features extracted from the neural networks are then fused to such that the spatial-temporal feature generated can be outputted to perform video recognition for final classification).
Regarding Claim 5, the combination of Feichtenhofer and Zhang  teaches the method of claim 1, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation further comprises: generating the spatial-temporal feature representation further based on the input (Page 2, Right Column Paragraph 2 These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification. The examiner interprets that the input frames into the neural network work as shown in figure 1 are used to generate the spatial-temporal feature such that it can be used to perform video recognition from the input frames.).

    PNG
    media_image1.png
    351
    495
    media_image1.png
    Greyscale
Regarding Claim 9, Feichtenhofer teaches a device, comprising: a processing unit; and a memory coupled to the processing unit and having instructions stored thereon which, when executed by the processing unit, causing the device to perform acts comprising (See Acknowledgment section, Feichtenhofer discloses the use of GPU’s and it is inherent that memory is coupled to the GPU.): receiving an input at a first layer of a learning network, the input comprising a plurality of images (Page 1, Right Column, Paragraph 2,  The two-stream architecture [22] incorporates motion information by training separate ConvNets for both appearance in still images and stacks of optical flow. Indeed, this work showed that optical flow information alone was sufficient to discriminate most of the actions in UCF101.); 
extracting, using a first unit of the first layer, first features of the plurality of images from the input in a spatial dimension, the first features characterizing a spatial presentation of the plurality of images (Page 2, Figure 1. Example outputs of the first three convolutional layers from a two-stream ConvNet model [22]. The two networks separately capture spatial (appearance) and temporal information at a fine temporal scale. ); extracting, in a temporal dimension, the second features at least characterizing temporal changes across the plurality of images (As seen in Figure 1, the top network shows how the neural network is extracting the temporal features from the plurality of images.); 
determining a type of a connection between the first unit and a second unit of the first layer is a first series connection type, a second series connection type, or a parallel connection type (The examiner interprets as seen in figure 1 of Feichenhofer shows a parallel connection type between the first and second unit of the first layer of the network.)
and extracting the second features from the input for the parallel connection type(See Figure 1 of Feichenhofer, the top network extracts the temporal features from the input images.);and generating a spatial-temporal feature representation of the plurality of images based at least in part on the second features (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification).
Feichtenhofer does not explicitly teach wherein the second unit extracts the second features, and extracting the second feature comprises: extracting at least the second features from the first features for a first series connection type; extracting at least the second features from the input for a second series connection type; 

    PNG
    media_image2.png
    401
    620
    media_image2.png
    Greyscale
Zhang teaches wherein the second unit extracts the second features, and extracting the second feature(As seen in Figure 4, Shows examples of PolyInception structures in which Inception blocks are extracting features from a first Inception block.) comprises: 
extracting at least the second features from the first features for the first series connection type(As seen in Figure 4C, the examiner interprets that Inception G to be the second unit and Inception F to be the first unit. The figure shows a series connection of Inception G and Inception F and shows the feature from Inception F is being passed on to Inception G so Inception G can extract second features from the first feature);
 
    PNG
    media_image3.png
    324
    183
    media_image3.png
    Greyscale
extracting at least the second features from the input for the second series connection type(As seen in Figure 4a, The annotated circle shows a series connection in which one can interpret inception F being a second unit which is in series with the input and extracts the features from the input.); 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 10, the combination of Feichtenhofer and Zhang  teaches the device of claim 9, where Feichtenhofer further teaches wherein the plurality of images are images processed by a second layer of the learning network. (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. As seen in Figure 1, the figure shows the first three convolutional layers where each network is learning either the spatial feature or the temporal feature from the plurality of images.)
Regarding Claim 11, the combination of Feichtenhofer and Zhang  teaches the method of claim 9, while Feichtenhofer teaches a parallel connection in which the second unit extracts the second features from the input (See Figure 1 of Feichenhofer, the top network extracts the temporal features from the input images.)
They don’t explicitly teach wherein the type of the connection between the first and second units is selected from a group consisting of: a first series connection of the first series connection type in which the second unit at least extracts the second features from the first features, a second series connection of the second series connection type in which the second unit at least extracts the second features from the input,

    PNG
    media_image2.png
    401
    620
    media_image2.png
    Greyscale

Zhang teaches the type of the connection between the first and second units is selected from a group consisting of (Page 4, Right Column, Paragraph 4, In this study, we constructed a number of variant versions of IR 3-6-3. In each version, we choose one stage (from A, B, and C) and replace all the Inception residual units therein with one of the six PolyInception modules introduced above (2-way, 3-way, poly-2, poly-3, mpoly-2, and mpoly-3). The examiner interprets that the study is selecting from one of the PolyInception structures shown in figure 4.)
a first series connection of the first series connection type in which the second unit at least extracts the second features from the first features (As seen in Figure 4C, the examiner interprets that Inception G to be the second unit and Inception F to be the first unit. The figure shows a series connection of Inception G and Inception F and shows the feature from Inception F is being passed on to Inception G so Inception G can extract second 
    PNG
    media_image3.png
    324
    183
    media_image3.png
    Greyscale
features from the first feature), 
a second series connection of the second series connection type in which the second unit at least extracts the second features from the input, (As seen in Figure 4a, The annotated circle shows a series connection in which one can interpret inception F being a second unit which is in series with the input and extracts the features from the input.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 12, the combination of Feichtenhofer and Zhang  teaches the device of claim 9, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation comprises: in response to the type of the connection being a second series connection of the second series connection type or a parallel connection of the parallel connection type (As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images), generating the spatial-temporal feature representation by combining the first features and second features (Page 2, Right Col, Paragraph 2, The method first decomposes video into spatial and temporal components by using RGB and optical flow frames. These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification, softmax scores are combined by late fusion. The examiner interprets that the spatial and temporal features extracted from the neural networks are then fused to such that the spatial-temporal feature generated can be outputted to perform video recognition for final classification).
Regarding Claim 13, the combination of Feichtenhofer and Zhang  teaches the method of claim 9, where Feichtenhofer further teaches wherein generating the spatial-temporal feature representation further comprises: generating the spatial-temporal feature representation further based on the input (Page 2, Right Column Paragraph 2 These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification. The examiner interprets that the input frames into the neural network work as shown in figure 1 are used to generate the spatial-temporal feature such that it can be used to perform video recognition from the input frames.).
Regarding Claim 15, Feichtenhofer teaches a computer program product being stored on a computer-readable medium and comprising machine-executable instructions which, when executed by a device, cause the device to: (See Acknowledgment section, Feichtenhofer discloses the use of GPU’s and it is inherent that memory is coupled to the GPU.): receive an input at a first layer of a learning network, the input comprising a plurality of images; (Page 1, Right Column, Paragraph 2,  The two-stream architecture [22] incorporates motion information by training separate ConvNets for both appearance in still images and stacks of optical flow. Indeed, this work showed that optical flow information alone was sufficient to discriminate most of the actions in UCF101.); extracting, using a first unit of the first layer, first features of the plurality of images from the input in a spatial dimension, the first features characterizing a spatial presentation of the plurality of images (Page 2, Figure 1. Example outputs of the first three convolutional layers from a two-stream ConvNet model [22]. The two networks separately capture spatial (appearance) and temporal information at a fine temporal scale. ); extracting, in a temporal dimension, the second features at least characterizing temporal changes across the plurality of images (As seen in Figure 1, the top network shows how the neural network is extracting the temporal features from the plurality of images.); 
determining a type of a connection between the first unit and a second unit of the first layer is a first series connection type, a second series connection type, or a parallel connection type (The examiner interprets as seen in figure 1 of Feichenhofer shows a parallel connection type between the first and second unit of the first layer of the network.)
and extracting the second features from the input for the parallel connection type(See Figure 1 of Feichenhofer, the top network extracts the temporal features from the input images.);and generating a spatial-temporal feature representation of the plurality of images based at least in part on the second features (Page 2, Right Column, Paragraph 2, These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification).
Feichtenhofer does not explicitly teach wherein the second unit extracts the second features, and extracting the second feature comprises: extracting at least the second features from the first features for a first series connection type; extracting at least the second features from the input for a second series connection type; 

    PNG
    media_image2.png
    401
    620
    media_image2.png
    Greyscale
Zhang teaches wherein the second unit extracts the second features, and extracting the second feature(As seen in Figure 4, Shows examples of PolyInception structures in which Inception blocks are extracting features from a first Inception block.) comprises: 
extracting at least the second features from the first features for the first series connection type(As seen in Figure 4C, the examiner interprets that Inception G to be the second unit and Inception F to be the first unit. The figure shows a series connection of Inception G and Inception F and shows the feature from Inception F is being passed on to Inception G so Inception G can extract second features from the first feature);
 
    PNG
    media_image3.png
    324
    183
    media_image3.png
    Greyscale
extracting at least the second features from the input for the second series connection type(As seen in Figure 4a, The annotated circle shows a series connection in which one can interpret inception F being a second unit which is in series with the input and extracts the features from the input.); 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer in order to extract the second feature based on the type of connection between the two networks. One skilled in the art would have been motivated to modify Feichtenhofer in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Claims 6 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Feichtenhofer et al. ("Convolutional Two-Stream Network Fusion for Video Action Recognition" as cited by applicant in IDS filed 04/21/2020) in view of Zhang et al. ("PolyNet: A Pursuit of Structural Diversity in Very Deep Networks", as cited by applicant in IDS filed 04/21/2020) in further view of He et al. ("Deep Residual Learning for Image Recognition", as cited by applicant in IDS filed on 04/21/2020).
Regarding Claim 6, the combination of Feichtenhofer and Zhang  teaches the method of claim 1, while Feichtenhofer teaches extracting first features from the input (See Figure 1, the examiner interprets the first feature to be spatial features in an image and the spatial stream is used to extract those spatial features from the input) 
However they don’t explicitly teach wherein the input has a first number of dimensions and reducing the dimensions of the input from the first number to a second number; 
He teaches wherein the input has a first number of dimensions and reducing the dimensions of the input from the first number to a second number; (Page 6, Right Column, Paragraph 3, The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. The examiner interprets that the input coming in will have a first number of dimensions and the 1x1 convolution layer is responsible to reduce the input to a second number)
	It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of He to Feichtenhofer and Zhang  in order to reduce the dimension size of the input. One skilled in the art would have been motivated to modify Feichenhofer and Zhang in this manner in order to train deeper neural networks by presenting a residual learning framework. (He, Abstract)
Regarding Claim 7, the combination of Feichtenhofer, Zhang and He teaches the method of claim 6, Feichtenhofer further teaches generating the spatial-temporal feature representation (Page 2, Right Col, Paragraph 2, The method first decomposes video into spatial and temporal components by using RGB and optical flow frames. These components are fed into separate deep ConvNet architectures, to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream is performing video recognition on its own and for final classification, softmax scores are combined by late fusion. The examiner interprets that the spatial and temporal features extracted from the neural networks are then fused to such that the spatial-temporal feature generated can be outputted to perform video recognition for final classification). 
He further teaches wherein the second features have a third number of dimensions, and further increasing the dimensions of the second features from the third number to a fourth number (Page 6, Right Column, Paragraph 3, The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. The examiner interprets that the second feature will have a certain number of dimensions and by using the bottleneck architecture described in the prior art, the 1x1 convolution layer can be used to increase the dimensions of the second feature to another number.); 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of He to Feichtenhofer and Zhang in order to increase the dimension size of the second feature. One skilled in the art would have been motivated to modify Feichenhofer and Zhang in this manner in order to train deeper neural networks by presenting a residual learning framework. (He, Abstract)
Claims 8 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Feichtenhofer et al. ("Convolutional Two-Stream Network Fusion for Video Action Recognition" as cited by applicant in IDS filed 04/21/2020) in view of Zhang et al. ("PolyNet: A Pursuit of Structural Diversity in Very Deep Networks", as cited by applicant in IDS filed 04/21/2020) in further view of Li et al. ("Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation").
Regarding Claim 8, the combination of Feichtenhofer and Zhang teaches the method of claim 1,  while Feichtenhofer teaches extracting spatial and temporal features using a first and second unit(As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images) they don’t explicitly teach: extracting third features of the plurality of images from the first intermediate feature representation in the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images, extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, 
and generating the spatial-temporal feature representation based at least in part on the fourth features.

    PNG
    media_image4.png
    813
    1268
    media_image4.png
    Greyscale

Li teaches extracting third features of the plurality of images from the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, (As seen in Figure 5 of the prior art, the examiner interprets that the prior art has multiple layers so any layer can be interpreted as a 3rd or 4th layer and in the 3d or 2d CNN’s they are extracting the spatial features from the plurality of images and in the LSTM layer they are extracting the temporal features from the plurality of images. ), 
and generating the spatial-temporal feature representation based at least in part on the fourth features(As seen in Figure 5 of the prior art, the extracted features are inputted into a softmax layer and the output of the softmax layer is used for sorting in order to determine S* which is the final action recognition result from the input video.).
	It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Li to Feichtenhofer, Zhang in order to have a third and fourth unit to extract the spatial temporal features from the plurality of images and generate a spatial-temporal feature representation. One skilled in the art would have been motivated to modify Feichtenhofer and Zhang  in this manner in order to accurately recognize action in videos using a convolutional neural network. (Li, Abstract) 

    PNG
    media_image5.png
    506
    661
    media_image5.png
    Greyscale
However they don’t teach generating a first intermediate feature representation for the first layer based at least in part on the second features and based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units; where Zhang teaches generating a first intermediate feature representation for the first layer based at least in part on the second features (As seen in Figure 5, the annotated arrow pointing to the annotated circle the examiner interprets the first layer to be stage A and second layer to be stage B and from the figure stage B is learning the features from stage and then outputting the learned features of stage A and B to stage C and the examiner believes that this process is generating the intermediate feature of the first layer based on the second features of stage B.)
    PNG
    media_image2.png
    401
    620
    media_image2.png
    Greyscale
based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units (As seen in Figure 4, the examiner interprets the prior art of Zhang shows various different connections between units and it would be obvious that by adding the teaching of Zhang the units can all have different types of connections);
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer and Li in order to generate an intermediate feature of the first and second unit and provide different type of connections between the third and fourth units. One skilled in the art would have been motivated to modify Feichtenhofer and Li in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Regarding Claim 14, the combination of Feichtenhofer and Zhang  teaches the device of claim 9,  while Feichtenhofer teaches extracting spatial and temporal features using a first and second unit(As seen in figure 1, it shows the parallel connection between two separate networks configured to extract either temporal or spatial features from the plurality of images) they don’t explicitly teach: extracting third features of the plurality of images from the first intermediate feature representation in the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images, extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, 
and generating the spatial-temporal feature representation based at least in part on the fourth features.

    PNG
    media_image4.png
    813
    1268
    media_image4.png
    Greyscale

Li teaches extracting third features of the plurality of images from the spatial dimension using a third unit of a third layer in the learning network, the third features characterizing the spatial presentation of the plurality of images extracting, temporal dimension, the fourth features at least characterizing temporal changes across the plurality of images, (As seen in Figure 5 of the prior art, the examiner interprets that the prior art has multiple layers so any layer can be interpreted as a 3rd or 4th layer and in the 3d or 2d CNN’s they are extracting the spatial features from the plurality of images and in the LSTM layer they are extracting the temporal features from the plurality of images. ), 
and generating the spatial-temporal feature representation based at least in part on the fourth features(As seen in Figure 5 of the prior art, the extracted features are inputted into a softmax layer and the output of the softmax layer is used for sorting in order to determine S* which is the final action recognition result from the input video.).
	It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Li to Feichtenhofer and Zhang in order to have a third and fourth unit to extract the spatial temporal features from the plurality of images and generate a spatial-temporal feature representation. One skilled in the art would have been motivated to modify Feichtenhofer and Zhang in this manner in order to accurately recognize action in videos using a convolutional neural network. (Li, Abstract) 
However they don’t teach generating a first intermediate feature representation for the first layer based at least in part on the second features and based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units; 

    PNG
    media_image5.png
    506
    661
    media_image5.png
    Greyscale
where Zhang teaches generating a first intermediate feature representation for the first layer based at least in part on the second features (As seen in Figure 5, the annotated arrow pointing to the annotated circle the examiner interprets the first layer to be stage A and second layer to be stage B and from the figure stage B is learning the features from stage and then outputting the learned features of stage A and B to stage C and the examiner believes that this process is generating the intermediate feature of the first layer based on the second features of stage B.)
    PNG
    media_image2.png
    401
    620
    media_image2.png
    Greyscale
based on a type of a connection between the third unit and a fourth unit of the third layer and using the fourth unit, the type of the connection between the third and fourth units being different from the type of the connection between the first and second units (As seen in Figure 4, the examiner interprets the prior art of Zhang shows various different connections between units and it would be obvious that by adding the teaching of Zhang the units can all have different types of connections);
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Zhang to Feichtenhofer and Li in order to generate an intermediate feature of the first and second unit and provide different type of connections between the third and fourth units. One skilled in the art would have been motivated to modify Feichtenhofer and Li in this manner in order to reduce computational cost and memory demand by exploring structural diversity in designing deep networks. (Zhang, Abstract)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAN D HOANG whose telephone number is (571)272-4344. The examiner can normally be reached Monday-Friday 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire X. Wang can be reached on (571) 270-1051. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HAN HOANG/Examiner, Art Unit 2663                                                                                                                                                                                                        
/CLAIRE X WANG/Supervisory Patent Examiner, Art Unit 2663