DETAILED ACTION

This action is in response to applicant’s amendment/arguments filed on 4/25/2022. This action is made FINAL.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 4/25/2022 have been fully considered but they are not persuasive.
Applicant argues “…the cited portions of Wang describe individual event identification and not identification and development to of a series of steps associated with a sequence of events within a video. For example, cited FIG. 1 of Wang illustrates identification of individual events, including camera adjustment (a), hair being blow dried (b), and hair “mold[ing]” (c). See Wang, FIG. 1. Applicant respectfully submits that this cited portion shows identification of a singular task and “steps of a process,” as recited in claim 1. For example, if the steps were presented in text form to a robot, the robot would not know what it means to blow dry hair. In sharp contrast, the robot would need steps, such as turning on the hair drier, pointing an end toward a portion of hair, moving the drier so as to not burn a target, etc. Such a series of steps is described as being different from the singular event isolation from the cited portions of Wang in Applicant’s specification, which notes “without context there may be many occurrences listed that may all correspond to a single step, but may provide an overwhelming amount of textual instruction.” Specification, § [0050]. This is the situation, shown in the cited portions of Wang, that the instant application attempts to avoid. Providing a singular label of an action is not equivalent to the claimed “steps of a process performed in a video.”…” [Emphasis Added].
	Examiner respectfully disagrees.
In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., “series of steps associated with a sequence of events”) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
Furthermore, applicant seems to be interpreting “steps” and “process” as claimed too narrowly. Examiner interprets the recitations under the Broadest Reasonable Interpretation (BRI) to be any and all steps and any and all process whether trivial or mundane or specific. The claimed recitation does not provide any bounds for the claimed “steps” and “process” that which the applicant claims novelty and non-obviousness. 
	Therefore, in addition, Wang’s disclosed “individual events” are not only read as claimed “steps” but also a “series of steps” due to a chronology as defined by “a” “b” “c”, etc. and is also read as the claimed “process” since the individual events combine to perform a process nonetheless. Whatever, singular event/label or a multiple event/label “process” that may be.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-30 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, Jingwen, et al. "Bidirectional attentive fusion with context gating for dense video captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
	Consider claim 1, Wang discloses a processor, comprising: one or more circuits to generate, using one or more neural networks (see fig. 2 and section 3.1; “…each video frame is encoded by the 3D CNN…”), text indicating steps of a process performed in a video (see figs. 1 and 2 and section 3.1 and 3.2; “…a recurrent neural network, specifically LSTM, is leveraged in our captioning module to translate visual input into a sentence…” “…The extracted C3D features are of temporal resolution δ = 16 frames, discretizing the input stream into T = L/δ time steps…” also see fig. 1 “(a)” “(b) “(c)”).
However, Wang does not explicitly disclose processor, system, hardware, or method.
Nevertheless, Wang discloses video captioning a recurrent neural network (see fig. 2 and section 3).
Therefore, it would have been obvious to one of ordinary skill in the art at a time before the effective filing date of the claimed subject matter to combine the neural network for video captioning with existing hardware and processor technology in order to physically realize the discussed implementation and yield predictable result.
Consider claim 2 as applied to respective claim, Wang as modified discloses the one or more neural networks determine content of individual frames of the video and determine bi-directional context for the determined content (see fig. 2; section 3.1; “…Our proposed bidirectional proposal module also involves a backward pass. The aim of such a procedure is to capture future context, in addition to current event clue for better event proposals.…”).
Consider claim 3 as applied to respective claim, Wang as modified discloses the one or more neural networks utilize the bi-directional context to divide the individual frames into a sequence of logical steps, corresponding to the steps of the process performed in the video (see figs. 1 and 2; section 2; “(a)” “(b) “(c)” “…dense video captioning generates multiple sentences and grounds them with time locations automatically…” “…The visual features are then fed into our bidirectional sequence encoder (e.g., LSTM). (c) Each hidden state from the forward/backward seq. encoder will be fed into the proposal module. The forward/backward seq. encoders are jointly learned to make proposal predictions…”).
Consider claim 4 as applied to respective claim, Wang as modified discloses the one or more neural networks include one or more convolutional neural networks (CNNs) for determining the content of the individual frames, and one or more attention networks for dividing the individual frames into the sequence of logical steps (see section 3.1 and fig. 2; “…each video frame is encoded by the 3D CNN …” “…Hidden states at boundary of a detected event ( → hn, ← hm) will be served as context vectors for the event. The context vectors and detected event clip features are then fused together and served as visual information input. We detail the fusion methods in Section 3.2.2. (e) The decoder LSTM translates visual input into a sentence.…” “”).
Consider claim 5 as applied to respective claim, Wang as modified discloses the one or more neural networks are further to infer the text for the steps of the process using the bi-directional context and the determined content of the individual frames corresponding to the steps (see fig. 2 and section 3.2; “…n fed into our bidirectional sequence encoder (e.g., LSTM).…” “Captioning Module”).
Consider claim 6 as applied to respective claim, Wang as modified discloses the one or more circuits are further to translate or compile the text into a format useful for performing the process (see fig. 1; “(a)” “(b) “(c)” and  section 3.2; “…our captioning module to translate visual input into a sentence…”).
Consider claim 7, Wang discloses a system comprising: one or more processors to generate, using one or more neural networks, text indicating steps of a process performed in a video (see fig. 2 and section 3.2; “…a recurrent neural network, specifically LSTM, is leveraged in our captioning module to translate visual input into a sentence…”“…The extracted C3D features are of temporal resolution δ = 16 frames, discretizing the input stream into T = L/δ time steps…” also see fig. 1 “(a)” “(b) “(c)”)).
However, Wang does not explicitly disclose processor, system, hardware, or method.
Nevertheless, Wang discloses video captioning a recurrent neural network (see fig. 2 and section 3).
Therefore, it would have been obvious to one of ordinary skill in the art at a time before the effective filing date of the claimed subject matter to combine the neural network for video captioning with existing hardware and processor technology in order to physically realize the discussed implementation and yield predictable result.
Consider claim 8 as applied to respective claim, Wang as modified discloses the one or more neural networks determine content of individual frames of the video and determine bi-directional context for the determined content (see fig. 2; section 3.1; “…Our proposed bidirectional proposal module also involves a backward pass. The aim of such a procedure is to capture future context, in addition to current event clue for better event proposals.…”).
Consider claim 9 as applied to respective claim, Wang as modified discloses the one or more neural networks utilize the bi-directional context to divide the individual frames into a sequence of logical steps, corresponding to the steps of the process performed in the video (see figs. 1 and 2; section 2; “(a)” “(b) “(c)” “…dense video captioning generates multiple sentences and grounds them with time locations automatically…” “…The visual features are then fed into our bidirectional sequence encoder (e.g., LSTM). (c) Each hidden state from the forward/backward seq. encoder will be fed into the proposal module. The forward/backward seq. encoders are jointly learned to make proposal predictions…”)..
Consider claim 10 as applied to respective claim, Wang as modified discloses the one or more neural networks include one or more convolutional neural networks (CNNs) for determining the content of the individual frames, and one or more attention networks for dividing the individual frames into the sequence of logical steps (see section 3.1 and fig. 2; “…each video frame is encoded by the 3D CNN …” “…Hidden states at boundary of a detected event ( → hn, ← hm) will be served as context vectors for the event. The context vectors and detected event clip features are then fused together and served as visual information input. We detail the fusion methods in Section 3.2.2. (e) The decoder LSTM translates visual input into a sentence.…” “”).
Consider claim 11 as applied to respective claim, Wang as modified discloses the one or more neural networks are further to infer the text for the steps of the process using the bi-directional context and the determined content of the individual frames corresponding to the steps (see fig. 2 and section 3.2; “…n fed into our bidirectional sequence encoder (e.g., LSTM).…” “Captioning Module”).
Consider claim 12 as applied to respective claim, Wang as modified discloses the one or more processors are further to translate or compile the text into a format useful for performing the process (see fig. 1; “(a)” “(b) “(c)” and  section 3.2; “…our captioning module to translate visual input into a sentence…”).
Consider claim 13, Wang discloses a method comprising: generating, using one or more neural networks, text indicating steps of a process performed in a video (see fig. 2 and section 3.2; “…a recurrent neural network, specifically LSTM, is leveraged in our captioning module to translate visual input into a sentence…”“…The extracted C3D features are of temporal resolution δ = 16 frames, discretizing the input stream into T = L/δ time steps…” also see fig. 1 “(a)” “(b) “(c)”)).
However, Wang does not explicitly disclose processor, system, hardware, or method.
Nevertheless, Wang discloses video captioning a recurrent neural network (see fig. 2 and section 3).
Therefore, it would have been obvious to one of ordinary skill in the art at a time before the effective filing date of the claimed subject matter to combine the neural network for video captioning with existing hardware and processor technology in order to physically realize the discussed implementation and yield predictable result.
Consider claim 14 as applied to respective claim, Wang as modified discloses the one or more neural networks determine content of individual frames of the video and determine bi-directional context for the determined content (see fig. 2; section 3.1; “…Our proposed bidirectional proposal module also involves a backward pass. The aim of such a procedure is to capture future context, in addition to current event clue for better event proposals.…”).
Consider claim 15 as applied to respective claim, Wang as modified discloses the one or more neural networks utilize the bi-directional context to divide the individual frames into a sequence of logical steps, corresponding to the steps of the process performed in the video (see figs. 1 and 2; section 2; “(a)” “(b) “(c)” “…dense video captioning generates multiple sentences and grounds them with time locations automatically…” “…The visual features are then fed into our bidirectional sequence encoder (e.g., LSTM). (c) Each hidden state from the forward/backward seq. encoder will be fed into the proposal module. The forward/backward seq. encoders are jointly learned to make proposal predictions…”)..
Consider claim 16 as applied to respective claim, Wang as modified discloses the one or more neural networks include one or more convolutional neural networks (CNNs) for determining the content of the individual frames, and one or more attention networks for dividing the individual frames into the sequence of logical steps (see section 3.1 and fig. 2; “…each video frame is encoded by the 3D CNN …” “…Hidden states at boundary of a detected event ( → hn, ← hm) will be served as context vectors for the event. The context vectors and detected event clip features are then fused together and served as visual information input. We detail the fusion methods in Section 3.2.2. (e) The decoder LSTM translates visual input into a sentence.…” “”).
Consider claim 17 as applied to respective claim, Wang as modified discloses the one or more neural networks are further to infer the text for the steps of the process using the bi-directional context and the determined content of the individual frames corresponding to the steps (see fig. 2 and section 3.2; “…n fed into our bidirectional sequence encoder (e.g., LSTM).…” “Captioning Module”).
Consider claim 18 as applied to respective claim, Wang as modified discloses translating the text into a format useful for performing the process (see fig. 1; “(a)” “(b) “(c)” and  section 3.2; “…our captioning module to translate visual input into a sentence…”).
Consider claim 19, Wang discloses a machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors to at least: generate, using one or more neural networks, text indicating steps of a process performed in a video (see fig. 2 and section 3.2; “…a recurrent neural network, specifically LSTM, is leveraged in our captioning module to translate visual input into a sentence…”“…The extracted C3D features are of temporal resolution δ = 16 frames, discretizing the input stream into T = L/δ time steps…” also see fig. 1 “(a)” “(b) “(c)”)).
However, Wang does not explicitly disclose processor, system, hardware, or method.
Nevertheless, Wang discloses video captioning a recurrent neural network (see fig. 2 and section 3).
Therefore, it would have been obvious to one of ordinary skill in the art at a time before the effective filing date of the claimed subject matter to combine the neural network for video captioning with existing hardware and processor technology in order to physically realize the discussed implementation and yield predictable result.
Consider claim 20 as applied to respective claim, Wang as modified discloses the one or more neural networks determine content of individual frames of the video and determine bi-directional context for the determined content (see fig. 2; section 3.1; “…Our proposed bidirectional proposal module also involves a backward pass. The aim of such a procedure is to capture future context, in addition to current event clue for better event proposals.…”).
Consider claim 21 as applied to respective claim, Wang as modified discloses the one or more neural networks utilize the bi-directional context to divide the individual frames into a sequence of logical steps, corresponding to the steps of the process performed in the video (see figs. 1 and 2; section 2; “(a)” “(b) “(c)” “…dense video captioning generates multiple sentences and grounds them with time locations automatically…” “…The visual features are then fed into our bidirectional sequence encoder (e.g., LSTM). (c) Each hidden state from the forward/backward seq. encoder will be fed into the proposal module. The forward/backward seq. encoders are jointly learned to make proposal predictions…”)..
Consider claim 22 as applied to respective claim, Wang as modified discloses the one or more neural networks include one or more convolutional neural networks (CNNs) for determining the content of the individual frames, and one or more attention networks for dividing the individual frames into the sequence of logical steps (see section 3.1 and fig. 2; “…each video frame is encoded by the 3D CNN …” “…Hidden states at boundary of a detected event ( → hn, ← hm) will be served as context vectors for the event. The context vectors and detected event clip features are then fused together and served as visual information input. We detail the fusion methods in Section 3.2.2. (e) The decoder LSTM translates visual input into a sentence.…” “”).
Consider claim 23 as applied to respective claim, Wang as modified discloses the one or more neural networks are further to infer the text for the steps of the process using the bi-directional context and the determined content of the individual frames corresponding to the steps (see fig. 2 and section 3.2; “…n fed into our bidirectional sequence encoder (e.g., LSTM).…” “Captioning Module”).
Consider claim 24 as applied to respective claim, Wang as modified discloses the instructions if executed further cause the one or more processors to translate or compile the text into a format useful for performing the process (see fig. 1; “(a)” “(b) “(c)” and  section 3.2; “…our captioning module to translate visual input into a sentence…”).
Consider claim 25, Wang discloses a code generation system, comprising: one or more processors to generate, using one or more neural networks, text indicating steps of a process performed in a video; a compiler for generating executable code from the text; and memory for storing network parameters for the one or more neural networks (see fig. 2 and section 3.2; “…a recurrent neural network, specifically LSTM, is leveraged in our captioning module to translate visual input into a sentence…”“…The extracted C3D features are of temporal resolution δ = 16 frames, discretizing the input stream into T = L/δ time steps…” also see fig. 1 “(a)” “(b) “(c)”)).
However, Wang does not explicitly disclose a compiler, processor, system, hardware, or method.
Nevertheless, Wang discloses video captioning a recurrent neural network (see fig. 2 and section 3 and fig. 1 “(a)” “(b) “(c)”).
Therefore, it would have been obvious to one of ordinary skill in the art at a time before the effective filing date of the claimed subject matter to combine the neural network for video captioning with existing hardware and processor technology in order to physically realize the discussed implementation and yield predictable result.
Consider claim 26 as applied to respective claim, Wang as modified discloses the one or more neural networks determine content of individual frames of the video and determine bi-directional context for the determined content (see fig. 2; section 3.1; “…Our proposed bidirectional proposal module also involves a backward pass. The aim of such a procedure is to capture future context, in addition to current event clue for better event proposals.…”).
Consider claim 27 as applied to respective claim, Wang as modified discloses the one or more neural networks utilize the bi-directional context to divide the individual frames into a sequence of logical steps, corresponding to the steps of the process performed in the video (see figs. 1 and 2; section 2; “(a)” “(b) “(c)” “…dense video captioning generates multiple sentences and grounds them with time locations automatically…” “…The visual features are then fed into our bidirectional sequence encoder (e.g., LSTM). (c) Each hidden state from the forward/backward seq. encoder will be fed into the proposal module. The forward/backward seq. encoders are jointly learned to make proposal predictions…”).
Consider claim 28 as applied to respective claim, Wang as modified discloses the one or more neural networks include one or more convolutional neural networks (CNNs) for determining the content of the individual frames, and one or more attention networks for dividing the individual frames into the sequence of logical steps (see section 3.1 and fig. 2; “…each video frame is encoded by the 3D CNN …” “…Hidden states at boundary of a detected event ( → hn, ← hm) will be served as context vectors for the event. The context vectors and detected event clip features are then fused together and served as visual information input. We detail the fusion methods in Section 3.2.2. (e) The decoder LSTM translates visual input into a sentence.…” “”).
Consider claim 29 as applied to respective claim, Wang as modified discloses the one or more neural networks are further to infer the text for the steps of the process using the bi-directional context and the determined content of the individual frames corresponding to the steps (see fig. 2 and section 3.2; “…n fed into our bidirectional sequence encoder (e.g., LSTM).…” “Captioning Module”).
Consider claim 30 as applied to respective claim, Wang as modified discloses the one or more processors are further to translate or compile the text into a format useful for performing the process (see fig. 1; “(a)” “(b) “(c)” and  section 3.2; “…our captioning module to translate visual input into a sentence…”).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any response to this Office Action should be faxed to (571) 273-8300 or mailed to:
Commissioner for Patents
                      P.O. Box 1450
		Alexandria, VA 22313-1450

Hand-delivered responses should be brought to 
Customer Service Window
Randolph Building
401 Dulany Street
Alexandria, VA 22314                                                                                                                                                                           

	Any inquiry concerning this communication or earlier communications from the  
Examiner should be directed to Fayyaz Alam whose telephone number is (571) 270-1102. The Examiner can normally be reached on Monday-Friday from 9:30am to 7:00pm.    
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, Sumati Lefkowitz can be reached on (571) 272-3638. The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.             
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free) or 703-305-3028.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the receptionist/customer service whose telephone number is (571) 272-2600.

Fayyaz Alam


May 5, 2022

/FAYYAZ ALAM/
Primary Examiner, Art Unit 2662