DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 9/16/2022 has been entered.
Response to Amendment
The amendments, filed 9/16/2022, have been entered and made of record. Claims 1, 18, and 20 have been amended. Claims 1-26 are pending.
Response to Arguments
Applicant’s arguments in the Remarks filed on 9/16/2018 have been considered but are moot in view of the new ground(s) of rejection.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
“a receiving unit configured to receive video game data generated during the playing of a video game” in claim 20.
“a first and second module, each module being configured to receive a respective one of the video signal and the audio signal in the video game data and to determine, using a trained model, an event occurring within the video game based on only the respective one of the video signal and the audio signal input to that module” in claim 20.
“a highlight detector configured to detect, based on the events detected by each module, one or more highlight events occurring within the playing of the video game” in claim 20.
“a recording unit configured to generate a recording of the video game gameplay based on the output of the highlight detector” in claim 20.
“the highlight detector is configured to determine an event occurring within the video game based on the output of the first and second modules and the received telemetry signal” in claim 22.
“a plurality of feature extractors, each feature extractor being configured to receive a respective one of the video signal and the audio signal in the received video game data and to generate feature representations of the frames in such respective one of the video signal and the audio signal” in claim 23.
“at least the first and second modules are configured to receive the feature representations generated for a respective one of the video signal and the audio signal” in claim 23.
“a first feature extractor is configured to receive video frames and a second feature extractor is configured to receive audio frames” in claim 24.
“the first module is configured to receive feature representations of video frames 5and the second module is configured to receive feature representations of audio frames” in claim 24.
“each feature extractor being configured to receive a different signal in the previously generated video 10game data” in claim 25.
“each clustering unit being configured to receive the feature representations output by a different feature extractor and to use unsupervised learning to sort the received feature representations into a plurality of clusters” in claim 25.
“a labelling unit operable to generate labels for the clusters output by each clustering unit, the labelling unit being configured to generate the labels based on an input from a user, each label indicating an event associated with the frames or corresponding feature representations in a respective cluster” in claim 25.
“a training unit configured to train at least the first and second modules, the training unit being configured to determine a relationship between the frames or feature representations input to the first and second modules and the corresponding labels generated by the labelling unit” in claim 25.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Syed in view of Long and Cheng
Claims 1-8, 10, 12, 13, 18, and 20-25 are rejected under 35 U.S.C. 103 as being unpatentable over Syed et al.(USPubN 2017/0228600; hereinafter Syed) in view of Long et al.(USPubN 2017/0157512; hereinafter Long) further in view of Cheng et al.(USPubN 2014/0328570; hereinafter Cheng).
As per claim 1, Syed teaches a method of generating a recording of video game gameplay, the method comprising: 
receiving video game data generated during the playing of a video game, the video game data comprising at least a video signal and corresponding audio signal, each signal comprising a plurality of frames(“Video input module 212 receives game data (in the form of game video and optionally game metadata) for a particular game video and routes appropriate portions to other functional modules for processing. For example, if game metadata associated with the particular game video is available, the game metadata may be sent directly to watchability scoring and highlight generation module 218. Meanwhile, video imagery and audio of the video at issue may be sent to game video feature extraction module 214” in Para.[0049]); 
inputting each respective one of the video signal and the audio signal in the received video game data into a respective machine learning model, each machine learning model having been trained to identify an event occurring within the video game based on the signals input to that model(“Game video feature extraction module 214 extracts various features from the video imagery and audio content of the video. According to one embodiment, game video feature extraction module 214 extracts several automatic video indexing features from the video imagery using one or more local invariant feature detectors as described in, for example, Tinne Tuytelaars and Krystian Mikolajczyk. “Local invariant feature detectors: a survey,” Found. Trends. Comput. Graph. Vis. vol. 3, no. 3 (July 2008), which is hereby incorporated by reference in its entirety for all purposes” in Para.[0050], “extracted video features (e.g., visual, audio and/or text features extracted by feature extraction module 214) and information regarding possible video games (e.g., game templates/specifications from VGDB 220) are received and compared. According to one embodiment, the video game shown in the video is identified by classifying the frames in a video using classifier models stored in VGDB 220. The classification may be carried out using Support Vector Machine (SVM) classifiers as described in, for example, M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, B. Scholkopf, “Support vector machines,” Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18-28, 1998 (hereafter, Hearst, et al.”). Alternatively, Deep Neural Networks as described in, for example, A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nev. (hereafter, Krizhevsky, et al.), or other classification techniques, such as those described in, for example, Xin Zhang, Yee-Hong Yang, Zhiguang Han, Hui Wang, and/or Chao Gao, “Object class detection: A survey,” ACM Comput. Surv. 46, 1, Article 10 (July 2013) can be used for this purpose. All of the foregoing documents are hereby incorporated by reference in their entirety for all purposes” in Para.[0061], “Once the game being shown in the video is recognized, at block 420, the temporal extent of the video game depicted in the video is determined by detecting the temporal boundaries of the game in the video. In one embodiment, the temporal boundaries may be computed by training in-game and out of game visual classifiers on the video features. Those skilled in the art will appreciate a variety of artificial intelligence or machine learning methods may be used for this purpose, including, but not limited to neural networks, deep networks, random forest classifiers and the like. In one embodiment, one of more of the techniques described in Hearst, et al. and Krizhevsky, et al. may be used” in Para.[0062], “information is collected about the video game through recognition of game levels, maps, objects and locations. In one embodiment, Support Vector Machines (SVMs), as described in Hearst, et al., are used for recognition and detection of such game information” in Para.[0065]); 
determining, based on an output of each machine learning model, whether a highlight event has occurred during the playing of the video game(“Game video labeling and information indexing module 216 is configured to identify the video game represented within the video data and extract various information relating to the status of the video game over time. According to one embodiment, game video labeling and information indexing module 216 module uses game design templates stored within a video game database (VGDB) 220 to recognize the video game being shown in the video data” in Para.[0051], "After extracting the status of the video game and/or game activity, game video labeling and information indexing module 216 indexes the extracted information and stores it within a game video information database 230. A non-limiting example of a simplified database schema for game video information database 230 is described in further detail below with reference to FIG. 7. The extracted information may also be sent to watchability scoring and highlight generation module 218” in Para.[0053]); 
selecting, based on a determination that a highlight event has occurred, at least some of the frames of the video signal and/or the audio signal for inclusion in a recording of the playing of the video game(“watchability scoring and highlight generation module 218 identifies one or more significant portions of the video containing game activity deemed to be of importance and one or more less significant portions of the video deemed to be relatively less important … Watchability scoring and highlight generation module 218 may also generate highlight videos based on watchability scores and/or by performing a separate analysis in relation to player achievements, player scores and/or nearness to completion of game and game level objectives. Module 218 also interacts with the video editing user interface module 260, allowing the user of the system to view and evaluate the game information extracted by system server 210 and provide input in terms of selection of clips, transitions, overlays, and audio in connection with producing a final video output that may be uploaded to YouTube or the like” in Para.[0054]), and 
generating a recording of the video game gameplay that includes the selected video frames and/or the audio frames(“Watchability scoring and highlight generation module 218 may also generate highlight videos based on watchability scores and/or by performing a separate analysis in relation to player achievements, player scores and/or nearness to completion of game and game level objectives. Module 218 also interacts with the video editing user interface module 260, allowing the user of the system to view and evaluate the game information extracted by system server 210 and provide input in terms of selection of clips, transitions, overlays, and audio in connection with producing a final video output that may be uploaded to YouTube or the like” in Para.[0054]).
Syed is silent about inputting each respective one of the video signal and the audio signal in the received video game data into a respective one of a plurality of machine learning models, each machine learning model among the plurality of machine learning models having been specifically trained to identify an event occurring within the video game based on only one of the video signal and the audio signal input to that model.
Long teaches respective one of a plurality of machine learning models, each machine learning model among the plurality of machine learning models having been specifically trained to identify an event occurring within the video game based on one of the video signal and the audio signal input to that model(“Highlight and special effect servers 270 may utilize machine learning and/or machine vision algorithms to auto-detect particular game events, scenes or moments of interest, and generate highlight videos, with or without highlight effects such as slow-motion or close-up with fly-by camera angles” in Para.[0105], “auto-detection or auto-identification of optimal virtual camera locations may be conducted live during a game play, again by leveraging computer vision and/or other machine learning algorithms to determine highlight metadata based on extracted visual, audio, and/or metadata cues, then identifying critical gaming moments based on the generated highlight metadata, and finally identifying optimal locations for placement of highlight virtual cameras based on the highlight metadata “ in Para.[0112], “computer vision and other machine learning algorithms may be applied to the game environment, previously recorded game plays, or live game plays for highlight metadata generation, possibly based on extracted visual, audio, and/or metadata cues, in different embodiments of the present invention. Examples of such algorithms include, but are not limited to edge detection, feature extraction, segmentation, object recognition, pose estimation, motion analysis, liner and non-liner transforms in time, spatial, or frequency domains, hypothesis testing, decision trees, neural networks including convolutional neural networks, vector quantization, and many others” in Para.[0113], “intelligent machine learning and/or computer vision algorithms may be first used to extract highlight cues, which in turn assist in the auto-detection of particular game scenes or game moments of interest” in Para.[0118], “Such machine learning algorithms may be trained using historical game play data” in Para.[0134]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed with the above teachings of Long in order to enhance accuracy of identify event within the video game for improving viewing experience.
Cheng teaches each machine learning model among the plurality of machine learning models having been specifically trained to identify an event occurring within the video game based on only one of the video signal and the audio signal input to that model(“visual feature detection module 212 quantizes the extracted low-level features by feature type using the visual feature models 236. In some embodiments, the feature models 236 or portions thereof are machine-learned (e.g., from training data in the collection 150) using, e.g., k-means clustering techniques. The visual feature detection module 212 can aggregate the quantized low-level features by feature type, by using, for example, a Bag-of-Words (BoW) model in which a frequency histogram of visual words is computed over the entire length of a video. The visual feature detection module 212 identifies the visual features 220 to the event detection module 22” in Para.[0042], “The audio feature model 238 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to the visual feature models 236 except that the audio features of the training data are analyzed rather than the visual features, in order to develop the audio feature model 238. The audio feature detection module 214 identifies the detected audio features 222 to the event detection module 228” in Para.[0043]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed and Long with the above teachings of Cheng in order to improve efficiency of event recognition with individual machine learning models.
As per claim 2, Syed, Long and Cheng teach all of limitation of claim 1. 
Syed is silent about wherein each machine learning model is trained using previously generated video game data generated during a previous playing of the video game, and each machine learning model is trained using semi- supervised learning to determine a relationship between one of the video signal and the audio signal input to that model and corresponding events.
Long teaches wherein each machine learning model is trained using previously generated video game data generated during a previous playing of the video game, and each machine learning model is trained using semi- supervised learning to determine a relationship between one of the video signal and the audio signal input to that model and corresponding events(Para.[0105], [0112], [0113]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed with the above teachings of Long in order to enhance accuracy of identify event within the video game for improving viewing experience.
As per claim 3, Syed, Long and Cheng teach all of limitation of claim 1.
Syed teaches wherein the video game data further comprises a telemetry signal indicating an in-game event; and wherein the method further comprises determining whether a highlight event has occurred based on an output of each machine learning model and received telemetry signal(“the temporal extent of the video game depicted in the video is determined by detecting the temporal boundaries of the game in the video. In one embodiment, the temporal boundaries may be computed by training in-game and out of game visual classifiers on the video features. Those skilled in the art will appreciate a variety of artificial intelligence or machine learning methods may be used for this purpose, including, but not limited to neural networks, deep networks, random forest classifiers and the like. In one embodiment, one of more of the techniques described in Hearst, et al. and Krizhevsky, et al. may be used” in Para.[0062]).
As per claim 4, Syed, Long and Cheng teach all of limitation of claim 1.
Syed teaches wherein the video game data further comprises one or more of: i. a haptic signal indicating haptic feedback output at one or more devices being used to play the video game; ii. a motion signal indicating motion of a player; iii. a speech signal comprising the player's speech; iv. a player input signal indicating player inputs received at one or more devices being used to play the video game; and v. a video camera signal comprising a video recording of the player(“one or more significant portions of the video(s) containing game activity deemed to be of importance and one or more less significant portions of the video(s) deemed to be relatively less important may be identified based on various factors including, but not limited to, input received from the user and changes in game status and/or game activity” in Para.[0042]).
As per claim 5, Syed, Long and Cheng teach all of limitation of claim 1.
Syed is silent about comprising inputting at least some of the frames of the video signal into a video machine learning model among the plurality of machine learning models; and wherein the video machine learning model is trained to identify a type of scene to which each of the at least some frames of the video signal correspond.
Long teaches comprising inputting at least some of the frames of the video signal into a video machine learning model among the plurality of machine learning models; and wherein the video machine learning model is trained to identify a type of scene to which each of the at least some frames of the video signal correspond(“intelligent machine learning and/or computer vision algorithms may be first used to extract highlight cues, which in turn assist in the auto-detection of particular game scenes or game moments of interest” in Para.[0118], Para.[0105], [0112], [0113]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed with the above teachings of Long in order to enhance accuracy of identify event within the video game for improving viewing experience.
As per claim 6, Syed, Long and Cheng teach all of limitation of claim 1.
Syed is silent about comprising inputting at least some of the audio frames of the audio signal into an audio machine learning model among the plurality of machine learning models; wherein the audio machine learning model is trained to identify an audio event to which each of the at least some frames of the audio signal correspond.
Long teaches comprising inputting at least some of the audio frames of the audio signal into an audio machine learning model among the plurality of machine learning models; wherein the audio machine learning model is trained to identify an audio event to which each of the at least some frames of the audio signal correspond(Para.[0105], [0112], [0113]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed with the above teachings of Long in order to enhance accuracy of identify event within the video game for improving viewing experience.
As per claim 7, Syed, Long and Cheng teach all of limitation of claim 6.
Syed is silent about comprising inputting at least some of the frames of the video signal into a video machine learning model among the plurality of machine learning models, wherein the video machine learning model is trained to identify a type of scene to which each of the at least some frames of the video signal correspond: generating respective feature representations of the at least some of the frames of the video signal and the at least some of the frames of the audio signal; and wherein the inputting the at least some of the frames of the video signal, and the inputting the at least some of the audio frames of the audio signal into, respectively, into the video machine learning model and the audio machine learning model comprises inputting the respective feature representations into the respective video machine learning model and the audio machine learning model.
Long teaches comprising inputting at least some of the frames of the video signal into a video machine learning model among the plurality of machine learning models, wherein the video machine learning model is trained to identify a type of scene to which each of the at least some frames of the video signal correspond: generating respective feature representations of the at least some of the frames of the video signal and the at least some of the frames of the audio signal; and wherein the inputting the at least some of the frames of the video signal, and the inputting the at least some of the audio frames of the audio signal into, respectively, into the video machine learning model and the audio machine learning model comprises inputting the respective feature representations into the respective video machine learning model and the audio machine learning model(Para.[0105], [0112], [0113]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed with the above teachings of Long in order to enhance accuracy of identify event within the video game for improving viewing experience.
As per claim 8, Syed, Long and Cheng teach all of limitation of claim 7.
Syed teaches wherein generating the feature representations of the video signal comprises inputting at least some of the frames of the video signal into a pre-trained model(Para.[0061]).
As per claim 10, Syed, Long and Cheng teach all of limitation of claim 5.
Syed teaches wherein the video machine learning model comprises a neural network(Para.[0061]).
As per claim 12, Syed, Long and Cheng teach all of limitation of claim 1.
Syed teaches wherein each trained one of the machine learning models is executed locally at a computing device being used to play the video game(Fig. 2, Para.[0037], Para.[0040], Para.[0047]).
As per claim 13, Syed, Long and Cheng teach all of limitation of claim 1.
Syed teaches comprising: receiving historical data generating during the playing of the video game by one or more other players; determining a correspondence between at least some of the historical data and the at least some of the frames of the video signal and/or the audio signal selected; and displaying the historical data that corresponds with the at least some of the frames of the video signal and/or the audio signal selected, when playing back the recording of video game gameplay(“a simplified database schema 600 for an exemplary video game database (e.g., VGDB 220) in accordance with an embodiment of the present invention. In one embodiment, database schema facilitates content-specific video parsing, editing, creation and/or labeling as it allows system server 210 to understand the content of the video at issue. In the context of the present example, database schema 600 is represented as a set of exemplary database tables, including a game info table 610, a game HUD table 620, a game objective table 630, a game levels/maps table 640 that includes classification models and a game achievements table 650. In addition, information about any players and teams participating in the game (if available) may also be stored in a player table 670 and teams table 680. Player game history may be stored in a player game history table 660. Fields presented in italic text (i.e., game_id, player_id, team_id, game_HUD_id, game_objective_id, game_level_id and game_achievement_id) are those that serve as primary keys. Id values typically represent values that uniquely identify the thing at issue (e.g., the game, the player, the game HUD specs, the teams, etc.) within the system” in Para.[0079]).
As per claim 18, Syed teaches a computer readable medium having computer executable instructions adapted to cause a computer system to perform(“a computer program product, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. Moreover, embodiments of the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection)” in Para.[0026]) and the other limitations in the claim 18 has been discussed in the rejection claim 1 and rejected under the same rationale. 
As per claim 20, the limitations in the claim 20 has been discussed in the rejection claim 1 and rejected under the same rationale.	
As per claim 21, the limitations in the claim 21 has been discussed in the rejection claim 2 and rejected under the same rationale.
As per claim 22, the limitations in the claim 22 has been discussed in the rejection claim 3 and rejected under the same rationale.
As per claim 23, Syed, Long and Cheng teach all of limitation of claim 20.
Syed teaches further comprising: a plurality of feature extractors, each feature extractor being configured to receive a respective one of the video signal and the audio signal in the received video game data and to generate feature representations of the frames in such respective one of the video signal and the audio signal; and wherein at least the first and second modules are configured to receive the feature representations generated for a respective one of the video signal and the audio signal(“game video feature extraction module (e.g., game video feature extraction module 214) in accordance with an embodiment of the present invention. In the context of the present example, game video feature extraction 300 includes one or more of visual and audio visual feature extraction 310, audio feature extraction 320, Optical Character Recognition (OCR) 330 and speech recognition 340 from a video game video” in Para.[0055]).
As per claim 24, Syed, Long and Cheng teach all of limitation of claim 23.
Syed teaches wherein a first feature extractor is configured to receive video frames and a second feature extractor is configured to receive audio frames; and wherein the first module is configured to receive feature representations of video frames and the second module is configured to receive feature representations of audio frames(“game video feature extraction 300 includes one or more of visual and audio visual feature extraction 310, audio feature extraction 320, Optical Character Recognition (OCR) 330 and speech recognition 340 from a video game video” in Para.[0055]).
As per claim 25, Syed, Long and Cheng teach all of limitation of claim 23.
Syed teaches comprising: a plurality of feature extractors for receiving previously generated video game data, each feature extractor being configured to receive a different signal in the previously generated video game data; a plurality of clustering units, each clustering unit being configured to receive the feature representations output by a different feature extractor and to use unsupervised learning to sort the received feature representations into a plurality of clusters; a labelling unit operable to generate labels for the clusters output by each clustering unit, the labelling unit being configured to generate the labels based on an input from a user, each label indicating an event associated with the frames or corresponding feature representations in a respective cluster; and a training unit configured to train at least the first and second modules, the training unit being configured to determine a relationship between the frames or feature representations input to the first and second modules and the corresponding labels generated by the labelling unit(Para.[0050], [0051], [0053], [0061], [0062], [0065]).

Syed in view of Long, Cheng and Chaudhuri
Claims 9 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Syed et al.(USPubN 2017/0228600; hereinafter Syed) in view of Long et al.(USPubN 2017/0157512; hereinafter Long) further in view of Cheng et al.(USPubN 2014/0328570; hereinafter Cheng) further in view of Chaudhuri et al.(USPubN 2018/0174600; hereinafter Chaudhuri).
As per claim 9, Syed, Long and Cheng teach all of limitation of claim 7. 
Syed, Long and Cheng silent about wherein generating feature representations of the audio signal comprises generating a mel-spectrogram of at least some of the frames of the audio signal.
Chaudhuri teaches wherein generating feature representations of the audio signal comprises generating a mel-spectrogram of at least some of the frames of the audio signal(“the sound filter bank 230 uses a Cascade of Asymmetric Resonators with Fast-Acting Compression (CARFAC) model to extract the features. The CARFAC model is based on a pole-zero filter cascade (PZFC) model of auditory filtering, in combination with a multi-time-scale coupled automatic-gain-control (AGC) network. This mimics features of auditory physiology. In other embodiments, the features may be extracted using another model, such as a spectrogram modified by a mel filter bank. Other methods of extracting features, such as using the raw spectrograms of the audio segments themselves as features, or mel filters, may also be used” in Para.[0077]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed, Long and Cheng with the above teachings of Chaudhuri in order to improve extracting audio accurately and significantly.
As per claim 11, Syed, Long and Cheng teach all of limitation of claim 6. 
Syed, Long and Cheng are silent about wherein the audio machine learning model comprises a logistic regression model or a binary classification model.
Chaudhuri teaches wherein the audio machine learning model comprises a logistic regression model or a binary classification model(“The speech detection 232 process may further smooth the raw scores over the multiple segments that are generated by the machine learning model. The goal of the smoothing process is to create a set of binary scores (“speech” or “no speech”) that do not fluctuate at a high frequency. To generate the binary scores, the speech detection 232 process generates an aggregate value for each of a series of consecutive segments in the audio stream based on the raw scores of the segments in each series. The aggregate value is computed using an aggregation function, which may be any statistical or mathematical operation that generates a single value from multiple values of a similar type, such as an average. The output from this process is a set of binary values indicating the temporal positions in the audio portion of the video where speech occurs, and where speech does not occur. In the output, while small gaps of lower raw scores in the audio may be smoothed away and the binary score for such a gap indicates speech, larger gaps will still indicate segments of no speech” in Para.[0081]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Syed, Long and Cheng with the above teachings of Chaudhuri in order to improve extracting audio accurately and significantly.
Allowable Subject Matter
Claim 26 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SUNGHYOUN PARK whose telephone number is (571)270-1333. The examiner can normally be reached M - Thur 6:00 am - 4 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI Q TRAN can be reached on (571)272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SUNGHYOUN PARK/Examiner, Art Unit 2484