Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Status
Claims 1-19 are pending.  

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-7, 9-13 and 16-18 are rejected under 35 U.S.C. because the claimed invention is directed to a judicial exception, i.e., an abstract idea which is not integrated into a practical application and does not include additional elements that amount to significantly more than the judicial exception.  
STEP 1:
The claims are directed to method/process which is one of the categories of statutory subject matter.
STEP 2A – Prong 1: 
The claims recite a judicial exception, i.e., an abstract idea because at least one of the method steps can be performed by mental process(s), i.e., thinking that can be performed in the human mind assisted by pen and paper if necessary.  
Claim 1 recites:
receiving, by a processor, the source media data element comprising one or more frames;
applying, by the processor, a machine learning algorithm to predict at least one first Region of Interest (ROI) in one or more of the at least one frames; and
cropping, the one or more frames to generate a new media data element based on the predicted at least one first ROI.
The limitation “cropping, the one or more frames to generate a new media data element based on the predicted at least one first ROI” can be performed by a mental process in the human mind and thus is directed to an abstract idea.  
STEP 2A – Prong 2:
The claim does not recite additional elements that integrate the judicial exception into a practical application.  Simply appending at a high level of generality, well-understood, routine conventional activities previously known to the industry to the judicial exception does not integrate the judicial exception into a practical application.  A generic computer (a processor) is simply linked to a particular technological environment or field of use, i.e., generating a new media item.  Clearly, the generic computer (a processor) does not improve the technological field because one or more processor functions are not integrated into the method steps.  Still further, the additional elements do not constitute improvement(s) in the operation of the generic computer per se. According to the considerations of Prong 2, it is concluded that a draftsman can simply append a processor to the claimed method steps.  
STEP 2B:  
	The additional elements do not amount to significantly more than the judicial exception.  i,e., the additional elements such as features, limitations, steps, individually or in combination do not contribute to a patentable inventive concept.  The inventive concept is well-known in the industry, i.e., generating a new media data element.  The present invention can be characterized as automating a well-known manual process.   
The courts have ruled that the following computer functions are well-understood, routine and conventional functions when they are claimed in a merely generic manner, e.g., at a high level of generality.  
The following are pertinent to the present invention:
Receiving or transmitting data over a network
Data gathering
Selecting a particular data source or type of data to be manipulated.  
The claims considered individually or in combination do not result in a new or improved new data element system.
Claims 2-7, 9-13 and 16-18 are also rejected on the basis of failing to integrate the judicial exception into a practical application and/or do not include additional elements that amount to significantly more than the judicial exception.  

Allowable Subject Matter
Claims 8, 14, 15 and 19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
8. The method of claim 2, wherein the machine learning algorithm is trained to minimize a regression loss function on the plurality of media data elements by at least one of: mean squared error, L1 mean absolute error, log-cosh error and Huber loss error between the predicted coordinates of the ROI and the tagged coordinates of the ROI.
14. The method of claim 5, wherein the machine learning algorithm is a recurrent neural network (RNN), wherein the source media data element comprises at least one sequence of frames, and wherein the applying of the encoder comprises: selecting, by the processor, ‘N’ frames from the at least one frame sequence; and feeding each of the ‘N’ frames to the encoder to receive a sequence of ‘N’ feature vectors.
15. The method of claim 14, wherein at least one layer of the RNN comprises one of: ‘N’ bidirectional long short-term memory (LSTM) units and ‘N’ unidirectional LSTM units.
19. The method of claim 17, wherein the machine learning algorithm is trained to minimize a regression loss function on the plurality of second media data elements by at least one of: mean squared error, L1 mean absolute error, log-cosh error and Huber loss error between the predicted coordinates of the at least one first ROI and the tagged coordinates of at least one of the plurality of second ROIs.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiong (US 2019/0138835) in view of Yen (US 2019/0034734).
Regarding claim 1, Xiong discloses:
receiving, by a processor, the source media data element comprising one or more frames;
Xiong [0014] FIG. 3 illustrates an exemplary data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames according to various embodiments.

applying, by the processor, a machine learning algorithm to predict at least one first Region of Interest (ROI) in one or more of the at least one frames; and
Xiong [0014] FIG. 3 illustrates an exemplary data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames according to various embodiments.

cropping, the one or more frames to generate a new media data element based on the predicted at least one first ROI.
Xiong discloses the elements of the claimed invention as noted but does not disclose above limitation.  However, Yen discloses:
Yen [0123] The DL system 926 can perform ROI clipping 927 using the generated ROIs. ROI clipping 927 includes cropping the one or more ROIs from the video frame 914 to generate a cropped video frame (which can also be referred to as a cropped image). A cropped video frame is illustrated as a bolded bounding box within the frame 925 shown in FIG. 9. The cropped video frame includes only the portion of the video frame corresponding to an ROI. In some implementations, if more than one ROI is generated for a video frame, a separate cropped image can be generated for each ROI, resulting in multiple cropped images (or cropped video frames) being generated from the full-sized video frame. Once a cropped video frame is generated, it can then be provided to a deep learning network engine (e.g., deep learning network engine 726 or forensic deep learning network engine 727) for application of a trained neural network (e.g., a deep learning network). Using a cropped portion of the higher resolution video frame 914 (instead of the entire video frame 914) that includes one or more objects of interest reduces the processing time and complexity of the deep network needed to process the video frame. As noted previously, using cropped frames reduces and can even eliminate the problem of classifying small objects. An object in the cropped image is large relative to the frame height, allowing the deep learning network engine to more accurately classify the object, as illustrated in the chart shown in FIG. 5.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Xiong to obtain above limitation based on the teachings of Yen for the purpose of cropping the one or more ROIs from the video frame to generate a cropped video frame (which can also be referred to as a cropped image). 

Claim(s) 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong and Yen and further in view of Roberge (US 2019/0139642) and further in view of Kisilev (US 2018/0247405).  
The combination of Xiong and Yen discloses 
receiving, by the processor, a plurality of media data elements;
Xiong [0014] 
The combination of Xiong and Yen does not disclose tagging at least one second ROI for each media data element of the plurality of media data elements.  However, Roberge discloses:
Roberge [0061] An additional objective of the present invention is to facilitate production of a report while simultaneously tagging or categorizing an image or a ROI.   Specifically, the current invention facilitates the use of eye tracking to record gaze data and associate the gaze data with terms from the report. The system can automatically process an image in order to identify a ROI within the image for training a machine learning algorithm.
It would have been obvious to one of ordinary skill in the art before the effective filing data to modify the combination of Xiong and Yen to obtain above limitation based on the teachings of Roberge for the purpose of simultaneously tagging or categorizing an image or a ROI.    

The combination of Xiong and Yen does not disclose feeding the received media data elements and each of the at least one second ROI to the machine learning algorithm to train the machine learning algorithm to predict the at least one first ROI in one or more of the at least one frames in the source media data element or to predict at least another ROI in another media data element.  However, Kisilev discloses:
Kisilev [0037] The extracted feature maps may be sent from the CNN 104 to both the region proposal network 106 and the multi-attribute net 110. The region proposal network 106 can be trained to generate region of interest (ROI) candidates. For example, the region proposal network 106 can be trained to predict an ROI bounding box coordinates and a bounding box score. For example, the proposal convolutional layer 114 can map the extracted feature maps to a lower-dimensional feature. The proposal class softmax layer 116 can estimate a probability of the ROI bounding box including a tumor for each proposed ROI candidate. The proposal bounding box (bbox) regressor layer 118 can encode the coordinates of the proposed ROI candidates. In some examples, to accommodate for a variety of lesion sizes depending on datasets, the region proposal network may use up to 9 scales of ROI's. For example, the scale of an ROI may range from 16×16 to 1024×1024 in size. Thus, the output of the region proposal network 108 may be any number of regions of interest candidates with associated bounding boxes and bounding box scores.
It would have been obvious to one of ordinary skill in the art before the effective filing data to modify the combination of Xiong and Yen to obtain above limitation based on the teachings of Kisilev for the purpose of training a CNN to predict a ROI.   

Regarding claim 3, the combination of Xiong, Yen, Roberge and Kisilev discloses wherein the machine learning algorithm comprises at least one of: a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN).
Kisilev [0037] The extracted feature maps may be sent from the CNN 104 to both the region proposal network 106 and the multi-attribute net 110.

Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong, Yen, Roberge and Kisilev and further in view of Katz (US 2018/0024643).   
The combination of     discloses the elements of the claimed invention as noted but does not disclose wherein the tagging is carried out in at least one of two perpendicular axes.  However, Katz discloses:
Katz  [0034] tagging a frame or capturing a frame from the video, cutting a subset of a video from a video
It would have been obvious to one of ordinary skill in the art before the effective filing data to modify the combination of Xiong, Yen, Roberge and Kisilev to obtain above limitation based on the teachings of Katz for the purpose of tagging a frame from a video.   

Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong and Yen and further in view of Katti (US 2020/0193164) 
The combination of Xiong and Yen discloses the elements of the claimed invention as noted but does not disclose further comprising applying an encoder to perform a transformation to at least one frame in the source media data element to produce at least one feature vector, wherein the machine learning algorithm is configured to predict the at least one first ROI based on the produced at least one feature vector.  However, Katti discloses:
Katti [0020] For example, where a scene is represented as a sequence of timestamped frames, the scene retrieval platform generates a frame representation for each timestamped frame, and then feeds the frame representations of the timestamped frames to the trained neural network. Responsive to receiving the frame representations, the trained neural network transforms the frame representations into an embedding vector (e.g., a vector representation). The vector representations of the frame representations may then be concatenated into a “scene vector,” wherein the scene vector comprises a vector representation of the scene. Using the vector representation of the scene, a similarity score may be determined for all scenes of equal length to the scene (based on the vector representation of the scene). The scenes of equal length may then be ranked based on the similarity scores, and presented to a user within a graphical user interface.
It would have been obvious to one of ordinary skill in the art before the effective filing data to modify the combination of Xiong and Yen to obtain above limitation based on the teachings of 
Katti for the purpose of transforming the frame representations into an embedding vector (e.g., a vector representation) 

Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong, Yen and Katti and further in view of Zhang (US 2020/0082573). 
Regarding claim 6, the combination of Xiong, Yen and Katti discloses the elements of the claimed invention as noted but does not disclose wherein training the encoder is unsupervised.  However, Zhang discloses:
Zhang [0102] In this embodiment, training the encoder and the decoder may adopt a process of training a variational auto-encoder, which belongs to an unsupervised learning method and does not need the scene parameters of the training scene to be manually noted in advance, thereby reducing an input of human and material resources in the training process.
It would have been obvious to one of ordinary skill in the art before the effective filing data to modify the combination of Xiong, Yen and Katti to obtain above limitation based on the teachings of Zhang for the purpose of not needing the scene parameters of the training scene to be manually noted in advance. 

Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong, Yen and Katti and further in view of Gottemukkula (US 2020/0302149).  
Regarding claim 7, the combination of Xiong, Yen and Katti discloses the elements of the claimed invention as noted but does not disclose wherein training the encoder is supervised.  However, Gottemukkula discloses:
Gottemukkula [0023] FIG. 4 is a flow diagram of an example process for training an encoder neural network based on a supervised loss function, an unsupervised loss function, or both.
It would have been obvious to one of ordinary skill in the art before the effective filing data to modify the combination of Xiong, Yen and Katti to obtain above limitation based on the teachings of Gottemukkula for the purpose of training an encoder neural network based on a supervised loss function.  

Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong and Yen.
Regarding claim 9 the combination of Xiong and Yen discloses further comprising modifying the predicted at least one first ROI, wherein at least one frame of the new media data element comprises the modified at least one first ROI.  
Yen [0123] The DL system 926 can perform ROI clipping 927 using the generated ROIs. ROI clipping 927 includes cropping the one or more ROIs from the video frame 914 to generate a cropped video frame (which can also be referred to as a cropped image). A cropped video frame is illustrated as a bolded bounding box within the frame 925 shown in FIG. 9. The cropped video frame includes only the portion of the video frame corresponding to an ROI. In some implementations, if more than one ROI is generated for a video frame, a separate cropped image can be generated for each ROI, resulting in multiple cropped images (or cropped video frames) being generated from the full-sized video frame. Once a cropped video frame is generated, it can then be provided to a deep learning network engine (e.g., deep learning network engine 726 or forensic deep learning network engine 727) for application of a trained neural network (e.g., a deep learning network). Using a cropped portion of the higher resolution video frame 914 (instead of the entire video frame 914) that includes one or more objects of interest reduces the processing time and complexity of the deep network needed to process the video frame. As noted previously, using cropped frames reduces and can even eliminate the problem of classifying small objects. An object in the cropped image is large relative to the frame height, allowing the deep learning network engine to more accurately classify the object, as illustrated in the chart shown in FIG. 5.
NOTE: See claim 1 for motivation statement.  

Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong, Yen and Kisilev and further in view of Mulford (US 2021/0012502).         
Regarding claim 10, the combination of Xiong, Yen and Kisilev discloses comprising modifying the predicted at least one first ROI, 
Kisilev [0037] The extracted feature maps may be sent from the CNN 104 to both the region proposal network 106 and the multi-attribute net 110. The region proposal network 106 can be trained to generate region of interest (ROI) candidates. For example, the region proposal network 106 can be trained to predict an ROI bounding box coordinates and a bounding box score. For example, the proposal convolutional layer 114 can map the extracted feature maps to a lower-dimensional feature.
NOTE:  See claim 2 for motivation statement.  
The combination of Xiong, Yen and Kisilev does not disclose wherein at least one frame of the new media data element is cropped based on the modified at least one first ROI.  However, Mulford discloses:
Mulford [0003]  Each frame in the media may be cropped based on the features identified within the frame. Features detected may include face tracking, object detection and/or recognition, text detection, detection of dominant colors, motion analysis, scene change detection, and image saliency. The identification of multiple salient features in the frame, however, may result in cropping of the media to include other features irrelevant to the viewer. A suboptimal cropping may lead to a need to regenerate the cropped media.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Xiong, Yen and Kisilev to obtain above limitation based on the teachings of Mulford for the purpose of cropping each frame based on the features identified within the frame. 

Claim(s) 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong and Yen and further in view of Kisilev  
Regarding claim 11, the combination of Xiong and Yen and further in view of Kisilev      discloses wherein at least one frame of the new media data element comprises the predicted at least one ROI.
Kisilev [0037] The extracted feature maps may be sent from the CNN 104 to both the region proposal network 106 and the multi-attribute net 110. The region proposal network 106 can be trained to generate region of interest (ROI) candidates. For example, the region proposal network 106 can be trained to predict an ROI bounding box coordinates and a bounding box score. For example, the proposal convolutional layer 114 can map the extracted feature maps to a lower-dimensional feature. The proposal class softmax layer 116 can estimate a probability of the ROI bounding box including a tumor for each proposed ROI candidate. The proposal bounding box (bbox) regressor layer 118 can encode the coordinates of the proposed ROI candidates. In some examples, to accommodate for a variety of lesion sizes depending on datasets, the region proposal network may use up to 9 scales of ROI's. For example, the scale of an ROI may range from 16×16 to 1024×1024 in size. Thus, the output of the region proposal network 108 may be any number of regions of interest candidates with associated bounding boxes and bounding box scores.
NOTE:  See claim 2 for motivation statement.  

Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong, Yen, Roberge and Kisilev and further in view of Kavanau (US 2019/0197332).
Regarding claim 12, the combination of Xiong, Yen, Roberge and Kisilev discloses the elements of the claimed invention as noted but does not disclose wherein the training is based on at least one of: transfer learning and parameters fine tuning.  However, Kavanau discloses:
Kavanau claim 26, The system of claim 21, further comprising: receive an input of the set of 2-dimensional images to the computer processor; receive an input of metrics to the computer processor, the metrics including a circumference to height ratio of the cylindrical topology, a specific circumference based on a desired resolution of the azimuthal angles, number of training cycles, and at least one parameter that impacts a degree of fine-tuning of the results with each training cycle; determination of symmetry of the set of 2-dimensional images and an identity of the azimuthal angles based on the symmetry; and rotational alignment of the input 2-dimensional images by applying the azimuthal angles to the input 2-dimensional images.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Xiong, Yen, Roberge and Kisilev to obtain above limitation based on the teachings of Kavanau for the purpose of fine-tuning a number of parameters of training cycle(s).  

Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Xiong and Yen and further in view of Bugir (US 2015/0161750)
Regarding claim 13, the combination of Xiong and Yen discloses the elements of the claimed invention as noted but does not disclose further comprising selecting a new display aspect ratio for the generated new media data element, wherein the selected new display aspect ratio is different than a display aspect ratio of the received source media data element.  However, Bugir discloses:
Bugir claim 18, The method of claim 17, wherein the intellectual property rights to the new media content define an aspect ratio for the new media content.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Xiong and Yen to obtain above limitation based on the teachings of Bugir for the purpose of defining an aspect ratio for the new media content. 

Claim(s) 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiong in view of Yen. 
Regarding claim 16, Xiong discloses:
receiving, by a processor, a source media data element;
Xiong [0014] FIG. 3 illustrates an exemplary data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames according to various embodiments.

applying, by the processor, a machine learning algorithm to detect at least one object in the received source media data element;
Xiong [0014] FIG. 3 illustrates an exemplary data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames according to various embodiments.

predicting, by the processor, a ROI in the received source media data element, wherein the ROI is predicted based on the detected at least one object; and
Xiong [0014] FIG. 3 illustrates an exemplary data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames according to various embodiments.

cropping, by the processor, the received source media data element to generate a new media data element based on the predicted ROI,
Xiong discloses the elements of the claimed invention as noted but does not disclose above limitation.  However, Yen discloses:
Yen [0123] The DL system 926 can perform ROI clipping 927 using the generated ROIs. ROI clipping 927 includes cropping the one or more ROIs from the video frame 914 to generate a cropped video frame (which can also be referred to as a cropped image). A cropped video frame is illustrated as a bolded bounding box within the frame 925 shown in FIG. 9. The cropped video frame includes only the portion of the video frame corresponding to an ROI. In some implementations, if more than one ROI is generated for a video frame, a separate cropped image can be generated for each ROI, resulting in multiple cropped images (or cropped video frames) being generated from the full-sized video frame. Once a cropped video frame is generated, it can then be provided to a deep learning network engine (e.g., deep learning network engine 726 or forensic deep learning network engine 727) for application of a trained neural network (e.g., a deep learning network). Using a cropped portion of the higher resolution video frame 914 (instead of the entire video frame 914) that includes one or more objects of interest reduces the processing time and complexity of the deep network needed to process the video frame. As noted previously, using cropped frames reduces and can even eliminate the problem of classifying small objects. An object in the cropped image is large relative to the frame height, allowing the deep learning network engine to more accurately classify the object, as illustrated in the chart shown in FIG. 5.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Xiong to obtain above limitation based on the teachings of Yen for the purpose of cropping the one or more ROIs from the video frame to generate a cropped video frame (which can also be referred to as a cropped image). 

wherein the generated new media data element is a portion of the source media data element.
Yen [0123] The DL system 926 can perform ROI clipping 927 using the generated ROIs. ROI clipping 927 includes cropping the one or more ROIs from the video frame 914 to generate a cropped video frame (which can also be referred to as a cropped image).

Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Makkonen (US 2014/0185863) in view of Biswas (US 2021/0065364) and further in view of Xiong.   
Regarding Claim 17, Makkonen discloses:
training, by a processor, a machine learning algorithm to predict at least one first ROI in at least one frame of at least one first media data element, wherein the training comprises: 
Makkonen [0068] In order to refine ROI detection, one may use the training data to detect from the extracted ROI region ROI subregions that are not relevant for processing and to focus OCR to ROT subregions that are of the desired ROI type. ROI subregions are referred here as ROI blocks. In an exemplary case, detection of non-relevant ROI blocks is based on an assumption that interpretation of ROI blocks is a Markov process that has a discrete state space. The training data can be used as sequences of observations to train a Hidden Markov Model and create a state transition graph and state feature vector distributions for ROI blocks. When these are available, Viterbi algorithm may be used to find the most probable path in the graph, i.e. the most probable interpretation for the ROI block. Other statistical models, like Conditional Random Fields may be applied without deviating from the scope of protection.

receiving, by the processor, a plurality of second media data elements; 
tagging at least one second ROI for each of the received plurality of second media data elements; and 
feeding the at least one second ROI to the machine learning algorithm; 
Makkonen discloses the elements of the claimed invention as noted but does not disclose above limitation.  However, Biswas discloses:
Biswas [0045] In order to train the CNN 118, the processor 102 causes the system 100 to slide a first window 208, a second window 210 and a third window 212 in a predefined path simultaneously on each training image such as the training image 202 of a training data set, the binary image containing a highlighted ROI of the each training image such as the binary image 204 containing ROI 218 of the training  image 202, and the labeled image containing a plurality of tags representing one or more features of interest of the each training image such as the labeled image 206 of the training image 202.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Makkonen to obtain above limitation based on the teachings of Biswas for the purpose of using a highlighted ROI as a training image.  

receiving, by a processor, a source media data element of the at least one first media data element; and 
Makkonen discloses the elements of the claimed invention as noted but does not disclose above limitation.  However, Xiong discloses:
Xiong [0014] FIG. 3 illustrates an exemplary data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames according to various embodiments.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Makkonen to obtain above limitation based on the teachings of Xiong for the purpose of providing data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames.

applying, by the processor, the trained machine learning algorithm to predict the at least one first ROI in the received source media data element.
Xiong [0014] FIG. 3 illustrates an exemplary data flow in an AI-based image data processing device for asynchronous object ROI detection from a sequence of frames according to various embodiments.

Claim(s) 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Makkonen, Biswas and Xiong and further in view of Yen.   
Regarding claim 18, the combination of Makkonen, Biswas and Xiong discloses the elements of the claimed invention as noted but does not disclose further comprising cropping, by the processor, the received source media data element to generate a new media data element based on the predicted at least one first ROI, wherein the generated new media data element is a subset of the source media data element.  However, Yen discloses:
Yen [0123] The DL system 926 can perform ROI clipping 927 using the generated ROIs. ROI clipping 927 includes cropping the one or more ROIs from the video frame 914 to generate a cropped video frame (which can also be referred to as a cropped image). A cropped video frame is illustrated as a bolded bounding box within the frame 925 shown in FIG. 9. The cropped video frame includes only the portion of the video frame corresponding to an ROI. In some implementations, if more than one ROI is generated for a video frame, a separate cropped image can be generated for each ROI, resulting in multiple cropped images (or cropped video frames) being generated from the full-sized video frame. Once a cropped video frame is generated, it can then be provided to a deep learning network engine (e.g., deep learning network engine 726 or forensic deep learning network engine 727) for application of a trained neural network (e.g., a deep learning network). Using a cropped portion of the higher resolution video frame 914 (instead of the entire video frame 914) that includes one or more objects of interest reduces the processing time and complexity of the deep network needed to process the video frame. As noted previously, using cropped frames reduces and can even eliminate the problem of classifying small objects. An object in the cropped image is large relative to the frame height, allowing the deep learning network engine to more accurately classify the object, as illustrated in the chart shown in FIG. 5.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Makkonen, Biswas and Xiong to obtain above limitation based on the teachings of Yen for the purpose of cropping the one or more ROIs from the video frame to generate a cropped video frame (which can also be referred to as a cropped image). 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ETIENNE PIERRE LEROUX whose telephone number is (571)272-4022. The examiner can normally be reached Monday through Friday 8:00 am to 4:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Apu Mofiz can be reached on 571 272 4080. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ETIENNE P LEROUX/Primary Examiner of Art Unit 2161