DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 8 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Peng et al. US-PGPUB No 2015/0117703 (hereinafter Peng) in view of Yao et al. US-PGPUB No. 2020/0143205 (hereinafter Yao) and Eledath et al. US-PGPUB No. 2016/0378861 (hereinafter Eledath). 
Re Claim 1: 
Peng teaches a processing method, performed by at least one processor, for an augmented reality scene, the processing method comprising: 
determining, by the at least one processor, a target video frame in a currently captured video (Peng teaches at Paragraph 0040 that key frames are extracted from input video sequences to summarize and represent the video sequences to help object tracking and recognition. Since only frames that contain detected objects are useful, the frames with no objects detected are simply filtered out and only the frames that contain detected objects are kept). 

determining, by the at least one processor, an object area in the target video frame based on a box selection model;
determining, by the at least one processor, a category of a target object in the object area based on a classification model used to classify an object in the object area; 
obtaining, by the at least one processor, augmented reality scene information associated with the category of the target object; and 
performing, by the at least one processor, augmented reality processing on the object area in the target video frame and the augmented reality scene information, to obtain the augmented reality scene. 
Yao explicitly teach the claim limitation:  
determining, by the at least one processor, an object area in the target video frame based on a box selection model (Yao teaches at FIG. 1 and Paragraph 0035 that the final image 114 has bounding boxes for each of three detected objects and at Paragraph 0095 scoring a plurality of bounding box regions of the image and detecting and classifying objects in the proposed regions using the feature maps);
determining, by the at least one processor, a category of a target object in the object area based on a classification model used to classify an object in the object area (Yao teaches at FIG. 1 and Paragraph 0035 that the final image 114 has bounding boxes for each of three detected objects and at Paragraph 0095 scoring a plurality of bounding box regions of the image and detecting and classifying objects in the proposed regions using the feature maps).  
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have combined Yao and Peng to have provided bounding boxes for the 
Peng and Yao suggests the claim limitation: 
obtaining, by the at least one processor, augmented reality scene information associated with the category of the target object (Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image); and 
performing, by the at least one processor, augmented reality processing on the object area in the target video frame and the augmented reality scene information, to obtain the augmented reality scene (Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image).
Eledath teaches the claim limitation: 
determining, by the at least one processor, an object area in the target video frame based on a box selection model (Eledath teaches at FIG. 5 and Paragraph 0038 that the system 110 creates a link between the image of the person and the person’s name and also creates a link between the image of the person and the car and at Paragraph 0110-0111 that Box 810 explains the graphical overlays 828 summarize the results of the intelligent image analysis by the system 110….This is shown in the image 904 by the bounding box surrounding the man’s face and the text box overlaid on the image 904 indicates that the system 110 provides feedback to let the user know that the user’s inquiry has been received and is being processed…a text label is also overlaid below the bounding box indicating “face detected” and at Paragraph 0133 the system 110 establishes and preserves links between the images and the corresponding text content);
Eledath teaches at FIG. 5 and Paragraph 0038 that the system 110 creates a link between the image of the person and the person’s name and also creates a link between the image of the person and the car and at Paragraph 0110-0111 that Box 810 explains the graphical overlays 828 summarize the results of the intelligent image analysis by the system 110….This is shown in the image 904 by the bounding box surrounding the man’s face and the text box overlaid on the image 904 indicates that the system 110 provides feedback to let the user know that the user’s inquiry has been received and is being processed…a text label is also overlaid below the bounding box indicating “face detected” and at Paragraph 0133 the system 110 establishes and preserves links between the images and the corresponding text content); 
obtaining, by the at least one processor, augmented reality scene information associated with the category of the target object (Eledath teaches at FIG. 5 and Paragraph 0038 that the system 110 creates a link between the image of the person and the person’s name and also creates a link between the image of the person and the car and at Paragraph 0110-0111 that Box 810 explains the graphical overlays 828 summarize the results of the intelligent image analysis by the system 110….This is shown in the image 904 by the bounding box surrounding the man’s face and the text box overlaid on the image 904 indicates that the system 110 provides feedback to let the user know that the user’s inquiry has been received and is being processed…a text label is also overlaid below the bounding box indicating “face detected” and at Paragraph 0133 the system 110 establishes and preserves links between the images and the corresponding text content); and 
Eledath teaches at FIG. 5 and Paragraph 0038 that the system 110 creates a link between the image of the person and the person’s name and also creates a link between the image of the person and the car and at Paragraph 0110-0111 that Box 810 explains the graphical overlays 828 summarize the results of the intelligent image analysis by the system 110….This is shown in the image 904 by the bounding box surrounding the man’s face and the text box overlaid on the image 904 indicates that the system 110 provides feedback to let the user know that the user’s inquiry has been received and is being processed…a text label is also overlaid below the bounding box indicating “face detected” and at Paragraph 0133 the system 110 establishes and preserves links between the images and the corresponding text content). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated the augmented reality labels of Eledath into the object detection and classification system of Yao and Peng to have displayed the category labels upon the objects. One of the ordinary skill in the art would have been motivated to have displayed the category label after determining a category of each object using the machine learning model. 
Re Claim 8: 
The claim 8 recites a system for implementing an augmented reality scene, the system comprising: at least one memory configured to store computer program code; and at least one processor configured to access the at least one memory and operate as instructed by the computer program code, the computer program code including: 

The claim 8 is in parallel with the claim 1 in the form of an apparatus claim. The claim 8 is subject to the same rationale of rejection as the claim 1. 
Moreover, Eledath and Yao further teach the claim limitation of a system for implementing an augmented reality scene, the system comprising: at least one memory configured to store computer program code; and at least one processor configured to access the at least one memory and operate as instructed by the computer program code (Eledath teaches at Paragraph 0125 software is stored in computer memory and at Paragraph 0145 processor 412 coupled to the memory 414 and at Paragraph 0152 that a plurality of instructions embodied in memory accessible by a processor of at least one of the computing devices, where the instructions are configured to cause the computing system to execute one or more image processing algorithms…augment the scene with a virtual element relating to the correlation between the at least one visual elements extracted from the scene and the knowledge accessible to the computing system. 
Yao teaches at Paragraph 0071 that the processor 4 is coupled to the image processing chip to drive the processes and at Paragraph 0098 that a computer-readable medium having instructions when operated on by the computer cause the computer to perform operations of the method). 
Re Claim 15: 
The claim 15 recites one or more non-transitory computer storage mediums storing computer readable instructions, the computer readable instructions, when executed by one or more processors, causing the one or more processors to: 
determine a target video frame in a currently captured video; determine an object area in the target video frame based on a box selection model; determine a category of a target object in the object area based on a classification model used to classify an object in the object area; obtain augmented reality scene information associated with the category of the target object; and perform augmented reality processing on the object area in the target video frame and the augmented reality scene information, to obtain an augmented reality scene.
The claim 15 is in parallel with the claim 1 in the form of an apparatus claim. The claim 15 is subject to the same rationale of rejection as the claim 1. 
Moreover, Eledath and Yao further teach the claim limitation of one or more non-transitory computer storage mediums storing computer readable instructions, the computer readable instructions, when executed by one or more processors, causing the one or more processors to [perform the method of the claim 1] (Eledath teaches at Paragraph 0125 software is stored in computer memory and at Paragraph 0145 processor 412 coupled to the memory 414 and at Paragraph 0152 that a plurality of instructions embodied in memory accessible by a processor of at least one of the computing devices, where the instructions are configured to cause the computing system to execute one or more image processing algorithms…augment the scene with a virtual element relating to the correlation between the at least one visual elements extracted from the scene and the knowledge accessible to the computing system. 
Yao teaches at Paragraph 0071 that the processor 4 is coupled to the image processing chip to drive the processes and at Paragraph 0098 that a computer-readable medium having instructions when operated on by the computer cause the computer to perform operations of the method). 

Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Peng et al. US-PGPUB No 2015/0117703 (hereinafter Peng) in view of Yao et al. US-PGPUB No. 2020/0143205 (hereinafter Yao) and Eledath et al. US-PGPUB No. 2016/0378861 (hereinafter Eledath). 
Re Claim 2: 
The claim 2 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the determining the target video frame in the currently captured video comprises: obtaining, by the at least one processor, all video frames within a preset frame range in the currently captured video; determining, by the at least one processor, feature points in the video frames; determining, by the at least one processor, whether the video is stable according to pixel locations of a corresponding feature point in the respective video frames; and based on the video being determined to be stable, determining, by the at least one processor, the target video frame from among the video frames. 
Peng further teaches the claim limitation that the determining the target video frame in the currently captured video comprises: obtaining, by the at least one processor, all video frames Peng teaches at Paragraph 0029 that the input video content is processed frame by frame and the frames in which the objects of interests may appear would be recorded with the bounding box of each object on each of these frames…generation of the hint information refers to obtain a small subset of frames from input video sequences to summarize and represent the video sequences and at Paragraph 0038 that object detection algorithms typically use extracted features and learning algorithms to detect instances of an object category); determining, by the at least one processor, whether the video is stable according to pixel locations of a corresponding feature point in the respective video frames; and based on the video being determined to be stable, determining, by the at least one processor, the target video frame from among the video frames (Peng teaches at Paragraph 0040 that key frames are extracted from input video sequences to summarize and represent the video sequences to help object tracking and recognition. Since only frames that contain detected objects are useful, the frames with no objects detected are simply filtered out and only the frames that contain detected objects are kept). 
Re Claim 9: 
The claim 9 encompasses the same scope of invention as that of the claim 8 except additional claim limitation that the target video frame determining code is further configured to cause the at least one processor to: obtain all video frames within a preset frame range in the currently captured video; determine feature points in the video frames; determine whether the video is stable according to pixel locations of a corresponding feature point in the respective video frames; and based on the video being determined as stable, determining the target video frame from among the video frames.

Re Claim 16: 
The claim 16 encompasses the same scope of invention as that of the claim 15 except additional claim limitation that the computer readable instructions further cause the one or more processor to determine the target video frame in the currently captured video by: obtaining all video frames within a preset frame range in the currently captured video; determining feature points in the video frames; determining whether the video is stable according to pixel locations of a corresponding feature point in in the respective video frames; and based on the video being determined to be stable, determining the target video frame from among the video frames.
The claim 16 is in parallel with the claim 2 in the form of a computer program product claim. The claim 16 is subject to the same rationale of rejection as the claim 2. 

Claims 3, 10 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Peng et al. US-PGPUB No 2015/0117703 (hereinafter Peng) in view of Yao et al. US-PGPUB No. 2020/0143205 (hereinafter Yao) and Eledath et al. US-PGPUB No. 2016/0378861 (hereinafter Eledath) and Blumstein-Koren et al. US-PGPUB No. 2012/0229629 (hereinafter Blumstein-Koren). 
Re Claim 3: 
The claim 3 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the determining the target video frame in the currently captured video comprises: detecting, by the at least one processor, all target feature points of a current video frame in the currently captured video; calculating and recording, by the at least one 
However, Blumstein-Koren et al. US-PGPUB No. 2012/0229629 teaches the claim limitation that the determining the target video frame in the currently captured video comprises: detecting, by the at least one processor, all target feature points of a current video frame in the currently captured video (Blumstein-Koren teaches at Paragraph 0049 the analysis of the AOI within a queried frame or image may be performed by an application server and the extracted query feature vector may include any appearance features of the detected object from the AOI and it may be used for matching with relevant areas in other frames selected from a video stream); calculating and recording, by the at least one processor, a current mean and a current variance of pixel coordinates of the target feature points (Blumstein-Koren teaches at Paragraph 0049 that the appearance features include mean and/or standard deviation of the object bitmap); calculating, by the at least one processor, a video frame difference value according to the current mean, the current variance, a previously calculated mean and a previously calculated variance (Blumstein-Koren teaches at Paragraph 0049-0050 that the analysis process may include scanning all relevant streams for finding a relevant stream which may include a frame or a plurality of frames having an area similar to the selected AOI and similarity may be defined in terms of a distance metric between the query feature vector (mean and standard deviation) and the feature vector extracted from the stream and at Paragraph 0053 selecting only a plurality of frames from the video stream may be performed and at Paragraph 0057-0059 that the analysis process may include comparing the query feature vector to the feature vector of each of the plurality of frames in order to measure appearance similarity and the analysis process may include determining whether a mismatch exists between the first frame and each one of the plurality of frames….if there is a match, it may indicate that the object still appears in the scene and therefore the search may continue to the next frame or the next frame as defined by a sequence skipping certain frames); and using, by the at least one processor, the current video frame as the target video frame based on the video frame difference value not satisfying a preset change condition (Blumstein-Koren teaches at Paragraph 0049-0050 that the analysis process may include scanning all relevant streams for finding a relevant stream which may include a frame or a plurality of frames having an area similar to the selected AOI and similarity may be defined in terms of a distance metric between the query feature vector (mean and standard deviation) and the feature vector extracted from the stream and at Paragraph 0053 selecting only a plurality of frames from the video stream may be performed and at Paragraph 0057-0059 that the analysis process may include comparing the query feature vector to the feature vector of each of the plurality of frames in order to measure appearance similarity and the analysis process may include determining whether a mismatch exists between the first frame and each one of the plurality of frames….if there is a match, it may indicate that the object still appears in the scene and therefore the search may continue to the next frame or the next frame as defined by a sequence skipping certain frames).
It would have been obvious to one of the ordinary skill in the art before filing date of the instant application to have incorporated Blumstein-Koren’s object detection using the object features such as the mean and standard deviation of the object region bitmap to have characterized the object region feature points and to have determined video frame difference 
Re Claim 10: 
The claim 10 encompasses the same scope of invention as that of the claim 8 except additional claim limitation that the target video frame determining code is further configured to cause the at least one processor to: detect all target feature points of a current video frame in the currently captured video; calculate and record a current mean and a current variance of pixel coordinates of the target feature points; calculate a video frame difference value according to the current mean, the current variance, a previously calculated mean and a previously calculated variance; and using the current video frame as the target video frame based on the video frame difference value not satisfying a preset change condition.
The claim 10 is in parallel with the claim 3 in the form of an apparatus claim. The claim 10 is subject to the same rationale of rejection as the claim 3. 
Re Claim 17: 
The claim 17 encompasses the same scope of invention as that of the claim 15 except additional claim limitation that the computer readable instructions further cause the one or more processor to determine the target video frame in the currently captured video by: detecting all target feature points of a current video frame in the currently captured video; calculating and recording a current mean and a current variance of pixel coordinates of the target feature points; calculating a video frame difference value according to the current mean, the current variance, a previously calculated mean and a previously calculated variance; and using the current video 
The claim 17 is in parallel with the claim 3 in the form of a computer program product claim. The claim 17 is subject to the same rationale of rejection as the claim 3. 

Claims 4-7, 11-14 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Peng et al. US-PGPUB No 2015/0117703 (hereinafter Peng) in view of Yao et al. US-PGPUB No. 2020/0143205 (hereinafter Yao) and Eledath et al. US-PGPUB No. 2016/0378861 (hereinafter Eledath). 
Re Claim 4: 
The claim 4 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the box selection model is obtained by training based on an image set and the processing method further comprises: configuring, by the at least one processor, an initial model; obtaining, by the at least one processor, a predicted image in the image set and description information of the predicted image, the description information comprising target area location information of a target area in the predicted image; analyzing, by the at least one processor, the predicted image using the initial model and determining a predicted image area in the predicted image; and updating the initial model based on difference information between image location information of the predicted image area and the target area location information not satisfying a preset modeling condition.
Yao further teaches the claim limitation that the box selection model is obtained by training based on an image set and the processing method further comprises: configuring, by the at least one processor, an initial model (Yao teaches at Step 202 of FIG. 2 and Paragraph 0038 pre-train a deep CNN Model and at Paragraph 0063 that the convolution layers may be based on a pre-trained CNN model based on pre-classified images); obtaining, by the at least one processor, a predicted image in the image set and description information of the predicted image (Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image. 
Yao teaches at Paragraph 0025 that the respective Hyper Features is computed over input images efficiently and at Paragraph 0027 that the Hyper Features over each input image are extracted….the region proposals are generated), the description information comprising target area location information of a target area in the predicted image (Yao teaches at Paragraph 0064 that the convolutional neural network may be trained by first training for region proposals using the pre-classified images and then training for object detection using the region proposal training and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image. 
Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image and at Paragraph 0064 that the object detection training may use region of interest polling of the region proposal images and at Paragraph 0066 that region proposals are generated using the combined feature map by scoring bounding box regions of image and the region proposals may be generated by first generating a score for the detection of an object and then generating bounding box regression and at Paragraph 0067 that objects in the proposed regions are detected and classified using the combined feature map. 
Yao teaches at Paragraph 0025 that the respective Hyper Features is computed over input images efficiently and at Paragraph 0027 that the Hyper Features over each input image are extracted….the region proposals are generated and at Paragraph 0030 that the score is a measure of the probability that the ROI includes an object instance and at Paragraph 0035 that the final image 114 has bounding boxes for each of three detected objects); analyzing, by the at least one processor, the predicted image using the initial model and determining a predicted image area in the predicted image (Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image. 
Yao teaches at Paragraph 0038 that the process of FIG. 2 starts at 202 with pre-training a deep CNN model. Yao teaches at Paragraph 0021 that the resized image is then provided to the network in which the first operation is directly initialized as the first N convolutional layers of a pre-trained CNN model and at Paragraph 0035 that the final image 114 has bounding boxes for each of three detected objects); and updating the initial model based on difference information between image location information of the predicted image area and the target area location information not satisfying a preset modeling condition (Yao teaches at Paragraph 0030 that the BBR allows the bounding box for each possible object to be refined and at Paragraph 0042 that the region proposal Hyper-Net training is fine-tuned and at Paragraph 0043 that the object detection Hyper-Net is fine-tuned and at Paragraph 0044-0046 that the unified Hyper-Net is output as the final model…to jointly hand region proposal generation and object detection.  
Yao teaches at Paragraph 0064 that the convolutional neural network may be trained by first training for region proposals using the pre-classified images and then training for object detection using the region proposal training. 
Yao teaches at Paragraph 0041-0045 that a complete object detection Hyper-Net 108 is trained and the region proposal Hyper-Net training is fine-tuned and the object detection Hyper-Net is fine-tuned and the unified Hyper-Net is output as the final model and the region proposals generated in each image are used in ROI pooling of the Hyper-Net for object detection. Yao teaches at Paragraph 0030 the BBR allows the bounding box for each possible object to be refined).
Re Claim 5: 
The claim 5 encompasses the same scope of invention as that of the claim 4 except additional claim limitation that the obtaining the description information of the predicted image comprises: displaying, by the at least one processor, the predicted image on a user interface; receiving, by the at least one processor, a box selection operation on the user interface; and obtaining, by the at least one processor, the description information based on location information of a prediction box selected in the box selection operation as the target area location information, wherein the target area location information comprises pixel upper left corner coordinates and pixel lower right corner coordinates of the prediction box.
Yao further teaches the claim limitation that the obtaining the description information of the predicted image comprises: displaying, by the at least one processor, the predicted image on a user interface (Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image); receiving, by the at least one processor, a box selection operation on the user interface; and obtaining, by the at least one processor, the description information based on location information of a prediction box selected in the box selection operation as the target area location information, wherein the Yao teaches at Paragraph 0020 that the feature maps are different in width, height and depth and at Paragraph 0050 the Hyper Feature is constructed by first reshaping 3D feature maps and at Paragraph 0066-0067 that region proposals are generated using the combined 3D feature map by scoring bounding box regions of image.. The bounding box regression may include location offsets for objects in the combined feature map. Since the feature map is a 3D feature map, the bounding box is a 3D bounding box having a depth dimension. 
Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image).
Re Claim 6: 
The claim 6 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the classification model is obtained by training based on an image set and the processing method further comprises: configuring, by the at least one processor, an initial model; obtaining, by the at least one processor, a predicted image in the image set and description information of the predicted image, the description information comprising a classification identifier of the predicted image; analyzing, by the at least one processor, the predicted image using the initial model to obtain a predicted category of the predicted image; and
updating the initial model based on the predicted category of the predicted image being different from a category indicated by the classification identifier.
Yao further teaches the claim limitation that the classification model is obtained by training based on an image set and the processing method further comprises: configuring, by the Yao teaches at Step 202 of FIG. 2 and Paragraph 0038 pre-train a deep CNN Model and at Paragraph 0063 that the convolution layers may be based on a pre-trained CNN model based on pre-classified images); obtaining, by the at least one processor, a predicted image in the image set and description information of the predicted image, the description information comprising a classification identifier of the predicted image 
(Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image and at Paragraph 0064 that the object detection training may use region of interest polling of the region proposal images and at Paragraph 0066 that region proposals are generated using the combined feature map by scoring bounding box regions of image and the region proposals may be generated by first generating a score for the detection of an object and then generating bounding box regression and at Paragraph 0067 that objects in the proposed regions are detected and classified using the combined feature map. 
Yao teaches at Paragraph 0025 that the respective Hyper Features is computed over input images efficiently and at Paragraph 0027 that the Hyper Features over each input image are extracted….the region proposals are generated), the description information comprising target area location information of a target area in the predicted image (Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image. 
Yao teaches at Paragraph 0064 that the convolutional neural network may be trained by first training for region proposals using the pre-classified images and then training for object detection using the region proposal training. 
Yao teaches at Paragraph 0025 that the respective Hyper Features is computed over input images efficiently and at Paragraph 0027 that the Hyper Features over each input image are extracted….the region proposals are generated and at Paragraph 0030 that the score is a measure of the probability that the ROI includes an object instance and at Paragraph 0035 that the final image 114 has bounding boxes for each of three detected objects); analyzing, by the at least one processor, the predicted image using the initial model to obtain a predicted category of the predicted image; and
updating the initial model based on the predicted category of the predicted image being different from a category indicated by the classification identifier (Yao teaches at Paragraph 0030 that the BBR allows the bounding box for each possible object to be refined and at Paragraph 0042 that the region proposal Hyper-Net training is fine-tuned and at Paragraph 0043 that the object detection Hyper-Net is fine-tuned and at Paragraph 0044-0046 that the unified Hyper-Net is output as the final model…to jointly hand region proposal generation and object detection.  
Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image. 
Yao teaches at Paragraph 0038 that the process of FIG. 2 starts at 202 with pre-training a deep CNN model. Yao teaches at Paragraph 0021 that the resized image is then provided to the network in which the first operation is directly initialized as the first N convolutional layers of a pre-trained CNN model and at Paragraph 0035 that the final image 114 has bounding boxes for each of three detected objects); and updating the initial model based on difference information between image location information of the predicted image area and the target area location information not satisfying a preset modeling condition (Yao teaches at Paragraph 0030 that the BBR allows the bounding box for each possible object to be refined and at Paragraph 0042 that the region proposal Hyper-Net training is fine-tuned and at Paragraph 0043 that the object detection Hyper-Net is fine-tuned and at Paragraph 0044-0046 that the unified Hyper-Net is output as the final model…to jointly hand region proposal generation and object detection.  
Yao teaches at Paragraph 0064 that the convolutional neural network may be trained by first training for region proposals using the pre-classified images and then training for object detection using the region proposal training. 
Yao teaches at Paragraph 0041-0045 that a complete object detection Hyper-Net 108 is trained and the region proposal Hyper-Net training is fine-tuned and the object detection Hyper-Net is fine-tuned and the unified Hyper-Net is output as the final model and the region proposals generated in each image are used in ROI pooling of the Hyer-Net for object detection. Yao teaches at Paragraph 0030 the BBR allows the bounding box for each possible object to be refined).
Re Claim 7: 
The claim 7 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the performing the augmented reality processing on the object area in the target video frame and the augmented reality scene information comprises: tailoring, by the at least one processor, the target video frame to obtain a tailored image comprising the object area; performing, by the at least one processor, three-dimensional superimposition on the 
Yao further teaches the claim limitation that the performing the augmented reality processing on the object area in the target video frame and the augmented reality scene information comprises: tailoring, by the at least one processor, the target video frame to obtain a tailored image comprising the object area (Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image); performing, by the at least one processor, three-dimensional superimposition on the augmented reality scene information and the tailored image (Yao teaches at Paragraph 0020 that the feature maps are different in width, height and depth and at Paragraph 0050 the Hyper Feature is constructed by first reshaping 3D feature maps and at Paragraph 0066-0067 that region proposals are generated using the combined 3D feature map by scoring bounding box regions of image.. The bounding box regression may include location offsets for objects in the combined feature map. Since the feature map is a 3D feature map, the bounding box is a 3D bounding box having a depth dimension. 
Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image); generating, by the at least one processor, a video frame of the augmented reality scene according to a subsequent image that is captured after the three-dimensional superimposition; and displaying, by the at least one Yao teaches at Paragraph 0020 that the feature maps are different in width, height and depth and at Paragraph 0050 the Hyper Feature is constructed by first reshaping 3D feature maps and at Paragraph 0066-0067 that region proposals are generated using the combined 3D feature map by scoring bounding box regions of image.. The bounding box regression may include location offsets for objects in the combined feature map. Since the feature map is a 3D feature map, the bounding box is a 3D bounding box having a depth dimension. 
Yao teaches at Paragraph 0062 the process starts at 402 with receiving a digital image and Yao teaches at Paragraph 0060 and FIGS. 6-7 that the region proposals are indicated with the rectangular boxes super-imposed over objects in the image).

Re Claim 11: 
The claim 11 encompasses the same scope of invention as that of the claim 8 except additional claim limitation that the box selection model is obtained by training based on an image set and the computer program code further includes: configuration code configured to cause the at least one processor to configure an initial model; obtaining code configured to cause the at least one processor to obtain a predicted image in the image set and description information of the predicted image, the description information comprising target area location information of a target area in the predicted image; analyzing code configured to cause the at least one processor to analyze the predicted image using the initial model and determine a predicted image area in the predicted image; and updating code configured to cause the at least one processor to update the initial model based on difference information between image 
The claim 11 is in parallel with the claim 4 in the form of an apparatus claim. The claim 11 is subject to the same rationale of rejection as the claim 4. 
Re Claim 12: 
The claim 12 encompasses the same scope of invention as that of the claim 11 except additional claim limitation that the obtaining code is further configured to cause the at least one processor to: display the predicted image on a user interface; receive a box selection operation on the user interface; and obtaining the description information based on location information of a prediction box selected in the box selection operation as the target area location information, wherein the target area location information comprises pixel upper left corner coordinates and pixel lower right corner coordinates of the prediction box.
The claim 12 is in parallel with the claim 5 in the form of an apparatus claim. The claim 12 is subject to the same rationale of rejection as the claim 5. 
Re Claim 13: 
The claim 13 encompasses the same scope of invention as that of the claim 8 except additional claim limitation that the classification model is obtained by training based on an image set and the computer program code further includes: configuration code configured to cause the at least one processor to configure an initial model; obtaining code configured to cause the at least one processor to obtain a predicted image in the image set and description information of the predicted image, the description information comprising a classification identifier of the predicted image; analyzing code configured to cause the at least one processor to analyze the predicted image using the initial model and obtain a predicted category of the predicted image; 
The claim 13 is in parallel with the claim 6 in the form of an apparatus claim. The claim 13 is subject to the same rationale of rejection as the claim 6. 
Re Claim 14: 
The claim 14 encompasses the same scope of invention as that of the claim 8 except additional claim limitation that the augmented reality processing code is further configured to cause the at least one processor to: tailor the target video frame to obtain a tailored image comprising the object area; perform three-dimensional superimposition on the augmented reality scene information and the tailored image; generate a video frame of the augmented reality scene according to a subsequent image that is captured after the three-dimensional superimposition; and display the video frame of the augmented reality scene.
The claim 14 is in parallel with the claim 5 in the form of an apparatus claim. The claim 14 is subject to the same rationale of rejection as the claim 5. 
Re Claim 18: 
The claim 18 encompasses the same scope of invention as that of the claim 15 except additional claim limitation that the box selection model is obtained by training based on an image set; and the computer readable instructions further cause the one or more processor to: configure an initial model; obtain a predicted image in the image set and description information of the predicted image, the description information comprising target area location information of a target area in the predicted image; analyze the predicted image using the initial model and determine a predicted image area in the predicted image; and updating the initial model based on 
The claim 18 is in parallel with the claim 4 in the form of a computer program product claim. The claim 18 is subject to the same rationale of rejection as the claim 4. 
Re Claim 19: 
The claim 19 encompasses the same scope of invention as that of the claim 18 except additional claim limitation that the computer readable instructions further cause the one or more processor to obtain the description information of the predicted image by: displaying the predicted image on a user interface; receiving a box selection operation on the user interface; and obtaining the description information based on location information of a prediction box selected in the box selection operation as the target area location information, wherein the target area location information comprises pixel upper left corner coordinates and pixel lower right corner coordinates of the prediction box. 
The claim 19 is in parallel with the claim 5 in the form of a computer program product claim. The claim 19 is subject to the same rationale of rejection as the claim 5. 
Re Claim 20: 
The claim 20 encompasses the same scope of invention as that of the claim 15 except additional claim limitation that the classification model is obtained by training based on an image set; and the computer readable instructions further cause the one or more processor to: configure an initial model; obtain a predicted image in the image set and description information of the predicted image, the description information comprising a classification identifier of the predicted image; analyze the predicted image using the initial model to obtain a predicted 
The claim 20 is in parallel with the claim 6 in the form of a computer program product claim. The claim 20 is subject to the same rationale of rejection as the claim 6. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JIN CHENG WANG whose telephone number is (571)272-7665. The examiner can normally be reached Mon-Fri 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached on 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) 





/JIN CHENG WANG/Primary Examiner, Art Unit 2613                                                                                                                                                                                                        77