Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/28/2021 has been entered.
Response to Amendment
Applicant's amendments and remarks submitted 10/28/2021 have been entered and considered, but are not found convincing. Claims 1, 3, 15, 17 have been amended. Claims 22-23 have been added.  In summary, claims 1-23 are pending in the application. 

Response to Arguments
Claim Rejections - 35 U.S.C. 103

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:

(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
 “an input that is configured to”; “a selection unit that is configured to”; “an output that is configured to”; “a media unit signature that” in claim 15.
“hybrid representation generator” in claims 16-21.
“an input that is configured to”; “an output that is configured to” are being interpreted to cover the corresponding structure described in the specification paragraph [00343] “Input and/or output may be any suitable communications component such as a network interface card, universal serial bus (USB) port, disk reader, modem or transceiver that may be operative to use protocols such as are known in the art to communicate either directly, or indirectly, with other elements of the system.”)
Processor 4950 may include at least some out of • Multiple spanning elements 495l(q). • Multiple merge elements 4952(r). • Object detector 4953. • Cluster manager 4954. • Controller 4955. • Selection unit 4956. • Object detection determination unit 4957. • Signature generator 4958. • Movement information unit 4959. • Identifier unit 4960.”
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
1.	Claims 1-2,4, 15-16, 18, 22 are rejected under 35 U.S.C. 103 as being unpatentable over Mihail Eric, “Fast Object Detection With Fast R-CNN”, posted October, 2018, (“Eric”) in view of Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017 (“Lin”) 
a method for generating a hybrid representation of a media unit, the method comprises:
receiving or generating the media unit (see section How Fast R-CNN Works, Let’s start with our obligatory cute cat and dog photo”)

    PNG
    media_image1.png
    264
    423
    media_image1.png
    Greyscale

Fig.1 of Eric
processing the media unit by performing multiple iterations (see first page “In this article, we will continue in the vein of classic object detection papers by discussing Fast R-CNN. Studying this line of region proposal with convolutional network work is rewarding because it allows us to see an iterative refinement on a collection of models, each seeking to address shortcomings in its predecessor.”), wherein at least some of the multiple iterations comprises applying, by spanning elements of the iteration (see pages 3-7, “…Now for each region proposal, we run it through a set of convolutional and max-pooling layers to extract a convolutional feature map: Now, given this feature map, we run each region-of-interest (RoI) through what is called an RoI pooling layer. This layer takes an hh x ww RoI region and runs max-pooling across a grid of sub-regions within the RoI. The output is a fixed HH x WW feature map, where HH and WW are hyperparameters that are constant across all RoIs, regardless of dimension. For example, HH x WW could be a 77 x 77 square. After we have run our convolutional feature map through the RoI pooling layer, we are guaranteed a fixed-length output regardless of region proposal size. Therefore we can now execute a set of fully-connected layers to get an RoI feature vector”)
selecting, based on an output of the multiple iterations, media unit regions of interest that contributed to the output of the multiple iterations (see pages 3-7, “After we have run our convolutional feature map through the RoI pooling layer, we are guaranteed a fixed-length output regardless of region proposal size. Therefore we can now execute a set of fully-connected layers to get an RoI feature vector. Now we run that RoI feature vector through two sibling output layers:A softmax classifier that outputs probabilities for the KK object classes of our training data plus a background class. A bounding box regressor that outputs refined bounding box positions for each of the KK object classes…..With these class probabilities and refined bounding box coordinates, we can output our final detection results for the original region proposal”); and
 providing the hybrid representation (see page 7,”With these class probabilities and refined bounding box coordinates, we can output our final detection results for the original region proposal:” see Fig.2 of Eric), wherein the hybrid representation comprises shape information regarding shapes of the media unit regions of interest (see Fig.2 of Eric where the bounding box of cat), and a media unit signature that comprises identifiers that identify the media unit regions of interest(see Fig.2 of Eric where label classification cat with score 0.8); wherein the shape information comprises polygon (see Fig.2 of Eric where bounding box of cat is polygon) Eric is understood to be silent on the remaining limitations of claim 1.

    PNG
    media_image2.png
    300
    417
    media_image2.png
    Greyscale

Fig.2 of Eric
receiving or generating the media unit (see section 3. Feature Pyramid Networks, second paragraph “Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion….”); processing the media unit by performing multiple iterations, wherein at least some of the multiple iterations comprises applying, by spanning elements of the iteration, dimension expansion process that are followed by a merge operation (see section 3. Feature Pyramid Networks, part Top-down pathway and lateral connections, second paragraph “Fig. 3 shows the building block that constructs our top-down feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsam- pled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration, we simply attach a 1×1 convolutional layer on C5 to produce the coarsest resolution map. Finally, we append a 3×3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called {P2, P3, P4, P5}, corresponding to {C2,C3,C4,C5} that are respectively of the same spatial sizes.”); selecting, based on an output of the multiple iterations, media unit regions of interest that contributed to the output of the multiple iterations (section 4.2. Feature Pyramid Networks for Fast RCNN, “Fast R-CNN [11] is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Fast R-CNN is most commonly performed on a single-scale feature map. To use it with our FPN, we need to assign RoIs of different scales to the 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify method object detection using fast R-CNN of  Eric with using feature pyramid networks as seen in Lin because this modification would takes a single –scale image of an arbitrary size as input and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion (section 3. Feature pyramid network, second paragraph of Lin).  
Thus, the combination of Eric and Lin teaches a method for generating a hybrid representation of a media unit, the method comprises: receiving or generating the media unit; processing the media unit by performing multiple iterations, wherein at least some of the multiple iterations comprises applying, by spanning elements of the iteration, dimension expansion process that are followed by a merge operation; selecting, based on an output of the multiple iterations, media unit regions of interest that contributed to the output of the multiple iterations; and providing the hybrid representation, wherein the hybrid representation comprises shape information regarding shapes of the media unit regions of interest, and a media unit signature that comprises identifiers that identify the media unit regions of interest; wherein the shape information comprises polygon.
Regarding claim 2, Eric and Lin teach the method according to claim 1 wherein the selecting of the media regions of interest is executed per segment out of multiple segments of the media unit (see pages 3-7 of Eric, “Like 3. Feature Pyramid Networks of Lin “Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is general purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance. segmentation proposals in Sec. 6.”; 6. Extensions: Segmentation Proposals) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 4, Eric and Lin teach the method according to claim 1 wherein the providing of the hybrid representation of the media unit comprises compressing the shape information of the media unit to provide compressed shape information of the media unit (see pages 6-7 of Eric “Now we run that RoI feature vector through two sibling output layers: A softmax classifier that outputs probabilities for the KK object classes of our training data plus a background class. A bounding box regressor that outputs refined bounding box positions for each of the KK object classes”; See Fig.2 of Eric where label cat is considered as compressed shape).
Regarding independent claim 15, Eric teaches a hybrid representation generator for generating a hybrid representation of a media unit, the hybrid representation generator (see section How Fast R-CNN is Built, “As in the case of R-CNN, it is crucial to use a pretrained network to initialize Fast R-CNN. Therefore, when training the system, we begin with a network pretrained on the ImageNet classification challenge. To adapt the model to the detection task, we perform a number of transformations: Modify the pretrained network inputs to accept a list of images and a list of RoIs for those images. Replace the last max-pooling layer of the pretrained network with an RoI pooling layer as discussed in the previous section. Replace the 1000-way ImageNet classification layer with two sibling output layers to enable a multi-task loss for training. The multi-task loss is one of the huge novelties of this work. It allows us to take the original R-CNN, which was a three-stage training pipeline (train convolutional network, train SVM classifiers, and train bounding box regressors) and collapse it into a single stage process.) comprises: Remaining of claim 15 is similar in scope to claim 1 and therefore rejected under the same rationale.
Regarding claim 16, Eric and Lin teach the hybrid representation generator according to claim 15 Remaining of claim 16 is similar in scope to claim 2 and therefore rejected under the same rationale.
Regarding claim 18, Eric and Lin teach the hybrid representation generator according to claim 15 that is configured to Remaining of claim 18 is similar in scope to claim 4 and therefore rejected under the same rationale.
Regarding claim 22, Eric and Lin teach the method according to claim 1 wherein the media unit is an image (see Fig.1 of Eric)
2	Claims 3, 8-11,17, 23 are rejected under 35 U.S.C. 103 as being unpatentable over Mihail Eric, “Fast Object Detection With Fast R-CNN”, posted October, 2018, (“Eric”) in view of Lin, Tsung-Yi, et al. "Feature pyramid networks for 
Regarding claim 3, Eric and Lin teach the method according to claim 1 wherein the polygons represent shapes that substantially bound the media unit regions of interest (see Fig.2 of Eric where bounding box of cat is polygon). Both Eric and Lin are understood to be silent on the remaining limitations of claim 3.
In the same field of endeavor, Mikhailov teaches wherein the shape information comprises polygons that represent shapes that substantially bound the media unit regions of interest (¶0028 “Once the objects 103, 105, 107 have been located in the image 101, Polygons 113, 115, 117 may be traced around each object, as indicated at 206 and as illustrated in FIG. 3C. Polygon data 124 may be stored in the memory 120. There are a number of different techniques for tracing the polygons”), wherein a number of edges per polygon of the polygons is based on a shape of a media unit region of interest represented by the polygon (¶0024 as shown in Fig. 3C “The processor 110 may be programmed with instructions that facilitate such operation. In particular, the processor 110 may be programmed with image capture instructions 112 that obtain an image 101 from the image capture device 106 and store the data 122 representing the image 101 or retrieve the stored image data 122 from some other device. The processor 110 may be further programmed with outlining instructions 114 that analyze the image data 122 to locate edges of the objects 103, 105, 107 in the image 101 and generate data 124 representing the corresponding polygons 113, 115, 117. The polygon data 124 may identify, among other things, locations of endpoints of a plurality of line segments that make up each side of each polygon. The locations of the endpoints within the image 101 may be defined with respect to some coordinate system. An origin of the coordinate system may be arbitrarily defined and each location may be identified in terms of a number of pixels horizontally and vertically between the endpoint and the origin.”; ¶0029 “To optimize the number of line segments in the polygon surrounding the object 401, the outline instructions 114 may be configured to implement the following procedure. First, second and third boundary adjacent points may be located within the image data. An angle between a first line segment connecting the first and second boundary points and a second line segment connecting the first and third points may be determined. If the angle is less than a threshold value, establishing the third line segment may be associated with a side of the polygon. If the angle is greater than the threshold value, the first line segment may be associated with first side of the polygon and associating the second line segment may be associated with a second side of the polygon adjacent the first side.) at least one of the polygons differs from a rectangle (Fig. 3C of Mikhailov where the polygon 117 differs from a rectangle) 
Therefore, in combination of Eric and Lin, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify bounding box of Eric with optimize the number of line segments in the polygon surrounding the object of Mikhailov because this modification would automatically trace polygons around each object (¶0028 of Mikhailov)
	Thus, the combination of Eric, Lin and Mikhailov teaches wherein the polygons represent shapes that substantially bound the media unit regions of interest, wherein a number of edges per polygon of the polygons is based on a shape of a media unit region of interest represented by the polygon at least one of the polygons differs from a rectangle.
Regarding independent claim 8, Eric teaches a non-transitory computer readable medium for generating a hybrid representation of a media unit, the non-transitory computer readable medium stores instructions for: (see section How Fast R-CNN is Built, “As in the case of R-CNN, it is crucial to use a pretrained network to initialize Fast R-CNN. Therefore, when training the system, we begin with a network pretrained on the ImageNet classification challenge. To adapt the model to the detection task, we perform a number of transformations: Modify the pretrained network inputs to accept a list of images and a list of RoIs for those images. Replace the last max-pooling layer of the pretrained network with an RoI pooling layer as discussed in the previous section. Replace the 1000-way ImageNet classification layer with two sibling output layers to enable a multi-task loss for training. The multi-task loss is one of the huge novelties of this work. It allows us to take the original R-CNN, which was a three-stage training pipeline (train convolutional network, train SVM classifiers, and train bounding box regressors) and collapse it into a single stage process.” Where system is considered having a computer which has memory)
receiving or generating the media unit(see section How Fast R-CNN Works, Let’s start with our obligatory cute cat and dog photo”);
 processing the media unit by performing multiple iterations (see first page “In this article, we will continue in the vein of classic object detection papers by discussing Fast R-CNN. Studying this line of region proposal with , wherein at least some of the multiple iterations comprises applying, by spanning elements of the iteration(; see pages 3-7, “…Now for each region proposal, we run it through a set of convolutional and max-pooling layers to extract a convolutional feature map: Now, given this feature map, we run each region-of-interest (RoI) through what is called an RoI pooling layer. This layer takes an hh x ww RoI region and runs max-pooling across a grid of sub-regions within the RoI. The output is a fixed HH x WW feature map, where HH and WW are hyperparameters that are constant across all RoIs, regardless of dimension. For example, HH x WW could be a 77 x 77 square. After we have run our convolutional feature map through the RoI pooling layer, we are guaranteed a fixed-length output regardless of region proposal size. Therefore we can now execute a set of fully-connected layers to get an RoI feature vector”); 
selecting, based on an output of the multiple iterations, media unit regions of interest that contributed to the output of the multiple iterations (see pages 3-7, “After we have run our convolutional feature map through the RoI pooling layer, we are guaranteed a fixed-length output regardless of region proposal size. Therefore we can now execute a set of fully-connected layers to get an RoI feature vector. Now we run that RoI feature vector through two sibling output layers: A softmax classifier that outputs probabilities for the KK object classes of our training data plus a background class. A bounding box regressor that outputs refined bounding box positions for each of the KK object classes…..With these class probabilities and refined bounding box coordinates, we can output our final detection results for the original region proposal”)); and 
providing the hybrid representation(see page 7,”With these class probabilities and refined bounding box coordinates, we can output our final detection results for the original region proposal:” see Fig.2 of Eric), wherein the hybrid representation comprises shape information regarding shapes of the media unit regions of interest (see Fig.2 of Eric where label classification cat 0.8 and bounding box of cat), and a media unit signature that comprises identifiers that identify the media unit regions of interest (see Fig.2 of Eric where label classification cat 0.8); wherein the shape information comprises polygons that represent shapes that substantially bound the media unit regions of interest (see Fig.2 of Eric where bounding box of cat is polygon). Eric is understood to be silent on the remaining limitations of claim 8.
In the same field of endeavor, Lin teaches receiving or generating the media unit (see section 3. Feature Pyramid Networks, second paragraph “Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion….”); processing the media unit by performing multiple iterations, wherein at least some of the multiple iterations comprises applying, by spanning elements of the iteration, dimension expansion process that are followed by a merge operation (see section 3. Feature Pyramid Networks, part Top-down pathway and lateral connections, second paragraph “Fig. 3 shows the building block that constructs our top-down feature maps. With a coarser-resolution feature map, we upsample the spatial  selecting, based on an output of the multiple iterations, media unit regions of interest that contributed to the output of the multiple iterations (section 4.2. Feature Pyramid Networks for Fast RCNN, “Fast R-CNN [11] is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Fast R-CNN is most commonly performed on a single-scale feature map. To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels.). Both the corresponding image region size (light orange) and canonical object size (dark orange) are shown.”). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify method object detection using fast R-CNN of  Eric with using feature pyramid networks as seen in Lin because this modification would takes a single –scale image of an arbitrary size as input and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion (section 3. Feature pyramid network, second paragraph of Lin).   Both Eric and Lin are understood to be silent on the remaining limitations of claim 8.
 wherein the shape information comprises polygons that represent shapes that substantially bound the media unit regions of interest (¶0028 “Once the objects 103, 105, 107 have been located in the image 101, Polygons 113, 115, 117 may be traced around each object, as indicated at 206 and as illustrated in FIG. 3C. Polygon data 124 may be stored in the memory 120. There are a number of different techniques for tracing the polygons”), wherein a number of edges per polygon of the polygons is based on a shape of a media unit region of interest represented by the polygon (¶0024 as shown in Fig. 3C “The processor 110 may be programmed with instructions that facilitate such operation. In particular, the processor 110 may be programmed with image capture instructions 112 that obtain an image 101 from the image capture device 106 and store the data 122 representing the image 101 or retrieve the stored image data 122 from some other device. The processor 110 may be further programmed with outlining instructions 114 that analyze the image data 122 to locate edges of the objects 103, 105, 107 in the image 101 and generate data 124 representing the corresponding polygons 113, 115, 117. The polygon data 124 may identify, among other things, locations of endpoints of a plurality of line segments that make up each side of each polygon. The locations of the endpoints within the image 101 may be defined with respect to some coordinate system. An origin of the coordinate system may be arbitrarily defined and each location may be identified in terms of a number of pixels horizontally and vertically between the endpoint and the origin.”; ¶0029 “To optimize the number of line segments in the polygon surrounding the object 401, the outline instructions 114 may be configured to implement the following procedure. First, second and third boundary adjacent points may be located within the image data. An angle 
Therefore, in combination of Eric and Lin, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify bounding box of Eric with optimize the number of line segments in the polygon surrounding the object of Mikhailov because this modification would automatically trace polygons around each object (¶0028 of Mikhailov)
Thus, the combination of Eric, Lin, and Mikhailov teaches a non-transitory computer readable medium for generating a hybrid representation of a media unit, the non-transitory computer readable medium stores instructions for: receiving or generating the media unit; processing the media unit by performing multiple iterations, wherein at least some of the multiple iterations comprises applying, by spanning elements of the iteration, dimension expansion process that are followed by a merge operation; selecting, based on an output of the multiple iterations, media unit regions of interest that contributed to the output of the multiple iterations; and providing the hybrid representation, wherein the hybrid representation comprises shape information regarding shapes of the media unit regions of interest, and a media unit signature that comprises identifiers that identify the media unit regions of interest; wherein the shape information comprises polygons that represent shapes that substantially bound the media unit regions of interest, wherein a number of edges per polygon of the polygons is based on a shape of a media unit region of interest represented by the polygon.
Regarding claim 9, Eric, Lin, and Mikhailov teach the non-transitory computer readable medium according to claim 8 wherein the selecting of the media regions of interest is executed per segment out of multiple segments of the media unit (see pages 3-7 of Eric, “Like the original R-CNN, the fast version also begins by extracting a set of around 2000 region proposals from the input image: Now for each region proposal, we run it through a set of convolutional and max-pooling layers to extract a convolutional feature map….”;see 3. Feature Pyramid Networks of Lin “Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is general purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance. segmentation proposals in Sec. 6.”; 6. Extensions: Segmentation Proposals) In addition, the same motivation is used as the rejection for claim 8.
Regarding claim 10, Eric, Lin, and Mikhailov teach the non-transitory computer readable medium according to claim 8 wherein at least one of the polygons differs from a rectangle (Fig. 3C of Mikhailov where the polygon 117 differs from a rectangle) In addition, the same motivation is used as the rejection for claim 8.
the non-transitory computer readable medium according to claim 8 wherein the providing of the hybrid representation of the media unit comprises compressing the shape information of the media unit to provide compressed shape information of the media unit (see pages 6-7 of Eric “Now we run that RoI feature vector through two sibling output layers: A softmax classifier that outputs probabilities for the KK object classes of our training data plus a background class. A bounding box regressor that outputs refined bounding box positions for each of the KK object classes”; See Fig.2 of Eric where label cat is considered as compressed shape).
Regarding claim 17, Eric and Lin teach the hybrid representation generator according to claim 15 wherein the polygons represent shapes that substantially bound the media unit regions of interest (see Fig.2 of Eric where bounding box of cat is polygon). Both Eric and Lin are understood to be silent on the remaining limitations of claim 17 .
In the same field of endeavor, Mikhailov teaches wherein the polygons represent shapes that substantially bound the media unit regions of interest (¶0028 “Once the objects 103, 105, 107 have been located in the image 101, Polygons 113, 115, 117 may be traced around each object, as indicated at 206 and as illustrated in FIG. 3C. Polygon data 124 may be stored in the memory 120. There are a number of different techniques for tracing the polygons”), wherein a number of edges per polygon of the polygons is based on a shape of a media unit region of interest represented by the polygon (¶0024 as shown in Fig. 3C “The processor 110 may be programmed with instructions that facilitate such operation. In particular, the processor 110 may be programmed with image capture instructions 112 that obtain an image 101 from the image capture device 106 and store the data 122 representing the image 101 or retrieve the stored image he processor 110 may be further programmed with outlining instructions 114 that analyze the image data 122 to locate edges of the objects 103, 105, 107 in the image 101 and generate data 124 representing the corresponding polygons 113, 115, 117. The polygon data 124 may identify, among other things, locations of endpoints of a plurality of line segments that make up each side of each polygon. The locations of the endpoints within the image 101 may be defined with respect to some coordinate system. An origin of the coordinate system may be arbitrarily defined and each location may be identified in terms of a number of pixels horizontally and vertically between the endpoint and the origin.”; ¶0029 “To optimize the number of line segments in the polygon surrounding the object 401, the outline instructions 114 may be configured to implement the following procedure. First, second and third boundary adjacent points may be located within the image data. An angle between a first line segment connecting the first and second boundary points and a second line segment connecting the first and third points may be determined. If the angle is less than a threshold value, establishing the third line segment may be associated with a side of the polygon. If the angle is greater than the threshold value, the first line segment may be associated with first side of the polygon and associating the second line segment may be associated with a second side of the polygon adjacent the first side.) and at least one of the polygons differs from a rectangle (Fig. 3C of Mikhailov where the polygon 117 differs from a rectangle) In addition, the same motivation is used as the rejection for claim 3.
Thus, the combination of Eric, Lin and Mikhailov teaches the hybrid representation generator according to claim 15 wherein the polygons represent shapes that substantially bound the media unit regions of interest, wherein a number of edges per polygon of the polygons is based on a shape of a media unit region of interest represented by the polygon; and at least one of the polygons differs from a rectangle.
Regarding claim 23, Eric, Lin, and Mikhailov teach the non-transitory computer readable medium according to claim 8 wherein the media unit is an image (See Fig.2 of Eric)
3.	Claims 5-7, 19-21  are rejected under 35 U.S.C. 103 as being unpatentable over  Mihail Eric, “Fast Object Detection With Fast R-CNN”, posted October, 2018, (“Eric”) in view of Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. (“Lin”) further in view of Tang, Shijian, and Ye Yuan. "Object detection based on convolutional neural network." International Conference-IEEE–2016. 2015. (“Tang”)  further in view of Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." IEEE transactions on pattern analysis and machine intelligence 32.9 (2009): 1627-1645.(“ Felzenszwalb”)
Regarding claim 5, Eric and Lin teach the method according to claim 4 comprising: the media unit signature of the media unit to signatures of multiple concept structures to find a matching concept structure that has at least one matching signature that matches to the media unit signature (see Fig.3 of Eric where softmax layer output with list of label cat (0.8), dog (0.15), etc…, outputs probabilities for cat class, dog class  )


    PNG
    media_image3.png
    172
    434
    media_image3.png
    Greyscale

Fig.3 of Eric
Eric and Lin are understood to be silent on the remaining limitations of claim 5.
In the same field of endeavor, Tang teaches comparing the media unit signature of the media unit to signatures of multiple concept structures to find a matching concept structure that has at least one matching signature that matches to the media unit signature (3.2. Training procedure, “….Then we divide the background data into four folders 1,2,3,4. For folder 1, the IoU with ground truth are between 0.5 and 0.7, for folder 2, the IoU with ground truth are between 0.3 and 0.5, for folders 3 and 4, the IoU with ground truth are less than 0.3. In the case of positive data, we randomly extract a region from raw image, if the IoU with ground truth is larger than 0.7, it is a positive data with the same class label as the ground truth.”; 3.3. Testing procedure, “where Apred and Agt are the areas included in the predicted and ground truth bounding box, respectively. Then we designate a threshold for IoU, for example 0:5, if the IoU exceeds the threshold, the detection marked as correct detection. Multiple detections of the same object are considered as one correct detection and with others as false detections)

In the same field of endeavor, Felzenszwalb teaches calculating higher accuracy shape information that is related to regions of interest of the media unit, wherein the higher accuracy shape information is of higher accuracy than the compressed shape information of the media unit (see section 7.3 Contextual Information, “…Let (D1, . . . , Dk) be a set of detections obtained using k different models (for different object categories) in an image I. Each detection (B, s) ∈ Di is defined by a bounding box B = (x1, y1, x2, y2) and a score s. We define the context of I in terms of a k-dimensional vector c(I) = (σ(s1), . . . , σ(sk)) where si is the score of the highest scoring detection in Di, and σ(x) = 1/(1+exp(−2x)) is a logistic function for renormalizing the scores. To rescore a detection (B, s) in an image I we build a 25-dimensional feature vector with the original score of the detection, the top-left and bottom-right bounding box coordinates, and the image context, g = (σ(s), x1, y1, x2, y2, c(I)). (30) The coordinates x1, y1, x2, y2 ∈ [0, 1] are normalized by the width and height of the image. We use a category specific classifier to score this vector to obtain a new score for the detection. The classifier is trained to distinguish correct detections from false positives by integrating contextual information defined by g.” where score of highest scoring detection is considered higher accuracy shape information), wherein the calculating is based on shape information associated with at least some of the matching signatures (7.1 Bounding Box Prediction, 7.2 Non-Maximum Suppression, “Using the matching procedure from Section 3.2 we usually get multiple overlapping detections for each instance of an object. We use a greedy procedure for eliminating repeated detections via non-maximum suppression. After applying the bounding box prediction method described above we have a set of detections D for a particular object category in an image. Each detection is defined by a bounding box and a score. We sort the detections in D by score, and greedily select the highest scoring ones while skipping detections with bounding boxes that are at least 50% covered by a bounding box of a previously selected detection”)
Therefore, in the combination of Eric, Lin and Tang, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify object detection using fast R-CNN of Eric with bounding box predict, rescore detections using contextual information as seen Felzenszwalb because this modification would lead to a noticible  improvement in the average precision on several categories in the PASCAL datasets (see section 7.3 Contextual Information, last paragraph of Felzenszwalb).
Thus, the combination of Eric, Lin,Tang and Felzenszwalb teaches comparing the media unit signature of the media unit to signatures of multiple concept structures to find a matching concept structure that has at least one matching signature that matches to the media unit signature; and calculating higher accuracy shape information that is related to regions of interest of the media unit, wherein the higher accuracy shape information is of higher accuracy than the compressed shape information of the media unit, wherein the calculating is based on shape information associated with at least some of the matching signatures.
Regarding claim 6, Eric, Lin, Tang and Felzenszwalb teach the method according to claim 5 comprising determining shapes of the media unit regions of interest using the higher accuracy shape information (see section 7.3 Contextual Information of Felzenszwalb “…Let (D1, . . . , Dk) be a set of detections obtained using k different models (for different object categories) in an image I. Each detection (B, s) ∈ Di is defined by a bounding box B = (x1, y1, x2, y2) and a score s. We define the context of I in terms of a k-dimensional vector c(I) = (σ(s1), . . . , σ(sk)) where si is the score of the highest scoring detection in Di, and σ(x) = 1/(1+exp(−2x)) is a logistic function for renormalizing the scores. To rescore a detection (B, s) in an image I we build a 25-dimensional feature vector with the original score of the detection, the top-left and bottom-right bounding box coordinates, and the image context, g = (σ(s), x1, y1, x2, y2, c(I)). (30) The coordinates x1, y1, x2, y2 ∈ [0, 1] are normalized by the width and height of the image. We use a category specific classifier to score this vector to obtain a new score for the detection. The classifier is trained to distinguish correct detections from false positives by integrating contextual information defined by g.”; 8 EMPIRICAL RESULTS of Felzenszwalb “A predicted bounding box is considered correct if it overlaps more than 50% with a ground-truth bounding box, otherwise the bounding box is considered a false positive detection. Multiple detections are penalized. If a system predicts several bounding boxes that overlap with a single ground-truth bounding box, only one prediction is considered 
Regarding claim 7, Eric, Lin, Tang and Felzenszwalb teach the method according to claim 5 wherein for each media unit region of interest, the calculating of the higher accuracy shape information comprises virtually 36overlaying shapes of corresponding media units of interest of at least some of the matching signatures (7.1 Bounding Box Prediction, 7.2 Non-Maximum Suppression,of Felzenszwab “Using the matching procedure from Section 3.2 we usually get multiple overlapping detections for each instance of an object. We use a greedy procedure for eliminating repeated detections via non-maximum suppression. After applying the bounding box prediction method described above we have a set of detections D for a particular object category in an image. Each detection is defined by a bounding box and a score. We sort the detections in D by score, and greedily select the highest scoring ones while skipping detections with bounding boxes that are at least 50% covered by a bounding box of a previously selected detection”; 8 EMPIRICAL RESULTS of Felzenszwalb “A predicted bounding box is considered correct if it overlaps more than 50% with a ground-truth bounding box, otherwise the bounding box is considered a false positive detection. Multiple detections are penalized. If a system predicts several bounding boxes that overlap with a single ground-truth bounding box, only one prediction is considered correct, the others are considered false positives. One scores a system by the average precision (AP) of its precision-recall curve across a testset.”) In addition, the same motivation is used as the rejection or claim 5.
the hybrid representation generator according to claim 18 that configured to:  Remaining of claim 19 is similar in scope to claim 5 and therefore rejected under the same rationale.
Regarding claim 20, Eric, Lin, Mikhailov, Tang and Felzenszwalb teach the hybrid representation generator according to claim 19 that configured to:  Remaining of claim 20 is similar in scope to claim 6 and therefore rejected under the same rationale.
Regarding claim 21, Eric, Lin, Mikhailov, Tang and Felzenszwalb teach the hybrid representation generator according to claim 19 that configured to:  Remaining of claim 21 is similar in scope to claim 7 and therefore rejected under the same rationale.
4.	Claims 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over  Mihail Eric, “Fast Object Detection With Fast R-CNN”, posted October, 2018, (“Eric”) in view of Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. (“Lin”) further in view of Mikhailov, U.S Patent Application Publication No. 20090102835 (“Mikhailov”)  further in view of Tang, Shijian, and Ye Yuan. "Object detection based on convolutional neural network." International Conference-IEEE–2016. 2015. (“Tang “ further in view of Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." IEEE transactions on pattern analysis and machine intelligence 32.9 (2009): 1627-1645.(“ Felzenszwalb”)
Regarding claim 12, Eric, Lin, Mikhailov teach the non-transitory computer readable medium according to claim 11 that stores instructions for: the media unit signature of the media unit to signatures of multiple concept structures to find a matching concept structure that has at least one matching signature that matches to the media unit signature (see Fig.3 of Eric where softmax layer output with list of label cat (0.8), dog (0.15), etc…, outputs probabilities for cat class, dog class  )


    PNG
    media_image3.png
    172
    434
    media_image3.png
    Greyscale

Fig.3 of Eric
Eric, Lin, Mikhailov are understood to be silent on the remaining limitations of claim 12.
In the same field of endeavor, Tang teaches comparing the media unit signature of the media unit to signatures of multiple concept structures to find a matching concept structure that has at least one matching signature that matches to the media unit signature (3.2. Training procedure, “….Then we divide the background data into four folders 1,2,3,4. For folder 1, the IoU with ground truth are between 0.5 and 0.7, for folder 2, the IoU with ground truth are between 0.3 and 0.5, for folders 3 and 4, the IoU with ground truth are less than 0.3. In the case of positive data, we randomly extract a region from raw image, if the IoU with ground truth is larger than 0.7, it is a positive data with the same class label as the ground truth.”; 3.3. Testing procedure, “where Apred and Agt are the areas included in the predicted and ground truth bounding box, respectively. Then we designate a threshold for IoU, for example 0:5, if the IoU exceeds the threshold, the detection marked as correct detection. Multiple detections of the same object are considered as one correct detection and with others as false detections)
Therefore, in the combination of Eric, Lin, Mikhailov,, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify object detection using fast R-CNN of Eric with comparing the IoU with ground truth as seen in Tang because this modification would determine data with the same class label as the ground truth (3.2. Training procedure of Tang). Eric, Lin, Mikhailov and Tang are understood to be silent on the remaining limitations of claim 12. 
In the same field of endeavor, Felzenszwalb teaches calculating higher accuracy shape information that is related to regions of interest of the media unit, wherein the higher accuracy shape information is of higher accuracy than the compressed shape information of the media unit (see section 7.3 Contextual Information, “…Let (D1, . . . , Dk) be a set of detections obtained using k different models (for different object categories) in an image I. Each detection (B, s) ∈ Di is defined by a bounding box B = (x1, y1, x2, y2) and a score s. We define the context of I in terms of a k-dimensional vector c(I) = (σ(s1), . . . , σ(sk)) where si is the score of the highest scoring detection in Di, and σ(x) = 1/(1+exp(−2x)) is a logistic function for renormalizing the scores. To rescore a detection (B, s) in an image I we build a 25-dimensional feature vector with the original score of the detection, the top-left and bottom-right bounding box coordinates, ∈ [0, 1] are normalized by the width and height of the image. We use a category specific classifier to score this vector to obtain a new score for the detection. The classifier is trained to distinguish correct detections from false positives by integrating contextual information defined by g.” where score of highest scoring detection is considered higher accuracy shape information), wherein the calculating is based on shape information associated with at least some of the matching signatures (7.1 Bounding Box Prediction, 7.2 Non-Maximum Suppression, “Using the matching procedure from Section 3.2 we usually get multiple overlapping detections for each instance of an object. We use a greedy procedure for eliminating repeated detections via non-maximum suppression. After applying the bounding box prediction method described above we have a set of detections D for a particular object category in an image. Each detection is defined by a bounding box and a score. We sort the detections in D by score, and greedily select the highest scoring ones while skipping detections with bounding boxes that are at least 50% covered by a bounding box of a previously selected detection”)
Therefore, in the combination of Eric, Lin, Mikhailov and Tang, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify object detection using fast R-CNN of Eric with bounding box predict, rescore detections using contextual information as seen Felzenszwalb because this modification would lead to a noticible  improvement in the average precision on several categories in the PASCAL datasets (see section 7.3 Contextual Information, last paragraph of Felzenszwalb).
comparing the media unit signature of the media unit to signatures of multiple concept structures to find a matching concept structure that has at least one matching signature that matches to the media unit signature; and calculating higher accuracy shape information that is related to regions of interest of the media unit, wherein the higher accuracy shape information is of higher accuracy than the compressed shape information of the media unit, wherein the calculating is based on shape information associated with at least some of the matching signatures.
Regarding claim 13, Eric, Lin, Mikhailov Tang and Felzenszwalb teach the non-transitory computer readable medium according to claim 12 that stores instructions for determining shapes of the media unit regions of interest using the higher accuracy shape information (see section 7.3 Contextual Information of Felzenszwalb “…Let (D1, . . . , Dk) be a set of detections obtained using k different models (for different object categories) in an image I. Each detection (B, s) ∈ Di is defined by a bounding box B = (x1, y1, x2, y2) and a score s. We define the context of I in terms of a k-dimensional vector c(I) = (σ(s1), . . . , σ(sk)) where si is the score of the highest scoring detection in Di, and σ(x) = 1/(1+exp(−2x)) is a logistic function for renormalizing the scores. To rescore a detection (B, s) in an image I we build a 25-dimensional feature vector with the original score of the detection, the top-left and bottom-right bounding box coordinates, and the image context, g = (σ(s), x1, y1, x2, y2, c(I)). (30) The coordinates x1, y1, x2, y2 ∈ [0, 1] are normalized by the width and height of the image. We use a category specific classifier to score this vector to obtain a new score for the detection. The classifier is trained to distinguish correct detections from false positives by integrating contextual 
Regarding claim 14, Eric, Lin, Mikhailov, Tang and Felzenszwalb teach the non-transitory computer readable medium according to claim 12 wherein for each media unit region of interest, the calculating of the higher accuracy shape information comprises virtually overlaying shapes of corresponding media units of interest of at least some of the matching signatures (7.1 Bounding Box Prediction, 7.2 Non-Maximum Suppression,of Felzenszwab “Using the matching procedure from Section 3.2 we usually get multiple overlapping detections for each instance of an object. We use a greedy procedure for eliminating repeated detections via non-maximum suppression. After applying the bounding box prediction method described above we have a set of detections D for a particular object category in an image. Each detection is defined by a bounding box and a score. We sort the detections in D by score, and greedily select the highest scoring ones while skipping detections with bounding boxes that are at least 50% covered by a bounding box of a previously selected detection”; 8 EMPIRICAL RESULTS of Felzenszwalb “A predicted bounding box is considered correct if it overlaps more than 50% with a ground-truth bounding box, otherwise the bounding box is considered a false 
Contact

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SARAH LE whose telephone number is (571)270-7842. The examiner can normally be reached Monday: 8AM-4:30PM EST, Tuesday: 8 AM-3:30PM EST, Wednesday: 8AM-2:30PM EST, Thursday and Friday off.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mark Zimmerman can be reached on 571-272-7653. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent 





/SARAH LE/           Primary Examiner, Art Unit 2619