DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
Claims 1-20 are pending in this application. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

35 U.S.C. § 112 Sixth Paragraph - Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an 
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: “unit” in claims 11-20.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.


Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Loui et al. (US PGPub US 2016/0379055 A1), hereby referred to as “Loui”, in view of Kwon et al. (USPGPub US 2020/0202615 A1), hereby referred to as “Kwon”. 
Consider Claims 1 and 11. 
Loui teaches:
1. A cross-domain image comparison method, comprising: / 11. A cross-domain image comparison system, comprising: (Loui: abstract, A method for graph-based spatiotemporal video segmentation and automatic target object extraction in high-dimensional feature space includes using a processor to automatically analyze an entire volumetric video sequence; using the processor to construct a high-dimensional feature space that includes color, motion, time, and location information so that pixels in the entire volumetric video sequence are reorganized according to their unique and distinguishable feature vectors; using the processor to create a graph model that fuses the appearance, spatial, and temporal information of all pixels of the video sequence in the high-dimensional feature space; and using the processor to group pixels in the graph model that are inherently similar and assign the same labels to them to form semantic spatiotemporal key segments. [0020])
1. obtaining two videos in cross-domain, wherein the videos are generated by different types of devices; / 11. an inputting unit, used for obtaining two videos in cross-domain, wherein the videos are generated by different types of devices; (Loui: [0018] Video graph representation according to the method of the present invention is now described. [0019] The method of the present invention ensures both temporal and spatial connectedness of regions. [0020] To detect semantic key-segments in a video sequence, the first step is to create a video-graph representation wherein each pixel is considered as a graph node, and two pixels are connected by an edge based on certain similarity criteria. The method of the present invention embeds all pixel information including color, motion, and location into the graph, and uses this graph model to achieve the clustering task by grouping pixels belonging to the same object together and assigns the same label in the high dimensional feature domain) 
1. obtaining a plurality of semantic segmentation areas from one frame of each of the videos; / 11. a semantic segmentation unit, used for obtaining a plurality of semantic segmentation areas from one frame of each of the videos;  (Loui: [0021] To an extent, how well such a graph is constructed based on the pixels from all frames will decide the accuracy of the segmentation. In order to group similar pixels together, the method of the present invention semantically connects pixels with a weighted edge that describes how likely two pixels belong to the same object regardless of their spatial and temporal locations that appeared in the video sequence. On the other hand, even if two pixels are spatially or temporally related to each other they do not necessarily have to be connected in the feature space. Therefore, it is desirable to perform a pixels reorganization according to individual feature vectors.)
1. obtaining a region of interest pair (ROI pair) according to moving paths of the semantic segmentation areas in the videos; / 11. a ROI unit, used for obtaining a region of interest pair (ROI pair) according to moving paths of the semantic segmentation areas in the videos; (Loui: [0022] FIG. 1. For example, the center node 108 in the middle has 14 edges incident to it by its spatial-temporal neighbors, i.e., pixels from previous frame 102, current frame 104 and next frame 106. The number of neighbors could be tuned as either smaller or larger depending on the object texture, contrast to the environment, etc. In a common situation, relatively larger number of neighbors would result in higher segmentation accuracy since more pixels from the same object would be more likely to be grouped together. [0023])
(Loui: [0030] A region-based graph model according to the present invention is now described. [0031] The volumetric key-segments is a good set of initial entities to search for target moving objects. However, since the processing method is not provided with any prior knowledge on the target's appearance, shape or location information)
1. and obtaining a similarity between the frames according to the correlation and the central points. / 11. and a similarity unit, used for obtaining a similarity between the frames according to the correlation and the central points. (Loui: [0032] Besides the assumption that background usually holds a dominant role, the observed saliency learned from labeled data of segmented objects or sub-objects allows inference of visual and motion cues for learning foreground and background models. To find the target object among all key-segments, the method of the present invention is looking for those most different in motion and appearance relative to the surroundings. The method of the present invention thereby generalizes the pixel level saliency score by expanding it to volumetric key-segments. Similar to region-node graph model clustering, key-segments are described with the combination of color and motion histogram in individual channels to resolve the visual and motion ambiguity due to cluttered surroundings and multiple motions introduced by camera motion and disparity.)
Loui does not teach: 
1. obtaining two bounding boxes and two central points of the ROI pair; / 11. a bounding box unit, used for obtaining two bounding boxes and two central points of the ROI pair; 
Kwon teaches:
1. A cross-domain image comparison method, comprising: / 11. A cross-domain image comparison system, comprising: (Kwon: abstract, method is disclosed for reconstructing three-dimensional video from two-dimensional video data using particle filtering and thereby generating training data for autonomous vehicles. In one version, the method comprises: receiving a set of annotations associated with a video frame comprising a view of at least a portion of a vehicle, each annotation comprising at least one two-dimensional line; removing at least one outlier from the set of annotations; determining an estimated vehicle model based on the set of annotations; and providing the estimated vehicle model to a driving simulator. [0013]-[0017])
1. obtaining two videos in cross-domain, wherein the videos are generated by different types of devices; / 11. an inputting unit, used for obtaining two videos in cross-domain, wherein the videos are generated by different types of devices; (Kwon: [0042]-[0044], [0045] In this disclosure, we propose a hybrid (human machine) intelligence pipeline 100 for 3D video reconstruction as in FIG. 1. Our approach leverages content diversity from different but related video frames to increase the accuracy of 3D state estimates. [0046] We introduce Popup, a crowd-powered system, that collects annotations of 3D dimension lines atop 2D videos, and then aggregates these annotations using particle filtering to generate 3D state estimates of objects of interest. We validate our method on videos from a publicly available and established dataset of traffic scenes [Ref. 13]. [0047]-[0049] [0116] At 1204, the process 1200 can receive one or more video frames. The one or more video frames can include a view of at least a portion of a vehicle such as a car or truck. In some embodiments, the one or more video frames can include ten or more video frames)
1. obtaining a plurality of semantic segmentation areas from one frame of each of the videos; / 11. a semantic segmentation unit, used for obtaining a plurality of semantic segmentation areas from one frame of each of the videos; (Kwon: [0044] This process of creating 3D scenes from real-world monocular video is called 3D video reconstruction. Generally, manual armotations are necessary at some point of the process to bridge the sensory and semantic gap between 2D and 3D. To efficiently scale up manual annotation, one can benefit from crowd-powered tools that rapidly leverage human effort. [0045]-[0049] [0047] (1) A novel means of aggregating and processing multiple annotations at different frames in videos using particle filtering, which enables more accurate 3D scene reconstruction even with an incomplete annotation set. [0048] (2) Popup, a crowd-powered system that estimates 3D position and orientation of objects from 2D images, using crowdsourced dimension line annotations on objects and their actual dimension lengths. [0117] The instructions can include words and/or figures detailing how to annotate a vehicle. More specifically, the instructions can detail how to crop the vehicle from the video frame by drawing and/or adjusting a two-dimensional bounding box to include the vehicle while excluding as much of the video frame that does not include the vehicle as possible. Additionally, the instructions can detail how to draw two-dimensional annotation lines along three dimensions of the vehicle, for example the length, width, and height of the vehicle.)
1. obtaining a region of interest pair (ROI pair) according to moving paths of the semantic segmentation areas in the videos; / 11. a ROI unit, used for obtaining a region of interest pair (ROI pair) according to moving paths of the semantic segmentation areas in the videos;  (Kwon: [0118] In some embodiments, the process 1200 can display a message or other prompt that a user should crop the vehicle from the video frame by drawing and/or adjusting a bounding box around the vehicle. The process 1200 can then proceed to 1216. [0119] At 1216 the process 1200 can receive coordinates associated with a plurality of bounding boxes. Each bounding box can be associated with the video frame as well as a specific user and/or annotation lines to be generated later, as will be explained below. The coordinates can include (x,y) coordinates of each comer of the bounding box. Each user may have drawn a bounding box in response to being displayed a video frame, and the coordinates of each corner of the bounding box can be provided to the process 1200. The process 1200 can then proceed to 1220)
1. obtaining two bounding boxes and two central points of the ROI pair; / 11. a bounding box unit, used for obtaining two bounding boxes and two central points of the ROI pair; (Kwon: [0120] At 1220, the process 1200 can cause an annotation interface to be displayed to the plurality of users. The process 1200 can cause an option to provide annotations and/or cause an option to not provide annotations to be displayed to the plurality of users. The annotation interface can include one or more buttons to allow a worker to draw annotation lines on the video frame, and more specifically, the portion of the video frame included in the bounding box. Each user can then choose to annotate or not annotate the video frame. Briefly referring to FIG. 3B, in some embodiments, the annotation interface can include at least a portion of the annotation interface 318 described above and shown in FIG. 3B. The process 1200 can then proceed to 1224. [0121] At 1224, the process 1200 can receive a set of annotations associated with the one or more video frames. The set of annotations can include two or more subsets of annotations, each subset of annotations being associated with a unique video frame included in the one or more video frames. Each annotation included in the set of annotations can include at least one two-dimensional line. Each two-dimensional line can be associated with a video frame, a dimension (i.e., a first dimension, a second dimension, or a third dimension) of the vehicle included in the video frame, and a user included in the plurality of users.)
1. and obtaining a similarity between the frames according to the bounding boxes and the central points. / 11. and a similarity unit, used for obtaining a similarity between the frames according to the bounding boxes and the central points. (Kwon: [0080] At 712, the process 700 can resample the set N of particles described above. The process 700 can resample the set N of particles in order to help determine a probable location of a vehicle in a current video frame (i.e., a video frame the process 700 is currently processing). [0081] At 720, the process 700 can calculate a probability value ( e.g., a weight value) for each of particles. The process 700 can generate a bounding cuboid based on the hypothesis value in three-dimensional space, and project the cuboid onto the video frame. Then, for each two-dimensional line associated with the current video frame, the process 700 can determine a first distance between endpoints of the two-dimensional line and an appropriate pair of edges of the cuboid.)
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to modify Loui’s graph based object segmentation and extraction to leverage the particle filtering algorithm of Kwon to more accurately reconstruct 3D video data.  The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify Loui’s method and system for semantic-based object segmentation and extraction in order to improve and refine the algorithm for 3D video reconstruction using Kwon’s algorithm. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and programming techniques, without changing a “fundamental” operating principle of Loui, while the teaching of Kwon continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of creating realistic simulated 3D video data of extracted features. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.

Consider Claims 2 and 12. 
The combination of Loui and Kwon teaches: 
2. The cross-domain image comparison method according to claim 1, wherein one of the videos is captured by a camera and another one of the videos is generated by a computer.
12. The cross-domain image comparison system according to claim 11, wherein one of the videos is captured by a camera and another one of the videos is generated by a computer. (Loui: [0018] Video graph representation according to the method of the present invention is now described. [0019] The method of the present invention ensures both temporal and spatial connectedness of regions. [0020] To detect semantic key-segments in a video sequence, the first step is to create a video-graph representation wherein each pixel is considered as a graph node, and two pixels are connected by an edge based on certain similarity criteria. The method of the present invention embeds all pixel information including color, motion, and location into the graph, and uses this graph model to achieve the clustering task by grouping pixels belonging to the same object together and assigns the same label in the high dimensional feature domain. Kwon: [0044] This process of creating 3D scenes from real-world monocular video is called 3D video reconstruction. Generally, manual annotations are necessary at some point of the process to bridge the sensory and semantic gap between 2D and 3D. To efficiently scale up manual annotation, one can benefit from crowd-powered tools that rapidly leverage human effort. [0045] In this disclosure, we propose a hybrid (human machine) intelligence pipeline 100 for 3D video reconstruction as in FIG. 1. Our approach leverages content diversity from different but related video frames to increase the accuracy of 3D state estimates. [0046] We introduce Popup, a crowd-powered system, that collects annotations of 3D dimension lines atop 2D videos, and then aggregates these annotations using particle filtering to generate 3D state estimates of objects of interest. We validate our method on videos from a publicly available and established dataset of traffic scenes [Ref. 13]. [0047]-[0049] [0116] At 1204, the process 1200 can receive one or more video frames. The one or more video frames can include a view of at least a portion of a vehicle such as a car or truck. In some embodiments, the one or more video frames can include ten or more video frames)

Consider Claims 3 and 13. 
The combination of Loui and Kwon teaches: 
3. The cross-domain image comparison method according to claim 1, wherein in the step of obtaining the semantic segmentation areas, the semantic segmentation areas are obtained via a semantic segmentation model, and the semantic segmentation model is a Fully Convolutional Networks model (FCN model), an U-net model or an efficient neural network model (Enet model).
13. The cross-domain image comparison system according to claim 11, wherein the semantic segmentation unit obtains the semantic segmentation areas via a semantic segmentation model, and the semantic segmentation model is a Fully Convolutional Networks model (FCN model), an U-net model or an efficient neural network model (Enet model). (Kwon: [0047] (1) A novel means of aggregating and processing multiple annotations at different frames in videos using particle filtering, which enables more accurate 3D scene reconstruction even with an incomplete annotation set. [0048] (2) Popup, a crowd-powered system that estimates 3D position and orientation of objects from 2D images, using crowdsourced dimension line annotations on objects and their actual dimension lengths [0128] In some embodiments, any suitable computer readable media can be used for storing instructions for performing the functions and/or processes described herein. For example, in some embodiments, computer readable media can be transitory or non-transitory. For example, non-transitory computer readable media can include media such as magnetic media (such as hard disks, floppy disks, etc.), optical media (such as compact discs, digital video discs, Blu-ray discs, etc.), semiconductor media (such as RAM, Flash memory, electrically programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), etc.), any suitable media that is not fleeting or devoid of any semblance of permanence during transmission, and/or any suitable tangible media. As another example, transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, or any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable intangible media. Loui: [0027] With the video-graph being constructed in the 7D feature space, weight per edge could simply be derived from the Euclidean distance between the endpoints since their location indicates their similarity. The 7D feature descriptors offer a much richer description of appearance and motion. With the novel video-graph model according to the present invention, the same segmentation algorithm can be applied to obtain volumetric key-segments in video sequences. The present invention also adopts a hierarchical implementation owing to the fact that the concept of semantic objects are perceived by an intelligent agent while the algorithm is only capable of recognizing lower level clusters. Using the output from previous hierarchy, higher level objects are progressively individuated with the combination of lower level clusters over time. After the first iteration, the nodes in the graph model become the clusters and the edges are directly inherited from the pixel level edges.)

Consider Claims 4 and 14. 

4. The cross-domain image comparison method according to claim 1, the step of obtaining the similarity between the frames includes: averaging a position similarity degree of the ROI pair, an angle similarity degree of the ROI pair, a size similarity degree of the ROI pair and a contour similarity of the ROI pair with weightings, to obtain the similarity between the frames.
14. The cross-domain image comparison system according to claim 11, wherein the similarity unit includes: a score calculator, used for averaging a position similarity degree of the ROI pair, an angle similarity degree of the ROI pair, a size similarity degree of the ROI pair and a contour similarity degree of the ROI pair with weightings, to obtain the similarity between the frames. (Loui: [0018] Video graph representation according to the method of the present invention is now described. [0019] The method of the present invention ensures both temporal and spatial connectedness of regions. [0020] To detect semantic key-segments in a video sequence, the first step is to create a video-graph representation wherein each pixel is considered as a graph node, and two pixels are connected by an edge based on certain similarity criteria. The method of the present invention embeds all pixel information including color, motion, and location into the graph, and uses this graph model to achieve the clustering task by grouping pixels belonging to the same object together and assigns the same label in the high dimensional feature domain. Kwon: [0080] At 712, the process 700 can resample the set N of particles described above. The process 700 can resample the set N of particles in order to help determine a probable location of a vehicle in a current video frame (i.e., a video frame the process 700 is currently processing). [0081] At 720, the process 700 can calculate a probability value ( e.g., a weight value) for each of particles. The process 700 can generate a bounding cuboid based on the hypothesis value in three-dimensional space, and project the cuboid onto the video frame. Then, for each two-dimensional line associated with the current video frame, the process 700 can determine a first distance between endpoints of the two-dimensional line and an appropriate pair of edges of the cuboid.)

Consider Claims 5 and 15. 
The combination of Loui and Kwon teaches: 
5. The cross-domain image comparison method according to claim 4, wherein the step of obtaining the similarity between the frames further includes: analyzing an Euclidean distance between the central points, to obtain the position similarity degree of the ROI pair.
15. The cross-domain image comparison system according to claim 14, wherein the similarity unit further includes: a position similarity analyzer, used for analyzing an Euclidean distance between the central points, to obtain the position similarity degree of the ROI pair. (Kwon: [0099] For the evaluation of the accuracy of the state estimates, we used two metrics: a distance difference metric, and an angular difference metric. The distance difference metric is the Euclidean distance between the ground truth and the estimate. The angular difference metric corresponds to the smallest angular difference between estimated orientation and the ground truth orientation (Equation 1). Loui: [0027] With the video-graph being constructed in the 7D feature space, weight per edge could simply be derived from the Euclidean distance between the endpoints since their location indicates their similarity. The 7D feature descriptors offer a much richer description of appearance and motion. With the novel video-graph model according to the present invention, the same segmentation algorithm can be applied to obtain volumetric key-segments in video sequences. The present invention also adopts a hierarchical implementation owing to the fact that the concept of semantic objects are perceived by an intelligent agent while the algorithm is only capable of recognizing lower level clusters. Using the output from previous hierarchy, higher level objects are progressively individuated with the combination of lower level clusters over time. After the first iteration, the nodes in the graph model become the clusters and the edges are directly inherited from the pixel level edges.)

Consider Claims 6 and 16. 
The combination of Loui and Kwon teaches: 
6. The cross-domain image comparison method according to claim 4, wherein the step of obtaining the similarity between the frames further includes: analyzing a relative angle between the bounding boxes, to obtain the angle similarity degree of the ROI pair.
16. The cross-domain image comparison system according to claim 14, wherein the similarity unit further includes: an angle similarity analyzer, used for analyzing a relative angle between the bounding boxes, to obtain the angle similarity degree of the ROI pair. (Kwon: [0065] The second step compares the distance of the length and angle of submitted dimension line annotations from the medians. If a dimension line is outside 1.5x Interquartile Range (IQR) from the median, it is filtered. This is useful for filtering out mistakes, e.g., a height entry mistakenly drawn as a length entry or a line added by mistake and not removed, and to filter out low quality annotations. [0123] To determine if a single two-dimensional line should be removed from the set of annotations, the process 1200 can compare the distance of the length and angle of the two-dimensional line from the median length and median angle of the other two-dimensional lines associated with video frame and the same dimension as the single two-dimensional line. Loui: [0027] With the video-graph being constructed in the 7D feature space, weight per edge could simply be derived from the Euclidean distance between the endpoints since their location indicates their similarity. The 7D feature descriptors offer a much richer description of appearance and motion. With the novel video-graph model according to the present invention, the same segmentation algorithm can be applied to obtain volumetric key-segments in video sequences. The present invention also adopts a hierarchical implementation owing to the fact that the concept of semantic objects are perceived by an intelligent agent while the algorithm is only capable of recognizing lower level clusters.)

Consider Claims 7 and 17. 
The combination of Loui and Kwon teaches: 
7. The cross-domain image comparison method according to claim 4, wherein the step of obtaining the similarity between the frames further includes: analyzing a diagonal length of each of the bounding boxes, to obtain the size similarity degree of the ROI pair.
17. The cross-domain image comparison system according to claim 14, wherein the similarity unit further includes: a size similarity analyzer, used for analyzing a diagonal length of each of the bounding boxes, to obtain the size similarity degree of the ROI pair. (Kwon: [0065] The second step compares the distance of the length and angle of submitted dimension line annotations from the medians. If a dimension line is outside 1.5x Interquartile Range (IQR) from the median, it is filtered. This is useful for filtering out mistakes, e.g., a height entry mistakenly drawn as a length entry or a line added by mistake and not removed, and to filter out low quality annotations. [0123] To determine if a single two-dimensional line should be removed from the set of annotations, the process 1200 can compare the distance of the length and angle of the two-dimensional line from the median length and median angle of the other two-dimensional lines associated with video frame and the same dimension as the single two-dimensional line. Loui: [0027] With the video-graph being constructed in the 7D feature space, weight per edge could simply be derived from the Euclidean distance between the endpoints since their location indicates their similarity. The 7D feature descriptors offer a much richer description of appearance and motion. With the novel video-graph model according to the present invention, the same segmentation algorithm can be applied to obtain volumetric key-segments in video sequences. The present invention also adopts a hierarchical implementation owing to the fact that the concept of semantic objects are perceived by an intelligent agent while the algorithm is only capable of recognizing lower level clusters.)

Consider Claims 8 and 18. 
The combination of Loui and Kwon teaches: 
8. The cross-domain image comparison method according to claim 4, wherein the step of obtaining the similarity between the frames further includes: analyzing two counters of the semantic segmentation areas corresponding the ROI pair, to obtain the contour similarity degree.
18. The cross-domain image comparison system according to claim 14, wherein the similarity unit further includes: a contour similarity analyzer, used for analyzing two counters of the semantic segmentation areas corresponding the ROI pair, to obtain the contour similarity degree. (Loui: [0021] To an extent, how well such a graph is constructed based on the pixels from all frames will decide the accuracy of the segmentation. In order to group similar pixels together, the method of the present invention semantically connects pixels with a weighted edge that describes how likely two pixels belong to the same object regardless of their spatial and temporal locations that appeared in the video sequence. On the other hand, even if two pixels are spatially or temporally related to each other they do not necessarily have to be connected in the feature space. Therefore, it is desirable to perform a pixels reorganization according to individual feature vectors. Kwon: [0044] This process of creating 3D scenes from real-world monocular video is called 3D video reconstruction. Generally, manual armotations are necessary at some point of the process to bridge the sensory and semantic gap between 2D and 3D. To efficiently scale up manual annotation, one can benefit from crowd-powered tools that rapidly leverage human effort. [0045]-[0049] [0047] (1) A novel means of aggregating and processing multiple annotations at different frames in videos using particle filtering, which enables more accurate 3D scene reconstruction even with an incomplete annotation set. [0048] (2) Popup, a crowd-powered system that estimates 3D position and orientation of objects from 2D images, using crowdsourced dimension line annotations on objects and their actual dimension lengths. [0117] The instructions can include words and/or figures detailing how to annotate a vehicle. More specifically, the instructions can detail how to crop the vehicle from the video frame by drawing and/or adjusting a two-dimensional bounding box to include the vehicle while excluding as much of the video frame that does not include the vehicle as possible. Additionally, the instructions can detail how to draw two-dimensional annotation lines along three dimensions of the vehicle, for example the length, width, and height of the vehicle.)

Consider Claims 9 and 19. 
The combination of Loui and Kwon teaches: 
9. The cross-domain image comparison method according to claim 8, wherein the counters are resized to be identical size.
19. The cross-domain image comparison system according to claim 18, wherein the contour similarity analyzer resizes the counters to be identical size. (Loui: [0018] Video graph representation according to the method of the present invention is now described. [0019] The method of the present invention ensures both temporal and spatial connectedness of regions. [0020] To detect semantic key-segments in a video sequence, the first step is to create a video-graph representation wherein each pixel is considered as a graph node, and two pixels are connected by an edge based on certain similarity criteria. The method of the present invention embeds all pixel information including color, motion, and location into the graph, and uses this graph model to achieve the clustering task by grouping pixels belonging to the same object together and assigns the same label in the high dimensional feature domain. Kwon: [0080] At 712, the process 700 can resample the set N of particles described above. The process 700 can resample the set N of particles in order to help determine a probable location of a vehicle in a current video frame (i.e., a video frame the process 700 is currently processing). [0081] At 720, the process 700 can calculate a probability value ( e.g., a weight value) for each of particles. The process 700 can generate a bounding cuboid based on the hypothesis value in three-dimensional space, and project the cuboid onto the video frame. Then, for each two-dimensional line associated with the current video frame, the process 700 can determine a first distance between endpoints of the two-dimensional line and an appropriate pair of edges of the cuboid.)

Consider Claims 10 and 20. 
The combination of Loui and Kwon teaches: 
10. The cross-domain image comparison method according to claim 1, wherein the two semantic segmentation areas corresponding the ROI pair are obtained from the two different videos.
20. The cross-domain image comparison system according to claim 11, wherein the two semantic segmentation areas corresponding the ROI pair are obtained from the two different videos. (Loui: [0021] To an extent, how well such a graph is constructed based on the pixels from all frames will decide the accuracy of the segmentation. In order to group similar pixels together, the method of the present invention semantically connects pixels with a weighted edge that describes how likely two pixels belong to the same object regardless of their spatial and temporal locations that appeared in the video sequence. On the other hand, even if two pixels are spatially or temporally related to each other they do not necessarily have to be connected in the feature space. Therefore, it is desirable to perform a pixels reorganization according to individual feature vectors. Kwon: [0044] This process of creating 3D scenes from real-world monocular video is called 3D video reconstruction. Generally, manual armotations are necessary at some point of the process to bridge the sensory and semantic gap between 2D and 3D. To efficiently scale up manual annotation, one can benefit from crowd-powered tools that rapidly leverage human effort. [0045]-[0049] [0047] (1) A novel means of aggregating and processing multiple annotations at different frames in videos using particle filtering, which enables more accurate 3D scene reconstruction even with an incomplete annotation set. [0048] (2) Popup, a crowd-powered system that estimates 3D position and orientation of objects from 2D images, using crowdsourced dimension line annotations on objects and their actual dimension lengths. [0117] The instructions can include words and/or figures detailing how to annotate a vehicle. More specifically, the instructions can detail how to crop the vehicle from the video frame by drawing and/or adjusting a two-dimensional bounding box to include the vehicle while excluding as much of the video frame that does not include the vehicle as possible. Additionally, the instructions can detail how to draw two-dimensional annotation lines along three dimensions of the vehicle, for example the length, width, and height of the vehicle.)



Conclusion
The prior art made of record in form PTO-892 and not relied upon is considered pertinent to applicant's disclosure. 
Ogale; Abhijit et al., US 20200174490 A1, NEURAL NETWORKS FOR VEHICLE TRAJECTORY PLANNING
Keating; Brett M. et al., US 7702185 B2, Use of image similarity in annotating groups of visual images in a collection of visual images
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAHMINA ANSARI whose telephone number is 571-270-3379.  The examiner can normally be reached on IFP Flex - Monday through Friday 9 to 5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, SUMATI LEFKOWITZ can be reached on 571-272-3638.  The fax phone numbers for the organization where this application or proceeding is assigned are 571-273-8300 for regular communications and 571-273-8300 for After Final communications. TC 2600’s customer service number is 571-272-2600.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the receptionist whose telephone number is 571-272-2600.




2662
/Tahmina Ansari/

September 9, 2021
/TAHMINA N ANSARI/Primary Examiner, Art Unit 2662