Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Allowable Subject Matter
Claims 1 – 20 are allowed.

Ning US PGPub: US 2021/0035187 A1 Feb. 4, 2021.
Ning teaches, a method and a device comprising: visual elements from an item image of an item; generating, by the one or more processors, an element descriptor for the item based on at least a part of the visual elements; and calculating, by the one or more processors, a compatibility value between the element descriptor and one or more other element descriptors for one or more other items (ABSTRACT, Figs. 2A, 4, 5, 7).
Generating the item image based on the raw image and the parsed image comprises steps of: cropping, by the one or more processors, the parsed image to generate a cropped parsed image comprising a cropped item region and a cropped non-item region; filtering out, by the one or more processors, the cropped non-item region from the cropped parsed image, to generate a human parsing mask; and overlaying, by the one or more processors, the human parsing mask with the raw image to generate the item image (Figs. 3A, 3B, paragraphs 0015, 0018, 0076).
A parsed image generated from the raw image of FIG. 3A by the human parsing module 232 according to some embodiments of the present disclosure, wherein the raw image is partitioned into a plurality of human related regions, such as a face region, a hair region, a left arm region, a right arm region, a dress region, a pants region, a shoes region, etc., and pixels in each of them are assigned the same grayscale value indicating a respective category (Figs. 3A, 3B, 7/702, paragraphs 0076, 0080, 0121).
The raw image depicts a girl wearing a dress, pants, and shoes, among which the dress, for example, may be the target item. In general, the acquired raw image may include both a target item region and a non-target item region. The non-target item region may contain, for example, a background region, a face region, a hair region, a left arm region, a right arm region, a left leg region, a right leg region, and additional-item regions in which the additional item is not the target item, etc., just to mention a few non-limiting example (Fig. 3A, paragraph 0073).

Siskind US PGPub: US 2014/0369596 A1 Dec. 18, 2014.
Siskind teaches, correlating videos and sentences, where, testing a video against an aggregate query includes automatically receiving an aggregate query defining participant(s) and condition(s) on the participant(s). Candidate object(s) are detected in the frames of the video. A method of providing a description of a video is also described and includes generating a candidate description with participant(s) and condition(s) selected from a linguistic model; constructing component lattices for the participant(s) or condition(s), producing an aggregate lattice having nodes combining component-lattice factors, and determining a score for the video with respect to the candidate description by determining an aggregate score for a path through the aggregate lattice (ABSTRACT, Figs. 1A, 1B, 4, 20, 23, paragraph 0045).
A system is presented that shows how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language (paragraph 0197).
This general-purpose inference mechanism for combining bottom-up information from low-level video-feature detectors and top-down information from natural-language semantics permits performing three tasks: tracking objects which are engaged in a specific event as described by a sentence, generating a sentence to describe a video clip, and learning word meaning from video clips paired with entire sentences (paragraphs 0245, 0247).

Peyman US PGPub: US 2020/0066405 A1 Feb. 27, 2020.
Peyman teaches, A telemedicine system with dynamic imaging is disclosed herein. In some embodiments, the telemedicine system comprises a laser imaging and treatment apparatus, and associated systems and methods that allow a physician (e.g., a surgeon) to perform laser surgical procedures on an eye structure or a body surface with a laser imaging and treatment apparatus disposed at a first (i.e. local) location from a control system disposed at a second (i.e. remote) location, e.g., a physician's office (ABSTRACT).
The computer-modified image 622 of the person is depicted where glasses 624 have been provided on the person, the iris color 626 of the eyes of the person has been changed, and an earring 628 has been added to one of the ears of the person. In order to make the dynamic identity recognition system spoof-proof, other computer-generated anatomical and cosmetic variations can also be made to the person, such as changing the skin color of the person, adding a hat on the head of the person, adding lipstick to the person, changing hair color, changing facial hair color (Fig. 24, paragraph 0373).
three-dimensional dynamic data is obtained either from multiple cameras or from the mathematical analysis of the obtained digital data from the light field camera system. Advantageously, the rapid variation of the light field camera eliminates the problem seen with moving objects that interferes with a good static facial recognition. Additional data can be obtained on the skin and its changes during the physiological dynamic imaging with an infrared camera or multispectral camera (paragraph 0390).

Harville US PGPub: US 2005/0094879 A1 May 5, 2005.
Harville teaches, a method for visual-based recognition of objects is described. Depth data for at least a pixel of an image of the object is received, the depth data comprising information relating to a distance from visual sensor to a portion of the object visible at the pixel. At least one plan-view image is generated based on the depth data. At least one plan-view template is extracted from the plan-view image (ABSTRACT, Figs. 4, 5, 7, paragraphs 0006, 0045, 0059).

Biswas US PGPub: US 2021/0065364 A1 Mar. 4, 2021.
Biswas teaches, pattern recognition by convolution neural network to identify one or more interests of an input image. the input image may be a two-dimensional image such as an ultrasonography (USG) image and the one or more features of interest may include one or more anatomical features of a human body such as lungs, liver, kidneys, abdomen, genetilia and the like. In an example, the one or more features of interest may include a biological specimen such an animal, a plant or a microbiological organism. (paragraph 0030).

Commons US PGPub: US 8,775,341 B1 Jul. 8, 2014.
Commons teaches, intelligent control with hierarchical stacked neural networks. A system and method of detecting an aberrant message is provided. An ordered set of words within the message is detected. The set of words found within the message is linked to a corresponding set of expected words, the set of expected words having semantic attributes. A set of grammatical structures represented in the message is detected, based on the ordered set of words and the semantic attributes of the corresponding set of expected words. A cognitive noise vector comprising a quantitative measure of a deviation between grammatical structures represented in the message and an expected measure of grammatical structures for a message of the type is then determined (ABSTRACT, Figs. 7, 10, column 20, lines 45 – 62).

The following is the examiner’s statement of reasons for allowance:

Claims 1 and it dependent claims thereof are allowed because the closest prior art either alone or in combination, fail to anticipate or render obvious, a system comprising: one or more computing devices comprising one or more processors and one or more non-transitory storage devices for storing instructions, wherein execution of the instructions by the one or more processors causes the one or more computing devices to: receive, at a neural network architecture comprising a human parsing network, an image comprising at least one human object; utilize a hierarchal graph comprising a plurality of nodes to model the at least one human object, wherein the nodes correspond to anatomical features associated with a human body; generate inference information for the nodes in the hierarchal graph, wherein generating inference information includes: deriving, with the neural network architecture, direct inference information for each of the nodes included in the hierarchal graph; deriving, with the neural network architecture, top-down inference information for at least a portion of the nodes included in the hierarchal graph; and deriving, with the neural network architecture, bottom-up inference information for at least a portion of the nodes included in the hierarchal graph; and generate, with the neural network architecture, parsing results based, at least in part, on the inference information associated with the nodes, in combination with all other limitations in the claim(s) as defined by applicant.

Claims 11 and it dependent claims thereof are allowed because the closest prior art either alone or in combination, fail to anticipate or render obvious, a method comprising: receiving, at a neural network architecture comprising a human parsing network, an image comprising at least one human object; utilizing a hierarchal graph comprising a plurality of nodes to model the at least one human object, wherein the nodes correspond to anatomical features associated with a human body; generating inference information for the nodes in the hierarchal graph, wherein generating inference information includes: deriving, with the neural network architecture, direct inference information for at least a portion of the nodes included in the hierarchal graph; deriving, with the neural network architecture, top-down inference information for at least a portion of the nodes included in the hierarchal graph; and deriving, with the neural network architecture, bottom-up inference information for at least a portion of the nodes included in the hierarchal graph; and generating, with the neural network architecture, parsing results based, at least in part, on the inference information associated with the nodes, in combination with all other limitations in the claim(s) as defined by applicant.

Claims 20 and it dependent claims thereof are allowed because the closest prior art either alone or in combination, fail to anticipate or render obvious, a computer program product comprising a non-transitory computer-readable medium, including instructions for causing a computer to: receive, at a neural network architecture comprising a human parsing network, an image comprising at least one human object; utilize a hierarchal graph comprising a plurality of nodes to model the at least one human object, wherein the nodes correspond to anatomical features associated with a human body; generate inference information for the nodes in the hierarchal graph, wherein generating inference information includes: deriving, with the neural network architecture, direct inference information for at least a portion of the nodes included in the hierarchal graph; deriving, with the neural network architecture, top-down inference information for at least a portion of the nodes included in the hierarchal graph; and deriving, with the neural network architecture, bottom-up inference information for at least a portion of the nodes included in the hierarchal graph; and generate, with the neural network architecture, parsing results based, at least in part, on the inference information associated with the nodes, in combination with all other limitations in the claim(s) as defined by applicant.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”


Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NIMESH PATEL whose telephone number is (571)270-1228.  The examiner can normally be reached on Monday thru Friday: 6:30 AM - 3:30 PM EST.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Rafael Perez-Gutierrez can be reached on 571-272-7915.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NIMESH PATEL/Primary Examiner, Art Unit 2642