DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103, which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5 and 7-22 are rejected under 35 U.S.C. 103 as being unpatentable over US 20190387168 A1 (Smith), in view of Shi, Y., Tian, Y., Wang, Y. and Huang, T., 2017. Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Transactions on Multimedia, 19(7), pp.1510-1520 (Shi) and in further view of Wang, L., Ge, L., Li, R. and Fang, Y., 2017. Three-stream CNNs for action recognition. Pattern Recognition Letters, 92, pp.33-40 (Wang).
Regarding Claims 1, 12-13, 18 and 22, Smith teaches:
A method of controlling an augmented reality (AR) apparatus, the method comprising: acquiring a video; detecting a human body from the acquired video; a MR environment/system with augmented reality capability comprises visual input 702 and sensory input 706 attached to the wearable device, room camera 704 and room sensory input 704, object recognizer 708a-n, and others; [0092]-[0094], object recognizers may use captured images to recognize and track events and objects, and act accordingly (i.e. control AR with predetermined mapping functions on AR; furthermore, the object recognition is performed through machine learning algorithms, where the operations, objects, and others may display on the screen of the wearable device).
Smith does not teach explicitly on action prediction/recognition. However, Shi teaches (Shi: Figs. 4-5, an action recognition system and method that takes a video stream, generates three streams: the spatial stream for spatial feature (frame based local feature, which is further illustrated in Fig. 1 of Wang), the temporal stream for short-term motion (video-based local feature), and the sDTD stream for long-term motion (video-based global feature); fuses respective features to perform action prediction/recognition).
It would have been obvious for one of ordinary skill in the art before the effective filling date of the claimed invention was made to modify Smith with action prediction/recognition as further taught by Shi and Wang. The advantage of doing so is to characterize long-term motion in video effectively, consequently facilitating action recognition (Shi: Introduction).
Regarding Claims 2, 14 and 19, Smith as modified teaches all elements of Claim 1, 13 and 18 respectively. Smith as modified further teaches:
The method of claim 1, wherein the acquired video is one or more of a video of the AR apparatus and a user of the AR apparatus captured by a camera distinguished from the AR apparatus, a video captured from a viewpoint of the user of the AR apparatus with a camera connected to the AR apparatus, and a video generated and stored in advance in the AR apparatus (Smith: Fig. 7, images may be taken by room cameras (3rd party) or camera associated with the wearable device).
Regarding Claims 3 and 15, Smith as modified teaches all elements of Claim 1, 13 respectively. Smith as modified further teaches:
The method of claim 1, wherein in response to the acquired video being a video of the AR apparatus and a user of the AR apparatus captured by a camera distinguished from the AR apparatus, the detecting of the human body comprises: recognizing the user corresponding to the detected human body; associating the user with the AR apparatus based on an AR user database (DB); and establishing a communication with the AR apparatus (Smith: [0092]-[0094], MR system may recognizer human pose, a person, among other objects, and supplement objects with semantic information to give life to the object, which implies that MR system associates the objects with associated user DB, and establishing a communication with the AR apparatus based on recognition and the DB).
Regarding Claims 4, 16 and 20, Smith as modified teaches all elements of Claim 1, 13 and 18 respectively. Smith as modified further teaches:

Regarding Claim 5, Smith as modified teaches all elements of Claim 1. Smith as modified further teaches:
The method of claim 1, wherein the performing of the action prediction comprises: acquiring a frame-based local feature image, a video-based local feature image and a video-based global feature image from an image frame of the acquired video; acquiring action classification results by performing an action classification with regard to the human body actions based on any two or any combination of a first action classification scheme of using the frame-based local feature image and a human body pose feature, a second action classification scheme of using the video-based local feature image, and a third action classification scheme of using the video-based global feature image and the video-based local feature image; and fusing the action classification results and performing the action prediction (Shi: Figs. 4-5 and Wang: Fig. 1).
Regarding Claim 7, Smith as modified teaches all elements of Claims 1/5. Smith as modified further teaches:
extract object (ROI) is the 1st step of object recognition, and use neural network in different configurations, be a CNN, DNN, and CNN-RNN, etc. are based on the applications and desired performance and cost as further illustrated by Shi, Wang, and Liang etc.).
Regarding Claim 8, Smith as modified teaches all elements of Claims 1/5. Smith as modified further teaches:
The method of claim 5, wherein the acquiring of the frame-based local feature image, the video-based local feature image and the video-based global feature image from the image frame of the video comprises performing an action localization on a plurality of frame-based local feature images including the frame-based local feature 
Regarding Claim 9, Smith as modified teaches all elements of Claims 1/5/8. Smith as modified further teaches:
The method of claim 8, wherein the performing of the action localization comprises performing the action localization with a fully connected (FC) network that comprises a first FC branch that determines which ROI candidate includes the human body and a second FC branch that determines a position of a box including the human body (Shi: Figs. 4-5).
Regarding Claims 10, 17 and 21, Smith as modified teaches all elements of Claims 1, 13 and 18 respectively. Smith as modified further teaches:
The method of claim 1, wherein the performing of the action prediction comprises: acquiring a video-based local feature image from an image frame of the video; extracting a first feature associated with a human body pose action and a second feature associated with an interactive action from the video-based local feature image with a first 3D CNN having a human body pose action as a classification label and a second 3D CNN having an interactive action as a classification label; and fusing the first feature and the second feature and acquiring an action classification result (Wang: Fig. 2, feature data is augmented, i.e. labeled).
Regarding Claim 11, Smith as modified teaches all elements of Claims 1/10. Smith as modified further teaches:
The method of claim 10, wherein the first 3D CNN is trained in advance with a loss function that classifies pose actions with a plurality of labels in mutually exclusive neural networks are trained in supervised learning or untrained in un-supervised learning, which is known).
Allowable Subject Matter
The Claim 6 is objected to as being dependent upon a rejected base claim, but are potentially allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZHITONG CHEN whose telephone number is (571) 270-1936.  The examiner can normally be reached on M-F 9:30am - 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Yuwen Pan can be reached on 571-272-7855.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 

/ZHITONG CHEN/
Primary Examiner, Art Unit 2649