DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments and amendments filed March 26, 2021 (herein “Amendment”) regarding the rejection of claims 1, 8, 9, 16 and claims depending therefrom under 35 U.S.C. 103, have been fully considered and are persuasive to the extent that rejection rationale has been updated for the new amendments into the independent claims, however, the combination of prior art cited in the 35 U.S.C. 103 rejection is constructively maintained. Applicant contends on page 11 of the Amendment that the rejection rationale for rejecting the “version number of the application” (see Non-Final Action dated December 31, 2020 on page 32) in citing to para. [0149] of Cross and the “Java Speech Grammar Format” (JSGF) disclosed therein, does not teach “an application” because  JSGF is a “textual representation of grammar in speech recognition.” Notwithstanding that the broadest reasonable interpretation of “application” could include a code snippet written using JSGF version 1.0 such as set forth in para. [0164], at least other portions of the cited paragraph, including the JSGF syntax instructing that when a particular event is triggered, then a matched search result should use Google™ (note the syntax public <search-action-final> = <action> and <action> = <click> | <map> | <google>). Para 167 of Cross details this aspect of the code snippet defined by JSGF, and therefore, at least the reference to based on at least the element information,” (new amendment is given in italics) especially since the code snippet not only defined using the Google™ search application for searches, but also specifies that such use of the app is triggered when a particular event (of the webpage such as a click) is triggered. 


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:

2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-3, 6, 8-11, 14, and 16-18, are rejected under 35 U.S.C. 103 as being unpatentable over Orr et al., (US 2017/0068670 A1, herein “Orr”) in view of Jing, (WO 2017028601 A1, with reference to the EPO Machine English Translation thereof, herein “Jing”) further in view of Cai et al., (US 2013/0014002 A1, herein “Cai”), further in view of further in view of Cross, (US 2008/0228494 A1, herein “Cross”).
Regarding claim 1, Orr teaches a method for controlling a page, comprising (Orr Abstract, process for operating a digital assistant in a media environment including media items that are displayed, where figs. 6A-6K show that the media is shown in a page format): 
receiving voice information from a terminal (Orr para. [0102], audio input is received, the audio input being a media-related request in the form of a user utterance (voice information) from a microphone of the media device (terminal)) and element information of at least one element in a displayed page (Orr para. [0147], the media-related request is a search request based on a media item (at least one element) on which the user interface is focused on the current display of user interface 602 (page) – see fig. 6A, cursor 609), the element information comprising an element identifier, element content information, position information of an element on the displayed page Orr para. [0147], media item 611 on which the user interface is focused via cursor 609 is a movie with a title (element identifier) and one or more parameter values such as movie director, and names of actors starring in the movie (element content information), where the cursor is positioned on a media item, and the cursor’s position on a media item is used as context for determining user intent in their voice command, thus since the media item corresponds to the claimed element, and the cursor position is known to correspond to a media item for disambiguating a voice command, then cursor position is position information of the media item); 
performing voice recognition on the voice information to acquire a voice recognition result (Orr para. [0064], speech input from user is received by the I/O processing module which forwards it to a speech-to-text (STT) processing module to convert the speech to text (voice recognition result)), in response to determining the voice information being used for controlling the displayed page (Orr paras. [0065], [0072], [0081] and [0084], the natural language processing which is part of the STT, determines an actionable intent from the sequence of words in the user utterance, the actionable intent identifying the task that the user intends the digital assistant to perform, such as a media search (which subsequently controls the listing of media on the displayed screen (page))); 
matching the voice recognition result with the element content information of the at least one element in the displayed page (Orr paras. [0084] and [0147], when user’s utterance contains insufficient information to complete the structured query, such as a parameter {media title} is missing, then the natural language processing module populates this parameter (matches the structured query which is the result of the voice recognition) to received contextual information (element content information) such as a title currently playing on the media device (thus in the displayed page), or as disclosed in para. [0147], the media title of where the cursor currently is positioned determines what “this” in the user query corresponds to); and 
generating page control information in response to determining successfully matching the voice recognition result with the element content information of the at least one element (Orr para. [0147], when it is determined that the user utterance is a media-related request to obtain an alternative set of media items similar to the media item currently in the position where the cursor is located, then a third set of media items is obtained (page control information) for display).
Orr does not explicitly teach simultaneously from a terminal, the displayed page being a page displayed in the terminal when the voice information is sent by the terminal.
Orr further does not explicitly teach and sending the page control information to the terminal to allow the terminal to control the displayed page based on the page control information, the page control information comprising a click operation for the element with matched element content information in the displayed page, an element identifier of the element with the matched element content information and position information of the element with the matched element content information on the displayed page.
Orr still further does not explicitly teach that the element information also comprises “and at least one of: an application name of an application 
Finally, Orr does not explicitly teach that the generating of the page control information is based on at least the element information.
Jing teaches simultaneously from a terminal, the displayed page being a page displayed in the terminal when the voice information is sent by the terminal (Jing paras. 252 and 271, in step S10, the voice instruction sent by the voice input device is received and the parameter information of all controllable control objects on the current display page of the smart terminal are collected).
Jing teaches and sending the page control information to the terminal to allow the terminal to control the displayed page based on the page control information (Jing paras. 493, 372, 502, 745, the controllable control object is obtained to implement control operations according to matched text information from the user voice input, where the control operations include how the control object is displayed such as expanding a drop down list and displaying the content of the drop down list), the page control information comprising a click operation for the element with matched element content information in the displayed page (Jing paras. 493, 372, 502, 739, 745 control operations to be implemented corresponding to the controllable control object which is the controllable control object corresponding to the matched text information, where para 745 gives an example of the controllable control object being a link on the webpage and the operation that is executed being a jumping to the webpage corresponding to the link (click operation), as well as an example of an OK button’s operation that is executed being the corresponding operation of the OK button (thus triggering a button click)) an element identifier of the element with the matched element content information with the matched element content information (Jing para. 493, control identifier of the controllable control object corresponding to the matched text information) on the displayed page (Jing para. 463, the controllable control object is an object on the current display page of the smart terminal).
Cai teaches and position information of the element (Cai fig. 1, paras. [0063], [0101], position information of a DOM node is extracted from the DOM structure).
Cross teaches and at least one of: an application name of an application corresponding to the displayed page or a version number of the application corresponding to the displayed page (Cross where the claim only requires “at least one”, Cross teaches in paras. [0145], [0149]-[0155], [0164]-[0167], that the matched search result appends additional web content into the DOM that is rendered as a result, and where such additional web content includes a VoiceXML action grammar which can be an event expression comprised of the information shown in paras. [0149] and [0164], where the example in para 164 includes or references code that is executed when a VoiceXML event is triggered, where the executed code is shown in para 166, and described in para 167 as specifying that an event can contain a value of click, map or google, where if the value is “google” then a new window is opened on the browser and a search is performed using Google™ (an application), accordingly, the references in the code snippets (including the element information) showing “google” is an application name of an application (google) corresponding to the link on the displayed page)  and based on at least the element information (Cross paras. [0164]-[0167], the multimodal browser performs an action (page control information) in dependence on the action identifier that is part of the VoiceXML element shown in paras 164-165).
Therefore, taking the teachings of Orr and Jing together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the voice instruction information provided along with parameter information of control objects on a current display page as disclosed in the specific passages cited to above in Jing at least because doing so would allow for entire voice control of an intelligent terminal, ensuring that a user never loses remote control of a smart device such as a smart TV (see Jing paras. 136 and 26).
Further, taking the teachings of Orr and Cai together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the position information within the DOM as disclosed in the specific passages cited to above in Cai at least because doing so would allow for accurately locating information and obtain an accurate extraction result with good robustness even after content of a web page is updated and the structure of the web page is changed (Cai para. [0008]).
Still further, taking the teachings of Orr and Cross together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the additional web 
Regarding claims 2 and 10, Orr teaches wherein the matching the voice recognition result with the element content information of the at least one element comprises: calculating, for element content information of each element among the at least one element, a similarity between the voice recognition result and the element content information of the element (Orr paras. [0151] and [0153], media request in text form (from user utterance which as disclosed above in the rejection rationale for claim 1, is translated to text (voice recognition result) via natural language processing in STT), can have fuzzy string matching performed including the determination of a shortest edit distance (calculating a similarity), where an alternate value (as a recognition result) can be found (successful matching) based on a parameter value that has the shortest edit distance among a plurality of media (element) -related parameter values (element content information) in a data structure), and determining successfully matching of the voice recognition result with the element based on the calculated similarity (Orr para. [0153], a particular parameter value is found (matching) to have the shortest edit distance among a plurality of media-related parameter values with the text version of the user utterance as a media request, and where para. [0155] teaches that the alternative parameter values that likely represent the actual intent of the user are displayed to the user, thus some “success” is attributed as these alternative parameter values are considered likely enough that they are presented to the user
Regarding claims 3 and 11, Orr teaches wherein the calculating a similarity between the voice recognition result and the element content information of the element, and determining successfully matching of the voice recognition result with the element based on the calculated similarity comprises: calculating a first edit distance between the voice recognition result and the element content information of the element (Orr paras. [0151]-[0153], fuzzy string matching including finding a parameter value (element content information) for a media item (element) to be returned as a result of a media search and then later displayed on the display unit, that has the shortest edit distance (thus an edit distance is calculated/determined) to the user uttered media request term, in the given example, user media request (for media items like movies (element))  “Chris Rucker”, as text returned from the STT (voice recognition result) has an edit distance determined with a plurality of media-related parameter values in a data structure (element content information – as actors names are a parameter value with movie results) to find a shortest edit distance); 
determining whether the first edit distance is greater than a preset first threshold (Orr para. [0153], the shortest edit distance is determined within a predetermined value (threshold), thus, the shortest being an edit distance being a determination that the edit distance is not greater (thus determining if greater)); and 
determining the successfully matching the voice recognition result with the element content information of the element in response to determining the first edit distance being not greater than the first threshold (Orr paras. [0153]-[0155], the media (element)-related parameter (element content information) from a plurality of potential media-related parameters that has a shortest edit distance (edit distance not greater) within a predetermined threshold (than the first threshold), where media items (element) with the parameter (element content information) of the parameter found to have the shortest edit distance, are presented to the user as the search results (thus a successful match)).
Regarding claims 6 and 14, Orr teaches wherein the element information of the at least one element is stored in a tree structure (Orr paras. [0074] and [0075], ontology 460 being an hierarchical structure containing many nodes in a tree-structure as shown in fig. 4B, reference 460 and fig. 4C, where nodes of the ontology represent a property (element information) relevant to an actionable intent or other property (where properties of a media domain include the movie title (element))); and 
the matching the voice recognition result with the element content information of the at least one element comprises: traversing each subnode of the tree structure (Orr paras. [0080]-[0081], natural language processing module determines what nodes of the ontology are implicated by the words in the token sequence, thus searching each node in the ontology to find those matching words in the token sequence); and 
matching the element content information of the element represented by the each subnode and the voice recognition result (Orr paras. [0080]-[0081], [0084], fig. 4c, when the user says “find me other seasons of this TV series,” which is translated to text via the STT process, and a structured query therefrom, the actionable intent of “media search” is determined, and a “media” node from the ontology is determined (tree), then from the media node, a parameter “media title” (element content information) is matched to the user inquiry including “this TV series”
Regarding claim 8, Orr teaches a method for controlling a page, comprising (Orr Abstract, process for operating a digital assistant in a media environment including media items that are displayed, where figs. 6A-6K show that the media is shown in a page format): 
sending (Orr para. [0038], digital assistant client provides (sending)), in response to receiving voice information from a user (Orr para. [0102], audio input is received, the audio input being a media-related request in the form of a user utterance (voice information) from a microphone), the voice information and element information of at least one element in a displayed page to a server (Orr paras. [0038]-[0039], digital assistant client provides (sending) contextual information and user input (the voice information) to a digital assistant server, where para. [0147] teaches the user input/utterance to be a media-related request is a search request based on a media item (at least one element) on which the user interface is focused on the current display of user interface 602 (displayed page) – see fig. 6A, cursor 609), the element information comprising an element identifier, element content information (Orr para. [0147], media item 611 on which the user interface is focused via cursor 609 is a movie with a title (element identifier) and one or more parameter values such as movie director, and names of actors starring in the movie (element content information)), position information of an element on the displayed page (Orr para. [0147], media item 611 on which the user interface is focused via cursor 609 is a movie with a title (element identifier) and one or more parameter values such as movie director, and names of actors starring in the movie (element content information), where the cursor is positioned on a media item, and the cursor’s position on a media item is used as context for determining user intent in their voice command, thus since the media item corresponds to the claimed element, and the cursor position is known to correspond to a media item for disambiguating a voice command, then cursor position is position information of the media item); 
receiving page control information (Orr paras. [0149]-[0150], when there is at least one media item in the third set of media items, then the third set of media items is displayed by replacing the primary set of media items with the third set of media items, where the replacing is a page control information under a broadest reasonable interpretation of “page control information,” the information that controls the display unit to display in accordance with updating the primary set of media items with the third set of media items), the page control information being generated after the determines (Orr para. [0147], when it is determined that the user utterance is a media-related request to obtain an alternative set of media items similar to the media item currently in the position where the cursor is located, then a third set of media items is obtained (page control information) for display) the voice information being used for controlling the displayed page (Orr paras. [0065], [0072], [0081] and [0084], the natural language processing which is part of the STT, determines an actionable intent from the sequence of words in the user utterance, the actionable intent identifying the task that the user intends the digital assistant to perform, such as a media search (which subsequently controls the listing of media on the displayed screen (page))) and a voice recognition result of the voice information (Orr para. [0064], speech input from user is received by the I/O processing module which forwards it to a speech-to-text (STT) processing module to convert the speech to text (voice recognition result)) matching the element Orr paras. [0065], [0072], [0081] and [0084], the natural language processing which is part of the STT, determines an actionable intent from the sequence of words in the user utterance, the actionable intent identifying the task that the user intends the digital assistant to perform, such as a media search (which subsequently controls the listing of media (thus matching the element content information) on the displayed screen (in the displayed page))); and 
determining a position of the element with the matched element content information on the displayed page based on the element identifier of the element with the matched element content information, in the determined position (Orr para. [0147], when it is determined that the user utterance is a media-related request to obtain an alternative set of media items similar to the media item with parameter values (element identifier) currently in the position where the cursor is located (a position of the element with the matched element content information)).
	Orr does not explicitly teach that the page control information is received from the server, although Orr does teach that page control information is received. Similarly, Orr does not specifically teach that it is specifically the “server” that determines the voice information being used for controlling the displayed page, although Orr does teach that the digital assistant system determines the voice information being used for controlling the displayed page. However, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the receiving page control information to be “from the server” as claimed, and that the server determines the voice information being used for controlling the displayed page, since Orr teaches in para. 
	Further, Orr does not explicitly teach simultaneously.
Still further, Orr does not explicitly teach the page control information comprising a click operation for the element with matched element content information in the displayed page, an element identifier of the element with the matched element content information and position information of the element with the matched element content information on the displayed page;” or “and the position information of the element with the matched element content information, and executing the click operation.”
Orr still further does not explicitly teach that the element information also comprises “and at least one of: an application name of an application corresponding to the displayed page or a version number of the application corresponding to the displayed page.”
Finally, Orr does not explicitly teach that the generating of the page control information is based on at least the element information.

Jing teaches simultaneously (Jing paras. 252 and 271, in step S10, the voice instruction sent by the voice input device is received and the parameter information of all controllable control objects on the current display page of the smart terminal are collected).
Jing teaches the page control information comprising a click operation for the element with matched element content information in the displayed page (Jing paras. 493, 372, 502, 745, the controllable control object is obtained to implement control operations according to matched text information from the user voice input, where the control operations include how the control object is displayed such as expanding a drop down list and displaying the content of the drop down list, where para 745 gives an example of the controllable control object being a link on the webpage and the operation that is executed being a jumping to the webpage corresponding to the link (click operation), as well as an example of an OK button’s operation that is executed being the corresponding operation of the OK button (thus triggering a button click)) an element identifier of the element with the matched element content information with the matched element content information (Jing para. 493, control identifier of the controllable control object corresponding to the matched text information) on the displayed page (Jing para. 463, the controllable control object is an object on the current display page of the smart terminal).
	Jing further teaches with the matched element content information (Jing para. 493, the controllable control object corresponding to the matched text information), and executing the click operation (Jing para 745 an example of the controllable control object being a link on the webpage and the operation that is executed being a jumping to the webpage corresponding to the link (click operation), as well as an example of an OK button’s operation that is executed being the corresponding operation of the OK button (thus triggering a button click).
Cai teaches the page control information comprising position information of the element (Cai fig. 1, paras. [0063], [0101], position information of a DOM node is extracted from the DOM structure).
Cai paras. [0103]-[0107], a DOM structure can be traversed to obtain the position information of particular information to be extracted in the DOM).
Cross teaches and at least one of: an application name of an application corresponding to the displayed page or a version number of the application corresponding to the displayed page (Cross where the claim only requires “at least one”, Cross teaches in paras. [0145], [0149]-[0155], [0164]-[0167], that the matched search result appends additional web content into the DOM that is rendered as a result, and where such additional web content includes a VoiceXML action grammar which can be an event expression comprised of the information shown in paras. [0149] and [0164], where the example in para 164 includes or references code that is executed when a VoiceXML event is triggered, where the executed code is shown in para 166, and described in para 167 as specifying that an event can contain a value of click, map or google, where if the value is “google” then a new window is opened on the browser and a search is performed using Google™ (an application), accordingly, the references in the code snippets (including the element information) showing “google” is an application name of an application (google) corresponding to the link on the displayed page)  and based on at least the element information (Cross paras. [0164]-[0167], the multimodal browser performs an action (page control information) in dependence on the action identifier that is part of the VoiceXML element shown in paras 164-165).
Therefore, taking the teachings of Orr and Jing together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the voice instruction information provided along with parameter information of control objects on a current display page as disclosed in the specific passages cited to above in Jing at least because doing so would allow for entire voice control of an intelligent terminal, ensuring that a user never loses remote control of a smart device such as a smart TV (see Jing paras. 136 and 26).
Further, taking the teachings of Orr and Cai together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the position information within the DOM as disclosed in the specific passages cited to above in Cai at least because doing so would allow for accurately locating information and obtain an accurate extraction result with good robustness even after content of a web page is updated and the structure of the web page is changed (Cai para. [0008]).
Still further, taking the teachings of Orr and Cross together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the additional web content information as disclosed in the specific passages cited to above in Cross at least because doing so would allow for speech-enabled searching of content regardless of whether the content is speech-enabled (see Cross para. [0168]).
Regarding claim 9, Orr teaches an apparatus for controlling a page, comprising (Orr fig. 1, para. [0018], Abstract, figs. 6A-6K system for operating a digital assistant, where the system executes a process for operating a digital assistant in a media environment including media items that are displayed, where figs. 6A-6K show that the media is shown in a page format): 
at least one processor (Orr paras. [0020], [0030], and [0026] and digital assistant “DA” client and server including a media device, where both the media device and the server have one or more processors/processing modules); and 
a memory storing instructions, which when executed by the at least one processor, cause the at least one processor to perform operations, the operations comprising (Orr paras. [0020], and [0034], digital assistant “DA” client and server including a media device where both the media device and server have a memory storing instructions for performing the processes disclosed therein): 
receiving voice information (Orr para. [0102], audio input is received, the audio input being a media-related request in the form of a user utterance (voice information) from a microphone of the media device) and element information of at least one element in a displayed page (Orr para. [0147], the media-related request is a search request based on a media item (at least one element) on which the user interface is focused on the current display of user interface 602 (page) – see fig. 6A, cursor 609), the element information comprising an element identifier, element content information,  position information of an element on the displayed page (Orr para. [0147], media item 611 on which the user interface is focused via cursor 609 is a movie with a title (element identifier) and one or more parameter values such as movie director, and names of actors starring in the movie (element content information), where the cursor is positioned on a media item, and the cursor’s position on a media item is used as context for determining user intent in their voice command, thus since the media item corresponds to the claimed element, and the cursor position is known to correspond to a media item for disambiguating a voice command, then cursor position is position information of the media item); 
performing voice recognition on the voice information to acquire a voice recognition result (Orr para. [0064], speech input from user is received by the I/O processing module which forwards it to a speech-to-text (STT) processing module to convert the speech to text (voice recognition result)), in response to determining the voice information being used for controlling the displayed page (Orr paras. [0065], [0072], [0081] and [0084], the natural language processing which is part of the STT, determines an actionable intent from the sequence of words in the user utterance, the actionable intent identifying the task that the user intends the digital assistant to perform, such as a media search (which subsequently controls the listing of media on the displayed screen (page)));
matching the voice recognition result with the element content information of the at least one element in the displayed page (Orr paras. [0084] and [0147], when user’s utterance contains insufficient information to complete the structured query, such as a parameter {media title} is missing, then the natural language processing module populates this parameter (matches the structured query which is the result of the voice recognition) to received contextual information (element content information) such as a title currently playing on the media device (thus in the displayed page), or as disclosed in para. [0147], the media title of where the cursor currently is positioned determines what “this” in the user query corresponds to); and 
Orr para. [0147], when it is determined that the user utterance is a media-related request to obtain an alternative set of media items similar to the media item currently in the position where the cursor is located, then a third set of media items is obtained (page control information) for display).
Orr does not explicitly teach simultaneously from a terminal, the displayed page being a page displayed in the terminal when the voice information is sent by the terminal.
Orr further does not explicitly teach and sending the page control information to the terminal to allow the terminal to control the displayed page based on the page control information, the page control information comprising a click operation for the element with matched element content information in the displayed page, an element identifier of the element with the matched element content information and position information of the element with the matched element content information on the displayed page.
Orr still further does not explicitly teach that the element information also comprises “and at least one of: an application name of an application corresponding to the displayed page or a version number of the application corresponding to the displayed page.”
Finally, Orr does not explicitly teach that the generating of the page control information is based on at least the element information.
Jing teaches simultaneously from a terminal, the displayed page being a page displayed in the terminal when the voice information is sent by the terminal (Jing paras. 252 and 271, in step S10, the voice instruction sent by the voice input device is received and the parameter information of all controllable control objects on the current display page of the smart terminal are collected).
Jing teaches and sending the page control information to the terminal to allow the terminal to control the displayed page based on the page control information (Jing paras. 493, 372, 502, 745, the controllable control object is obtained to implement control operations according to matched text information from the user voice input, where the control operations include how the control object is displayed such as expanding a drop down list and displaying the content of the drop down list), the page control information comprising a click operation for the element with matched element content information in the displayed page (Jing paras. 493, 372, 502, 739, 745 control operations to be implemented corresponding to the controllable control object which is the controllable control object corresponding to the matched text information, where para 745 gives an example of the controllable control object being a link on the webpage and the operation that is executed being a jumping to the webpage corresponding to the link (click operation), as well as an example of an OK button’s operation that is executed being the corresponding operation of the OK button (thus triggering a button click)) an element identifier of the element with the matched element content information with the matched element content information (Jing para. 493, control identifier of the controllable control object corresponding to the matched text information) on the displayed page (Jing para. 463, the controllable control object is an object on the current display page of the smart terminal).
Cai teaches and position information of the element (Cai fig. 1, paras. [0063], [0101], position information of a DOM node is extracted from the DOM structure).
Cross teaches and at least one of: an application name of an application corresponding to the displayed page or a version number of the application corresponding to the displayed page (Cross where the claim only requires “at least one”, Cross teaches in paras. [0145], [0149]-[0155], [0164]-[0167], that the matched search result appends additional web content into the DOM that is rendered as a result, and where such additional web content includes a VoiceXML action grammar which can be an event expression comprised of the information shown in paras. [0149] and [0164], where the example in para 164 includes or references code that is executed when a VoiceXML event is triggered, where the executed code is shown in para 166, and described in para 167 as specifying that an event can contain a value of click, map or google, where if the value is “google” then a new window is opened on the browser and a search is performed using Google™ (an application), accordingly, the references in the code snippets (including the element information) showing “google” is an application name of an application (google) corresponding to the link on the displayed page)  and based on at least the element information (Cross paras. [0164]-[0167], the multimodal browser performs an action (page control information) in dependence on the action identifier that is part of the VoiceXML element shown in paras 164-165).
Therefore, taking the teachings of Orr and Jing together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the voice instruction information provided along 
Further, taking the teachings of Orr and Cai together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the position information within the DOM as disclosed in the specific passages cited to above in Cai at least because doing so would allow for accurately locating information and obtain an accurate extraction result with good robustness even after content of a web page is updated and the structure of the web page is changed (Cai para. [0008]).
Still further, taking the teachings of Orr and Cross together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the additional web content information as disclosed in the specific passages cited to above in Cross at least because doing so would allow for speech-enabled searching of content regardless of whether the content is speech-enabled (see Cross para. [0168]).
Regarding claim 16, Orr teaches an apparatus for controlling a page, comprising (Orr fig. 1, para. [0018], Abstract, figs. 6A-6K system for operating a digital assistant, where the system executes a process for operating a digital assistant in a media environment including media items that are displayed, where figs. 6A-6K show that the media is shown in a page format): 
at least one processor (Orr paras. [0020], [0030], and [0026] and digital assistant “DA” client and server including a media device, where both the media device and the server have one or more processors/processing modules); and 
a memory storing instructions, which when executed by the at least one processor, cause the at least one processor to perform operations, the operations comprising (Orr paras. [0020], and [0034], digital assistant “DA” client and server including a media device where both the media device and server have a memory storing instructions for performing the processes disclosed therein): 
sending (Orr para. [0038], digital assistant client provides (sending)), in response to receiving voice information from a user (Orr para. [0102], audio input is received, the audio input being a media-related request in the form of a user utterance (voice information) from a microphone), the voice information and element information of at least one element in a displayed page to a server (Orr paras. [0038]-[0039], digital assistant client provides (sending) contextual information and user input (the voice information) to a digital assistant server, where para. [0147] teaches the user input/utterance to be a media-related request is a search request based on a media item (at least one element) on which the user interface is focused on the current display of user interface 602 (displayed page) – see fig. 6A, cursor 609), the element information comprising an element identifier, element content information (Orr para. [0147], media item 611 on which the user interface is focused via cursor 609 is a movie with a title (element identifier) and one or more parameter values such as movie director, and names of actors starring in the movie (element content information)), position information of an element on the displayed page (Orr para. [0147], media item 611 on which the user interface is focused via cursor 609 is a movie with a title (element identifier) and one or more parameter values such as movie director, and names of actors starring in the movie (element content information), where the cursor is positioned on a media item, and the cursor’s position on a media item is used as context for determining user intent in their voice command, thus since the media item corresponds to the claimed element, and the cursor position is known to correspond to a media item for disambiguating a voice command, then cursor position is position information of the media item); 
receiving page control information (Orr paras. [0149]-[0150], when there is at least one media item in the third set of media items, then the third set of media items is displayed by replacing the primary set of media items with the third set of media items, where the replacing is a page control information under a broadest reasonable interpretation of “page control information,” the information that controls the display unit to display in accordance with updating the primary set of media items with the third set of media items), the page control information being generated after the determines (Orr para. [0147], when it is determined that the user utterance is a media-related request to obtain an alternative set of media items similar to the media item currently in the position where the cursor is located, then a third set of media items is obtained (page control information) for display) the voice information being used for controlling the displayed page (Orr paras. [0065], [0072], [0081] and [0084], the natural language processing which is part of the STT, determines an actionable intent from the sequence of words in the user utterance, the actionable intent identifying the task that the user intends the digital assistant to perform, such as a media search (which subsequently controls the listing of media on the displayed screen (page))) and a voice recognition result of the voice information (Orr para. [0064], speech input from user is received by the I/O processing module which forwards it to a speech-to-text (STT) processing module to convert the speech to text (voice recognition result)) matching the element content information of the at least one element (Orr paras. [0065], [0072], [0081] and [0084], the natural language processing which is part of the STT, determines an actionable intent from the sequence of words in the user utterance, the actionable intent identifying the task that the user intends the digital assistant to perform, such as a media search (which subsequently controls the listing of media (thus matching the element content information) on the displayed screen)); and 
determining a position of the element with the matched element content information on the displayed page based on the element identifier of the element with the matched element content information in the determined position (Orr para. [0147], when it is determined that the user utterance is a media-related request to obtain an alternative set of media items similar to the media item with parameter values (element identifier) currently in the position where the cursor is located (a position of the element with the matched element content information)).
	Orr does not explicitly teach that the page control information is received from the server, although Orr does teach that page control information is received. Similarly, Orr does not specifically teach that it is specifically the “server” that determines the voice information being used for controlling the displayed page, although Orr does teach that the digital assistant system determines the voice information being used for controlling the displayed page. However, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the receiving page control information to be “from the server” as claimed, and that the server determines the voice information being used for controlling the displayed page, since Orr teaches in para. [0052] that the modules and functions of the digital assistant system as a whole can have various configurations and arrangements between the client side and server side, and where the set of functions of the digital assistant system is a closed set of functions 
Further, Orr does not explicitly teach simultaneously.
Still further, Orr does not explicitly teach the page control information comprising a click operation for the element with matched element content information in the displayed page and an element identifier of the element with the matched element content information and position information of the element with the matched element content information;” or “executing the click operation and the position information of the element with the matched element content information.”
Orr still further does not explicitly teach that the element information also comprises “and at least one of: an application name of an application corresponding to the displayed page or a version number of the application corresponding to the displayed page.”
Finally, Orr does not explicitly teach that the page control information is generated based on at least the element information.
Jing teaches simultaneously (Jing paras. 252 and 271, in step S10, the voice instruction sent by the voice input device is received and the parameter information of all controllable control objects on the current display page of the smart terminal are collected).
Jing teaches the page control information comprising a click operation for the element with matched element content information in the displayed page (Jing paras. 493, 372, 502, 745, the controllable control object is obtained to implement control operations according to matched text information from the user voice input, where the control operations include how the control object is displayed such as expanding a drop down list and displaying the content of the drop down list, where para 745 gives an example of the controllable control object being a link on the webpage and the operation that is executed being a jumping to the webpage corresponding to the link (click operation), as well as an example of an OK button’s operation that is executed being the corresponding operation of the OK button (thus triggering a button click)) an element identifier of the element with the matched element content information with the matched element content information (Jing para. 493, control identifier of the controllable control object corresponding to the matched text information) on the displayed page (Jing para. 463, the controllable control object is an object on the current display page of the smart terminal).
Jing further teaches and executing the click operation (Jing para 745 an example of the controllable control object being a link on the webpage and the operation that is executed being a jumping to the webpage corresponding to the link (click operation), as well as an example of an OK button’s operation that is executed being the corresponding operation of the OK button (thus triggering a button click) with the matched element content information (Jing para. 493, the controllable control object corresponding to the matched text information).
Cai teaches the page control information comprising position information of the element (Cai fig. 1, paras. [0063], [0101], position information of a DOM node is extracted from the DOM structure).
Cai further teaches and the position information of the element (Cai paras. [0103]-[0107], a DOM structure can be traversed to obtain the position information of particular information to be extracted in the DOM).
Cross teaches and at least one of: an application name of an application corresponding to the displayed page or a version number of the application corresponding to the displayed page (Cross where the claim only requires “at least one”, Cross teaches in paras. [0145], [0149]-[0155], [0164]-[0167], that the matched search result appends additional web content into the DOM that is rendered as a result, and where such additional web content includes a VoiceXML action grammar which can be an event expression comprised of the information shown in paras. [0149] and [0164], where the example in para 164 includes or references code that is executed when a VoiceXML event is triggered, where the executed code is shown in para 166, and described in para 167 as specifying that an event can contain a value of click, map or google, where if the value is “google” then a new window is opened on the browser and a search is performed using Google™ (an application), accordingly, the references in the code snippets (including the element information) showing “google” is an application name of an application (google) corresponding to the link on the displayed page)  and based on at least the element information (Cross paras. [0164]-[0167], the multimodal browser performs an action (page control information) in dependence on the action identifier that is part of the VoiceXML element shown in paras 164-165).

Further, taking the teachings of Orr and Cai together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the position information within the DOM as disclosed in the specific passages cited to above in Cai at least because doing so would allow for accurately locating information and obtain an accurate extraction result with good robustness even after content of a web page is updated and the structure of the web page is changed (Cai para. [0008]).
Still further, taking the teachings of Orr and Cross together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the digital assistant and operations there of as specifically cited to above in Orr with the additional web content information as disclosed in the specific passages cited to above in Cross at least because doing so would allow for speech-enabled searching of content regardless of whether the content is speech-enabled (see Cross para. [0168]).
Regarding claim 17, Orr teaches a computer readable storage medium storing a computer program, wherein the program, when executed by a processor, cause the processor to perform the method according to claim 1 (Orr paras. [0052], [0054] and [0057], components of the digital assistant system includes a memory that includes a computer readable medium with software instructions for execution by one or more processors
Regarding claim 18, Orr teaches a computer readable storage medium storing a computer program, wherein the program, when executed by a processor, cause the processor to perform the method according to claim 8 (Orr paras. [0052], [0054] and [0057], components of the digital assistant system includes a memory that includes a computer readable medium with software instructions for execution by one or more processors).
Claims 4 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Orr in view of Jing in view of Cai in view of Cross, as set forth above regarding claim 3 from which claim 4 depends, and as set forth above regarding claim 11 from which claim 12 depends, further in view of Li et al., (US 2008/0104056 A1, herein “Li”).
Regarding claims 4 and 12, Orr teaches the claimed wherein the calculating a similarity between the voice recognition result and the element content information of the element, and determining successfully matching of the voice recognition result with the element based on the calculated similarity further comprises, the claimed element content information of the element, and voice recognition result as set forth above in the rejection rationales for claims 1 and 11. Orr does not teach the remainder of the limitations of claims 4 and 12.
Li teaches calculating a second edit distance between a pronunciation corresponding to the result and a pronunciation corresponding to the information of the element (Li paras. [0062], [0075] and [0080], as a later step in a process that generates a probability of a candidate word sequence given an input query, the edit distance of a phonetic description of the input word (pronunciation corresponding to result) and a candidate term (pronunciation corresponding to the information of the element) is determined after a step 704 where a first edit distance is determined) in response to determining the first edit distance being greater than the first threshold (Li para. [0075], in an earlier step 704, another (first) edit distance is calculated where is the edit distance is compared to a threshold, and if it exceeds than a feature is set to 1, and the process continues on eventually to step 716 where the phonetic edit distance is calculated (thus in response to first edit distance determination)); 
determining whether the second edit distance is greater than a preset second threshold (Li para. [0080], the phonetic edit distance is compared to a threshold, and if it exceeds the threshold, then the phonetic feature is set to 1); 
determining the successfully matching the voice recognition result with the element in response to determining the second edit distance being not greater than the second threshold (Li paras. [0080], [0084], [0062]-[0063]  and  [0071], when the phonetic edit distance is not greater than a threshold, the phonetic feature is set to a zero value, where the phonetic feature contributes towards a probability result (high probability indicating a success) for matching between a candidate and an input word, and where equation 13 in para. [0062], a zero value for the phonetic feature leads to a higher probabilistic result); and 
determining unsuccessfully matching the voice recognition result with the element in response to determining the second edit distance being greater than the second threshold (Li paras. [0080], [0084], [0062]-[0063]  and  [0071], when the phonetic edit distance is greater than a threshold, the phonetic feature is set to a one value, where the phonetic feature contributes towards a probability result (low probability indicating unsuccessfulness) for matching between a candidate and an input word, and where equation 13 in para. [0062], a one value for the phonetic feature leads to a lower probabilistic result).
Therefore, considering Orr and Li together as a whole, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to have modified the media query analysis and result display disclosed in Orr with the edit distances and consideration thereof for probability of a candidate match as disclosed in Li at least because doing so would improve candidate query corrections for an input query (Li para. [0014]).

Conclusion
Applicant's amendment necessitated any new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908.  The examiner can normally be reached on Monday-Friday, 9:30a-7p.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


MICHELLE M. KOETH
Primary Examiner
Art Unit 2656