DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant's claim for foreign priority based on an application filed in Japan on 02 July 2019. It is noted, however, that applicant has not filed a certified copy of the JP 2019-123748 application as required by 37 CFR 1.55.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-21 are rejected under 35 U.S.C. 103 as being unpatentable over Sato et al. (U.S. Patent Application Publication 2010/0182501) in view of Ooshima (U.S. Patent Application Publication 2016/0292898).
Regarding claim 1, Sato et al. discloses an image processing apparatus comprising: a selection unit configured to select, from a moving image including a plurality of frames, a part of the moving image (Figs. 1, 2, 6, 8, 20, and 23; paragraphs [0203]-[0218] – overall flow of information processing method – obtain moving picture data (step S101) – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the moving picture analysis unit 103 analyzes the moving picture data transferred from the moving picture data acquisition unit 101, and generates moving picture metadata, i.e., metadata relating to feature quantities characterizing a moving picture corresponding to the transferred moving picture data (step S105) – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving picture metadata (step S107) – generate comic display data (step S119)); an extraction unit configured to extract a voice during a predetermined time corresponding to the selected part in the moving image (Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving picture metadata (step S107) – the comic display data generation unit 111 generates audio data used for displaying the comic based on the audio data transferred from the audio extraction unit 105 and the frame information transferred from the comic display conversion 107 – further, ; and a combination unit configured to combine a character string based on a voice extracted by the extraction unit, with the part of the moving image selected by the selection by the selection unit or a frame among frames corresponding to the part (Figs. 1, 2, 6, 8, 20, and 23; paragraph [0191] – Fig. 20 shows a relationship between audio metadata and speech balloons; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving picture metadata (step S107) – the comic display data generation unit 111 generates audio data used for displaying the comic based on the audio data transferred from the audio extraction unit 105 and the frame information .  However, Sato et al. fails to disclose wherein based on that a voice extracted by the extraction unit satisfies a predetermined condition corresponding to a mixed voice, the combination unit combines a character string prepared in advance with the selected part or the frame.
Referring to the Ooshima reference, Ooshima discloses an image processing apparatus comprising: a combination unit configured to combine a character string based on a voice extracted by the extraction unit, with the part of the moving image selected by the selection by the selection unit or a frame among frames corresponding to the part, wherein based on that a voice extracted by the extraction unit satisfies a predetermined condition corresponding to a mixed voice, the combination unit combines a character string prepared in advance with the selected part or the frame (paragraph [0016] – it is necessary to select a voice of a person who is present in a frame image to be combined with a character string corresponding to the voice from voices extracted from the moving image – however, in a case where plural persons are present in the moving image, a considerable effort is necessary for determining which person a voice belongs to, or for selecting a desired voice from plural voices of the person; paragraph [0018] .
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have extracted a voice that was part of a mixed voice as disclosed by Ooshima in the apparatus disclosed by Sato et al. in order to properly create a comic-like image with the correct dialogue when multiple people are speaking in the video and/or there is background noise.
2, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 1 including that wherein the combination unit combines, based on that a voice extracted by the extraction unit does not satisfy the predetermined condition, a character string indicating contents of the voice with the selected part or the frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving . 
Regarding claim 3, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 1 including that the image processing apparatus further comprises: a conversion unit configured to convert a voice extracted by the extraction unit into a character string, wherein in a case where the voice extracted by the extraction unit does not satisfy the predetermined condition, the combination unit combines a character string acquired by conversion by the conversion unit with the selected part of the frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like; paragraphs [0203]-[0218] – overall flow of information processing method – the moving . 
Regarding claim 4, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 1 and 3 including that the image processing apparatus further comprises: a determination unit configured to determine a theme of the image by analyzing the selected image, wherein the combination unit combines a character string which is based on the theme determined by the determination unit, in a case where the voice extracted by the extraction unit satisfies the predetermined condition (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving picture metadata (step S107) – the comic display data generation unit 111 generates audio data used for displaying the comic based on the audio data transferred from the audio extraction unit 105 and the frame information transferred from the comic display .
Regarding claim 5, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 1, 3, and 4 including that the image processing apparatus further comprises: a table in which each value of a theme of an image and a character string are associated with each other; and a search unit configured to search for a character string corresponding to the theme determined by the determination unit in the table, wherein the combination unit uses a character string searched by the search unit (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data . 
Regarding claim 6, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 1 including that the selection unit selects the frame, and selects the part of the moving image based on the selected frame, and the extraction unit extract the voice based on the selected part and the combination unit combines the character string with the selected frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an .
7, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 1 and 6 including that the selection unit selects a part of the moving image corresponding to the predetermined time before or after the selected frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving picture metadata (step S107) – the comic display . 
Regarding claim 8, Sato et al. discloses an image processing method comprising: a selection step of selecting, from a moving image including a plurality of frames, a part of the moving image (Figs. 1, 2, 6, 8, 20, and 23; paragraphs [0203]-[0218] – overall flow of information processing method – obtain moving picture data (step S101) – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the moving picture analysis unit 103 analyzes the moving picture data transferred from the moving picture data acquisition unit 101, and generates moving picture metadata, i.e., metadata relating to feature quantities characterizing a moving picture corresponding to the transferred moving picture data (step S105) – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred ; an extraction step of extracting a voice during a predetermined time corresponding to the selected part in the moving image (Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving ; and a combination step of combining a character string based on a voice extracted at the extraction step, with the part of the moving image selected at the selection step or a frame among frames corresponding to the part (Figs. 1, 2, 6, 8, 20, and 23; paragraph [0191] – Fig. 20 shows a relationship between audio metadata and speech balloons; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail .  However, Sato et al. fails to disclose wherein based on that a voice extracted at the extraction step satisfies a predetermined condition corresponding to a mixed voice, at the combination step, a character string prepared in advance is combined with the selected part or the frame.
Referring to the Ooshima reference, Ooshima discloses an image processing method comprising: a combination step of combining a character string based on a voice extracted at the extraction step, with the part of the moving image selected at the selection step or a frame among frames corresponding to the part, wherein based on that a voice extracted at the extraction step satisfies a predetermined condition corresponding to a mixed voice, at the combination step, a character string prepared in advance is combined with the selected part or the frame (paragraph [0016] – it is necessary to select a voice of a person who is present in a frame image to be combined with a character string corresponding to the voice from voices .
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have extracted a voice that was part of a mixed voice as disclosed by Ooshima in the method disclosed by Sato 
Regarding claim 9, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 8 including that wherein at the combination step, based on that a voice extracted at the extraction step does not satisfy the predetermined condition, a character string indicating contents of the voice is combined with the selected part or the frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the . 
Regarding claim 10, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 8 including that the image processing method further comprises: a conversion step of converting a voice extracted at the extraction step into a character string, wherein in a case where the voice extracted at the extraction step does not satisfy the predetermined condition, at the combination step, a character string acquired by conversion at the conversion step is combined with the selected part of the frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio . 
Regarding claim 11, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 8 and 10 including that the image processing method further comprises: a determination step of determining a theme of the image by analyzing the selected image, wherein, at the combination step, a character string which is based on the theme determined at the determination step is combined, in a case where the voice extracted at the extraction step satisfies the predetermined condition (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving picture metadata (step S107) – the comic display data generation unit 111 generates audio data .
Regarding claim 12, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 8, 10, and 11 including that the image processing method further comprises: a search step of searching for a character string corresponding to the theme determined at the determination step in a table in which each value of a theme of an image and a character string are associated with each other, wherein, at the combination step, a character string searched at the search step is used (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving . 
Regarding claim 13, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 8 including that at the selection step, the frame is selected, and the part of the moving image based on the selected frame is selected, and at the extraction step, the voice based on the selected part is extracted and at the combination step, the character string is combined with the selected frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a .
Regarding claim 14, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 8 and 13 including that at the selection step, a part of the moving image corresponding to the predetermined time before or after the selected frame is selected (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number . 
Regarding claim 15, Sato et al. discloses a non-transitory computer readable storage medium storing a program for causing a computer to perform an image processing method comprising: a selection step of selecting, from a moving image including a plurality of frames, a part of the moving image (Figs. 1, 2, 6, 8, 20, and 23; paragraphs [0203]-[0218] – overall flow of information processing method – obtain moving picture data (step S101) – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the moving picture analysis unit 103 analyzes the moving picture data transferred from the moving picture data acquisition unit 101, and generates moving picture metadata, i.e., metadata relating to feature quantities characterizing a moving ; an extraction step of extracting a voice during a predetermined time corresponding to the selected part in the moving image (Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest ; and a combination step of combining a character string based on a voice extracted at the extraction step, with the part of the moving image selected at the selection step or a frame among frames corresponding to the part (Figs. 1, 2, 6, 8, 20, and 23; paragraph [0191] – Fig. 20 shows a relationship between audio metadata and speech balloons; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display .  However, Sato et al. fails to disclose wherein based on that a voice extracted at the extraction step satisfies a predetermined condition corresponding to a mixed voice, at the combination step, a character string prepared in advance is combined with the selected part or the frame.
Referring to the Ooshima reference, Ooshima discloses an image processing method comprising: a combination step of combining a character string based on a voice extracted at the extraction step, with the part of the moving image selected at the selection step or a frame among frames corresponding to the part, wherein based on that a voice extracted at the extraction step satisfies a predetermined condition corresponding to a mixed voice, at the combination step, a character string prepared in advance is combined with the selected part or the frame (paragraph [0016] – it is necessary to select a voice of a person who is present in a frame image to be combined with a character string corresponding to the voice from voices extracted from the moving image – however, in a case where plural persons are present in the moving image, a considerable effort is necessary for determining which person a voice belongs to, or for selecting a desired voice from plural voices of the person; paragraph [0018] – a voice extraction section that extracts a voice from the moving image, a voice recognition section that converts the voice into character string data by voice recognition, an association section that generates information of association between the central person and a voice of the central person; paragraph [0030] – it is preferable that the association section determines the gender and age of the central person from a person region of a frame image where the central person is present, determines, from the pitch of the voice of the central person, the gender and age of a person corresponding to the voice, and generates the association information so that the gender and age of the central person match the gender and age of the person corresponding to the voice; paragraph [0070] – further, in a case where a voice is included in a moving image, a frame image may be extracted from the moving image before or after a time point (time code) when the volume of the voice is larger than a certain standard or the voice becomes louder than other scene; paragraph [0083] – detects periods with no character string data, pieces of character string data before and after the time period as different pieces of character string data and stores the result in the storage section 42).

Regarding claim 16, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 15 including that wherein in the image processing method, at the combination step, based on that a voice extracted at the extraction step does not satisfy the predetermined condition, a character string indicating contents of the voice is combined with the selected part or the frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the . 
Regarding claim 17, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 15 including that wherein: the image processing method further comprises: a conversion step of converting a voice extracted at the extraction step into a character string, wherein in a case where the voice extracted at the extraction step does not satisfy the predetermined condition, at the combination step, a character string acquired by conversion at the conversion step is combined with the selected part of the frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a . 
18, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 15 and 17 including that wherein the image processing method further comprises: a determination step of determining a theme of the image by analyzing the selected image, wherein, at the combination step, a character string which is based on the theme determined at the determination step is combined, in a case where the voice extracted at the extraction step satisfies the predetermined condition (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the .
Regarding claim 19, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 15, 17, and 18 including that wherein the image processing method further comprises: a search step of searching for a character string corresponding to the theme determined at the determination step in a table in which each value of a theme of an image and a character string are associated with each other, wherein, at the combination step, a character string searched at the search step is used (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously . 
Regarding claim 20, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claim 15 including that at the selection step, the frame is selected, and the part of the moving image based on the selected frame is selected, and at the extraction step, the voice based on the selected part is extracted and at the combination step, the character string is combined with the selected frame (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained audio data to the comic display data generation unit 111 – the digest score calculation unit 151 of the comic display conversion unit 107 calculates digest scores of all of the images (frame images) constituting the moving picture based on the transferred moving picture metadata – the thumbnail number determination unit 153 selects representing frame images used as thumbnail images by using the transferred digest scores and the moving picture metadata (step S107) – the comic display data generation unit 111 generates audio data used for displaying the comic .
Regarding claim 21, Sato et al. in view of Ooshima discloses all of the limitations as previously discussed with respect to claims 15 and 20 including that at the selection step, a part of the moving image corresponding to the predetermined time before or after the selected frame is selected (Sato et al.: Figs. 1, 2, 6, 8, 20, and 23; paragraph [0119] – the audio analysis unit 139 performs a classification processing on the audio data so as to determine whether a sound is a speech, a laughter, a cheering such as “wow”, a clapping sound such as “bang”, an applause such as clapping sound, and music – this classification processing on the audio data can be executed by referencing, for example, an audio analysis database and the like previously stored in the storage unit 117 and the like and executing an audio analysis program and the like – the theme is based on the metadata and mood; paragraphs [0203]-[0218] – overall flow of information processing method – the moving picture data acquisition unit 101 transfers the obtained moving picture data to the moving picture analysis unit 103 and the audio extraction unit 105 – the audio extraction unit 105 extracts audio data from the moving picture data transferred from the moving picture data acquisition unit 101 (step S103), and transfers the obtained . 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Masutani (U.S. Patent Application Publication 2011/0317984).
Sato et al. (U.S. Patent Application Publication 2013/0028571).
Kurata et al. (U.S. Patent Application Publication 2013/0086458).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HEATHER R JONES whose telephone number is (571)272-7368.  The examiner can normally be reached on Mon. - Fri.: 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Vaughn can be reached on (571)272-3922.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/HEATHER R JONES/Primary Examiner, Art Unit 2481                                                                                                                                                                                                        

August 13, 2021