MediaGnosis IEEE ICIP2023 Industry Seminar

MediaGnosis:
The next-generation media processing artificial intelligence
Ryo Masumura, NTT Corporation, Japan

View Slide

1
Copyright NTT CORPORATION
Self-introduction
 Ryo Masumura, Ph.D.
 Biography
• 2011.04: Join into Nippon Telegraph and Telephone Corporation (NTT)
• 2015.04-2016.09: Ph.D. Student, Tohoku University
• Now: Distinguished research scientist at NTT Human Information Labs.
 Research Topics
• Speech processing (speech recognition, classification, etc.)
• Natural language processing (classification, generation, etc.)
• Computer vision (classification, detection, captioning, etc.)
• Crossmodal processing (joint modeling, etc.)
My goal:
Establishing
general-purpose
media processing AI

View Slide

2
Overview of my presentation
 Present next-generation media-processing AI “MediaGnosis,” being
developed at NTT Corporation
 What is MediaGnosis?
• How does MediaGnosis differ from general AIs?
• What technology is the key enabler of MediaGnosis?
 What is possible by MediaGnosis?
• App. 1: multi-modal conversation sensing application
• App. 2: multi-modal personality factor measurement application

View Slide

3
What is MediaGnosis?

View Slide

4
Overview of MediaGnosis
 MediaGnosis is a multi-modal foundation model that can handle various
functions and modalities within a single brain in an integrated manner
 “MediaGnosis” originates from the idea of “treating all sorts of media (records of
information) as gnosis (knowledge) in an integrated manner like humans and making
a diagnosis (judgment) based on that knowledge
Speech and Audio Processing
Image and Video Processing
Natural Language Processing
Cross-modal
Processing
Multi-modal foundation model

View Slide

5
Problems in general media processing AIs
Speech
recognition
LLM
First impression
recognition
Face
recognition
Speaker
recognition
Emotion
recognition
XXX
Gender and
age estimation
They have an independent brain for each
of speech recognition, face recognition,
and emotion recognition
 Knowledge acquired in each general AI function is not mutually utilized
 General AIs work with an independent brain for each function

View Slide

6
Problems in general media processing AIs
 Difficult to combine multiple functions in a complex manner
Difficult to combine multiple
technologies individually built on
different concepts
 While the market is demanding new services that utilize multi-modals and multiple AI
functions, the difficulty is a major bottleneck in service development
Speech recognition
provided by corp. A
Emotion recognition
provided by corp. C
REST API
using cloud
Web socket API
using cloud
Standalone Python
module
Face recognition
provided by corp. B
Service
with
multiple AI
functions
Attribute estimation
provided by corp. D
Web assembly

View Slide

7
Speech
recognition
LLM
First impression
recognition
Face
recognition
Speaker
recognition
Emotion
recognition
XXX
Gender and
age estimation
 Store cross-functional knowledge in a unified multi-modal foundation
model and process various types of information with the shared knowledge
Strength of MediaGnosis
 Even if the training data for each function is limited, sharable knowledge allows for
efficient learning and growth
※ This is example, this shows a part of functions

View Slide

8
 Can provide various functions in an all-in-one manner and
realize complex AI inference combining multi-modal multi-tasks
Speech
recognition
Gender and age
estimation
Translation
LLM
Emotion
recognition
First impression
recognition
 Easily offer new services that utilize multi-modals and multiple AI processing
Inputs
Outputs
Strength of MediaGnosis
※ This is example, this shows a part of functions

View Slide

9
Key Technology in MediaGnosis
 Multi-modal foundation modeling based on our home-made multi-modal
or multi-task joint modeling techniques
 Joint modeling enables knowledge sharing by representing various AI functions
in a unified model structure and performing co-learning
Text
Generation
Speech and
Audio
Understanding
Image and
Video
Understanding
Emotion
Classification
Attribute
Classification
Happy, Sad,
neutral
Male,
Female
Elder,
Adult, Child
“It is sunny today”
Crossmodal
Understanding
Natural
Language
Understanding
Can be shared between speech
recognition, speaker recognition,
speech-based emotion recognition, etc.
Can be shared between speech-based attribute
estimation, face-based attribute estimation, and
audio-visual attribute estimation, etc.
Can be shared between
all AI functions
※ This is example, this shows a part of modules

View Slide

10
Training in MediaGnosis
 MediaGnosis jointly utilizes various datasets for training architecture
Dataset
for speech
recognition
Dataset
for face emotion
recognition
Dataset
for machine
translation
Training
datasets etc.
Text
Generation
Speech and
Audio
Understanding
Image and
Video
Understandin
g
Emotion
Classification
Attribute
Classification
Happy, Sad,
neutral
Male,
Female
Elder,
Adult, Child
Crossmodal
Understanding
Natural
Language
Understanding
 Utilize both unpaired datasets (text-only, speech-only, image-only datasets) and
input-output paired datasets
Text-only
dataset
Speech-only
dataset
Image-only
dataset

View Slide

11
Inference in MediaGnosis
 Possible to implement single-modal functions and multimodal functions by
extracting modules for the target function without using all modules
Case 1:
Speech recognition
Case 3:
Audio-visual
emotion recognition
Case 2:
Face-based gender
and age estimation
Speech and
Audio
Understanding
Cross-modal
Understanding
Text
Generation
Image and Video
Understanding
Cross-modal
Understanding
Attribute
Classification
Speech and
Audio
Understanding Cross-modal
Understanding
Image and Video
Understanding
Emotion
Classification
Happy, Sad,
neutral
Male,
Female
Elder,
Adult, Child

View Slide

12
Our major technical point (1)
Speech, Text, and speech-text cross-modal joint
modeling for text generation
[Masumura+ INTERSPEECH2022]
Speech, Image, and speech-image cross-modal joint
modeling for classification
[Takashima+ INTERSPEECH2022]
 Cross-modal Transformer-based joint modeling
[Masumura+ INTERSPEECH2022] [Takashima+ INTERSPEECH2022]
 Jointly model multiple single-modal tasks and multimodal tasks using a shared
cross-modal transformer architecture
Speech
Text
Speech
Image

View Slide

13
 Grounding other modals into Text-based LM for cross-modal modeling
[Masumura+ SLT2019] [Masumura+ INTERSPEECH2020] [Masumura+ EUSIPCO2023]
 Leverage knowledge gained through large amounts of text for improving speech
or image processing with limited training data
Text-based LM improves image captioning
[Masumura+ EUSIPCO2023]
Text-based LM improves speech recognition
Text
Speech Text
Image

View Slide

14
 Self-supervised representation learning for multi-domain joint modeling
[Masumura+ SLT2021] [Ihori+ ICASSP2021] [Tanaka+ INTERSPEECH2022] [Tanaka+ ICASSP2023]
 Utilize unpaired data collected from a variety of domains for self-supervised learning
Domain adversarial self-supervised
representation learning
[Tanaka+ INTERSPEECH2022]
Cross-lingual self-supervised
representation learning
[Tanaka+ ICASSP2023]
Considering
multi-domain
datasets
Considering
cross-lingual
datasets

View Slide

15
 Special token based inference control in joint modeling
[Ihori+ INTERSPEECH2021] [Tanaka+ INTERSPEECH2021] [Ihori+ COLING 2022] [Orihashi+ ICIP 2022]
 Control output generation style in inference while jointly training multiple tasks
Style token based inference control
for speech recognition
[Tanaka+ INTERSPEECH2021]
Switching token based inference control
for text-style conversion
[Ihori+ INTERSPEECH2021]
Switching tokens
Style tokens
Auxiliary token based inference
control for scene-text recognition
[Orihashi+ ICIP2022]
Auxiliary tokens

View Slide

16
 Joint inference of multiple information with auto-regressive joint modeling
[Masumura+ INTERSPEECH2022] [Ihori+ INTERSPEECH2022] [Makishima+ INTERSPEECH2023]
 Jointly generate multiple function’s outputs in one inference and consider relationship
between the multiple functions
Joint inference of multi-talker’s gender, age, and
transcriptions [Masumura+ INTERSPEECH2022]
Joint inference of dual-style outputs
[Ihori+ INTERSPEECH2023]
Jointly generate
spoken text and
written text
Jointly generate
multi-talker’s
gender, age, and
transcriptions

View Slide

17
 Multi-modal end-to-end modeling for challenging domains
[Yamazaki+ AAAI2022] [Hojo+ INTERSPEECH2023]
 Enhance multi-modal scene-context awareness, conversation-context awareness
Audio-visual scene aware dialog modeling
with crossmodal transformer [Yamazaki+ AAAI2022]
Audio-visual conversation understaniding modeling with
crossmodal transformer [Hojo+ INTERSPEECH2023]
Handle conversation-context
between a seller and a buyer
Handle conversation histories and
conversation video scene

View Slide

18
 Joint modeling of recognition and generation modules
[Masumura+ APSIPA2019] [Masumura+ INTERSPEECH2019] [Makishima+ INTERSPEECH2022]
 Consider speech generation or image generation to properly recognize speech or
image information
Consider generation criteria in itraining
[Makishima+ INTERSPEECH2021]
Consider generation criteria in inference
Considering reconstruction
criterion in inference
Considering
reconstruction
criterion in training

View Slide

19
 Training techniques to jointly use unpaired and paired data
[Masumura+ ICASSP2020] [Takashima+ APSIPA2020] [Orihashi+ INTERSPEECH2020] [Suzuki+ ICIP2023]
 Unpaired data can be utilized for semi-supervised learning in a training phase,
and utilized for unsupervised adaptation in a run-time phase
Semi-supervised learning for sequence-to-sequence
modeling [Masumura+ ICASSP2020]
Online domain adaptation for transformer
based object detection [Suzuki+ ICIP2023]
Unlabeled data
Unlabeled data

View Slide

20
What is possible by
MediaGnosis?

View Slide

21
Our vision in industry fields
Smart Communication Smart Office
Smart City Smart XXX
Communication practice
❓
AI Anywhere
powered by
Auto customer service
City Robots
Monitoring
AI collaboration AI-based RPA
 Provide any solutions across multimodal and multiple domains
using the MediaGnosis (our multimodal foundation model)

View Slide

22
Applications powered by MediaGnosis
 Have developed various single or multi-modal applications
 Pickup two multi-modal applications for smart communication fields

View Slide

23
App. 1: Remote communication support
 Facilitate a remote meeting by sensing multimodal multi-party information
 We often encounter stumbling blocks…
 It is important to ensure that
such meetings proceed smoothly
• I can’t sense the attitude of the
other participants…
• Only a limited number of people
speak…
• The discussion is deadlocked…

View Slide

24
Demo

View Slide

25
Analysis sheet
 Sense various aspects of a meeting and visualize its changing state
Speaker’s audio-
visual emotion
Listeners’ visual
emotion
Audio-Visual
Multiparty
analyses
 Help to improve overall quality of the meeting
Transcriptions
(can be trnaslated)

View Slide

26
App. 2: Personality measurement
Select a situation that you like Perform role-playing against the
selected situation for about 60s
Have your analysis result
 Measure your personality and help you discover your potential charm
by sensing multi-modal information
 Select a situation and perform a role-play (e.g., apologize to your lover)
 Analyze your responses, express them numerically, and classifies them into categories

View Slide

27
Demo

View Slide

28
Analysis sheet
 Explain your popularity factors, which represent the most charming points
of a person, and also give advice about how to further enhance your charm
 Represent how your popularity factors differ from the averages
Energetic compared with averaged ones Compare the differences with others in various aspects

View Slide

29
Summary

View Slide

30
Summary
 MediaGnosis is a multi-modal foundation model that can handle various
functions and modalities within a single brain in an integrated manner
Speech and Audio Processing
Image and Video Processing
Natural Language Processing
Cross-modal
Processing
 Can provide various functions in an all-in-one manner and realize complex AI inference
combining multi-modal multi-tasks
 Aim to provide any solutions across multimodal and multiple domains
Multi-modal foundation model

View Slide

31
References
[Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato,
Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint
Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the
International Speech Communication Association (INTERSPEECH), pp.3218-3222, 2022.
[Takashima+ INTERSPEECH2022] Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida and Shota Orihashi,
"Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition", In Proc. Annual Conference of the International
Speech Communication Association (INTERSPEECH), pp.4740-4744, 2022.
[Masumura+ SLT2019] Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, Takanobu Oba, Ryuichiro Higashinaka,
"Improving Speech-Based End-of-Turn Detection via Cross-Modal Representation Learning with Punctuated Text Data", In Proc. IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), pp.1062-1069, 2019.
[Masumura+ INTERSPEECH2020] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi,
"Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition", In Proc. Annual
Conference of the International Speech Communication Association (INTERSPEECH), 2822-2826, 2020.
[Masumura+ EUSIPCO2023] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, "Text-to-Text
Pre-Training with Paraphrasing for Improving Transformer-based Image Captioning", In Proc. European Signal Processing Conference
(EUSIPCO), pp.516-520, 2023.
[Masumura+ SLT2021] Ryo Masumura, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima and Shota Orihashi, "Large-Context
Conversational Representation Learning: Self-Supervised Learning for Conversational Documents" In Proc. IEEE Spoken Language Technology
Workshop (SLT), 1012-1019, 2021.

View Slide

32
References
[Ihori+ ICASSP2021] Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura, "MAPGN: MAsked
Pointer-Generator Network for Sequence-to-Sequence Pre-training", In Proc. International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), 7563-7567, 2021.
[Tanaka+ INTERSPEECH2022] Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara and Takafumi
Moriya, "Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks", In Proc.
Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1066-1070, 2022.
[Tanaka+ ICASSP2023] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi
Moriya, "Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning", In Proc. International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023.
[Ihori+ INTERSPEECH2021] Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi and Ryo Masumura, "Zero-Shot
Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens", In Proc. Annual Conference of the International
Speech Communication Association (INTERSPEECH), 776-780, 2021.
[Tanaka+ INTERSPEECH2021] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi and Naoki Makishima, "End-to-
End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning", In Proc. Annual Conference of the International
Speech Communication Association (INTERSPEECH), 4458-4462, 2021.
[Ihori+ COLING 2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, "Multi-Perspective Document Revision", In Proc. International
Conference on Computational Linguistics (COLING), pp.6128-6138, 2022.
[Orihashi+ ICIP2022] Shota Orihashi, Yoshihiro Yamazaki, Mihiro Uchida, Akihiko Takashima, Ryo Masumura, "Fully Sharable Scene Text
Recognition Modeling for Horizontal and Vertical Writing", In Proc. International Conference on Image Processing (ICIP), pp.2636-2640, 2022.

View Slide

33
References
[Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato,
Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint
Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the
[Ihori+ INTERSPEECH2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nabukatsu Hojo, "Transcribing Speech as
Spoken and Written Dual Text Using an Autoregressive Model", In Proc. Annual Conference of the International Speech Communication
Association (INTERSPEECH), pp.461-465, 2023.
[Makishima+ INTERSPEECH2023] Naoki Makishima, Keita Suzuki, Satoshi Suzuki, Atsushi Ando, Ryo Masumura, "Joint Autoregressive Modeling
of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction", In Proc. Annual Conference of the
[Yamazaki+ AAAI2022] Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima, "Audio Visual Scene-Aware
Dialog Generation with Transformer-based Video Representations", In Proc. DSTC Workshop at AAAI Conference on Artificial Intelligence(AAAI),
No.35, 2022.
[Hojo+ INTERSPEECH2023] Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka,
"Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer", In Proc. Annual
Conference of the International Speech Communication Association (INTERSPEECH), pp.2663-2667, 2023.
[Masumura+ APSIPA2019] Ryo Masumura, Yusuke Ijima, Satoshi Kobashikawa, Takanobu Oba, Yushi Aono, "Can We Simulate Generative
Process of Acoustic Modeling Data? Towards Data Restoration for Acoustic Modeling", In Proc. Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), pp.655-661, 2019.

View Slide

34
References
[Masumura+ INTERSPEECH2019] Ryo Masumura, Hiroshi Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, Takanobu Oba, "End-to-End
Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders", In Proc. Annual
Conference of the International Speech Communication Association (INTERSPEECH), pp.1606-1610, 2019.
[Makishima+ INTERSPEECH2022] Naoki Makishima, Satoshi Suzuki, Atsushi Ando and Ryo Masumura, "Speaker consistency loss and step-wise
optimization for semi-supervised joint training of TTS and ASR using unpaired text data", In Proc. Annual Conference of the International
Speech Communication Association (INTERSPEECH), pp.526-530, 2022.
[Masumura+ ICASSP2020] Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Atsushi Ando, Yusuke Shinohara, "Sequence-level
Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition", In Proc. International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), pp.7049-7053, 2020.
[Takashima+ APSIPA2020] Akihiko Takashima, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura, "Unsupervised
Domain Adversarial Training in Angular Space for Facial Expression Recognition", In Proc. Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), pp.1054-1059, 2020.
[Orihashi+ INTERSPEECH2020] Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, "Unsupervised Domain Adaptation for Dialogue
Sequence Labeling Based on Hierarchical Adversarial Training", In Proc. Annual Conference of the International Speech Communication
Association (INTERSPEECH), pp.1575-1579, 2020.
[Suzuki+ ICIP2023] Satoshi Suzuki, Taiga Yamane, Naoki Makishima, Keita Suzuki, Atsushi Ando, Ryo Masumura, "ONDA-DETR: Online Domain
Adaptation for Detection Transformers with Self-Training Framework", In Proc. International Conference on Image Processing (ICIP), pp.1780-
1784, 2023.

View Slide

MediaGnosis IEEE ICIP2023 Industry Seminar

MediaGnosis IEEE ICIP2023 Industry Seminar

Ryo Masumura

More Decks by Ryo Masumura

Other Decks in Research

Featured

Transcript