Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MediaGnosis IEEE ICIP2023 Industry Seminar

Ryo Masumura
October 11, 2023

MediaGnosis IEEE ICIP2023 Industry Seminar

Ryo Masumura

October 11, 2023
Tweet

More Decks by Ryo Masumura

Other Decks in Research

Transcript

  1. MediaGnosis:
    The next-generation media processing artificial intelligence
    Ryo Masumura, NTT Corporation, Japan

    View Slide

  2. 1
    Copyright NTT CORPORATION
    Self-introduction
     Ryo Masumura, Ph.D.
     Biography
    • 2011.04: Join into Nippon Telegraph and Telephone Corporation (NTT)
    • 2015.04-2016.09: Ph.D. Student, Tohoku University
    • Now: Distinguished research scientist at NTT Human Information Labs.
     Research Topics
    • Speech processing (speech recognition, classification, etc.)
    • Natural language processing (classification, generation, etc.)
    • Computer vision (classification, detection, captioning, etc.)
    • Crossmodal processing (joint modeling, etc.)
    My goal:
    Establishing
    general-purpose
    media processing AI

    View Slide

  3. 2
    Copyright NTT CORPORATION
    Overview of my presentation
     Present next-generation media-processing AI “MediaGnosis,” being
    developed at NTT Corporation
     What is MediaGnosis?
    • How does MediaGnosis differ from general AIs?
    • What technology is the key enabler of MediaGnosis?
     What is possible by MediaGnosis?
    • App. 1: multi-modal conversation sensing application
    • App. 2: multi-modal personality factor measurement application

    View Slide

  4. 3
    Copyright NTT CORPORATION
    What is MediaGnosis?

    View Slide

  5. 4
    Copyright NTT CORPORATION
    Overview of MediaGnosis
     MediaGnosis is a multi-modal foundation model that can handle various
    functions and modalities within a single brain in an integrated manner
     “MediaGnosis” originates from the idea of “treating all sorts of media (records of
    information) as gnosis (knowledge) in an integrated manner like humans and making
    a diagnosis (judgment) based on that knowledge
    Speech and Audio Processing
    Image and Video Processing
    Natural Language Processing
    Cross-modal
    Processing
    Multi-modal foundation model

    View Slide

  6. 5
    Copyright NTT CORPORATION
    Problems in general media processing AIs
    Speech
    recognition
    LLM
    First impression
    recognition
    Face
    recognition
    Speaker
    recognition
    Emotion
    recognition
    XXX
    Gender and
    age estimation
    They have an independent brain for each
    of speech recognition, face recognition,
    and emotion recognition
     Knowledge acquired in each general AI function is not mutually utilized
     General AIs work with an independent brain for each function

    View Slide

  7. 6
    Copyright NTT CORPORATION
    Problems in general media processing AIs
     Difficult to combine multiple functions in a complex manner
    Difficult to combine multiple
    technologies individually built on
    different concepts
     While the market is demanding new services that utilize multi-modals and multiple AI
    functions, the difficulty is a major bottleneck in service development
    Speech recognition
    provided by corp. A
    Emotion recognition
    provided by corp. C
    REST API
    using cloud
    Web socket API
    using cloud
    Standalone Python
    module
    Face recognition
    provided by corp. B
    Service
    with
    multiple AI
    functions
    Attribute estimation
    provided by corp. D
    Web assembly

    View Slide

  8. 7
    Copyright NTT CORPORATION
    Speech
    recognition
    LLM
    First impression
    recognition
    Face
    recognition
    Speaker
    recognition
    Emotion
    recognition
    XXX
    Gender and
    age estimation
     Store cross-functional knowledge in a unified multi-modal foundation
    model and process various types of information with the shared knowledge
    Strength of MediaGnosis
     Even if the training data for each function is limited, sharable knowledge allows for
    efficient learning and growth
    ※ This is example, this shows a part of functions

    View Slide

  9. 8
    Copyright NTT CORPORATION
     Can provide various functions in an all-in-one manner and
    realize complex AI inference combining multi-modal multi-tasks
    Speech
    recognition
    Gender and age
    estimation
    Translation
    LLM
    Emotion
    recognition
    First impression
    recognition
     Easily offer new services that utilize multi-modals and multiple AI processing
    Inputs
    Outputs
    Strength of MediaGnosis
    ※ This is example, this shows a part of functions

    View Slide

  10. 9
    Copyright NTT CORPORATION
    Key Technology in MediaGnosis
     Multi-modal foundation modeling based on our home-made multi-modal
    or multi-task joint modeling techniques
     Joint modeling enables knowledge sharing by representing various AI functions
    in a unified model structure and performing co-learning
    Text
    Generation
    Speech and
    Audio
    Understanding
    Image and
    Video
    Understanding
    Emotion
    Classification
    Attribute
    Classification
    Happy, Sad,
    neutral
    Male,
    Female
    Elder,
    Adult, Child
    “It is sunny today”
    Crossmodal
    Understanding
    Natural
    Language
    Understanding
    Can be shared between speech
    recognition, speaker recognition,
    speech-based emotion recognition, etc.
    Can be shared between speech-based attribute
    estimation, face-based attribute estimation, and
    audio-visual attribute estimation, etc.
    Can be shared between
    all AI functions
    ※ This is example, this shows a part of modules

    View Slide

  11. 10
    Copyright NTT CORPORATION
    Training in MediaGnosis
     MediaGnosis jointly utilizes various datasets for training architecture
    Dataset
    for speech
    recognition
    Dataset
    for face emotion
    recognition
    Dataset
    for machine
    translation
    Training
    datasets etc.
    Text
    Generation
    Speech and
    Audio
    Understanding
    Image and
    Video
    Understandin
    g
    Emotion
    Classification
    Attribute
    Classification
    Happy, Sad,
    neutral
    Male,
    Female
    Elder,
    Adult, Child
    “It is sunny today”
    Crossmodal
    Understanding
    Natural
    Language
    Understanding
     Utilize both unpaired datasets (text-only, speech-only, image-only datasets) and
    input-output paired datasets
    Text-only
    dataset
    Speech-only
    dataset
    Image-only
    dataset
    ※ This is example, this shows a part of modules

    View Slide

  12. 11
    Copyright NTT CORPORATION
    Inference in MediaGnosis
     Possible to implement single-modal functions and multimodal functions by
    extracting modules for the target function without using all modules
    Case 1:
    Speech recognition
    Case 3:
    Audio-visual
    emotion recognition
    Case 2:
    Face-based gender
    and age estimation
    Speech and
    Audio
    Understanding
    Cross-modal
    Understanding
    Text
    Generation
    Image and Video
    Understanding
    Cross-modal
    Understanding
    Attribute
    Classification
    Speech and
    Audio
    Understanding Cross-modal
    Understanding
    Image and Video
    Understanding
    Emotion
    Classification
    Happy, Sad,
    neutral
    Male,
    Female
    Elder,
    Adult, Child
    “It is sunny today”
    ※ This is example, this shows a part of modules

    View Slide

  13. 12
    Copyright NTT CORPORATION
    Our major technical point (1)
    Speech, Text, and speech-text cross-modal joint
    modeling for text generation
    [Masumura+ INTERSPEECH2022]
    Speech, Image, and speech-image cross-modal joint
    modeling for classification
    [Takashima+ INTERSPEECH2022]
     Cross-modal Transformer-based joint modeling
    [Masumura+ INTERSPEECH2022] [Takashima+ INTERSPEECH2022]
     Jointly model multiple single-modal tasks and multimodal tasks using a shared
    cross-modal transformer architecture
    Speech
    Text
    Speech
    Image

    View Slide

  14. 13
    Copyright NTT CORPORATION
    Our major technical point (2)
     Grounding other modals into Text-based LM for cross-modal modeling
    [Masumura+ SLT2019] [Masumura+ INTERSPEECH2020] [Masumura+ EUSIPCO2023]
     Leverage knowledge gained through large amounts of text for improving speech
    or image processing with limited training data
    Text-based LM improves image captioning
    [Masumura+ EUSIPCO2023]
    Text-based LM improves speech recognition
    [Masumura+ INTERSPEECH2020]
    Text
    Speech Text
    Image

    View Slide

  15. 14
    Copyright NTT CORPORATION
    Our major technical point (3)
     Self-supervised representation learning for multi-domain joint modeling
    [Masumura+ SLT2021] [Ihori+ ICASSP2021] [Tanaka+ INTERSPEECH2022] [Tanaka+ ICASSP2023]
     Utilize unpaired data collected from a variety of domains for self-supervised learning
    Domain adversarial self-supervised
    representation learning
    [Tanaka+ INTERSPEECH2022]
    Cross-lingual self-supervised
    representation learning
    [Tanaka+ ICASSP2023]
    Considering
    multi-domain
    datasets
    Considering
    cross-lingual
    datasets

    View Slide

  16. 15
    Copyright NTT CORPORATION
    Our major technical point (4)
     Special token based inference control in joint modeling
    [Ihori+ INTERSPEECH2021] [Tanaka+ INTERSPEECH2021] [Ihori+ COLING 2022] [Orihashi+ ICIP 2022]
     Control output generation style in inference while jointly training multiple tasks
    Style token based inference control
    for speech recognition
    [Tanaka+ INTERSPEECH2021]
    Switching token based inference control
    for text-style conversion
    [Ihori+ INTERSPEECH2021]
    Switching tokens
    Style tokens
    Auxiliary token based inference
    control for scene-text recognition
    [Orihashi+ ICIP2022]
    Auxiliary tokens

    View Slide

  17. 16
    Copyright NTT CORPORATION
    Our major technical point (5)
     Joint inference of multiple information with auto-regressive joint modeling
    [Masumura+ INTERSPEECH2022] [Ihori+ INTERSPEECH2022] [Makishima+ INTERSPEECH2023]
     Jointly generate multiple function’s outputs in one inference and consider relationship
    between the multiple functions
    Joint inference of multi-talker’s gender, age, and
    transcriptions [Masumura+ INTERSPEECH2022]
    Joint inference of dual-style outputs
    [Ihori+ INTERSPEECH2023]
    Jointly generate
    spoken text and
    written text
    Jointly generate
    multi-talker’s
    gender, age, and
    transcriptions

    View Slide

  18. 17
    Copyright NTT CORPORATION
    Our major technical point (6)
     Multi-modal end-to-end modeling for challenging domains
    [Yamazaki+ AAAI2022] [Hojo+ INTERSPEECH2023]
     Enhance multi-modal scene-context awareness, conversation-context awareness
    Audio-visual scene aware dialog modeling
    with crossmodal transformer [Yamazaki+ AAAI2022]
    Audio-visual conversation understaniding modeling with
    crossmodal transformer [Hojo+ INTERSPEECH2023]
    Handle conversation-context
    between a seller and a buyer
    Handle conversation histories and
    conversation video scene

    View Slide

  19. 18
    Copyright NTT CORPORATION
    Our major technical point (7)
     Joint modeling of recognition and generation modules
    [Masumura+ APSIPA2019] [Masumura+ INTERSPEECH2019] [Makishima+ INTERSPEECH2022]
     Consider speech generation or image generation to properly recognize speech or
    image information
    Consider generation criteria in itraining
    [Makishima+ INTERSPEECH2021]
    Consider generation criteria in inference
    [Masumura+ INTERSPEECH2019]
    Considering reconstruction
    criterion in inference
    Considering
    reconstruction
    criterion in training

    View Slide

  20. 19
    Copyright NTT CORPORATION
    Our major technical point (8)
     Training techniques to jointly use unpaired and paired data
    [Masumura+ ICASSP2020] [Takashima+ APSIPA2020] [Orihashi+ INTERSPEECH2020] [Suzuki+ ICIP2023]
     Unpaired data can be utilized for semi-supervised learning in a training phase,
    and utilized for unsupervised adaptation in a run-time phase
    Semi-supervised learning for sequence-to-sequence
    modeling [Masumura+ ICASSP2020]
    Online domain adaptation for transformer
    based object detection [Suzuki+ ICIP2023]
    Unlabeled data
    Unlabeled data

    View Slide

  21. 20
    Copyright NTT CORPORATION
    What is possible by
    MediaGnosis?

    View Slide

  22. 21
    Copyright NTT CORPORATION
    Our vision in industry fields
    Smart Communication Smart Office
    Smart City Smart XXX
    Communication practice

    AI Anywhere
    powered by
    Auto customer service
    City Robots
    Monitoring
    AI collaboration AI-based RPA
     Provide any solutions across multimodal and multiple domains
    using the MediaGnosis (our multimodal foundation model)

    View Slide

  23. 22
    Copyright NTT CORPORATION
    Applications powered by MediaGnosis
     Have developed various single or multi-modal applications
     Pickup two multi-modal applications for smart communication fields

    View Slide

  24. 23
    Copyright NTT CORPORATION
    App. 1: Remote communication support
     Facilitate a remote meeting by sensing multimodal multi-party information
     We often encounter stumbling blocks…
     It is important to ensure that
    such meetings proceed smoothly
    • I can’t sense the attitude of the
    other participants…
    • Only a limited number of people
    speak…
    • The discussion is deadlocked…

    View Slide

  25. 24
    Copyright NTT CORPORATION
    Demo

    View Slide

  26. 25
    Copyright NTT CORPORATION
    Analysis sheet
     Sense various aspects of a meeting and visualize its changing state
    Speaker’s audio-
    visual emotion
    Listeners’ visual
    emotion
    Audio-Visual
    Multiparty
    analyses
     Help to improve overall quality of the meeting
    Transcriptions
    (can be trnaslated)

    View Slide

  27. 26
    Copyright NTT CORPORATION
    App. 2: Personality measurement
    Select a situation that you like Perform role-playing against the
    selected situation for about 60s
    Have your analysis result
     Measure your personality and help you discover your potential charm
    by sensing multi-modal information
     Select a situation and perform a role-play (e.g., apologize to your lover)
     Analyze your responses, express them numerically, and classifies them into categories

    View Slide

  28. 27
    Copyright NTT CORPORATION
    Demo

    View Slide

  29. 28
    Copyright NTT CORPORATION
    Analysis sheet
     Explain your popularity factors, which represent the most charming points
    of a person, and also give advice about how to further enhance your charm
     Represent how your popularity factors differ from the averages
    Energetic compared with averaged ones Compare the differences with others in various aspects

    View Slide

  30. 29
    Copyright NTT CORPORATION
    Summary

    View Slide

  31. 30
    Copyright NTT CORPORATION
    Summary
     MediaGnosis is a multi-modal foundation model that can handle various
    functions and modalities within a single brain in an integrated manner
    Speech and Audio Processing
    Image and Video Processing
    Natural Language Processing
    Cross-modal
    Processing
     Can provide various functions in an all-in-one manner and realize complex AI inference
    combining multi-modal multi-tasks
     Aim to provide any solutions across multimodal and multiple domains
    Multi-modal foundation model

    View Slide

  32. 31
    Copyright NTT CORPORATION
    References
    [Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato,
    Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint
    Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the
    International Speech Communication Association (INTERSPEECH), pp.3218-3222, 2022.
    [Takashima+ INTERSPEECH2022] Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida and Shota Orihashi,
    "Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition", In Proc. Annual Conference of the International
    Speech Communication Association (INTERSPEECH), pp.4740-4744, 2022.
    [Masumura+ SLT2019] Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, Takanobu Oba, Ryuichiro Higashinaka,
    "Improving Speech-Based End-of-Turn Detection via Cross-Modal Representation Learning with Punctuated Text Data", In Proc. IEEE Automatic
    Speech Recognition and Understanding Workshop (ASRU), pp.1062-1069, 2019.
    [Masumura+ INTERSPEECH2020] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi,
    "Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition", In Proc. Annual
    Conference of the International Speech Communication Association (INTERSPEECH), 2822-2826, 2020.
    [Masumura+ EUSIPCO2023] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, "Text-to-Text
    Pre-Training with Paraphrasing for Improving Transformer-based Image Captioning", In Proc. European Signal Processing Conference
    (EUSIPCO), pp.516-520, 2023.
    [Masumura+ SLT2021] Ryo Masumura, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima and Shota Orihashi, "Large-Context
    Conversational Representation Learning: Self-Supervised Learning for Conversational Documents" In Proc. IEEE Spoken Language Technology
    Workshop (SLT), 1012-1019, 2021.

    View Slide

  33. 32
    Copyright NTT CORPORATION
    References
    [Ihori+ ICASSP2021] Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura, "MAPGN: MAsked
    Pointer-Generator Network for Sequence-to-Sequence Pre-training", In Proc. International Conference on Acoustics, Speech, and Signal
    Processing (ICASSP), 7563-7567, 2021.
    [Tanaka+ INTERSPEECH2022] Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara and Takafumi
    Moriya, "Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks", In Proc.
    Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1066-1070, 2022.
    [Tanaka+ ICASSP2023] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi
    Moriya, "Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning", In Proc. International
    Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023.
    [Ihori+ INTERSPEECH2021] Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi and Ryo Masumura, "Zero-Shot
    Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens", In Proc. Annual Conference of the International
    Speech Communication Association (INTERSPEECH), 776-780, 2021.
    [Tanaka+ INTERSPEECH2021] Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi and Naoki Makishima, "End-to-
    End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning", In Proc. Annual Conference of the International
    Speech Communication Association (INTERSPEECH), 4458-4462, 2021.
    [Ihori+ COLING 2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, "Multi-Perspective Document Revision", In Proc. International
    Conference on Computational Linguistics (COLING), pp.6128-6138, 2022.
    [Orihashi+ ICIP2022] Shota Orihashi, Yoshihiro Yamazaki, Mihiro Uchida, Akihiko Takashima, Ryo Masumura, "Fully Sharable Scene Text
    Recognition Modeling for Horizontal and Vertical Writing", In Proc. International Conference on Image Processing (ICIP), pp.2636-2640, 2022.

    View Slide

  34. 33
    Copyright NTT CORPORATION
    References
    [Masumura+ INTERSPEECH2022] Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato,
    Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo and Atsushi Ando, "End-to-End Joint
    Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training", In Proc. Annual Conference of the
    International Speech Communication Association (INTERSPEECH), pp.3218-3222, 2022.
    [Ihori+ INTERSPEECH2022] Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nabukatsu Hojo, "Transcribing Speech as
    Spoken and Written Dual Text Using an Autoregressive Model", In Proc. Annual Conference of the International Speech Communication
    Association (INTERSPEECH), pp.461-465, 2023.
    [Makishima+ INTERSPEECH2023] Naoki Makishima, Keita Suzuki, Satoshi Suzuki, Atsushi Ando, Ryo Masumura, "Joint Autoregressive Modeling
    of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction", In Proc. Annual Conference of the
    International Speech Communication Association (INTERSPEECH), pp.2913-2917, 2023.
    [Yamazaki+ AAAI2022] Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima, "Audio Visual Scene-Aware
    Dialog Generation with Transformer-based Video Representations", In Proc. DSTC Workshop at AAAI Conference on Artificial Intelligence(AAAI),
    No.35, 2022.
    [Hojo+ INTERSPEECH2023] Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka,
    "Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer", In Proc. Annual
    Conference of the International Speech Communication Association (INTERSPEECH), pp.2663-2667, 2023.
    [Masumura+ APSIPA2019] Ryo Masumura, Yusuke Ijima, Satoshi Kobashikawa, Takanobu Oba, Yushi Aono, "Can We Simulate Generative
    Process of Acoustic Modeling Data? Towards Data Restoration for Acoustic Modeling", In Proc. Asia-Pacific Signal and Information Processing
    Association Annual Summit and Conference (APSIPA ASC), pp.655-661, 2019.

    View Slide

  35. 34
    Copyright NTT CORPORATION
    References
    [Masumura+ INTERSPEECH2019] Ryo Masumura, Hiroshi Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, Takanobu Oba, "End-to-End
    Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders", In Proc. Annual
    Conference of the International Speech Communication Association (INTERSPEECH), pp.1606-1610, 2019.
    [Makishima+ INTERSPEECH2022] Naoki Makishima, Satoshi Suzuki, Atsushi Ando and Ryo Masumura, "Speaker consistency loss and step-wise
    optimization for semi-supervised joint training of TTS and ASR using unpaired text data", In Proc. Annual Conference of the International
    Speech Communication Association (INTERSPEECH), pp.526-530, 2022.
    [Masumura+ ICASSP2020] Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Atsushi Ando, Yusuke Shinohara, "Sequence-level
    Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition", In Proc. International Conference on Acoustics, Speech,
    and Signal Processing (ICASSP), pp.7049-7053, 2020.
    [Takashima+ APSIPA2020] Akihiko Takashima, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura, "Unsupervised
    Domain Adversarial Training in Angular Space for Facial Expression Recognition", In Proc. Asia-Pacific Signal and Information Processing
    Association Annual Summit and Conference (APSIPA ASC), pp.1054-1059, 2020.
    [Orihashi+ INTERSPEECH2020] Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, "Unsupervised Domain Adaptation for Dialogue
    Sequence Labeling Based on Hierarchical Adversarial Training", In Proc. Annual Conference of the International Speech Communication
    Association (INTERSPEECH), pp.1575-1579, 2020.
    [Suzuki+ ICIP2023] Satoshi Suzuki, Taiga Yamane, Naoki Makishima, Keita Suzuki, Atsushi Ando, Ryo Masumura, "ONDA-DETR: Online Domain
    Adaptation for Detection Transformers with Self-Training Framework", In Proc. International Conference on Image Processing (ICIP), pp.1780-
    1784, 2023.

    View Slide