Extend the use of supplemental variables in GDA by applying machine learning to the free text descriptive response portion and combining it with MCA analysis

Extend the use of supplemental variables in GDA
by applying machine learning to the free text
descriptive response portion and combining it with
MCA analysis ver1.0
CARME2023 09/28 Room2
11:00-12:30
kazuo fujimoto [email protected]
Project Researcher
Institute for Mathematics and Computer Science
Tsuda University

View Slide

Very short seld introduction:After CARME…
After CARME2015,
This transrated Book
publisherd.
After CARME2019!

View Slide

So Aftre CARME2023…
• Not decided …
2023/9/28 CARME2023@University of Bonn 3

View Slide

Abstract
The practice of linking the distribution of individuals within the space
revealed by MCA with qualitative surveys has been mentioned in the book [1]
and practiced in research activity [2]. In Japan, KH Coder [3] as a text
analysis tool has been remarkably popularized and used in many social
surveys.
It is possible to link this text analysis with the selected answers using
functions within KH Coder. Our first attempt as a mixed research method is
to use this functionality.
The next step is to add the frequently occurring words (important words)
obtained at this stage to the individual coordinates as supplementary variables
in the MCA and to analyze them by a GDA method [4].
In this report, as the next step, we report an example [5] in which frequently
occurring words (important words) were tagged as positive/negative by the
machine learning process and analyzed as supplementary variables.
This approach extends the use of supplementary variables in GDA.

View Slide

References
• [1] Le Roux, Brigitte, & Henry Rouanet. 2010. "Multiple correspondence
analysis.", Quantitative applications in the social sciences 163. Thousand Oaks,
Calif: Sage Publications. "Between quantity and quality, there is geometry."p1
• [2] Tony Bennett, Mike Savage, Elizabeth Silva, Alan Warde, Modesto Gayo-Cal
and David Wright al, "Culture, Class, Distinction",2009,2010, Routledge
• [3] https://khcoder.net/en/
• [4] with [1] and using the GDAtools package of R. Robette N. (2023), GDAtools :
Geometric Data Analysis in R, version 2.0, https://nicolas-
robette.github.io/GDAtools/
• [5] Kazuo Fujimoto and Kazuya Ohata, “Development of a method for analyzing
participant satisfaction survey data that combines MCA and Aspect Based
Sentiment Analysis.”(in Japanese), NLP2023
• https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/Q1-11.pdf
• (in English) https://419kfj.sakura.ne.jp/db/wp-
content/uploads/2023/09/nlp2023−article_01−13v1.1_eng.pdf English
version

View Slide

Software related references
• Higuchi, Koichi 2017 “A Two-Step Approach to Quantitative Content Analysis: KH Coder Tutorial
using Anne of Green Gables (Part II)” Ritsumeikan social sciences review 53(1): 137-147. [PDF
File] https://khcoder.net/en/
• Robette N. (2023), GDAtools : Geometric Data Analysis in R, version 2.0, https://nicolas-
robette.github.io/GDAtools/
• RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA
URL http://www.rstudio.com/.
• R Core Team (2023). _R: A Language and Environment for Statistical Computing_. R Foundation
for Statistical Computing, Vienna, Austria. .
• Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A,
Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D,
Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the
tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686.

View Slide

Notice and Apology
• In the following report, due to an application problem of the reporter,
permission to reuse the raw data was not granted, so graphs and other
information are based on the report for the The Association for Natural
Language Processing in 2023/03, and no new analysis was conducted.
• Referenced reports
• Kazuo Fujimoto and Kazuya Ohata, “Development of a method for analyzing
participant satisfaction survey data that combines MCA and Aspect Based
Sentiment Analysis.”(in Japanese), NLP2023
(https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/Q1-11.pdf)
(English version)

View Slide

Outline of my presentaion

View Slide

Outline of my presentaion
• Characteristics of the data (congratulatory response)
• Challenge:
• How can we extract improvement measures and issues when most of the responses are "good"?
• Step 0 Exploratory Data Analysis (EDA) and MCA, and Basic Text mining, separately.
• Step 1: Focus on free text responses. Linking text mining and MCA
• Step 2: Focus on ambiguity of most frequently used key words and phrases. Adding Tags
(positive/negative/ none) by machine learning (ABSA: Aspect Based Semantic Analysis).
• Step 3 Project the tagged words onto the MCA indivisual map.
• Issue. It was found that the individuals who selected the important tagged words can be plotted on
the whole individual map, but the amount of tagging depends on the dictionary of machine learning.
• Also, the MCA map is very biased to begin with, so we would like to deepen the analysis by
utilizing CSA.

View Slide

Schematic overview of this report
• Projecting tagged extracted words as supplemental variables into MCA's
result space.
• Our trial is an attempt to create supplemental variables by text mining and
machine learning tagging and plotting them in individual space, and
developing another mixed research methods.
* Le Roux, Brigitte, & Henry Rouanet. 2010. "Multiple correspondence
analysis.", chapter 1
Famous
phrases. *
MCA and mixed
research methods

View Slide

Data Structure
ID Var1 Var2 …. Varn
１
２
３
m-3
m-2
m-1
ｍ
Open Ended Free Text Answer parts
:
:
:
:
…
…

View Slide

Step 0 MCA and Text Minig Separately
ID Var1 Var2 …. Vark
１
２
３
N-3
N-2
N-1
N
Free Text parts
:
:
:
:
…
…
Specific MCA
Text Mining by
KH Coder.
One Variable and its
categories can be
ploted in co-
occurrence Network
and CA Plot with
words.
Examning the mutual relations
by KWIC concordance
Separately

View Slide

Step 1 MCA and Frequent word as
supplymentary variables
１
２
３
N-3
N-2
N-1
N
Free Text parts
:
:
Word1 Word2 Word3 … Wordk
1
0
1
0
1
1
0
Specific MCA and SDA Interpret the Words using KWIC
of Step 0

View Slide

by using KWIC of Step0
• We found the Ambiguous Meaning within frequented Words.
• So we made next another approach as as follows:
• put the p and n tag to each words. p means “positive” and n means “negative”
• We make this process by using Aspect Based Semantic Analysis (ABSA).
• After tagging to the Words and make data frame as Supplymentaly variable.
• Overlayed them on individual space which is generated by MCA.

View Slide

Step 2 MCA and Tagged Frequent word as
supplymentary variables
１
２
３
N-3
N-2
N-1
N
Free Text parts
:
:
Word1/p Word1/n Word2/p … Word/n
1
0
1
0
1
1
0
Specific MCA and SDA Interpret the Words using KWIC
of Step 0

View Slide

Step 0 and Step 1

View Slide

Characteristics of the data and Challenge
• Characteristics of the data (congratulatory response)
• Response selection for 5 case method
• Mostly 5 or 4 responses. Average is ….
• The seminar was about information security workshop, and participants were
highly motivated.
• Challenge: How can we extract improvement measures and issues
when most of the responses are "good"?
• Based on these results, if it is sufficient to summarize that the event was a
success, then there is nothing to say.
• However, it is necessary to identify issues that need to be addressed in order to
make the event even better.

View Slide

Step0 Exploratory Data Analysis (EDA) and
MCA
• Number of respondents 2001
• Confirmation of the relationship between satisfaction and responses.
• Analysis of the distribution of data by MCA confirms the trend of
unsatisfactory respondents.
• Responses that could lead to improvement (free text responses) are not found
in the unsatisfactory response group.
• An analysis of the free-response statements of the satisfied respondent group is
needed.

View Slide

Paris displsy of
Skill improved and
Understanding
2023/9/28 CARME2023@University of Bonn
• A large portion of
“understanding” is accounted
for by "skills: improved ".
• ! Don’t understanding and
skill improvement are not
related.
• Congratulatory Responses
• That wasn't so bad, was it?
(Polite Responses)
• Involvement Self-identification
Confirmation Responses
• As long as you participated,
there should be results.
• There are issues to be clarified
here.
skills: improved
skills: improved
understanding
understanding
very improved、improved、
no change、Don’t know、NA understand well、understand、
Don’t understand some, Don’t understand many
NA 19

View Slide

hese three questions are biased toward posive.
Instructor's
explanation
and others
focusing on
understanding
Seen in this way,
responses about
“instructor explanation”,
“support”, and
“response” are considered
to be uninformative with
respect to
“understanding”
Understanding instructor explanation support responses
Understanding
instructor
explanation
support responses
← Positive /Negative →
20

View Slide

Step 1: Focus on free text response.
Linking text mining and MCA
Respondents with extremely low
satisfaction did not respond to the
open-ended (free text )responses
either.
Therefore, they are not eligible to
explore areas for improvement in
the workshops.

View Slide

Space generation by MCA
(speMCA with only NA excl.)
Completely disagree.
Clustering of response patterns
22

View Slide

Number of responses and response rate to open-
ended free text questions (Q15-2, Q20, Q22)
• Answer all three questions: 223
(14.8%)
• Reasons for "understand" responses
(Q15-2):
• 742+348+87+223=1400
• 70.0%
• Course environment (Q20):
• 73+348+223+9=653
• 32.6%
• Other overall impressions (Q22):
• 24+87+223+9=343
• 17.1%
Reasons for
"understand"
overall impressions
Course environment
23

View Slide

Step 2

View Slide

Step 2: Focus on ambiguity of frequently
used key words and phrases.
• Tag (positive/negative/none) these by machine learning (ABSA).
• Words with both p/n occurrences
• 'time, exercise, content, knowledge, terminology, explanation, training, lecture
• Negative Word Top 5 ('time', 84), ('exercise', 72), ('content', 52), ('knowledge',
44), ('term', 30)
• Positive Word Top 5 ('exercise', 120), ('content', 94), ('explanation', 78),
('training', 49), ('lecture', 37)
• The table on the next page shows the "extracted words" list without
the p/n tag. Frequent words detected by the aspect-based sentiment
analysis are marked in this.

View Slide

Words with a high number of occurrences with
ambiguous usage
• Time
• Exercise
• Contents
• Knowledge
• explanation
抽出語出現回数抽出語出現回数
1理解 583 21流れ 148
2時間 516 22発⽣ 143
3思う 488 23ありがとう 141
4インシデント 387 24研修 135
5演習 363 25⾮常 134
6内容 351 26勉強 134
7対応 316 27業務 130
8知識 307 28⾏う 127
9感じる 254 29解析 124
10ログ 208 30情報 124
11事前 183 31具体 122
12実際 182 32⽤語 119
13説明 175 33難しい 107
14学習 171 34グループ 98
15部分 168 35参加 98
16多い 160 36分かる 98
17もう少し 154 37⾃分 95
18受講 153 38良い 94
19セキュリティ 151 39講義 93
20報告 149 40必要 93
• Training
• Specific
terms
• lecture
Term Frequency
Term Frequency
26

View Slide

Response patterns for each question
Sill improved Understanding Explanation of lecturer
Adequate Speed ? Supports Responces to Questions
27

View Slide

Step 3 Project the tagged words onto the
MCA entity map.
The ”explanation" and "content" are characterized
by negative expressions (successfully separated).

View Slide

Interim Summary and Future
Issues

View Slide

Interim Summary and Future Issues
• As indicated above, the results suggest that the input of free description
responses from text mining as a supplemental variable in MCA allows for
analysis in combination with the analysis of the free description portion and
categorical variables.
• It is also suggested that text mining can be used not only to extract words,
but also to tag them using machine learning to enable more detailed analysis.
• The key issue to be addressed is whether it is possible to encourage
workshop participants to respond to free-text questions.
• Since the distribution of congratulatory responses is highly skewed, we
would like to deepen the analysis by using CSA and other methods.

View Slide

Summary by charts
ID Var1 Var2 …. Varn
１
２
３
Open Ended Free Text Answer parts
MCA KH Coder /Text mining
KWIC concordance
[Frequency
List] of words
SDA w/supplymentary
Variables
Πϯγσϯτ
Α͘ཧղͰ͖ͨ
಺༰
஌ࣝ
۩ମ
ཧղ
ԋश
ରԠ
ϩά
આ໌
डߨ
ݚम
ඇৗ
࣌ؒ
͋Γ͕ͱ͏
ࢥ͏
ײ͡Δ
࣮ࡍ
ཧղͰ͖ͨ
ࣄલ
෦෼
ηΩϡϦςΟ
ྲྀΕ ൃੜ
΋͏গ͠
ཧղͰ͖ͳ͍಺༰͕͋ͬͨ
༻ޠ
ઐ໳
ղੳ
ෆ଍
ଟ͍
೉͍͠
ཧղͰ͖ͳ͍಺༰͕ଟ͔ͬͨ
ػձ
ษڧ
Degree:
ø
ù
ú
û
Frequency:
ø÷÷
ù÷÷
ú÷÷
û÷÷
ü÷÷
ಛʹ
ෆ଍
આ໌ķ
࣌ؒĵ
ॳΊͯ
۩ମ
෦෼
ઐ໳
಺༰ķ
ଟ͍
ϋϯζΦϯ
୹͍
֬ೝ
গͳ͍
಺༰ĵ
ϩά
ֶश
࣮ફ
೉͍͠
༻ޠ
ԋशĵ
଍ΓΔ
෼͔Δ
ԋशķ
ܦݧ
άϧʔϓ
શମ
஌ࣝ
՝୊
ղੳ
ྲྀΕ
࣌ؒ
஌Δ
࡞ۀ
ݚमķ
ରԠ
ࢿྉ
આ໌
಺༰
ࣄલ
ํ๏
ମݧ
ݚम
ߨࢣ ࢀߟ
ηΩϡϦςΟ
ษڧ
ൃੜ
࡞੒
ֶͿ
ཧղ
৘ใ
࣮ࡍ
Πϯγσϯτ
डߨ
΋͏গ͠
ඞཁ
ԋश
ࣗ෼
ඇৗ
ôù
ôø
÷
ø
ù
ú
ôø ÷ ø ù
੒෼øççï÷õûúûüóççúõýìð
੒෼ùççï÷õúĀøĀóççúõùüìð
čĹĬĸļĬĵĪŀā
ø÷÷
ù÷÷
ú÷÷
û÷÷
ü÷÷
Co-Occurrence
Network Map
CA Map
MCA Map
Step1
ABSA/Machine
Learning
Step2
Tagged
GDA/SDA
Questionnaire Free text answers
Step0
Analysis separately
Analysis Text by
KWIC,
refering MAPs.

View Slide

Acknowledgments

View Slide

Acknowledgments
• This paper would not have been possible without the machine learning
(ABSA) run by NICT's 2022 RA. Kazuya Ohata; thank you again for
the co-authored paper and poster session presentation at the March
2023 Natural Language Processing Conference NLP2023.
• The research on multiple correspondence analysis by the reporter is
also supported by Grant-in-Aid for Scientific Research (KAKENHI),
20K02162 "Research on Categorical Data Analysis Methods Focusing
on Geometric Arrangement of Data". We would like to express our
gratitude for the support. https://kaken.nii.ac.jp/ja/grant/KAKENHI-
PROJECT-20K02162/

View Slide

Thank you for your attention.
Questions and suggestions are
welcome.
[email protected]

View Slide

MEMO

View Slide

Extend the use of supplemental variables in GDA by applying machine learning to the free text descriptive response portion and combining it with MCA analysis

Extend the use of supplemental variables in GDA by applying machine learning to the free text descriptive response portion and combining it with MCA analysis

419kfj

More Decks by 419kfj

Other Decks in Research

Featured

Transcript