Data Science, Explained

Curious to know exactly what’s happening to improve your Community Voice Word Clouds? Read on!

The Baby with the Bath Water

The biggest challenge in creating a useful word cloud is throwing out the unimportant words without inadvertently throwing out the baby with the bath water. Determining which words are important is both a judgment call and domain-specific. For example, are the words “school” or “student” important? It depends. For our clients (K12 school districts in the US), words like “school” and “student” dominate the word cloud without adding any real value to the visualization. These words are not your typical unimportant words like “the” or “and” (called “stop words” in the field of NLP) that can easily be filtered out of the text data using predefined lists.

Fortunately, there is an ancient numerical statistic that can be used to tackle this problem. TF-IDF (Term Frequency – Inverse Document Frequency) has been around for nearly 50 years, which is certainly ancient in the world of AI. It is intended to measure the importance of a word in the context of a collection of documents, which is the key. The TF-IDF algorithm allows us to leverage years of historical data to systematically identify and remove the less important words from the text data, allowing the more meaningful words to rise to the surface. We injected our K-12 education expertise to slightly tweak the algorithm and carefully set a threshold that throws out the junk without losing data that could reveal important trends.

These unimportant words are not only domain-specific, however, they are also client specific. For example, the name of the school district would perpetually litter the word cloud, but the name of the school district is unique to each client. Each client uses the Let’s Talk application a bit differently, and depending on how they use it, the judgment call of whether or not a word is important can change. In order to generate a meaningful word cloud for each client, we are also running the TF-IDF algorithm for each individual client, on their unique collection of dialogues.

Don’t Be a Lemma

To help reconstitute the fragmented data that often results from different forms of the same word (i.e. teach, teaching, teaches, taught), we implemented a lemmatization process. Not to be confused with “stemming” which simply chops off parts of a word, lemmatization is the process of reducing words down to their root word. Whereas the stemming process does not necessarily result in an actual recognizable word, the lemma of any given word is always a real word.

To be worthwhile, however, the lemmatization process must also consider the context of the word because, you know, sometimes words have two meanings. For instance, the word “writing” can be a noun or a verb. Used as a verb, the lemma is “write.” But used as a noun, the lemmatization process would leave the word “writing” unchanged. Identifying the part of speech of any given word in the dataset requires implementing a separate AI model that is trained for this purpose that considers the surrounding words, punctuation, etc.

Once each word is tagged with the appropriate part of speech, we initiate the lemmatization process which gives us the root form of each word so that words like “bully,” “bullying,” and “bullied” are unified under the common root word “bully,” instead of being scattered in small pieces in different parts of the word cloud. Combined with the process of filtering out the unimportant words, this process allows the word cloud to reveal important trends in the data that may have otherwise been missed.

No Time to Get Sentimental

We didn’t stop there. While our existing word cloud is certainly vivid, the rainbow of colors is nothing but color and fury. Instead, we wanted to use the color axis to add a meaningful, intuitive dimension to the word cloud. The updated word cloud offers two views: one version uses color to visualize the average sentiment of the dialogues containing the relevant word/phrase; the other uses color to visualize the average intensity of emotion.

The sentiment view helps to answer the questions: What is the community upset about? What is the community happy about? But as is often the case, the tyranny of averages can rear its ugly head. We quickly saw that some keywords would appear to be neutral in the sentiment word cloud because half the community was happy and half the community was upset. In fact, it is precisely these kinds of divisive issues that educators need to be most aware of. And so, we created the emotional intensity view of the word cloud. Instead of averaging positive and negative sentiment, which can cancel out, we average emotional intensity (regardless of whether the emotion is negative or positive). All of sudden, different insights were popping out of the word cloud, ones that we hope will be useful to educators.

Not long ago, AI-powered sentiment analysis was somewhat underwhelming, but the field of Natural Language Processing has made some eyebrow-raising advancements in just the last few years. If you haven’t heard about the cutting-edge general language models that have been developed by tech giants like Google and OpenAI (i.e. “BERT” and “GPT-3”) yet, you will soon. (Oh wait, you just did.) Although there are a lot of overblown claims about these general language models, they are also quite impressive and represent major milestones in the field of NLP. These general language models have dramatically improved a computer’s ability to interpret text, which in turn has made sentiment analysis noticeably better. Our new word cloud is leveraging a customized BERT-based sentiment analysis model to generate sentiment scores for all communications going in and out of our Let’s Talk application. Over time, the model will be further and further tuned to be in sync with the K12 education domain specifically, which has its own vocabulary and context.

Signal or the Noise?

Educators know first-hand that, sometimes, a small group of people can make a whole lot of noise. The traditional word cloud sums up the occurrences of each word, but to help decipher the signal from the noise, our new word cloud sums up the number of unique dialogues containing the words/phrases in the word cloud. That way, if one person writes in with a novel-length diatribe about a particular issue, it does not have a disproportionate effect on the word cloud. Brevity is a virtue. The members of the community who are concise and to the point should not have less of a voice than those who send messages with many, many words.