Welcome to the sixth blog of the technology aided gut (TAG) checks series. So far in this series, we have focused on the tools and techniques of a just-in-time learning strategy. We will now switch gears and show how, with very little effort, we can use TAG checks to make simple yet (occasionally) profound conclusions about data - big and small.
As we delve into the details of TAG checks in the next several blogs, we will be using web programming tools and techniques to gather, process and analyze data. While we will try to be as comprehensive as possible in our explanations, it may not be always as detailed as we would like it to be. This forum, after all, is a blog and not a training tutorial. We hope by applying the just-in-time learning strategy that we have discussed so far in the series, you will be able to supplement what we miss in our explanations. Our goal for the overall series has been to empower you. We hope the first part of the series has made you an empowered self-learner.
The second part of the series will make you an empowered and savvy data consumer, a development professional who can confidently rely on the story the data tells to accomplish her tasks.
For the readers who are just joining in, we suggest that you become somewhat familiar with the just-in-time learning strategy by skimming the series so far.
Define quantifiable gut feelings
We all have gut feelings. Here are a few examples:
Junaid is avoiding me.
Michelle is going to get the promotion.
I am not going to get this job that I had applied to.
Rob is a “negative” person.
Rochelle is most definitely the manager’s favorite.
Sina is a polite person.
All of the above assertions are personal. That is the essence of a gut feeling. Also none of them on the surface appears to be evidence based (hence we use the term “feelings”). However, if we dig deeper we will see that there are nuggets of evidence that support the assertions. For example
For #1 above, the person may think Junaid is avoiding her because
Junaid is a colleague who is usually very prompt at responding emails but has not been so recently.
He is not on a mission.
Other people have received responses from Junaid.
For #4 above, Rob may be a “negative” person because
His email responses always starts with what he cannot do for you
His email response time is consistently long for a colleague who is junior
His email response time consistently short for a colleague who is senior
(We leave the others as an exercise for our readers)
As one can quickly glean from the above, the evidence is not enough to support the assertions beyond reasonable doubt but they are strong enough that they should not to be ruled out as feelings. These are the areas where we will do the TAG checks. We call our approach the art of unscience. One note: unscience (at least about emails) are supported by hard science. For example, here's a great paper on how email usage reflects attentional differences due both to personal propensities and to work demands and relationships. TIn the process the paper also looks at features of email messages that influenced attention to the message. For example, messages with social content were significantly more likely to receive an immediate response even though social content decreased a respondent’s rated their importance lower than work-related of the messages.
We now take the example of “negative Rob” further. In order to do our TAG check on Rob we first need to define negativity so that we can measurecount the evidence in some form. We can define a negativity scoring of the email as follows:
If an email contains more than 60% negative terms in the first 20% of the content, increase the negativity score.
If the average email response time to junior colleagues is 60% higher than of senior colleagues, increase the negativity score.
The example is intentionally provocative. There will be several organizational and legal issues preventingstopping this type of analysis from ever happening in an organization. Rob will not like it. His bosses will fear it. While there can be a philosophical debate about whether these type of analysis should not only be allowed, but embraced by organizations who pride in being transparent, for our purposes we will choose unscience that is far less controversial and much more simple to define and count.
Count and Show (and Tell)
There is no shortage of sophisticated statistical tools for data analysis and understanding. Their interfaces make them very easy to use. Their outputs are aesthetically appealing. However, the output is hard to interpret accurately without a sound statistics background.
We want TAG checks to be simple. So we go back to one of the fundamental mechanics of data inspection: counting frequencies. The frequency of occurrences of a pattern in the data can give us useful insights. The patterns do not even have be complex to be meaningful. For example, if your data is in the form of natural language documents, then by simply counting the words and looking at their frequencies you can quickly assess what the emerging theme is, how to best categorize the document, if it is someone's else's document, and find out the theme of the document without reading it thoroughly.
After computing the frequencies we can display them as tag clouds. Tag clouds are a simple and efficient way to visually highlight dominating data patterns. For example, here is a word frequency distribution and a tag cloud of one of our blogs.
TAG check in Action
During the plagiarism controversy surrounding Melania Trump’s speech at the US Republican Party Convention earlier this year, one of the authors of this blog analyzed the frequencies of specific words in the convention speeches of Laura Bush(2000), Cindy McCain and Michelle Obama (2008), and Melania Trump (2016) to get a rough idea of what our potential first ladies say in their first national “job interviews.” Despite their diverse backgrounds and life experiences, when it came to the convention speeches the words (and by extension the themes) were very homogenous. Democrats and Republicans both like to believe that their candidates and their wives are different from the other side, but in reality the country expects the first wives to be the same, and that’s what the wives end up projecting.
Of course the patterns we want to count can be more complex. But as long as they are well defined, they can be programmatically detected and counted. Here’s an example of a more complex analysis.
Advanced TAG checks
During the early stage of the 2016 US election cycle, when the campaigns were focused on (at least the projection of) transparency, policy robustness, and governance competency, a candidate, Jeb Bush, voluntarily released his official emails. Collectively, those emails captured the thoughts and actions of a man who at that point had every chance to become the next US President. Naturally, we wanted to peek into Governor Jeb Bush’s mind or rather analyze his 3.4 GB email text (600 times bigger than the Bible!). Specifically, we wanted to find out what/who influences him the most (here is an article with the gory details including some of the code).
In general, people who you respond to more are the people who are “close” to you, and they are those close people most likely have some influence over your actions. Since emails formed the basis of this unscience, we needed to define them precisely as well. For our analysis we combined all fo the emails clumped all the (forwards, replies,y history, etc.) together and considered them as one data setemail. This will result in double counting in many cases, but since we only looked at the flow of the emails and not their content for our analysis, this redundancy did not matter.
We counted the people and the companies that have the highest email exchanges with the governor. Specifically the ones who the Governor wrote to. And we came up with the “Jeb Influencers”.
Before we say goodbye - till the next Blog
In this blog we went into the details of TAG checks. We used two examples from US politics. The mechanics of TAG checking involves
Defining a pattern
Looking for that pattern in the data and counting the frequencies of their occurrence
Displaying the frequencies using a tag cloud
If words are enough for you then ...
Here are some tools
If you are interested in generating tag clouds of your own here is a web app that can generate tag clouds from PDFs/Web pages/Text files/Google Docs and even images (for Docs privacy should be set to public). It allows you to filter words and download the cloud. Works best if you paste the text or the link of a google document.
If you are a google doc user there is great Google Docs Add on to generate tag clouds.
(full disclosure: both written by the authors and both are for free)
If you need/want to define your own patterns ...
In the next few blogs we will teach you how to write your own TAG checkers. It will involve a bit of programming, but we will make it easy. For now, if you are interested, make sure you have a gmail account. That’s it!
Parting notes - (mostly for the benefit of the now angry data scientists readers…)
TAG checking is not data science. We did not use advanced data analysis or text mining AI tools to look at the data and then tell you a story or give you our insights. We believe we do not need to. You are very much capable of getting your own story from the data, if we can just summarize it nicely for you.
Follow PublicSphereWB on Twitter!