Can deep NLP help solve the misinformation problem?

Michael Mahoney
11 min readApr 15, 2021

Introduction:

In short, I believe it can, but more work needs to be done. Let me explain.

I recently delved into the world of deep NLD in an attempt to try and glean insight into, what I consider to be one of the largest problems facing modern society, misinformation. I would love to explore the colorful history of misinformation in human society — as it has always been around — but instead, I will leave you with a statement. The rest is up to you.

Misinformation has been a tool used in human history for as long as it’s been written down.

So if it’s always been around, what makes it so dangerous in modern times? I don’t have an academic answer for you. As research continues to develop on this topic over the next half-decade I’m sure terms like “social media”, “machine learning”, “NLP”, and “propaganda” will end up as key terms.

From where I stand, modern misinformation has the ability to subvert the truth in ways that spur people to action in a mass-produced fashion that is both incredible in scale and of increasing potency.

There are a lot of terms here but the one I want to focus on is potency. Misinformation coming from the village idiot doesn’t worm its way into the hearts of passers-by. But what about misinformation coming from your neighbor? Your friends? Your parents? Your kids? We lose clarity of truth when falsehoods are perpetuated by those we trust.

So what do we do about it?

The first thing to understand is that the solution isn’t going to be easy. Like all great undertakings, any true solution is going to be a multi-faceted approach with wide-ranging societal implications. But, I believe a crucial component of any attempt at solving the problem at large begins with the ability to tell the difference between true and false.

This brings us to the main question of this article. Can neural network modeling help us tell the difference between what is true and false? The remainder of this paper will be focused on my attempt to explore this question. The data I used is largely focused on the political sphere. All the data used was pulled from Politifact.com. There are a couple of reasons I chose this site.

  1. Politifact reviews small statements: The average statement is 18 words.
  2. Politifact writes small articles reviewing every quote that explore the truth of each statement with a reasonable degree of nuance.
  3. They have a bank of around 19000 quotes reviewed.
  4. Their site was easy to scrape

Data Overview

On to the data. In order to keep this article a reasonable length, we’re going to do this lightning round style.

If you want to see the entire project here’s the repo:

https://github.com/minthammock/cap-stone

There’s also an app where you can interact with the data.

https://capstone-dash-app.herokuapp.com

Part of the reason I chose Politifact was for the granular scale of truth values. They rate things in 6 different categories

  1. True
  2. Mostly-True
  3. Half-True
  4. Barely-True
  5. False
  6. Pants-Fire (false statements that are outlandish for one reason or another)

Our data distribution is the following:

To handle this imbalance we are employing the sci-kit learn class_weights function. With only 19000 entries in an NLP task, I opted to not undersampling. Similarly for oversampling, given the sparse nature of NLP tasks, I felt it best to not oversample for fear that certain words would be massively over-represented which would lead to overfitting.

Here’s a nice sample of our corpus. The statements have been tokenized and lemmatized at this point.

The political nature of our corpus is immediately clear. As is the largest word, “say.” If you care to take a stroll down Politifact.com way, you’ll see why this is the case. Hearsay is a popular trend in national and local politics.

If you want a function for this process, take these functions. Take a look at the docstrings. It returns a dictionary that functions as a partition for the column of your choosing. There’s also the “all” key which includes the lemmatized tokens with stopwords removed for the entire input df.

The first function is a helper function that creates a partition.

def data_partition(df, partitionColumn):
'''
This function partitions a pd.DataFrame object by a given column. This function should only be used on categorical columns but will work in any case.
Parameters: df - pd.DataFrame. The dataframe object you with to partition
partitionColumn - string. The string name of the column you wish to
partition the df by.

Returns:
partition - dictionary. A serialized dictionary. The keys are the values of
the partition column. The values are the DataFrame entries that
correspond to that value of the partition column. This preforms
a df.loc operation.
'''
# find all values of the partition column
uniqueValues = df[partitionColumn].unique()
# create the partition dict
partition = {}
# loop through the partition column values and slice the df with said value
for value in uniqueValues:
part = df.loc[df[partitionColumn] == value]
partition[value] = (part)
return partition
def lemmatize(df, textColumn, partitionColumn, stopwordsList = None):
'''
This function will partition a dataframe that includes a text corpus and return
a dictionary of the tokenized and lemmatized text sequences where the keys are
the various unique entires of the partition. The lemmatization with by done using
NLTK and the WordNetLemmatizer class.
Parameters:
df - pd.DataFrame. The dataFrame that houses the corpus and any
other supporting columns of information.
textColumn - string. The string name of the column which holds the sequences
of corpus text.
partitionColumn - string. The string name of the target value column. This
column will be used to create the partition for the return
dictionary
stopwordsList - list. The list of values which will be filtered out of the
considered tokens of the text sequences. Default is None.
Returns:
tokenPartition - dictionary. A dictionary where the keys are the unique values
of the partition column and the values are the tokenized and
lemmatized text sequences.
'''
# run the data_partition function. See funtion doc string for more info
partitionList = data_partition(df, partitionColumn)
# create housing dictionary for the return partition
tokenPartition = {}
# loop through the partition and remove stopwords and lemmatize remaining tokens
for truthValue,part in partitionList.items():
lemmatizer = WordNetLemmatizer()
# Create tokens with NLTK word_tokenize funtion
part_tokens = part[textColumn].map(word_tokenize)
part_tokens_lemmatized = []
for text in part_tokens:
temp = []
for word in text:
# clean the tokens of whitespace and quotation marks
word = word.strip(string.punctuation+' '+"'"+'"')
# remove word in the stopwords list as well as punctuation tokens that survived the strip cleaning
if word not in string.punctuation and word not in ['.', ',', "'", '"', 's'] and word not in stopwordsList:
temp.append(lemmatizer.lemmatize(word.lower()))
part_tokens_lemmatized.append(temp)
# serialize the entry in the partition column
tokenPartition[truthValue] = part_tokens_lemmatized
# run the same process as before but over the entire dataFrame
lemmatizer = WordNetLemmatizer()
all_tokens = df[textColumn].map(word_tokenize)
all_tokens_lemmatized = []
for text in all_tokens:
temp = []
for word in text:
word = word.strip(string.punctuation+' ')
if word not in string.punctuation and word not in ['.', ',', "'", '"', 's'] and word not in stopwordsList:
temp.append(lemmatizer.lemmatize(word.lower()))
all_tokens_lemmatized.append(temp)
# serialize the entire df as 'all'
tokenPartition['all'] = all_tokens_lemmatized
return tokenPartition

Brief NLP Discussion

With a high-level overview of our quotes corpus in hand, let’s dig a little deeper. In the full write-up notebook I explore several NLP models to understand more about our data.

  1. Bi-Gram Investigations
  2. Word2Vec Models
  3. Lime Explanations
  4. Latent Dirichlet Models

While interesting, I will leave the bottom three out of this blog for brevity. You can check out the app or the full write-up for more info on these models.

The Bi-grams give considerable insight all by themselves. Here are the top 25 from the corpus at large.

  1. health care
  2. united states
  3. donald trump
  4. barack obama
  5. hillary clinton
  6. president barack
  7. president obama
  8. social security
  9. says president
  10. new york
  11. scott walker
  12. joe biden
  13. says donald
  14. last year
  15. health insurance
  16. tax cut
  17. mitt romney
  18. photos show
  19. obama administration
  20. illegal immigrant
  21. supreme court
  22. says u.s
  23. income tax
  24. says hillary
  25. new jersey

The pairs fall into three general categories: Social issues, people, and hearsay mentions. Rather remarkably, people appear to more center stage than political issues. While health care is the top pair, this is more the exception than the rule. President Obama, President Trump, and Secretary Clinton hold 3 of the top 5 and 5 of the top 10 spots.

You might be able to guess, but there’s somewhat of a correlation between individual people — and mentions of them — appearing in quotes and the underlying truth value of said quotes. Take a look.

Hearsay mentions and tagging individuals appears much more commonly in the less truthful categories. Oddly enough, when inspecting rankings, social issues don’t have a distinct stratification, and quotes mentioning them are more or less equally spread across all truth values.

The images show the top 25 word pairs in each truth category, but in my research, I inspected the top 50 word pairs and the conclusions drawn here hold farther down the ladder.

So it would seem that we have identified something that might be able to set apart our less-truthful quotes right? The bad news is the sheer volume of mentions of highly visible political figures pollutes all truth categories, thus making specific classifications of any given quote difficult.

Modeling

This leads to the modeling of the project. I’m not going to lie, the modeling didn’t turn up the results I had hoped for when beginning this project, but this doesn’t mean we finished empty handed.

The notebook for the project includes the true CRISM-DM approach and iterative progression through various modeling techniques. With various approaches to text vectorization I tried the following class of models.

  1. Dummy Classifier from sk learn (this was the baseline)
  2. Naïve Bayes
  3. Random Forest
  4. Bi-direction LSTM
  5. Multi-head Attention

The winner was the Bi-directional LSTM with basic index-based vectorization and a 150 dimension embedding layer — see the notebook for all the nitty gritties.

Here is a confusion matrix from one of the winning model’s training cycles.

The overall accuracy ended up just shy of 28%. For reference, the dummy classifier scored 17%. Mathematically, there’s something to be said for scoring better than the baseline. Personally, I find the result disappointing, albeit, not that surprising. For those of you interested in the NLP world, it turns out that 19000 samples is woefully inadequate for most text analysis, this one included. This is mostly due to the incredibly sparse nature of text representations in data. What do I mean? In this example, there were just under 7000 unique words after removing stopwords and lemmatizing the rest. As each word gets its own unique vector representation for the modeling process, to use tabular data as an analogy, our data would have 7000 columns. That’s a large vector space. Couple this with a limited representation of any given word across the entire corpus and you are left with model decision boundaries that sometimes are sensitive to the 30th decimal point! The overall result is a razor-thin difference between all the predicted probabilities for the classification, and thus, high model confusion. Taking a look above, this can be seen in the low percentages seen in most decision squares and the existence of a catch-all prediction — the “barely-true” label in this case.

I can hear some of you asking, “What about the ‘pants-fire’ case?!”

Ah yes, this is the silver lining of the project. The best model — and most of the models I considered — were able to pick out the “pants-fire” column with, dare I say, reasonable accuracy? It would seem that even with our model’s high confusion that there is something about outlandish lies that set them apart from the other truth categories in a way that is conducive to being classified.

Something I would also like to point out are some of the underlying logical associations the model is creating between our truth categories that gives me hope! When the model guesses incorrectly, it tends to do so in a way such that the guesses are what I would consider being towards the closest relatives to the correct truth value.

Looking at the “true” column, the two most guesses categories are the “true” and “mostly-true” labels. Conversely, the two least guessed labels are the “pants-fire” and “false” labels. This is an important distinction! Yes, the model isn’t classifying true quotes with a production worthy accuracy, but the numbers from the confusion matrix do suggest that, on some level, the model can tell the difference between true and false. What I hope to achieve going forward from this post is to widen the decision boundaries so our models can capitalize on the learning demonstrated in the confusion matrix.

This won’t be an easy task. I imagine more data, better preprocessing and new modeling approaches will be needed to make this goal a reality.

I’ll end this blog with showing some of the label collapsing that was considered. Ultimately, I would love for a model to be able to classify statements with a high level of nuance. At this point, however, the performance isn’t good enough to not consider collapsing our truth labels in hopes that the model can perform better with fewer choices.

It did…kinda.

In general, I conclude that the model isn’t worthy of a true production environment even after collapsing labels. This exercise did confirm my suspicions that the “false” labels are being learned more fundamentally when compared to both “true” and grey-area (everything that’s not true or false). This can be seen in the difference between the final two images. The false category in the second to the last image has a higher precision when compared to the true category in the last image. Furthermore, 75% recall on the false classes isn’t too shabby considering there is a considerable imbalance when combining labels (I did recalculate the class weights for each scenario).

Conclusion:

I’ll make this short and sweet as I’m sure you’re sick of reading by this point (because I’m sick of writing).

  1. Our data set was scraped from Politifact.com and is highly political in nature. It included 19000ish quotes of various truth values.
  2. From our dataset, we’ve seen that political figures are more likely to be associated with the “pants-fire”, “false” and “barely-true” labels where as social issues are more or less uniform across all truth categories.
  3. Neural networks do appear to be able to create logical associations between the truth labels. Our model is much more likely to mislabel a “true” quote as “mostly-true” than it is to mislabel is as “pants-fire” or “false.”
  4. False quotes appear to be easier to classify than true quotes which in turn are easier to classify than “barely-true”, “half-true” and “mostly-true” categories.
  5. Currently, no model in my research is what I would consider production ready. But! I think better performance is very possible with more data and more powerful models!

--

--

Michael Mahoney

I love life, family, math and the internet. I’ve done everything from academic research to digging holes. I can be stubborn but always try to keep and open mind