What if your computer could read?

Today we might type more than we talk, enormous amount of text data has been generated every single minute since digital era. Data is new oil (and even more valuable, for those who know how to blend it). However, a human brain is not sharp at reading large of amount of text. Our eyes are limited of two sides, which disallow a normal person to read more than 200–300 words per minute. According to “The Guinness World Record Book”, Maria Teresa Calderon can read more than 50,000 words per minute: 166 times faster than us (Maria could finish reading this blog, by end of our first two line of this paragraph). Would it be great if all of us can beat Maria in reading by a few line of Python code? The answer is yes! It would be great, with some help from free software fastText, and little understanding of Python scripts

The first task is basic Natural Language Processing (NLP) and apply some technique in “text classification”,which every line of text data can be categorised into assigned groups. For example, banking industry may offer countless of financial services which require written statement from customers. Text classification module (classifier) can read those statement, and categorise them into one of the bank services category. Anyway, machines need to be taught in order to understand human language.

How to start training your machine? In this session, we are going to provide quick-and-dirty written in Python code. First, open your Python shell…and downloading these following dependencies: pandas, fastText, and scikit-learn.

import pandas as pd
path_to_file = './data/consumer_complaints.csv'
df = pd.read_csv(path_to_file, usecols=[1,5], \
    dtype={'consumer_complaint_narrative': object})

US Consumer Finance Complaints data frame(df) contains two important columns. We explore [product] that represents available bank sevices and teach our machine to learn customer’s complaints by using [consumercomplaintnarrative]. Before going too deep about this data frame, we should know that how many products are mentioned in the complains and what are the numbers of complains for each products.

from collection import Counter
counts = Counter(df['product'])
# Counter({'Debt collection': 17552, 'Mortgage': 14919, 'Credit reporting': 12526, 'Credit card': 7929, 'Bank account or service': 5711, 'Consumer Loan': 3678, 'Student loan': 2128,
'Prepaid card': 861, 'Payday loan': 726, 'Money transfers': 666, 'Other financial service': 110})

And these following are examples of “Debt Collection” complaints from customers.

for complaint in df[df['product'] == 'Debt collection'].consumer_complaint_narrative.sample(3, random_state=1):
# Had 4 phone calls in one day to my cell phone about debt collecting.
# They are asking to talk to a XXXX XXXX ... ... Not me ... .Never heard of him. They got the wrong number! I keep explaining to them you got the wrong number and they get very rude!
# ________________________________________
# My sister provided Hyundai Motor Finance my phone # while hers was not working. I received a call from their XXXX number and when advised my sister was not available and asked who         was calling. Female declined to identify herself or her company. I advised that the cell phone being called belongs to me and they no longer have my permission to dial my number manually or via their automated dialer. Female then hung up on me. My sister took care of the past due payment ( was just an oversight ) and we assumed everything was good. payment rec by HMF on XXXX/XXXX/15. On XXXX/XXXX/15 I recevied another call from HMF. I had my sister call back and they advised her account current and no record of call. Advised could furnish proof of call and requested again that they no longer call me.Rep said nothing he can do. Sister requested supervisor who told her the only way they would guarantee no calls to my phone is if she revokes her permission for ANY phone contact. My sister said she still wants to be contacted, just not at my phone number. Supervisor said nothing he can do. Also said that there were no automated calls out and that the only way she would have been dialed after she revoked permission, would be a manual dial. I called back in and advised that it is my cell and I revoke authorization, that they absolutely were not to manually/auto call my number any further. Ultimately they were calling out to an unauthorized party about a current account ... manually. Intentional harrassment. I received another call today from the XXXX number and a 28 second long dead air voicemail. Sister called back in and of course they had no record of call and were unhelpful again. I called back in, they said i had to get in touch with consumer affair department and because I 'm not a customer, just an innocent third party that they are harrassing that I would have to communicate with them via mail. Finally agreed to send me over via phone so solution could be expedited, but they kept just transferring me back to customer service. Eventually male rep told me that there was nothing I could do about the phone calls.
# ________________________________________
# This item claimed by Action Collections on behalf of Benchmark Apartments from XX/XX/XXXX in the amount of {$2100.00} ( originally was {$1200.00} until I filed a dispute then was increased as of XX/XX/XXXX to {$2100.00} ) is inaccurate because the Benchmark Apartments charged me for inaccurate charges, falsified an important document and then immediately sent me to Action Collections with intentions of receiving illegitimate payment, whilst destroying my credit. This has went on for 7 years and has prevented me from having perfect credit with this one exception to my credit report. I worked with the XXXX as a mediator between myself and the XXXX as well spoke with Attorney General in XXXX, Id whom both agreed this collection was incredulous and a mistake. The XXXX were willing to bring amount owed to {$300.00} through XXXX, however Action Collections still had the collection and were completely unwilling to work with me. Action Collections shredded all the documents I submitted one which was an original move in/out document, which Benchmark falsified and I tried informing Action Collections they forced my signature and I had the original document, they asked me to send in all my receipts and that document. I made copies of everything but submitted in hopes they would see that XXXX never had intentions of performing a move out walk through to verify I had cleaned the apartment and previous damage to apartment before I moved in. XXXX went and had their own cleaning crew come in and then sent me an invoice for cleaning charges, inaccurate rent amount and miscellaneous charges that were never agreed to in contract. I had receipts to show I had cleaned the apartment in hopes of getting my deposit returned and they never met me after moving out. They immediately sent me to collections when I questioned the invoice that was sent. I am requesting the item be removed completely and as soon as possible from my credit report because it is affecting my life negatively. Enclosed are the following attachments ( exhibits ) : ( Exhibit XXXX ) copies of receipts from cleaning services I paid for per my rental contract when moving out the XXXX apartments. I made several attempts to check out with a manager by calling and going down to office but was never returned any phone calls. Finally, I was called back by someone in their office only to obtain my address and was informed that the apartment had been rented. In addition was notified the apartment had been rented. This was important because under the condition I have professional cleaning services done, have proper walk out to show they were done and provide receipts, if the apartment rented prior to end of month I would not be responsible for the {$650.00} rent and I would receive my deposit back of {$300.00}. Instead of my deposit I received a lengthy itemized invoice for inaccurate charges and charging me rent, which was not the correct rent amount in the agreement, ( Exhibit XXXX ). The rent amount in invoice was inaccurate from rental contract ( Exhibit XXXX ). There were also charges for professional services I had already had paid for to have cleaned. I demanded to know why these charges were invoiced to me but never heard back from anyone and then after two months of no answer and or hanging up on me, I received a notice from Action Collections that I was to pay them and my account was turned over to them. I pleaded with Action Collections as well contacted XXXX because I felt I was handled wrongly and never had a chance to hear from Benchmark ( Exhibits XXXX, XXXX, XXXX ). In addition, XXXX XXXX sent a falsified move-in sheet ( Exhibit XXXX ) with my name at top and the signing manager at bot
# ________________________________________

Here comes to the modelling part, first the data is partitioned into two sets, training set and test set. Training set is used to train the machine to understand human language, and the test set will help us measure how smart the machine is.

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=.2, random_state=9)

To handle unstructured text data, I start using fastText (not easy to install but trust me it’s worth your time!) for text classification. To use fastText, the format of input data makes huge impact on training the machine. We convert input data into to fastText required format.

def make_fastText_input(df, fname):
    texts = df.consumer_complaint_narrative.values
    cat_id = df.category_id.values
    with open(fname, 'w', encoding='utf-8') as f:
    for idx in range(df.__len__()):
            f.write('__label__' + str(cat_id[idx]) + ' ' + texts[idx].replace('\n', '') + '</s>\n')
path_to_input = './data/customer_conplaint_fastText_input'
make_fastText_input(df=train, fname=path_to_input)

Input (‘./data/customerconplaintfastText_input’) file must look similar to the following statement.

__label__2 My mortgage is owned by XXXX and serviced by PNC Mortgage. My Loan # XXXX. I have been working with PNC for XXXX years now to get a Making home affordable loan     modification to save my home from forclosure with no success. I have reviewed all the information and based on guidelines should qualify. PNC has been giving me the run around. I have given them all needed documents and they close my file and make me start over. They keep blaming it on the investor guidelines buy wont allow me to speak with anyone from XXXX. I am never given clear denial reasons. Please help me save my home from forclosure. I am stressing out over this long process that leads no where. I have once again submitted another making home affordable application on XXXX/XXXX/15. </s>

label6 i didnt make a inquiry for this bank

Where label0, label1, …, label10 are ‘Debt collection’, ‘Consumer Loan’, ‘Mortgage’, ‘Credit card’, ‘Credit reporting’, ‘Student loan’, ‘Bank account or service’, ‘Payday loan’, ‘Money transfers’, ‘Other financial service’ and ‘Prepaid card’ respectively. The training data is ready, let start train it.

import fastText as fT
classifier = fT.train_supervised(input=path_to_input)
# Read 10M words
# Number of words:  118348
# Number of labels: 11
# Progress: 100.0% words/sec/thread: 4515701 lr:  0.000000 loss: 0.865297 ETA:   0h 0m

We can test the machine performance by picking a random statement from the test set.

# pick one complaint from test set
test_01 = test.consumer_complaint_narrative.sample(1, random_state=10).values[0].replace('\n', '')
# I have been receiving calls from an unknown debt collector. They have not given me any information as to who they are, a physical address, and call me day and night, after I have     repeatedly asked to only contact me through mail. Now they have called my work without my permission. I do n't even know how they got my work number, and jeopardizing my job by doing so.
# test session
# (('__label__0',), array([0.85070831]))
# create your own complaint
my_complaint = 'I need a loan asap, show me the money'
# your result
# (('__label__5',), array([0.62615275]))
# lol, it said 'Student Loan'

It predicts that ‘text_01’ is ‘Debt Collection’ complaint with probability equal to 0.85, and that is a correct answer!!! Keep calm and test more…

# making test file
path_to_test = './data/customer_conplaint_fastText_test'
make_fastText_input(df=test, fname=path_to_test)
# testing the test file
# (13362, 0.8083370752881305, 0.8083370752881305)

Roughy speaking, the model classified text in test set at 80.83% accuracy rate (about 10,800 out of 13,362). So how fast computer can read?

import time
start = time.time()
result = classifier.test(path=path_to_test)
end = time.time()
print(end - start)
# 0.3329501152038574

It takes 0.33 second to read 13,362 complaints or around 2,526,411 total words, so it has ability to read more than 455 million words per minute (according to this classifier on my laptop*). More than a million times faster than ordinary people can do, or thousand times faster than Maria.

To conclude,‘Natural Language Processing (NLP)’ is to make a computer to understand our human language, and the task we just finished “Text Classification’ is one of the ‘Supervised Learning’ approaches that is a subset of “Machines Learning” which mainly focus on developing machines to preform similar tasks demonstrated above. By the way, the more complicated the state of the art methods like RNN and LSTM are performing with more precise results are coming in next blog, so stay tune…

Have a question?

Drop us a line and we will get back to you