So here is a quick tutorial on how to construct the graph shown above, showing the distribution of messages over a given time period. First of all we must obtain the dataset we are going to use, which for this tutorial is the chat history obtained from WhatsApp. This can be obtained following these instructions or for iMessages or FaceTime data you can use this (paid) software.This tutorial is posted with permission by Prometheus09 on Reddit. It was originally posted as an album on Imgur.
Importing the Message into Excel
After exporting the chat history from WhatsApp you will have a basic text file. Open the file with excel so that you can save the messages as a .csv file (this is what is imported into R). Opening the text file in excel will bring up the Text Import Wizard. Choose the options: Delimited > Next > Space > Next > General > Finish. This will put characters between each space in the text file into a separate cell in the .csv file. More importantly, it will put all the dates, the times, sender and messages in their own columns. Unfortunately this will break the words of each message into separate columns.
Cleaning the Message Data
To clean up the csv, remove all of the columns that contain data from the messages and put a column name at the top of the remaining columns (e.g. date, time, morning, sender). In the morning and sender columns we also want to replace (ctrl + H) the colons ‘:’ with an empty space so they are removed from the .csv file. We will be left with the image below.One thing to be aware of is that messages that were split across multiple paragraphs in WhatsApp (i.e. a return was used to split the message) will be split across multiple rows in the .csv file. It’s best to just to delete these extra rows. They are easy to identify if you just sort the document by the date column (either ascending or descending) which will bring them to the top/bottom of the document.
Working with R
Now that we have a nice clean .csv file, let’s save it as messages.csv and import it into R (download from R-Project). You can also learn more about R with the Cookbook-R introduction.
In R, use the read .csv function and assign the .csv file to a data frame in R.dat <- read.csv("~/Dropbox/Public/messages.csv)", header = TRUE) |
-
dat
Shows the data frame
-
head(data)
Show only the first 6 rows
-
str(dat)
Shows the structure of the data frame
Congratulations, the hard part is done now; the data is in R and you can even determine how many texts a person has sent right now using the table() function.
table(dat$sender) ##Crosstabulation of sender variable in the dat dataframe |
From this we can see I have sent 6 messages and she has sent 8.
Making New Variables
It’s time too construct some new variables to add onto the data frame: iso, month, day, hour. Also, change the variable date from a factor to a date. The iso will be POSIXlt class variable telling us the date and time each message was sent. For example, we are pasting the date, time and morning variables together (paste()) and then telling R that this variable represents time (strptime()).
dat$iso <- paste(dat$date, dat$time, dat$morning, sep=" ") dat$iso <- strptime(dat$iso, "%d/%m/%Y %I:%M:%S %p") ?strptime ##use to get help, pay attention to the format of the dates. |
Now to make the new variables month, day, hour using the functions months(), weekdays() and format().
dat$month <- months(dat$iso, abbreviate = TRUE) dat$day <- weekdays(dat$iso, abbreviate = TRUE) dat$hour <- format(dat$iso, "%H") |
Finally, tell R that the variable date is a date and not a factor.
dat$date <- as.Date(as.character(dat$date), format = "%d/%m/%y") |
Success! Now we can use this data frame to plot our graphs.
Graphing With GGPlot
For this we will be using ggplot2. To install and load these packages, use the following:install.packages('ggplot2') library(ggplot2) |
The format for ggplot is pretty easy and very customizable. Take a look at this page to get a general idea of the things we can do when plotting distributions. The general format for plots we are doing will be:
ggplot(dat, aes(x = days, fill = sender)) + geom_histogram(binwidth = 7, position = 'dodge') |
This tells R to make a plot using the function ggplot. The data comes from the data frame ‘dat’. The aesthetics consist of ‘x’ the variable that goes along the x axis of the graph, which will consist of the number of messages sent each day, and ‘fill’ which will ggplot to colour each bar by sender. We use the geom_histogram() function to indicate that we want to plot a histogram. Binwidth determines the width of each bar (in this case 7 days) while position dodge is saying to plot each ‘senders’ number of messages next to each, in comparison to the previous graph where there were stacked on top of each. To have a stacked graph don’t use position.
So now to look at some graphs I prepared earlier…
ggplot(dat, aes(x = hour, fill = sender)) + geom_histogram() |
This graph is interpreted as each bar is the total number of messages sent while the colour are proportion sent by me and her (i.e. she sent more than me).
ggplot(dat, aes(x = month, fill = sender)) + geom_histogram(position = 'dodge') |
And that’s all! For any other type of message, the concept is the same: Import a file into R that contains a variable that contains date/time and a variable for sender.
If you want to look at character length of messages, you need to import the original text file into excel so that the messages are not broken up. For example, rather than using the option ‘space’ you can use other ‘:’, though this will break up time. To avoid this, in the text file you can replace colons next to sender and morning with a unique character such as ‘|’, and then separate the text files based on this character. Counting the character length in each message can be done using the stri_length() function from the stringi package.
Skype and FaceTime data are a little bit different and instead what you are focusing on is the duration of the call and the date the call was place.
Hi Mario,
Thank you for such a clear presentation you have helped me a lot. I am trying to do a similar research analysis on a group chat but the problem is when I try to import into excel the data tends to become very messy as people tend to post really long messages which when I try to import they are being entered in the first column which contains date. Now having to remove the text from the first column is becoming a challenge as I have 500 such texts out of 3200 rows. If you know a quicker way around it may you please assist.
Your response will be greatly appreciated,
regards,
Allen
What did you do with the Emoji?