To get some practice using Python for data analysis, and to get more familiar with a few data visualization tools, I’ve decided to do a series of projects on random datasets and see what I can get out of them. For my first project, I’m taking a look at the data dump that Facebook gives you.
- Download all my data from Facebook, figure out what’s interesting about it, analyze it, and visualize it.
There will be two main dimensions to the analysis: a look at the general type and structure of data that Facebook keeps, and a delve into my own personal data to see what’s really going on there.
I graduated with degrees in history and sociology, so I’m incapable of starting a project without developing some research questions. Though they’ll be refined once I get into the data, initially, I’m looking to find out:
- What are the main types of data Facebook has (interests, location, advertising, interaction history, etc)?
- How extensive is this data, exactly? How far back does it go? What gets kept around?
- What can I found out about myself (or someone else, theoretically) with unfettered access to this data dump?
- Interesting sub-question: if I was to reconstruct my life using only Facebook data, what would it look like, and how closely does it match my real life?
Downloading your Facebook data is pretty easy, though the files aren’t tiny. That’s probably because you’re not just getting text data and code—you’re getting every picture and video on your profile.
I opted to download all my data in both the JSON and HTML formats, just for kicks. Opening up the unzipped JSON folder with Jupyter Notebook reveals a bunch of folders with names like “friends,” “ads,” “location_history,” and a bunch of other stuff. The JSON files within the folders are, as you would expect, full of information on you in raw text format.
The HTML folder is a bit more user-friendly, so it should be good for answering some of my initial questions and getting an idea of what I should be looking for. The file structure is identical—same folder names and everything—but opening the HTML files gives you clear human-speak language in a nice format.
What are the main types of data?
Okay, first question: what’s in the box? (WHAAAAT’S IN THE BOOOOX?!”) Some big categories jump out right away:
- Profile data: about you, pages, profile information
- Friend data: Friends, groups, following and followers
- Timeline/interaction data: Comments, events, likes and reactions, messages, photos and videos, posts
- Advertising data: ads
- Location/security: location history, security and login information
- Stuff you looked for: search history, saved items
Upon inspection, it turns out that this is pretty much your entire timeline. The lion’s share of data here is your posts, comments, likes, photos, videos, and messages, which is all stuff you’ve generated and which has stuck around as a mini-history of you. This stuff will be interesting to look at from a behavioral/statistical standpoint. What’s my posting frequency by year? Average post length? Of course, I could get one of those funky little Facebook analysis apps to do this for me, but the whole point here is to analyze it myself.
What I’m really looking forward to here is the stuff I didn’t generate, because I don’t know exactly what’s in there. Do they know everywhere I’ve been? Do they keep a record of every login? What’s my advertising profile like? And search history? Yeah, I generated that, but are they keeping a record of every single search I’ve ever done? This is where it gets interesting, mostly because it’s a little creepy.
Over the next few weeks, I’m going to go piece by piece, looking at the data with both human eyes and some Python code. With any luck, I’ll produce some interesting data and some kind of data visualization on everything I look at. Coming up:
- A personal analysis
- My posting history
- My comment/reaction history
- Differences between my messages and posts
- I noticed that the JSON files included a fair bit of photo metadata, so I’ll see what I can do with that.
- Creep factor
- My location history (Facebook also logs IP addresses in a separate location, so it probably goes back a long way)
- My search history
- Advertising: what are they looking at? Who is advertising to me?