Analyzing Facebook Data: Part 1

To get some practice using Python for data analysis, and to get more familiar with a few data visualization tools, I’ve decided to do a series of projects on random datasets and see what I can get out of them. For my first project, I’m taking a look at the data dump that Facebook gives you.

The mission:

  • Download all my data from Facebook, figure out what’s interesting about it, analyze it, and visualize it.

There will be two main dimensions to the analysis: a look at the general type and structure of data that Facebook keeps, and a delve into my own personal data to see what’s really going on there.


I graduated with degrees in history and sociology, so I’m incapable of starting a project without developing some research questions. Though they’ll be refined once I get into the data, initially, I’m looking to find out:

  • What are the main types of data Facebook has (interests, location, advertising, interaction history, etc)?
  • How extensive is this data, exactly? How far back does it go? What gets kept around?
  • What can I found out about myself (or someone else, theoretically) with unfettered access to this data dump?
    • Interesting sub-question: if I was to reconstruct my life using only Facebook data, what would it look like, and how closely does it match my real life?

Getting started

Downloading your Facebook data is pretty easy, though the files aren’t tiny. That’s probably because you’re not just getting text data and code—you’re getting every picture and video on your profile.

I opted to download all my data in both the JSON and HTML formats, just for kicks. Opening up the unzipped JSON folder with Jupyter Notebook reveals a bunch of folders with names like “friends,” “ads,” “location_history,” and a bunch of other stuff. The JSON files within the folders are, as you would expect, full of information on you in raw text format.

The HTML folder is a bit more user-friendly, so it should be good for answering some of my initial questions and getting an idea of what I should be looking for. The file structure is identical—same folder names and everything—but opening the HTML files gives you clear human-speak language in a nice format.

What are the main types of data?

Okay, first question: what’s in the box? (WHAAAAT’S IN THE BOOOOX?!”) Some big categories jump out right away:

  • Profile data: about you, pages, profile information
  • Friend data: Friends, groups, following and followers
  • Timeline/interaction data: Comments, events, likes and reactions, messages, photos and videos, posts
  • Advertising data: ads
  • Location/security: location history, security and login information
  • Stuff you looked for: search history, saved items

Upon inspection, it turns out that this is pretty much your entire timeline. The lion’s share of data here is your posts, comments, likes, photos, videos, and messages, which is all stuff you’ve generated and which has stuck around as a mini-history of you. This stuff will be interesting to look at from a behavioral/statistical standpoint. What’s my posting frequency by year? Average post length? Of course, I could get one of those funky little Facebook analysis apps to do this for me, but the whole point here is to analyze it myself.

What I’m really looking forward to here is the stuff I didn’t generate, because I don’t know exactly what’s in there. Do they know everywhere I’ve been? Do they keep a record of every login? What’s my advertising profile like? And search history? Yeah, I generated that, but are they keeping a record of every single search I’ve ever done? This is where it gets interesting, mostly because it’s a little creepy.

Over the next few weeks, I’m going to go piece by piece, looking at the data with both human eyes and some Python code. With any luck, I’ll produce some interesting data and some kind of data visualization on everything I look at. Coming up:

  • A personal analysis
    • My posting history
    • My comment/reaction history
    • Differences between my messages and posts
  • Photos
    • I noticed that the JSON files included a fair bit of photo metadata, so I’ll see what I can do with that.
  • Creep factor
    • My location history (Facebook also logs IP addresses in a separate location, so it probably goes back a long way)
    • My search history
    • Advertising: what are they looking at? Who is advertising to me?

MIT, micromasters, and disrupting higher ed

For the past few months I’ve been taking courses with MIT’s Data, Economics, and Development micromasters program. After having completed 2/5 courses I’ve absorbed a lot of very interesting material, some of which will almost certainly be making it onto this page in the future.

Beyond the content, though, what impresses me most is that MIT is offering these courses on an audit basis to anyone who wants to try them, with a potential path to actually attending the school and earning a full masters. Even more fascinating is that there are no set academic prerequisites for entering the full masters program beyond completing the micromasters online. Of course, not all applicants who go through the MM will be accepted, but the raw idea is worth digging into.

MOOCs are edging us towards a path to college admissions that removes significant guesswork from both sides of the equation. On the institution’s side the need for credentials to enter universities is based on the idea that past performance is a reliable indicator of future results; on the student’s side, the assumption is that acceptance is a fairly reliable indicator that the university approximately matches their capabilities. Students may also operate under the assumption that they will find the coursework useful in some way, that they will be interested in their chosen field, and that they won’t for some reason be inspired to drop out.

Of course, past performance is never a guarantee (thanks, Nassim Taleb), but past performance in a different environment is still less of a guarantee. People drop out or fail because they are either poorly matched or because the reality did not meet their expectations in some way. Under MITs system, presenting the material at the actual level of difficulty a student would encounter in the full program, and with some financial commitment (100-1000 dollars, dependent on income, if you choose to take the micromasters track rather than simply auditing, which is free) essentially corrects the information asymmetry.

On the university side, it gets a lot easier to select students that you know will fit you at an academic, and to some extent, a cultural level, without doing a lot of research into their performance in regional spelling bees. This could serve to significantly reduce the time spent per student in the admissions office. In short, MIT knows how you’re doing in their course, regardless of past performance, and the student has a fairly good idea of what they’re getting into.

In the rosiest view, I could see this system correcting for a lot of issues that currently plague higher education. Low to no cost auditing with light credentialling attached would be a way to admit students with real ability who don’t have the formal paperwork in hand. It might improve dropout rates by exposing students to the product before they make the decision to invest. It would certainly help universities by creating a self-selection process based on actual observed performance rather than educated guesswork, making admissions into less of a bureaucratic necessity and more of a competitive process. It could reduce administrative overhead in the long run, create richer technological integration in a stubbornly old-fashioned higher education system, and possibly create a clearer market for universities based on instruction and utility rather than experiences and amenities, which tend to be pushed in promotional materials.

Nonetheless, as Tyler Cown laments in The Complacent Class, universities are already bubbles of like-minded, like-backgrounded individuals. This could get better with self-selection opening up admissions to students who would not traditionally be able to enter, but it might also become worse as the courses will perhaps tend to promote over-matching, appealing to those with a certain worldview and background. Something like Jonathan Haidt’s Heterodox Academy project would have to figure into the creation of these courses in order to ensure that diversity in all respects would continue to thrive and that we wouldn’t self-select into even more polarized groups than we already do.

Additionally, those without the time or access to participate would be at a disadvantage, as they wouldn’t be able to rely on the relatively less labor-intensive traditional admissions process to make their transition at the speed they need to. Of course, optimizing for university admissions already tends to be a process that starts around middle school and intensifies throughout high school. Optimistically, self-selection playing a role in the admission process would enable those without the resources to invest in resume-building activities to prove their skills; pessimistically, it would become one more hoop for already-disadvantaged students to jump through. That’s why if this was to be implemented in any form it would need to be clearly separated from other criteria, or even, in a very libertarian implementation of the idea, used as the only criteria to ensure that it didn’t become simply another time suck or long-form standardized test weighing down admissions even more.

A host of other pros and cons exist–I am inclined to lean strongly towards the pros; done right, this could reduce a lot of costs, both financial and social, on both sides. I think in the future we will be seeing a lot more of this trend being built into MOOCs; I doubt they will continue to exist solely as nice, cheap ways for smart people to get smarter. They’ll be turned towards credentialling and program integration eventually, which I think will vastly improve the quality of the market. And of course, at this stage in the existence of the traditional university, disruption in any form that raises efficiency and lowers costs is welcome.