"Privacy and Publicity in the Context of Big Data"

danah boyd
WWW 2010
April 29, 2010

[This is a rough unedited crib of the actual talk]

Citation: boyd, danah. 2010. "Privacy and Publicity in the Context of Big Data." WWW. Raleigh, North Carolina, April 29.

INTRODUCTION

Unless you've been hiding under a rock, you've witnessed all sorts of grumblings about privacy issues in relation to social media. Sometimes, this comes in the form of complete panic. "OMG, kids these days! What are they putting up online!?!?" At other times, we hear this issue emerge in relation to security issues: data retention, identity theft, phishing, etc. Privacy concerns emerge whenever we talk about government surveillance or behavioral advertising.

Privacy concerns are not new; people have been talking about privacy - or the lack thereof - forever. So what's different now? The difference is Big Data.

While databases have been aggregating and chomping on data for over a century, the Internet has created unprecedented opportunities for people to produce and share data, interact with and remix data, aggregate and organize data. Data is the digital air in which we breathe and countless efforts are being put into trying to make sense of all of the data swirling around. When we talk about privacy and publicity in a digital age, we can't avoid talking about data. We can't avoid talking about how data is produced, stored, shared, consumed, aggregated. Privacy is completely intermingled with Big Data.

For the purpose of this talk, when I talk about Big Data, I am talking about the kinds of data that marketers and researchers and business folks are currently salivating over. Data about people, their activities, their interactions, their behaviors. Data that sits at the foundation of Facebook, Twitter, Google, and every social media tool out there. Big Data is also at the center of many of the conversations here at WWW. Many of you are in the business of mucking around with Big Data. Others of you are building tools that leverage Big Data. All of you are producing content that is feeding into Big Data.

We've entered an era where data is cheap, but making sense of it is not. Many of you are sitting on terabytes of data about human interactions. The opportunities to scrape data - or more politely, leverage APIs - are also unprecedented. And folks are buzzing around wondering what they can do with all of the data they've got their hands on. But in our obsession with Big Data, we've forgotten to ask some of the hard critical questions about what all this data means and how we should be engaging with it. Thinking about privacy and publicity is just one way of getting at the hard questions brought about by Big Data but one that I think is essential.

To get at privacy issues, we need to back up and talk about methodology. Talking about methodology with respect to Big Data is critically important and WWW is the perfect conference to begin some of these critical conversations. Methodology shapes how we engage with Big Data. But thinking about methodology and ethics also gives us a critical framework with which to think about Big Data.

METHODOLOGICALLY SOUND APPROACHES TO BIG DATA

I am an ethnographer. What that means at the coarsest level is that I'm in the business of mapping culture. I didn't start out here. I started out in computer science and only got interested in cultural practices after I began visualizing digital social networks. As someone who was visualizing data, I fell deeply in love with Big Data. I loved the patterns that could be uncovered, but I also quickly worked out that the patterns introduced more questions than answers. I made a methodological shift to qualitative social science because I wanted to understand _why_ people produced the data that I was seeing. In this shift, I began to reflect critically on what Big Data could and could not tell us about human behavior.

In a recent essay on Big Data, Scott Golder began by quoting George Homans, a famous sociologist: "The methods of social science are dear in time and money and getting dearer every day."

Social scientists have long complained about the challenges of getting access to data. Historically speaking, collecting data has been hard, time consuming, and resource intensive. Much of the enthusiasm surrounding Big Data stems from the opportunity of having easy access to massive amounts of data with the click of a finger. Or, in Vint Cerf's words, "We never, ever in the history of mankind have had access to so much information so quickly and so easily." Unfortunately, what gets lost in this excitement is a critical analysis of what this data is and what it means.

Many of you in the room are approaching Big Data from a computational perspective, but it's absolutely crucial that you understand that you're dealing with data about people. This is where social science theory and analysis can help. Social scientists are in the business of collecting, aggregating, and analyzing social data. And their analysis is tightly tied to their methodologies. Using social science logic, I want to discuss five things that all working with Big Data must understand:

1) Bigger Data are Not Always Better Data
2) Not All Data are Created Equal
3) What and Why are Different Questions
4) Be Careful of Your Interpretations
5) Just Because It is Accessible Doesn’t Mean Using It is Ethical

#1: Bigger Data are Not Always Better Data

Big Data is exciting, but quality matters more than quantity. And to know the quality, you must know the limits of your data.

One of the most important methodological considerations when working with data is sampling. Sampling is crucial to all social sciences, including ethnography. How you sample your data affects what claims you can make. You need to know the sample to extrapolate arguments about what the data says. If you're trying to make claims about representativeness, your ultimate sample should be a random one. If you're trying to make topological claims, you need to sample for diversity not representativeness. (This, by the way, is one of the core reasons why ethnographers don't make representativeness claims; we sample to try to map out the entire terrain, not to quantify frequencies.)

In methodological utopia, you want to have access to the whole population so that you can make wise decisions about how to sample. Given a whole data set, a scholar with questions about frequencies would easily be about to get a representative random sample. Given a whole data set, a scholar with questions about diversity would be able to account for outliers. Historically, it was extremely rare to have access to the whole of anything and so scholars had to find best approximations.

Big Data introduces the possibility of whole datasets. But Big Data isn’t always a whole data set. Twitter has all of Twitter. But most researchers don’t have all of Twitter. At best, they’ve got all of the public tweets. More likely, they have the stream of public tweets from the public timeline. These tweets aren’t even random.

Lately, I’ve been reading a lot of articles where scholars argue that their sample is valid because they have millions of tweets, without accounting for what those tweets are. Big-ness and whole-ness are NOT the same. If you’re trying to understand topical frequency of tweets and Twitter removes all tweets that contain problematic words, your analysis will be wrong regardless of how many tweets you have.

Sampling also requires that you work out your biases at every step. Are certain types of people more likely to be under or over represented? If so, what does that mean? Let's again assume that you have every public tweet ever sent on Twitter. If you randomly sample those, you do NOT have a random sample of users. You have a random sample of public tweets. This is simple: not all accounts tweet at the same frequency so randomly sampling tweets oversamples accounts that produce more tweets. Let’s say you can determine all Twitter accounts and you randomly sample across those. In doing so, you are not randomly sampling Twitter users; you are randomly sampling accounts. Some users have multiple accounts. Others don’t have accounts but read Twitter actively.

You need to know your dataset. Just because you’re seeing millions and millions of pieces of data doesn’t mean that your data is random or representative of anything. To make claims about your data, you need to know where the data comes from.

#2: Not All Data Are Created Equal

This brings us to the second key issue: not all data are created equal. Because of the big-ness of Big Data, many who work with it believe that it is the best data out there. Those who argue that Big Data will render other approaches to data collection useless astonish me. This usually stems out of an arrogant belief that Big Data is "pure" data. Big Data is valuable, but it has its limitations - it can only reveal certain things and it's outright dangerous to assume that it says more than it does.

This issue keeps coming up in relation to social networks. People from diverse disciplines are analyzing social networks using diverse methodological and analytical approaches. But it kills me when those working with Big Data think that the data they’ve collected from Facebook or through cell phone records are more “accurate” than those collected by sociologists. They’re extremely valuable networks, but they’re different networks. And those differences need to be understood.

Historically speaking, when sociologists were the only ones interested in social networks, data about social networks was collected through surveys, interviews, observations, and experiments. Using this data, they went on to theorize about people's personal networks. Debates rage about how to measure people's networks and whether certain approaches create biases and how to account for them.

Big Data introduces two new popular types of social networks derived from data traces: articulated social networks and behavioral social networks. Articulated networks are those that result from the typically public articulation of social networks. As in the public list of people’s Friends on Facebook. Behavioral networks are those that are derived from communication patterns and cell coordinates. Each of these networks is extraordinarily interesting, but they are NOT the same as what sociologists have historically measured or theorized.

Keep in mind that people list people in their contact lists that they don't really know or don't really like. And they communicate with others that they don't really know or don't really like. This is not inherently a bad thing, but when all connections result in an edge, you need to ask yourself what that edge means. And you need to ask why some edges that are meaningful are missing, either because people don't connect with everyone they care about regularly or because they can't articulate people who aren't present on a particular service. You cannot analyze Facebook Friends lists and say that you’ve analyzed a person’s social network. You haven’t. You’ve analyzed their Facebook Friends list.

It’s a very good idea to take theories about personal networks and see if they also apply to behavioral or articulated networks. We’ve already seen that homophily has. But you cannot assume that because something is different in articulated or behavioral networks, that it’s wrong in personal networks. Consider tie strength. The person that I list as my Top Friend on MySpace may not be my closest friend because I probably chose that person for all sorts of political reasons. I may spend a lot more time with my collaborators than my mother but that doesn’t mean that my mother is less important. Measuring tie strength through frequency or public articulation misses the point: tie strength is about a mental model of importance and signals trust and reliability and dependence.

Data is not generic. It doesn't say generic things simply because you can model it, graph it, or compute it. You need to understand the meaning behind the representations to understand what it can and cannot say. Not all data is equivalent even it if can be represented similarly.

#3: What and Why are Different Questions

Nobody loves Big Data better than marketers. And nobody misinterprets Big Data better than marketers. They do so because they think that What answers questions of Why. My favorite moment came when I was on a panel where a brand marketer from Coca-Cola proudly announced that they had lots of followers on MySpace. I couldn't help but burst out laughing. Coincidentally, I had noticed that Coca-Cola was quite popular as a "Friend" and so I had started poking around to figure out why. After interviewing a few people, I found the answer: Those who were linking to coke were making an identity statement, but it wasn't the fizzy beverage that they were referring to.

Analyzing traces of people's behaviors and interactions is an extremely important research task. But it's only the first step to understanding social dynamics. You can count until you're blue in the face, but unless you actually talk to people, you're not going to know why they do what they do. What and Why are different questions. If you want to work with Big Data, you need to know which questions you can answer and which ones you can't. And projecting Why into What based on your own guesses is methodologically irresponsible. You can motivate your analysis through whatever you’d like, but if you’re going to make claims about data, you better be sure that you know what you’re measuring.

All people – not just researchers – are dreadful at actually understanding that What might have different explanations when you ask Why. This is why providing data about What can often backfire in the most amusing ways when people assume it answers Why. Consider Cobot, a software agent that once trolled LambdaMOO, collecting data about all of the netizens who resided there. Upset that Cobot wasn't giving back to the community, members demanded that Cobot's owner make it useful. So Cobot was reprogrammed to answer questions about the data it collected. It didn't take long for people to start asking who they talked to the most. And then to ask who the people they were friends with talked to. Sure enough, once Bob learned that his most frequent interlocutor Alice talked to Carol more than him, Bob was irate and stopped talking to Alice. Many in the community took the behavioral information as indicative of relationship value. And the community fell apart. Researchers aren’t the only ones who consistently mistake frequency of contact for meaningful tie strength. And who think that seeing patterns explains the reasons behind them.

#4: Be Careful of Your Interpretations

Every act of data analysis involves interpretation, regardless of how big or mathematical your data is. There's a mistaken belief that qualitative researchers are in the business of interpreting stories and quantitative researchers are in the business of producing facts. All of us are interpreting data. As computational scientists have started engaging in acts of social science, they’ve come to believe that they’re in the business of facts and not doing any interpretation. You can build a model that is mathematically sound but the moment you try to understand what it means, you're engaging in an act of interpretation. You can execute an experiment that is structurally sound, but the moment you try to understand the results, you're engaging in an act of interpretation. And how you design what you measure also stems from interpretation.

Misinterpretations are beautifully displayed when people try to implement findings into systems. My favorite example of this occurred when Friendster decided to implement Robin Dunbar’s work. Analyzing gossip practices in humans (and grooming habits in monkeys), Dunbar found that people could only actively maintain 150 relationships at any time. In other words, the maximum size of a person's personal network at any point in their life should be 150. Unfortunately, Friendster mistakenly believed that people were modeling their personal networks on the site so they took that to mean that no one should have a Friend's list greater than 150. So they capped it. D'oh!

Interpretation is the hardest part of doing data analysis. And no matter how big your data is, if you don't understand the limits of it, if you don't understand your own biases, you will misinterpret it. This is precisely why social scientists have been so obsessed with methodology. So if you want to understand Big Data, you need to begin by understanding the methodological processes that go into analyzing social data.

ETHICS OF DOING BIG DATA RESEARCH

Connected to methodology are questions of ethics… And here’s where privacy and publicity come into play.

The #1 destabilizer of privacy today stems from our collective obsession with Big Data. The biases and misinterpretations that are present in the analysis and use of Big Data are fundamentally affecting people's lives. The Uncertainty Principle doesn't just apply to physics. The more you try to formalize and model social interactions, the more you disturb the balance of them. Our collective tendency to treat social data as an abstractable entity rather than soylent green puts people at risk. If you don't understand what the data is or where it comes from, how you use it can be deeply problematic. And when you implement new features based on misinterpretations, you can hurt people.

Helen Nissenbaum has long reminded us that privacy is about context. So is the interpretation of Big Data. Methodology is about working out the context in which data is collected, aggregated, and analyzed. It's about making a best guess about how your presence and analysis will affect the people being observed or used. This is precisely why we talk about ethics.

This leads us to the biggest methodological danger zone presented by our collective obsession with Big Data: Just because data is accessible doesn't mean that using it is ethical.

It terrifies me when those who are passionate about Big Data espouse the right to collect, aggregate, and analyze anything that they can get their hands on. In short, if it's accessible, it's fair game. To get here, we've perverted "public" to mean "accessible by anyone under any conditions at any time and for any purpose." We've stripped content out of context, labeled it data, and justified our actions by the fact that we had access to it in the first place. Alarm bells should be ringing because the cavalier attitudes around accessibility and Big Data raise serious ethical issues. What's at stake is not whether or not something is possible, but what the unintended consequences of doing something are.

WHAT IS PRIVACY?

Privacy is not about control over data nor is it a property of data. It's about a collective understanding of a social situation's boundaries and knowing how to operate within them. In other words, it’s about having control over a situation. It's about understanding the audience and knowing how far information will flow. It’s about trusting the people, the situating, and the context. People seek privacy so that they can make themselves vulnerable in order to gain something: personal support, knowledge, friendship, etc.

People feel as though their privacy has been violated when their expectations are shattered. This classicly happens when a person shares something that wasn’t meant to be shared. This is what makes trust an essential part of privacy. People trust each other to maintain the collectively understood sense of privacy and they feel violated when their friends share things that weren't meant to be shared.

Understanding the context is not just about understanding the audience. It’s also about understanding the environment. Just as people trust each other, they also trust the physical setting. And they blame the architecture when they feel as though they were duped. Consider the phrase "these walls have ears" which dates back to at least Chaucer. The phrase highlights how people blame the architecture when it obscures their ability to properly interpret a context.

Consider this in light of grumblings about Facebook's approach to privacy. The core privacy challenge is that people believe that they understand the context in which they are operating; they get upset when they feel as though the context has been destabilized. They get upset and blame the technology.

What's interesting with technology is that unlike physical walls, social media systems DO have ears. And they're listening to, recording, and reproducing all that they here. Often out of context. This is why we’re seeing a constant state of confusion about privacy.

Big Data isn’t arbitrary data; it’s data about people’s lives, data that is produce through their interactions with others, data that they don’t normally see let alone know is being shared. The process of sharing it and using it and publicizing it is a violation of privacy. Our obsession with Big Data threatens to destabilize social situations and we need to consider what this means. To get at this, I want to argue five points:

1) Security Through Obscurity Is a Reasonable Strategy
2) Not All Publicly Accessible Data is Meant to be Publicized
3) People Who Share PII Aren’t Rejecting Privacy
4) Aggregating and Distributing Data Out of Context is a Privacy Violation
5) Privacy is Not Access Control

#1: Security Through Obscurity Is a Reasonable Strategy

People do many things in public spaces that are not recorded. They have conversations in parks, swim in oceans, and do cartwheels down roads. How they act in public spaces depends on the context. They reasonably assume that what they do in public is ephemeral and that no one is witnessing their acts unless they're present. Mediating technologies change the equation. Surveillance cameras record those cartwheels, mobile phones trace when people are in parks, and cameras capture when people are swimming in oceans. When people know that they're being recorded, their behavior changes. When they know they're being amplified, their behavior changes. Why? Because technologies that seek to record or amplify change the situation. Still, people do what they do and technology fades into the background.

In mediated settings like Facebook, recording and amplifying are now default. The very act of interacting with these systems involves accounting for the role of technology. As people make sense of each new system, they interpret the situation and try to act appropriately. When the system changes, when the context changes, people must adjust. But each transition can have consequences.

People's encounters with social systems rely on their interpretation of the context. And they've come to believe that, even when their data is recorded, they're relatively obscure, just like they're obscure when they're in the ocean. And generally, that's pretty true. Just because technology can record things doesn't mean that it brings attention to them. So people rely on being obscure, even when technology makes that really uncertain.

You may think that they shouldn't rely on being obscure, but asking everyone to be paranoid about everyone else in the world is a very very very unhealthy thing. People need to understand the context and they need to have a sense of boundaries to cope. Even in public situations, people regularly go out of their way to ignore others, to give them privacy in a public setting. Sociologist Erving Goffman refers to this as "civil inattention." You may be able to stare at everyone who walks by but you don't. And in doing so, you allow people to maintain obscurity. What makes the Internet so different? Why is it OK to demand the social right to stare at everyone just because you can?

#2: Not All Publicly Accessible Data is Meant to be Publicized

People make data publicly accessible because they want others to encounter it. Some may hope that their content is widely distributed, but many more figure that it will only be consumed by the appropriate people. They don't want to lock it down because they want it to be accessible in the right context. Making content publicly accessible is not equal to asking for it to be distributed, aggregated, or otherwise scaled.

Paparazzi make celebrities lives a living hell. They argue that they have the right to photograph and otherwise stalk a celebrity whenever that person is "in public." As a result, celebrities are often reclusive, staying at home where they can't be bothered or actively seeking protection when they leave the house.

When we argue for the right to publicize any data that is publicly accessible, we are arguing that everyone deserves the right to be stalked like a celebrity. Even with the money and connections to actually maintain some kind of control, many celebrities go crazy or even die trying to navigate paparazzi. What might be the psychological consequences of treating everyone this way?

#3: People Who Share PII Aren’t Rejecting Privacy

Historically, our conversations about privacy centered on "personally identifiable information" or PII. When we're thinking about governments and corporations, we usually resort back to PII. But people regularly share their name or other identifying information with others for all sorts of legitimate reasons. They almost always share PII when engaging in a social interaction. Social media is all about social interactions so, not surprisingly, people are sharing PII all the time. What they care about, what they're concerned about is PEI: Personally Embarrassing Information. That's what they're trying to maintain privacy around.

Too many people working with Big Data assume that people who give out PII want their data to be aggregated and shared widely. But this isn’t remotely true. And they certainly don’t want PEI mixed with PII and spread widely. They share data in context and, by and large, they want it to remain in context.

#4: Aggregating and Distributing Data Out of Context is a Privacy Violation

I've said it before and I'm going to say it again: Context Matters. There are two kinds of content that we focus on when we think about Big Data - that which is shared explicitly and that which is implicitly derived. There's a nice parallel here to what sociologist Erving Goffman describes as that which is given and that which is given off. When people share something explicitly, they assess the situation and its context and choose what to share. When they produce implicit content, they're living and breathing the situation without necessarily being conscious of it. Context still matters. It shapes the data that’s produced and what people’s expectations are.

When you take content produced explicitly or implicitly out of its context, you're violating social norms. When you aggregate people’s content or redistribute it without their consent, you're violating their privacy. At some level, we know this. This is precisely why we force people to sign contracts in the form of Terms of Use that take away their right to demand contextual integrity. This may be legal, but is it ethical? Is it healthy? What are the consequences?

#5: Privacy is Not Access Control

When we talk about privacy in technical circles, it's hard to get past the technical issue: How does one represent privacy? We have a long history of thinking of content as public or private, of representing privacy through numerical sequences like 700. But this collapses two things: privacy and accessibility. File permissions are about articulating who can and cannot access something. Privacy is about understanding the social conditions and working to manage the situation. Limiting access can be one mechanism in one's effort to maintain privacy, but it is not privacy itself. Privacy settings aren't privacy settings; they're accessibility settings. Privacy settings should be about defining the situation and communicating one's sense of the situation to others.

In LiveJournal, it's common for participants to lock a post and then write at the top of the post everyone who has access to it. This process is context setting; it's letting everyone who can see the post understand the situation in which the post is being produced and who is expected to be in the conversation. It's dangerous to read the accessibility settings and assume that this conveys the privacy expectations. Unfortunately, because access controls are so common, we've lost track of the fact that accessibility and privacy are not the same things. And privacy settings don't address the core problem.

… And Publicity Twists it All

All five of these issues present ethical questions for Big Data. Just because we can rupture obscurity, should we? Just because we can publicize content, should we? Just because we can leverage PII, should we? Just because we can aggregate and redistribute data, should we? While I’ve laid out some challenging issues, the answers to these questions aren't clear.

Social norms can and are changing, but that doesn't mean that privacy has been thrown out the door. People care deeply about privacy, care deeply about maintaining context. But they also care about publicity, or the right to walk out in public and be seen. Technology has provided new opportunities for people to actively seek to distribute their content. They can and should have a right to leverage technology to demand a presence in public. And technology that helps them scale is beneficial. The problem is that it's hard to differentiate between publicly accessible data that is meant to be widely distributed and that which is meant to simply be accessible. It's hard to distinguish between the content that people want to share to be aggregated for their own gains and that which is never meant for any such thing. It's hard to distinguish between PII that is shared for social purposes and that which is shared as a self-branding exercise.

This goes back to our methodological conundrum with Big Data. Not all data are created equal and it's really hard to make reasonable interpretations from 30,000 feet without understanding the context in which content is produced and shared. Treating data as arbitrary bytes is bound to get everyone into trouble. So we’re stuck with an ethical conundrum: do we err on the side of making sure that we care for those who are most likely to be hurt or do we accept the costs of exposing people?

With this in mind, let's talk about Facebook.

FACEBOOK'S PRIVACY HICCUPS

When Facebook first launched in 2004, it started as a niche social network site that was only accessible to those privileged enough to have a Harvard.edu email address. As it spread to other universities, it built its reputation on being a closed system. People trusted the service because they felt it provided boundaries that helped people navigate social norms. As it grew, it was interpreted as the anti-MySpace. While MySpace was all about publicly accessible content, Facebook was closed and intimate, a more genuine "place for friends." As I roamed the United States interviewing teens and others, I was continuously told that Facebook was more private. For some, that was the precise reason that they loved the site.

First impressions matter and people will go to great lengths to twist any new information that they receive to fit their first impression rather than trying to alter it. To this day, many average people still associate Facebook with privacy. They believe that they understand how their information flows on Facebook and believe that they understand the tools that allow them to control what is going on. Unfortunately, their confidence obscures the fact that most don't actually understand the privacy settings or what they mean. It's precisely this mismatch that I want to discuss.

During its tenure, Facebook has made a series of moves that have complicated people's understanding of context, resulting in numerous outpourings of frustration over privacy.

The first major hiccup concerned the launch of the News Feed. When Facebook introduced the News Feed in 2006, people were initially outraged. Why? Every bit of content that the News Feed displayed was already accessible. But the News Feed took implicit content and publicized it in a new way. If you were stalking someone and they changed their relationship status from "In a Relationship" to "Single," you might have noticed. But with the News Feed, this change was announced in a stream to all of your Friends. Accordingly, you could see every move that your Friends made on Facebook. Publicizing accessible data was a game changer. People got upset because it changed the context and created uncountable embarrassing moments for people. Their loud anger forced Facebook to create tools so that users could choose what was and was not shared via the News Feed.

This is not to say that the News Feed wasn't a success. It has proven to be a key feature in Facebook, but its introduction forced users to adjust, to reinterpret the context of Facebook. They changed their behavior. They learned to live with the News Feed, to produce content with the News Feed in mind. They stopped changing profile content that would result in implicit updates. They removed many implicit updates from their settings. Users who joined after 2006 took the News Feed for granted and developed a set of norms with it as a given. That transition was rocky but it was possible because everyone could see that it was happening. The feature was so visible that people could learn to work with it.

The second hiccup had a different trajectory. This was Beacon, an advertisement system launched in 2007 that allowed external websites to post information to users' News Feed based on their activities with the external website. So, imagine that you're surfing Blockbuster and decided to rent Kill Bill. This tool would post a message from you to your Friends on the News Feed noting that you rented Kill Bill. Most people had no idea how this was happening. Those who figure it out learned the hard way. Beacon was disconcerting because it made individual people vessels for advertising to their Friends. It took their implicit actions on other websites and posted them. While users could turn Beacon off, they were opted in by default. And it wasn’t nearly as visible as the News Feed. Most people only learned that this was happening when a true problem occurred and they had to clean up the mess.

In 2008, a class action lawsuit was filed against Facebook and its partners. One of the examples given during this case was of a young man who purchased a diamond ring on a website only to learn that his Facebook announced his purchase, thereby wrecking his plans of a romantic proposal to his girlfriend. Beacon was dismantled last September and in October, Facebook settled the case. Regulatory involvement here stemmed from the fact that, unlike News Feed, people didn’t see what was happening with Beacon, didn’t understand where the data was coming from.

The third notable hiccup came in December when Facebook decided to invite users to change their privacy settings. The first instantiation of the process asked users to consider various types of content and choose whether to make that content available to "Everyone" or to keep their old settings. The default new choice was "Everyone." Many users encountered this pop-up when they logged in and just clicked on through because they wanted to get to Facebook itself. In doing so, these users made most of their content visible to "Everyone", many without realizing it. When challenged by the Federal Trade Commission, Facebook proudly announced that 35% of users had altered their privacy settings when they had encountered this popup. They were proud of this because, as research has shown, very few people actually change the defaults. But this means that 65% of users made most of their content publicly accessible.

Facebook is highly incentivized to encourage people to make their data more publicly accessible. But most people would not opt-in to such a change if they understood what was happening. As a result, Facebook’s initial defaults were viewed as deceptive by regulators in Canada and Europe. I interviewed people about their settings. Most had no idea that there was a change. I asked them to describe what their privacy settings were and then asked them to look at them with me; I was depressed to learn that these never matched. (Notably, everyone that I talked to changed their settings to more private once they saw what their settings did.)

Facebook has slowly dismantled the protective walls that made users trust Facebook. Going public is not inherently bad - there are plenty of websites out there where people are even more publicly accessible by default. But Facebook started out one way and is slowly changed, leaving users either clueless or confused or outright screwed. This is fundamentally how contexts get changed in ways that make people's lives really complicated. Facebook users are the proverbial boiling frog - they jumped in when the water was cold but the water has slowly been heating up and some users are getting cooked.

Just last week, Facebook introduced two new features to connect Facebook data with external websites: Social Plugins and Instant Personalizer. Unlike Beacon, this system is more about bringing publicly accessible Facebook data to the third party websites; permission is required for the external websites to post back to Facebook. But if you go to Pandora or CNN, you might find that your Friends also like the music you’re listening to or have shared articles on CNN. You too can click "Like" on various websites and report back to Facebook the content you value. And slowly but surely, data about your tastes and interactions across the web are being aggregated along with your profile and connected to others who you know.

The goal is to give people a more personal web experience. But users don’t understand how it works, let alone how to truly turn it off. And Facebook doesn’t make it easy to opt-out entirely; you have to opt-out to each partner site individually on Facebook and on the partner site. And your friends might still leak your information. Social Plugins and Instant Personalizer aren’t inherently bad things, but they rely on people making their data public. They rely on the December changes that no one understood. And for this reason, all sorts of people are making their content extremely accessible without knowing it, without choosing to do it, and without understanding the consequences.

Healthy social interaction depends on effectively interpreting a social situation and knowing how to operate accordingly. This, along with an understanding of how information flows, is central to the process of privacy. When people cannot get a meaningful read on what's happening, people are likely to make numerous mistakes that are socially costly. Social Plugins and Instant Personalizer are more like Beacon than like News Feed. It’s not shoved in people’s faces; they don’t understand what’s happening; they don’t know how to adjust.

Facebook does a great job of giving people lots of settings for adjusting content's visibility, but they do a terrible job of making them understandable. Even when they inform people that change is underway, they opt people in by default rather than doing the work of convincing people that a new feature might be valuable to them. The opt-out norm in Facebook - and on many other sites - is not in the better interest of people; it's in the better interest of companies.

Facebook could go further in helping people understand how visible their content is. When you post a status update, you see a little lock icon that tells you which groups can see that particular content. They could easily let you see all of the individuals that this includes, or at least the number of people that can see the content. If you knew the magnitude of visibility, you might think differently about what you share. In fact, this precise feature is used by corporate Exchange servers to limit people from spamming mailing lists. When I write a message in Outlook to a mailing list on our internal server, I'm told that the post will go out to 10,432 people. This inevitably makes me think twice. But I fear that Facebook doesn’t want people to think twice.

Facebook could also tell you all of the services that have accessed your data through their APIs and all of the accounts that have actually looked at any particular item of content. People do actually want this feature. Countless teens installed what were actually phishing programs into their MySpaces because the service promised to tell them who was logging in. They want the feature, but it’s not available to them. Because it’s not in any company’s better interest. It is more likely to stifle participation than encourage it.

When Facebook makes a change that’s in people’s faces, they react extremely negatively. When they make a change that’s not as visible, people don’t understand what’s happening until it’s too late. That’s a dangerous cycle to get into, especially when you think of all of the third parties who are engaged in exposing people without them realizing it.

WALKING A BIG DATA TIGHTROPE

Many of you are playing with Facebook's data. Others of you wish to be. The Social APIs are exciting and there are so many possibilities. Facebook has one of the most fascinating sets of Big Data out there. It reveals traces of people's behavior around the globe and provides the richest articulated social network out there. But you're also playing with fire. Much of the data that is publicly accessible was not meant for you to be chomping away at. And distinguishing between what should be publicized and what shouldn't be is impossible. People are engaging with Facebook as individuals and in small groups, but the aggregate of their behavior is removed from that context. People are delighted by the information about their Friends that provides better context, but completely unaware of how their behaviors shape what their Friends see. This creates challenging ethical questions that are not going to be easy to untangle.

People don't seek privacy when they have something to hide. They hide because they want to maintain privacy. They seek privacy because they are social creatures who want to understand the context and manage information accordingly. They seek privacy because they want to be socially appropriate and make themselves vulnerable to those around them. People hide in plain sight all the time, but this is getting trickier and trickier with each new technology.

Technology creates all sorts of new mechanisms by which we can walk out into public and engage, share, connect. It also creates fascinating new opportunities for researchers to get access to data. But these advantages are not without their complications. It's easy to swing to extremes, preaching about the awesomeness of all of these new technologies or condemning them as evil. But we know that reality is much more complicated and that the pros and cons are intricately intertwined. Teasing out how to walk the tightrope of privacy and publicity is going to be a critical challenge of our era.

As I conclude, let me bring Larry Lessig into the picture for a moment. A decade ago, Larry published his seminal book "Code" where he argued that change is regulated by four factors: the market, law, social norms, and architecture or code. The changes that we're facing with privacy and publicity have been brought about because of changes in architecture thanks to code. It is possible to do things today that were never previously available. As we've seen, this has introduced all sorts of new market opportunities and we're watching as the market is pushing privacy and publicity in one direction. As I've outlined today, social norms are much more messy and not even remotely stabilized. To date, the law has been relatively uninvolved in what's happening. This won't last. We're already seeing grumblings in Europe and Canada and, to a lesser degree, the US about what is unfolding. But where the law will fall on these issues is quite unclear. As technologists, you need to be aware of these other regulatory actors and the ways in which they are part of the ecology, part of trying to balance things out.

As a community, WWW is the home of numerous standards bodies, Big Data scholars, and developers. You have the technical and organizational chops to shape the future of code, the future of business, the direction law goes. But you cannot just assume that social norms will magically disappear over night. What you choose to build and how you choose to engage with Big Data matters. What is possible is wide open, but so are the consequences of your decisions. As you're engaging with these systems, I need you to remember what the data is that you're chewing on is. Never forget that Big Data is soylent green. Big Data is made of people. People producing data in a context. People producing data for a purpose. Just because it's technically possible to do all sorts of things with that data doesn't mean that it won't have consequences for the people it's made of. And if you expose people in ways that cause harm, you will have to live with that on your conscience.

Privacy will never be encoded in zeros and ones. It will always be a process that people are navigating. Your challenge is to develop systems and do analyses that balance the complex ways in which people are negotiating these systems. You are shaping the future. I challenge you to build the future you want to inhabit.

Thank you!