by Rachel Shadoan

We live in an age of unprecedented access to information. The Internet is an ocean of it, containing nearly limitless depths for learning and connecting. This is a tremendous boon for humanity: our understanding of reality is constructed from the information we are exposed to, and the Internet provides the opportunity for constructing an understanding that is richer and more complete than we could ever hope to construct without it.

If our understanding of reality can only be constructed from the information we are exposed to, however, controlling access to information is tantamount to controlling reality. Access to information on the Internet is controlled by inscrutable algorithms owned by companies that are legally obligated to prioritize profit, which may or may not align with the best interests of humanity. In short, our reality is increasingly constructed by algorithms—and we have no way to inspect the rules being used to construct it.

In a perfect “Don’t Be Evil” world, the opacity of the algorithms controlling our access to information would not be cause for alarm. As it stands, we live in a world where the companies controlling these algorithms use them to filter information out of the reach of entire countries—and that’s just the “curation” that we know about.

two woman using a computer together
credit: #wocintech

Without algorithm transparency, we have no way of knowing why we are shown certain information, and, crucially, what information we are not seeing. Without algorithm transparency, we have no frame of reference for critiquing our own understanding, no way to know if the information used to construct our conception of reality is in any way representative. Opaque algorithms cripple our ability to reason about what we know.

This problem will not resolve itself overnight, no matter how vital algorithm transparency is to the future of thinking. There are many reasons opaque algorithms are the status quo. The kind of complex algorithms that act as portals to the Internet often represent a sizable investment; maintaining their opacity stymies competitors. Opaque algorithms are more difficult to exploit; opacity can discourage spam. Opacity is highly convenient and will not be easily vanquished. As a result, it is necessary to be able to investigate and reason about opaque algorithms. Let’s begin with a closer look at algorithms and how they’re constructed.

What is an algorithm?

In the most general terms, an algorithm is a finite series of steps which, when executed, produces outputs. Algorithms can be very simple, like those used to add numbers, or very complex, like the ones that make up Google search; both complete a series of steps, and produce outputs. Algorithms may take inputs—numbers or search terms, for example—but some algorithms require no inputs at all. While we typically associate algorithms with computing, many of the analogue activities we engage in every day, like making a sandwich or washing laundry, are also algorithms.

For a laundry washing algorithm, the input is dirty laundry, and the output is clean, wet laundry. If we’re washing laundry by hand, the steps of the algorithm are:

  • Place dirty laundry in the sink.
  • Fill the sink with water and add detergent.
  • Agitate the laundry in the water until it reaches the desired level of cleanliness.
  • Drain the water from the sink.
  • Rinse the laundry.
  • Squeeze out excess water.

In addition to the inputs, outputs, and well-defined steps, algorithms may have control surfaces that allow you to alter the way steps are carried out, as well as internal state that keeps track of which step to execute next. In our laundry washing example, the temperature of the water and the amount of detergent are control surfaces.

The internal state of the algorithm could be stored in two ways. The simplest state would be the number of the last executed step; with that information, we know which step to execute next. A more complicated state would be observations about the washing environment from which we can deduce what step we are on: whether the laundry is wet or dry, the fill level of the water, the cleanliness of the water, whether the water level is rising or falling. If the sink is full of water, the laundry is wet, and the water is dirty, then we are in step 3 of the algorithm.

Now that we’re more familiar with algorithms, let’s look at what it means for an algorithm to be transparent.

What does it mean for an algorithm to be transparent?

A transparent algorithm is one that facilitates scrutiny of itself. Specifically, a transparent algorithm allows a user to examine:

Inputs. A transparent algorithm allows a user to view and interrogate all of its inputs. It should be clear why any particular piece of input is necessary. In our laundry algorithm, it is intuitive that dirty laundry is required as an input. If the algorithm required us to input mayonnaise in addition to dirty laundry, that would require significant justification in order to be transparent.

Control surfaces. If there are settings to control the way an algorithm executes its steps, those settings should be clearly identified and the resulting impact on outputs clearly described. If we set the water temperature to “hot” in our washing algorithm, the water should indeed be hot; it also should be made clear that hot water produces cleaner laundry but may result in fading or shrinkage.

Assumptions and models the algorithm uses. Algorithms make certain assumptions about the inputs they will receive, as well as what the user wants. These assumptions need to be described in detail, so that users can evaluate whether the algorithm is producing results in line with their needs. Our laundry algorithm assumes that the water from the tap is clean, and that rinsing the laundry with water from the tap will not render it dirty again.

Justification for outputs produced. For any given output, a user needs to be able to answer the question, “Why was this output produced from the inputs I provided?” If we put dirty laundry in the sink and executed our washing algorithm, it should produce clean laundry. If, instead, it produces a sack of wet weasels, we should be able to investigate why it produced a sack of wet weasels instead of the clean laundry.

Algorithm steps and internal state. The processes that the algorithm executes and its internal state must be open to the user. Internal state is arguably the most difficult part of an algorithm to make transparent to a user. While state is fairly straightforward to conceptualize in simple algorithms, it becomes increasingly difficult as complexity grows. This is especially true of machine learning algorithms whose internal state is constructed over time rather than programmed as a fixed set of rules.

Of course, most of the algorithms that we use to access information fall far short of this standard of transparency. In that case, how do we figure out as much as possible about an opaque algorithm?

How can I reason about an opaque algorithm?

When we reason about an opaque algorithm, the big question we are trying to answer is “Why are you showing me this piece of information?” Since we cannot answer that question directly, we need to ask other questions to find clues about what the algorithm might be using to choose what to show us.

1. How does this service make money?

As I mentioned previously, most of the algorithms controlling access to information, especially on the Internet, are owned by for-profit entities. Those algorithms exist to make money for the companies that own them. If we are not paying for a service, it is safe to assume that we are the product; an audience to be packaged and sold to advertisers.

Advertisements appear as both separate from content, such as Google AdWords and Facebook’s sidebar ads, and as sponsored content, such as Twitter’s “Promoted Tweets” and Amazon’s “Sponsored Products”. The FTC requires that sponsored content be labeled; look carefully for such disclosures, as there is incentive to make them only as obvious as is legally required.

Once you have determined how a company is bringing in revenue, ask yourself, “What could this company be doing to increase the amount of money they bring in?” Assume that their algorithms will be tailored to maximize revenue.

2. What information does this service have about me?

Facebook has a wealth of information about us—when we were born, where we were born, when we moved to a new city, where we went to college, how many siblings we have, the names of our children—the list goes on and on. Every detail we provide through our user profile, timeline events, and other settings is a piece of information that can be used as input to Facebook’s algorithms. However, any website you connect to has at least a little information about you. When you navigate to a URL, your browser makes an HTTP request, which includes details about the browser and operating system you are using to access the website. Your IP address, which tells the server your browser is requesting a webpage from where to send the content back to, is sent with the HTTP request. An IP address gives an approximate geographic location, from which they often make inferences about your language. Google and Twitter both filter available results based on geographic location, for instance.

3. What ways can I interact with the content on this service?

In addition to the static information about you, services also have access to the history of your interaction. This is especially true of any service that requires you to log in, but your IP address, which is often consistent if you connect from the same places, can also be used to track your interaction history. Many websites also use cookies to uniquely identify users and track their interactions with the site. Any interaction you can make with a website can be logged and then used as input to the algorithms that select the content to display. Twitter’s “While You Were Away” algorithm appears to work in this manner: based on the tweets you have interacted with in the past, it generates a list of similar tweets that happened since you last interacted with the Twitter service. Even clicks/views count as interactions; Google Search will list results higher in their search if you have clicked on the result in the past.

4. What information does this service have about its content?

A service doesn’t only have information about you—it has information about all the content appearing on its servers. Facebook has the text of every post, as well as the timestamp of when it was posted, who it was posted by, and at least an approximate geographic location of where it was posted. Amazon has a wealth of information about each product it sells, from the size of the product to the materials it’s made of to whether or not people liked it. Additionally, they have not only your history of interaction with each product, but everyone else’s history of interaction with those products, as well. This information is often used to calculate the similarity between pieces of content, which in turn is used to suggest similar content to content you are currently interacting with or have interacted positively with in the past. Amazon’s “Customers Who Viewed This Item Also Viewed” list and Pinterest’s “Suggested Pins” both use the interaction history of other users to suggest content you may like, while Facebook’s Newsfeed appears to use the content and poster of status updates that you have liked or commented on to choose what content to show in your Newsfeed the next time.

Once you have developed a list of all the things an algorithm could be using to curate the information it shows you, you can start changing the information the algorithm has access to and see what changes. Here are some ideas to get you started:

Compare the Google search results from the same search string:

  • while logged in to your Google account versus logged out
  • while logged out of your Google account but on your home network, versus using a VPN or proxy server to obscure your IP address
  • while logged in to your Google account, after repeating the same search 20 times, and clicking on the 10th search result each time

Compare your Facebook News Feed and ads:

  • immediately before changing your gender versus immediately after
  • immediately before changing your relationship status versus immediately after
  • before clicking “Show me less like this” on a story versus after

Compare who appears in your “While You Were Away” feed:

  • after one week of favoriting a significant number of tweets from an account you follow but have never favorited
  • after one week of favoriting a significant number of tweets from an account you follow and favorite all the time
  • after one week of favoriting every picture with a particular hashtag tweeted by someone you follow

We can’t know for sure what happens inside the black box of an opaque algorithm, but with careful consideration and some experimentation, we can make some good guesses. Until algorithm transparency is the status quo, those educated guesses are among our best tools to ensure we understand what information is being delivered to us “automagically” by algorithm—and what information is missing.

Rachel Shadoan is the co-founder and CEO of Akashic Labs, a Portland-based research and development consultancy, where she specializes in combining data science and UX research methodologies to provide rich and accurate answers to technology’s pressing questions. Her hobbies include distributed systems, gardening, and live-tweeting Mister Rogers’ Neighborhood.