by Valerie Collins and Nicole Contaxis
Registration forms are easy to ignore. They are filled out and whisked away, yet they invade the most personal aspects of our lives. From bank applications, clinical intake forms, university or job or housing applications, and research surveys on any topic to registration for military service, the local gym, or a social media app, these forms of data collection are ubiquitous. The majority of forms are so familiar that they may seem interchangeable. Name, address, phone number, age: these are common facets of this type of data collection instrument, and providing this information doesn’t usually require too much thought.
When we move into other common pieces of data collection, such as race, ethnicity, and gender, it may become more difficult to provide an accurate response within the restrictions of the form. A poorly designed survey or registration form has the capacity to erase identities and perpetuate tensions between how the data-collector perceives the world and how the subject of data collection perceives themself.
Governments, researchers, and businesses employ registration or survey forms to collect information locally, statewide, nationally, and internationally. While a survey generally requests information from individuals in specific populations on a given topic of interest, registration forms request information from an individual in order to be able to provide a service to that person. The data from either of these forms can allow subsequent users of the data to understand demographic change and other population-level phenomena.
For those of us who do not have difficulties finding the correct box to check (the authors generally find White/Caucasian and Woman/Female on forms that require this information), it can be easy to ignore the social and political dynamics that shape the creation, dissemination, and completion of such a form, and the subsequent analysis and reporting of the data. If one’s identity is not generally or accurately described in a registration or survey form, however, these dynamics can become more apparent.
A survey or registration form is not inert; it plays an active role in data collection. In some cases, the process of data collection can perpetuate quiet discrimination against minority populations. If data-collectors only ever create survey or registration forms with five racial categories and two gender categories, for example, then the form implicitly asserts to those filling it out that these are the limits of identity. A form that overly restricts available categories and which fails to ask relevant questions can also result in a lack of intersectional awareness in the data – and thus damage the results that are drawn from it.
Data are important for understanding a population, but the collection of data is rarely straightforward. Data collection can be an opaque process, starting with the decisions about which questions to include in a form and, perhaps just as importantly, how to frame questions which address identity. For example, I can say that all U.S. WWI draft cards collected data on race. But if I wanted to investigate race and the military in WWI, I would have to know that there were three drafts in WWI. The first registration card (June 5, 1917) included an open-ended question on race: “Specify which?” The second registration card (July 5, 1918) changed this to a close-ended question: “White, Indian, Negro.” The third registration card (September 12, 1918) changed these categories to: “White, Negro, Indian (specify citizen or non-citizen) and Oriental.” The specific wording of other questions changed as well, but this difference in racial categories over the course of the same war indicates how some constructions of identity are reflections of the time, and complicate long-term comparisons of data.

The proliferation of digital technology and the increasing sophistication of online tracking means that new avenues of large scale collection of individual and population data are available to those with the money and interest. Our focus in this article on population data is restricted to instances where a data collector formulates a question, and an individual responds, such as for the U.S. Census. The Census was designed to decide how to allocate representation in federal government across the nation. Over time, it has also come to serve as a way to capture the demographics of the U.S. as they have changed. This type of data collection provides a unique look into its dynamics, both in the deliberate construction of categories around identity, and the impact this kind of data can have.
In the U.S. decennial census, the Office of Management and Budget acknowledges that “Asian Indians, for example, were counted as “Hindus” in censuses from 1920 to 1940, as “White” from 1950 to 1970, and as “Asians or Pacific Islanders” in 1980 and 1990.” As consumers of the data, we have to acknowledge our own assumptions of what categories like “White” or “Caucasian” represent when we see data reported out, and we also have to question what they represent when we include them on any kind of data collection form. Because policy-makers, researchers, and many organizations make decisions based on the data the Census collects, the questions used in their data collection have large ramifications.

A survey or registration form is designed with the aim to collect information. Since a poorly designed questionnaire can limit subsequent use and analysis of data, one doesn’t have to look very far to find material on how to avoid poorly designed questionnaire. A poorly designed questionnaire can fail to elicit the responses that the creators were prepared to analyze. There are also contexts in which it is inappropriate, irrelevant, or illegal to ask certain questions about a person. Thus, for us, a poorly designed form that erases identity is one that: (1) asks a relevant question poorly; or (2) fails to ask a question relevant to the form’s purpose.
An individual filling out a survey or registration form can only answer the questions they are asked, which means that they may find themselves at odds with a form that does not allow them to provide the accurate answer. A closed-ended question may not include all relevant options or the options may be poorly worded. An open-ended question can be harder to analyze and may not be included because of the extra burden on both participant and collector. Of course, there is no perfect way to address this issue. Comparing data over time is far easier with a controlled answer set, but limiting the answers will blur some facts of a person’s identity. To give a registration form example, Spotify currently allows users to designate themselves as one of three genders: (1) male; (2) female; (3) non-binary (Spotify). This is a clear improvement oven having only two gender options, but it still limits how people can identify themselves. One also wonders why a digital music service needs this information.

Limiting possible answers or failing to ask relevant questions can also lead to more insidious gaps in data collection. Data collection will be opaque unless we really question the process, and an example of all of this is described in the 1988 Federal Committee Report on the Status of Native American Veterans. Tasked to review how well the Veterans Administration (VA) was providing service to Native American veterans, the report matter-of-factly outlines the failure of the VA to even gather statistical data on the Native American veterans. As the report states, the lack of “accurate and comprehensive statistical data on Native American veterans at local, regional, and national levels severely limits veteran identification, needs assessment, determination of utilization rates, and program planning” (p. 2). Without having that population data in hand, the VA could not fulfill their mission in respect to Native American veterans. The absence of data even prevents recognition of these men and their service.
The commission identified three separate data sources collected directly by the VA, as well as a further three by other federal agencies that the VA could request access to. The VA data sources included the Application for Medical Benefits, the Annual Patient Census, and the Patient Treatment Files. Despite having numerous avenues of collection, these data sources failed not only to collect data on veterans who did not seek VA services, but also none directly asked questions on race–instead, race was assigned by clinical staff based on patient surnames or by stereotypes based on appearance. Data sources identified from other federal agencies included the Bureau of Census, Indian Health Services, and the National Medical Expenditure Survey, however, except for the National Medical Expenditure Survey, these data sources did not ask for veteran status when surveying Native American populations.
This failure to gather both pieces of information simultaneously—veteran and Native American—meant that that population was unrepresented on a federal level. These absent data sit in the center of an intersection of underserved populations, illustrating how specific identities can be erased in data collection and in policy creation. Multiple agencies, not just the U.S. Census Bureau, failed to collect valuable information on Native American veterans. On a federal level, employee hands may be tied, depending on whether there is a policy or directive guiding the language of the question or whether the question can even be asked. While a federal commission was able to uncover the absence of data on Native American veterans at that time, there is no way to say how many other similar absences exist at the federal level for other populations. Those populations may not be receiving the recognition, support, or funding that they need.
Here, one can begin to see how tensions between the data collector and the subjects of data collection affect the way we understand our world. Native American veterans know that they exist and have needs, but due to the legal and bureaucratic structure of relevant agencies at the time, the available data collection methods overlooked Native American veterans. In this case, the result of data collection is more than an understanding of the world; it helps determine the funding and support of a particular group. It is important to remember that data collection programs, and the resulting reports, can affect the lives of people in concrete ways. Thus, we need to question how data, particularly demographic data, have been collected. Was it self-reported, or did researchers assign categories based on observation? What language was used, and could the question or response be misconstrued? Did the definitions behind categories change over time for longitudinal data like the Census, when data is collected on the same sample at different points in time? If the form only accounted for three religions, has this decision been justified, or was there an unexamined assumption that only a few major religions would be represented in the population—or that those would be the only ones important for analysis?
If we accept that data-collectors and the subjects of data collection can approach a given topic of interest from different perspectives, then it is also important to understand how this divergence in approaches play out and why it matters. In the situation that we have been describing, the “topic of interest” is the usage of Veteran Administration programs by Native American veterans. To approach this topic—the usage of services—the data that are usually collected are: who is using the service, when they use the service, and why they use the service. By addressing who is using the service, program administrators look for populations that are underserved or over-represented, which usually means collecting demographic information. Erasure of identities thus comes into play when the program administrators approach their data collection process with preconceived notions of the various identities that may be in the pool of respondents.
The form creator may not be aware of their assumptions, or may be blocked by their employer’s policies. In the former case, the form designer may be so accustomed to only ever seeing two categories for gender that they don’t stop to question it. In the latter case, a federal agency may be restricted by a directive or a policy to only use specific, agreed-upon racial categories, while a local business may copy those categories for their use simply because they are the “official” ones. Program administrators and data collectors may wish to restrict the variables that they have to collect and analyze. To do this, they may limit the number of categories they will accept for a question, and anything else becomes “other.” In this way, the data collector creates a depiction of the world that contains fewer complexities.
Although it is understandable that a data collector would like to simplify their work, this puts the subjects of collection in a precarious position. The respondent, when faced with questions that do not provide adequate answers has a choice: a) do not answer the question or stop filling out the form, thus allowing additional representational voices to fall out of the data collection, or b) knowingly select an option that does not accurately represent them and is forced by the survey design to misrepresent themselves. An individual may, understandably, object or feel that it is dangerous for them to divulge some information about themselves.
In some cases, a form may bungle a relatively innocuous question. If options for “renting” are not included on a survey inquiring about housing, the oversight might be embarrassing for the survey creators. When the question is on topics such as race, gender, ethnicity, or religion, there is a more insidious effect if the existence or importance of a particular population is denied or the language used in the question is inappropriate. When this experience is multiplied out across the entirety of a person’s experience, say when applying for jobs, mortgages, housing, benefits, local services, online accounts or social media services, then these innocuous-seeming questions seed structural discrimination and perpetuate it.
The identities most at risk are those which are not the default cultural assumption of human experience—cis, white, straight, fully-abled. The more identities that a person carries outside of the default cultural assumption, the more likely it is that a failure in data collection will elide a component of that person’s identity or the entirety of it. Yet, when this data can have implications for the way people understand the world and the way policy is drafted, it needs to be collected correctly. Data holds power, and when a governing body collects data, absences in data collection can affect the lived experience of anyone affected by their policy. In other words, it can affect everyone.
Regardless of how it occurs, this absence can allow those in power to justify ignoring the needs of individuals overlooked or overwritten in the data collection process. To return to the example of Native American veterans, what we see here is an assumption that a person who inhabits one type of identity (race, ethnicity, gender) cannot also inhabit another (the profession of a soldier). It is a failure to believe that these two identities cannot exist in one person, and this absence in the data allows for mainstream, embedded discrimination to persist. It also lets us to avoid the necessary recognition of the men and women who, despite the long history of oppression they have faced in America, fought for that country when called.
Data are usually seen as an objective way of representing elements of the world, which then can be analyzed to discover relationships or truths. Yet we sometimes say that the data “lies,” instead of addressing the underlying issue of its original collection. An organization collecting data may not be able or allowed to require that a participant discloses information about themself, or they may not be able to verify a respondent’s answer. Gaps and absences in data appear in these situations as well, and impact the final count. When money is disbursed, policy is enacted, or even general public perception is swayed by the results of data collection, then it matters if certain experiences are not reflected in the data structures. How can we make an accurate judgment on the reality of the situation, when data consumers aren’t aware that collectors did not gather relevant information?
With the increased capacity to collect, store, and manipulate data, more is being collected at ever-increasing rates. There is a promise inherent in the relentless gathering of data: if one can gather enough data points, account for all of the variables, and track their changes over time, then we can create an actionable model of the world. With a complete and accurate picture of the world, we can point to hidden “truths” and predict how small changes might affect the whole, whether this is retaining business customers, creating better models to prevent plane crashes, changing factors to improve the health of a community, or addressing racial inequalities as they are expressed through public and private services. Data collection is a complicated process, far more complicated than we can address in this article. The nature of the process, however, does not change the fact that there is a need to ask how data is acquired, what are the assumptions built into data collection, and who or what may be missing from it.
Sources
National Archives. (2016). World War I Draft Registration Cards. Retrieved from https://www.archives.gov/research/military/ww1/draft-registration on 13 December 2016.
For digitized copies of these draft cards, see: Ancestry.com. U.S., World War I Draft Registration Cards, 1917-1918 [database on-line]. Provo, UT, USA: Ancestry.com Operations Inc, 2005. http://search.ancestry.com/search/db.aspx?dbid=6482#Cards
Office of Management and Budget. (1995). Standards for the Classification of Federal Data on Race and Ethnicity. Retrieved from https://www.whitehouse.gov/omb/fedreg_race-ethnicity on 13 December 2016.
Spotify (2016). Retrieved from https://www.spotify.com/us/signup/ on 13 December 2016
United States. Veterans Administration. Advisory Committee on Native American Veterans. (1988). Final report. Washington, D.C. Retrieved from https://catalog.hathitrust.org/Record/001299927
Valerie Collins works as an archivist in Minnesota. She handles digital records primarily, but is also involved in local data management and curation efforts.
Nicole Contaxis works in data management and data sharing in New York City. She thinks too much about technology history.