Metadata? Whadja think?

Wed 14 October 2015 by Rick Gilmore


Databrary has received generous funding from NSF and NICHD in order to foster increased data sharing and reuse. In the almost four years I've worked on the project, I've learned a great deal about data repositories and open and transparent research practices. One thing has consistently stood out: Successful repositories attract research communities around datasets when the data have useful and searchable metadata. It's not enough to store and share materials and data openly. Users must be able to search and find the information needed to ask a particular question. I suggest that those of us interested in fostering meaningful data sharing and reuse should urge our colleagues to start collecting and reporting metadata that will make it easier for future researchers to build on our hard work.

In the behavioral sciences, the critical metadata falls into just a few categories, most of it reported in methods sections.

  • Who. Behavioral scientists want to know about who (or what species) was tested, their ages, sex or gender, race/ethnicity (or strain), native language(s), and possibly health status. Some research programs require even more specialized participant-level metadata, but the items I mention form the core.
  • What. What did participants do? What tasks did they perform? What measures were taken from them? In some subfields there are well-established conceptual ontologies, but in others, efforts to establish them have largely stalled. We should work harder to report task and measure-related metadata in standardized forms, not just in the methods sections of our manuscripts. Those of us who present visual or audio materials to participants should also store and share actual examples of these materials, not just static depictions suitable for paper-based journal articles.
  • Where and When. Where were participants observed? This usually means the setting (e.g., home, lab, park, school), but could also mean geographic location. When were participants observed? We know that many behaviors vary by time of day or season of the year. Those data aren't regularly collected and shared now, but easily could be.

Subfields might expand on these, of course. But, to my mind, openly shared data should include these types of critical metadata elements in searchable formats. I'm pleased to say that Databrary does this now, and as a result, it is possible to search for specific individual data collection sessions that meet particular criteria. Going forward, that design decision is going to make it much easier to find (video) data that meet particular conditions needed for a specific reuse case. I suggest that reporting on similar subject, task, and setting metadata would serve the same purpose for researchers seeking to repurpose or reanalyze flat-file or physiological data.

In a future post, I'll talk about how one can go too far in requiring researchers to report metadata, and this can make it much harder for researchers to share. But, for now, I'm taking a Goldilocks approach. Who, what, where, and when seem 'just right'.


Fork me on GitHub