As noted on the main index page, this information is very important, particularly if you wish to collaborate with someone else on a project – and even if not, for your own sake and soundness of your mind, do yourself the favor of taking (some of) the advice offered on this page to heart…
While working on an “old dataset” (doesn't everybody have those shelved treasures somewhere hidden on the attic, that pop up every now and then when a secondary, tertiary or even more remote re-analysis of some old data has to be done?), I once again was faced with a situation I have come to dread:
The person who collected the data initially couldn't make up their mind about how to enter (code) the data in a systematic way, which makes batched (scripted/automated) analysis difficult (and sometimes impossible), and also highly increases the chance for making an error (for instance if data is coded arbitrarily or, even worse, ambiguous!)
It might take you a while to get accustomed to a more rigorous way of coding data, but I promise that you will have the following benefits (if done properly at least):
less time to look up facts in data sheets
easier communication of coding scheme (incl. taking notes about it in written form!)
writing scripts becomes a charm
likelihood of mistaking one setting for another one is diminished
(less embarrassment when it comes to showing your data to someone else)
And while the last reason is of course the least important (when it comes to producing good results, at least), it might be the strongest motivator after all! ;)
Take a moment (or two) and think about (and decide on) the following items:
what kind of information do you want to code? (e.g. number of correct responses)
what part of the data needs to be “anonymized”? (i.e. can people refer from some of the items to who the subject was/is?)
how will the data be stored? (e.g. as one or several Excel workbooks, an SPSS SAV file, or a Matlab MAT file?)
how is the data collected? (manually, as part of a logging procedure in a program, GSR/ECG recording, etc.)
will there be “sub-sets” of the data? (do you want to share, say, part of the behavioral data with a collaborator but not the entire dataset)
There is no golden rule, but I would advise to rather store all data that is collected (versus dropping data prematurely only to discover later that it is needed). With harddisk space being abundantly available these days, there is really no excuse for removing potentially interesting data from a dataset simply to “save space” (especially since Excel has this feature of hiding column, which makes data much easier to handle for later analyses).
It is (hopefully) obvious that NONE of your general data files should contain clearly identifiable data (names, addresses, email addresses). But even so, sometimes you even have to be careful with what else needs to be coded. For instance, if you are running a study with HIV-positive smokers, you might not want to say in your data what the HIV status is directly. Instead you could enter a serial number and store that information separately so that you don't need to remove/mask this portion when you share your data with someone else.
How to store the data?
Having received training in how to create databases, there are three major approaches:
a flat approach: data is stored in a format where redundant data reappears in multiple records, e.g. if a subject undergoes 20 trials you store the subject/session number along with all other data in each of the records → data overhead and a bit outdated
a split approach: data is stored for different “units”, such as sessions, separately; this requires less overhead (as the unique information, such as subject and session number don't need to be repeated over and over again)
a reference approach: data is stored in multiple tables, and a unique identifier (e.g. subject and session number) is used to reference between tables → the way to go! (IMHO at least
If all your data comes from one “source” (e.g. is entered manually into Excel), this is of less importance. But if you are using several different sources (e.g. self-report ratings from a logfile written by a stimulation presentation program, such as EPrime, combined with per-trial physiological measures, such as GSR peak-to-peak response), you should think about how you are going to combine this data in a way that no errors occur during the procedure!
Sometimes you wish to store only parts of the data (simpler processing due to reduced amount of data, or for sharing with a collaborator). In either case, you should make sure that it is easy to separate required from non-required information by separately coding different properties in each record.
in a table, add a header row with the names of fields using “human-readable” names (no overly shortened acronyms, but also not too long text!)
use spaces in field names (this makes reading plain text files very difficult for humans, as it can be ambiguous whether two words are separated by the field separator, usually the TAB-code hexadecimal value 0x09, or the space-code hexadecimal value 0x20), if you concatenate several words, use CamelCase
(as a suggestion) or separate words by underscores
try to establish a unique order of fields (columns) that is consistent between all your data files (containing the same kind of information!) (while scripts can usually work with variable fields, it is a common source of making errors when entering/altering/looking up data!)
wherever suitable, code few unique values (such as subject sex or group) as numbers not text, e.g. if you have three different types of trials, think about having a “condition” property per record that has a value of either
3; naturally you must keep a record of which number refers to which condition! The advantage is that languages such as Matlab have more powerful operators available (a simple
== is enough instead of using
strcmpi for instance); if you do use text tokens (e.g.
female make sure that the spelling is constitent)
if, for some reason, your field or value names differ between subjects, at least try to make them follow a clear pattern, so that a script (and another human being!) can later use the coded information using a pattern rather than a complete lookup table
try to avoid coding several pieces of information into one “property” (Excel worksheet/Matlab/SPSS column) but instead separate into as many columns as needed, for instance if in your logfiles (which sometimes is unavoidable) a trial is identified as
'hm_12_left_nofix' and another is identified as
'fm_17_right_fix', separate along the underscores into different columns and, if useful for further analysis, replace unique text identifiers with numbers (see above)
unless required for some reason, do not code the same information in several places (e.g. do not insert the session number as a string into another property simply to have it “there”, unless you need to have it there!)
although you do know NOW to which study/project a particular file belongs, you might, later, stumble across it in a different context; use a (easy to remember and speaking) acronym as leading string in filenames (and other appropriate places, such as subject identifiers, folder names, etc.) to help you keep track of things (and, of course, keep this string consistent across all uses!)
if numbers are used in filenames, make sure to fill them up with leading zeros, e.g. do not store files such as
pt142_react.xls, but instead rename the first to
do NOT use spaces or any special characters, such as brackets, ampersand, greater/less than sign, etc. anywhere in your information unless you absolutely have to!!! I cannot stress this enough!
| Type of use || The Good, || the Bad, || and the Ugly…
| folder name || |
| either of |
'hg' (too short) or
'hand gestures with moco s91 (2005-1114)' (too long, spaces, what the f#%& is moco? and s91? is that a date?)
'hgmcs91-05-11-14' (you WILL need some other record later, which a good name would have prevented)