processing_stream_-_coding_guidelines
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | processing_stream_-_coding_guidelines [2010/06/23 20:47] (current) – created jochen | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Coding guidelines ====== | ||
+ | As noted on the main index page, **this information is very important**, | ||
+ | ===== Motivation ===== | ||
+ | While working on an "old dataset" | ||
+ | |||
+ | The person who collected the data initially couldn' | ||
+ | |||
+ | It might take you a while to get accustomed to a more rigorous way of coding data, but I promise that you will have the following benefits (if done properly at least): | ||
+ | * less time to look up facts in data sheets | ||
+ | * easier communication of coding scheme (incl. taking notes about it in written form!) | ||
+ | * writing scripts becomes a charm | ||
+ | * likelihood of mistaking one setting for another one is diminished | ||
+ | * (less embarrassment when it comes to showing your data to someone else) | ||
+ | |||
+ | And while the last reason is of course the least important (when it comes to producing good results, at least), it might be the strongest motivator after all! ;) | ||
+ | |||
+ | ===== Requirements ===== | ||
+ | Take a moment (or two) and think about (and decide on) the following items: | ||
+ | |||
+ | * what kind of information do you want to code? (e.g. number of correct responses) | ||
+ | * what part of the data needs to be " | ||
+ | * how will the data be stored? (e.g. as one or several Excel workbooks, an SPSS SAV file, or a Matlab MAT file?) | ||
+ | * how is the data collected? (manually, as part of a logging procedure in a program, GSR/ECG recording, etc.) | ||
+ | * will there be " | ||
+ | |||
+ | ==== What information? | ||
+ | There is no golden rule, but I would advise to rather store **all** data that is collected (versus dropping data prematurely only to discover later that it is needed). With harddisk space being abundantly available these days, there is really no excuse for removing potentially interesting data from a dataset simply to "save space" (especially since Excel has this feature of hiding column, which makes data much easier to handle for later analyses). | ||
+ | |||
+ | ==== Anonymization ==== | ||
+ | It is (hopefully) obvious that **NONE** of your general data files should contain **clearly** identifiable data (names, addresses, email addresses). But even so, sometimes you even have to be careful with what else needs to be coded. For instance, if you are running a study with HIV-positive smokers, you might not want to say in your data what the HIV status is directly. Instead you could enter a serial number and store that information separately so that you don't need to remove/mask this portion when you share your data with someone else. | ||
+ | |||
+ | ==== How to store the data? ==== | ||
+ | Having received training in how to create databases, there are three major approaches: | ||
+ | * a flat approach: data is stored in a format where redundant data reappears in multiple records, e.g. if a subject undergoes 20 trials you store the subject/ | ||
+ | * a split approach: data is stored for different " | ||
+ | * a reference approach: data is stored in multiple tables, and a unique identifier (e.g. subject and session number) is used to reference between tables -> the way to go! (//IMHO at least//) | ||
+ | |||
+ | ==== Data collection ==== | ||
+ | If all your data comes from one " | ||
+ | |||
+ | ==== Sub-sets ==== | ||
+ | Sometimes you wish to store only parts of the data (simpler processing due to reduced amount of data, or for sharing with a collaborator). In either case, you should make sure that it is easy to separate required from non-required information by separately coding different properties in each record. | ||
+ | |||
+ | ===== Coding rules ===== | ||
+ | ==== General advice ==== | ||
+ | * in a table, add a header row with the names of fields using " | ||
+ | * do **not** use spaces in field names (this makes reading plain text files very difficult for humans, as it can be ambiguous whether two words are separated by the field separator, usually the TAB-code hexadecimal value 0x09, or the space-code hexadecimal value 0x20), if you concatenate several words, use [[http:// | ||
+ | * try to establish a unique order of fields (columns) that is consistent between all your data files (containing the same kind of information!) (while scripts can usually work with variable fields, it is a common source of making errors when entering/ | ||
+ | * wherever suitable, code few unique values (such as subject sex or group) as numbers not text, e.g. if you have three different types of trials, think about having a " | ||
+ | * if, for some reason, your field or value names differ between subjects, at least try to make them follow a clear pattern, so that a script (and another human being!) can later use the coded information using a pattern rather than a complete lookup table | ||
+ | * try to avoid coding several pieces of information into one " | ||
+ | * unless required for some reason, do **not** code the same information in several places (e.g. do not insert the session number as a string into another property simply to have it " | ||
+ | * although you do know **NOW** to which study/ | ||
+ | * if numbers are used in filenames, make sure to fill them up with leading zeros, e.g. do not store files such as '' | ||
+ | * **do //NOT// use spaces or any special characters, such as brackets, ampersand, greater/ | ||
+ | |||
+ | ===== Examples ===== | ||
+ | ^ Type of use ^ The Good, ^ the Bad, ^ and the Ugly... | ||
+ | | folder name | ''' |
processing_stream_-_coding_guidelines.txt · Last modified: 2010/06/23 20:47 by jochen