Why Coding?
Steps in data management
Prepare the
data collection instrument and collect the data;
Prepare the
data dictionary or codebook;
Tips on Coding:
Prepare the data
matrix worksheets;
Prepare
instructions for data entry and data analysis.
Working with original data, however, can be very cumbersome, whether it is hundreds of mailed questionnaires, figures on yearly accident rates for the fifty states, or observations of classroom behavior of school children. For this reason, data are often coded.
Coded allow the researcher to reduce large quantities
of information into a form than can be more easily handled, especially
by computer programs. Not all data need to be coded. For example, the accident
rates for the fifty states would not be coded, but each state could be
assigned a number (1 through 50) instead of using the state name. There
are also content analysis computer programs that help researchers to code
textual data for qualitative or quantitative analysis.
2. How long have you been an employee in this company? _______years
3. How many county-sponsored training sessions have you attended? _____
4. What is your job classification?
_____Management
_____Technical
_____Administrative
_____Clerical
5. Is your position
_____supervisory
_____non-supervisory
6. Sex
_____male
_____female
7. In what area would you like to receive additional training? ___________
Many computer programs have limits on way data can be entered, stored, and retrieved. These limits should be reflected in the codebook. For example, the names of your variables often cannot exceed eight characters. Use short variable names, preferably all letters. You generally can use numbers as well as letters in variable names, but you cannot use spaces, punctuation, or other special characters.
The variable names you assign to the data should reflect the nominal definitions of the variables themselves, such as "age," "jobclass," "seniority," and so forth. You may want to adopt a rule such as using only lower case letters for any alphanumeric data that you enter, or only uppercase letters. This will make typing variable names easier later when you must tell the computer program which variables to analyze.
Data can be stored in many ways. The most common form for variables is numeric data, consisting only of numbers. Usually this allows for fractions to be stored as decimals, for example, 2.3 or 0.888
Data can also be stored as letters, called alpha-numeric format. This allows the variable to be stored as either letters or numbers or a combination of the two. For example, you could store first names, such as "Amy," "Brad," "Caroline," etc. or combinations such as apartment numbers (102b), or license plate numbers (3XGJ429), etc.
In neither case should data ever be entered with spaces, punctuation marks, or any special characters of any kind. Large numbers should not have any commas placed in them; names should not have any periods, dashes, quotation marks, etc.
The codebook tells the coder how each questionnaire will be coded for data entry. It specifies the question on the questionnaire from which the data is taken, the variable name, the operational definition of the variable, the coding options, and the type of variable (numeric or alpha-numeric) and the number of columns the variable requires.
Example: Quality of Work Life Codebook
Q.
No. |
Variable
Name |
Operational Definition | Coding | Col.
type |
ID | Questionnaire Number | 001-999 | 1-3
num |
|
1 | DIVISION | Name of Division where you work? | Planning=1
Traffic=2 Engineering=3 Enforcement=4 missing=9 |
4
num |
2 | LENGTH | How long have you been an employee in this company? | 01-98
missing=99 |
5-6
num |
3 | TRAINING | How many county-sponsored training sessions have you attended? | 00-98
missing=99 |
7-8
num |
4 | JOBCLASS | What is your job classification?
Management, Technical, Administrative, Clerical |
Management=1
Technical=2 Administrative=3 Clerical=4 missing=9 |
9
num |
5 | SUPER | Is your position supervisory or non-supervisory? | non-supervisory=0
supervisory=1 missing=9 |
10
num |
6 | SEX | Sex: male, female | male=0
female=1 missing=9 |
11
num |
7 | NEEDS | In what area would you like to receive additional training? | supervising=1
budgeting=2 computers=3 personnel=4 other=5 missing=9 |
12
num |
on a scale of attitudes about work,
5=Very satisfied
|
on a survey of where city residents live,
Central=1
|
on a survey of college majors,
Business=1
|
2. Use zero and one to code variables with binary response categories, such as:
Are you a supervisor? No=0 Yes=1
Sex: Male=0 Female=1
Are you at headquarters or in the field? Headquarters=0 Field=1
(Be sure to use the number zero, and not the letter "O"; and the number one, not the letter "L").
3. The same data can be coded in more than one way. For example, the
following data on what materials the library should acquire can be coded
in two different ways:
data:
-books on the middle ages
|
Code for Subject Matter, e.g.:
History
|
Code for type of material, e.g.:
reference works
|
4. One question on a questionnaire can yield more than one variable. For example: What type of training would you like to receive?
_____supervising _____budgeting _____computers _____personnel
This can be coded as one variable,
TRAINING
|
Or as two variables, indicating first and second
choices:
TRAIN1
TRAIN2
|
Or as four variables, indicating a yes/no preference
for each type:
TSUPER
TBUDGET
TCOMPUT
TPERS
|
The researcher has to try to anticipate how the data will look. A good idea of this can be gained from doing a pilot test of the instrument, and a dry run of the data collection process. It is important to be sure to leave enough columns to properly code the information for each variable, and to provide enough variables to capture all the richness, complexity, and variety of data that has been collected.
If a sample of college students is asked about barriers
they encounter is attempting to use the campus library, will students be
asked to list the one main barrier, to rank order all the barriers, or
to choose only the barriers relevant to them? And what if the students
do not follow the instructions? Depending on what shape the data come in,
the researcher will have to decide how to code this information, using
one, two, or many variables.
Example:
Data Entry Worksheets Quality of Work Life Codebook
Id
1-3 |
Division
4 |
Length
5-6 |
Training
7-8 |
Jobclass
9 |
Super
10 |
Sex
11 |
Needs
12 |
001 | 3 | 22 | 15 | 4 | 0 | 1 | 4 |
002 | 1 | 1 | 3 | 2 | 1 | 0 | 1 |
003 | 2 | 9 | 99 | 3 | 0 | 0 | 3 |
Each single numeral or character that is entered into a computer program takes up one column of space. Each datum can be found by knowing its location by column number in the matrix.
Columns 1 through 3 taken together represent the
person's employee ID number.
Column 4 represents the division worked in.
Columns 5-6 represent the length of time employed.
Columns 7-8 represent the number of training classes taken (note that
the information on number of classes taken is missing for person number
003).
Column 9 represents the person's job classification.
Column 10 indicates whether the person is a supervisor or not.
Column 11 indicates whether the person is male or female.
Column 12 indicates what type of training the person wants in the future.
Each record, case, questionnaire, or other unit of analysis is represented by a single row of data across the matrix. For example, person 001 is found in row 1; person 002 in row 2; and person 003 in row 3.
Each record must be entered in exactly the same way. If the position of the data are to be entered in fixed-columns, this is referred to as fixed-field format. If data are missing for a record on any of the variables, something must still be entered into that field. Usually this is a number indicating that the data is missing. For a 1-column field, use the number 9; for a two-column field, use 99; and so forth. Just make sure that "9" or "99" is not also a valid response. In that case, use some other number; some computer programs will allow you to use a period (".") as a placeholder that is also an indicator of missing data.
When you ask the computer, for example, the compute the average length of time employed of all the employees in your survey, the computer will look in columns 5-6 of each record. It will take whatever it finds there, and attempt to compute an average. It is important, therefore, that all length of employment data be in columns 5-6 for every record, and that no other type of data be in columns 5-6. The computer will disregard missing data codes (i.e., values of "99") in computing the average.
Many computer programs have a limitation of a total
of 80 columns of data per record. This is a holdover from when data were
punched on cardboard cards that were fed into card readers, rather than
entering data directly into the computer. If your data require more than
80 columns, you will have to construct additional data matrices to record
the remainder of the information for each record.
There are a number of statistical, spreadsheets, and data base programs that can be used for data entry. Most programs will save the data and allow it to be output as a plain text or ASCII file, which is accepted by most statistical programs, such as SAS, SPSS, or STATA. Most of these programs are available in a desktop version, and many also come in cheaper student versions as well, such as Student Stata and Mystat.
There are also a number of stand-alone products such as DataPerfect, which can be easily programmed to look just like the data collection instrument, making data entry quite easy and eliminating the need for a data entry matrix to be filled in. These programs also have built-in safeguards, so that, for example, alpha-numeric data cannot be entered into a variable that is for numeric data only; data are constrained to a limited number of columns so that four digits can't be entered into a three-digit variable; etc.