How to keep format your data

Data Courtesy
Today there are a multitude of powerful and convenient programs that are used for storing and manipulating data. These include data base programs (Access, Foxpro, dBase, Paradox), spreadsheets (Excel, QuatraPro, Lotus), and word processing packages (Word, Word Perfect). All of these programs are powerful tools for data manipulation. When the intention is to perform statistical analysis with the data stored in these programs, then how easily the data can be imported into a statistical package such as SAS or SPSS should be considered. What may appear to be a well laid-out data set may be a challenge to re-format for statistical analysis, may be a very time-consuming process and may result in data errors in the process. Prior to starting data entry researchers should consider how they can set up their data to facilitate statistical analysis. It is easiest to consider this prior to setting up the spreadsheet and entering the data.
It is recommended that a data base program such as Microsoft Access, which is part of the Microsoft Office Professional Package, Foxpro, dBase, or Pardox be used for data entry where possible. With these programs the structure of columns is pre-defined and limits and checks can put on values that are be entered. Data from such data base programs can easily be imported directly into a statistical program or be exported to a format which can easily be imported.
Because it is included with the standard Microsoft Office program and because of its ease of use, Microsoft Excel is often used for entering and storing research data. Excel is a spreadsheet program which allows users to easily manipulate their data and perform simple calculations, including some statistical summaries. However, these statistical functions are limited in scope and need to be used carefully to avoid error. In order to perform statistical analysis the excel data will need to be imported into a statistical program. Despite the apparent clarity and attractiveness of a spreadsheet importing the data into a statistical program may not be a straightforward task. The complexity of this process can make it the most time-consuming part of a statistical analyses.
Statistical programs require data to be set up in a specific format. The following is a list of recommendations/cautions for using spreadsheets such as Microsoft Excel for entering and storing research data that will be used in statistical analysis:
Worksheet setup Data should be set up with rows representing cases and columns representing fields (variables).
Headings The first row, and only the first row, can be used for column labels/headings.
Headings should be unique and consistent with SAS/SPSS variable (field, column) naming conventions. Column headings should begin with an alphabetic character followed by alphabetic characters, numbers or an “_”. A space between two words will be converted to “_”.
Column Consistency Within columns data should be of a consistent type – numeric, character, or date. SAS and SPSS read the first few rows and use the format of those rows to format the columns.
Numeric Columns If the statistical program decides that the column is numeric it will treat any non-numeric data as missing. Where there are missing values in the first few columns use a unique numerical value , for example -99, to indicate that the data is missing. That same value may be used for all missing data in that column. Separate unique values can be used for other values for which there is no number. For example data that is below a detectable level might be assigned a value of -77.
Character (text) Columns If the statistical program decides that the column is character it will set any numbers to missing, unless they are pre-specified as character values. The length of a character column will also be determined from the first few cases. Hence longer data strings that are further down in the spreadsheet will be truncated. One way to deal with this is to use a long series of ‘X’s in one of the first few rows that are missing to set the length of the column.
Where text is used in a column, spelling should be consistent. For example, if a character string is used to identify the gender of the subject and various forms are used such as “m”, “M”,”_m” “male”,”Male”, “MAle”, “MALE” for males, one has to convert these to a single consistent format before the analysis can be performed. This can become even more challenging when variations of spelling for names or conditions are used which are not familiar to the individual performing the statistical analysis
Date Columns Columns containing dates should be formatted as dates. Care should be taken to distinguish between the American and European formats for date. The format will be set as one format but Excel allows the entry of both. For example Excel will allow both 3/19/2010 and 19/3/2010. The one will be set as a date and the other as a character that will become missing when imported into a statistical program. Date formats can be set up so that the three characters for the month are seen once the date is entered. This allows one to easily see that a date has been entered and also to differentiate between dates such as 3/4/2010 and 4/3/2010.
Time Columns Columns containing times should be set up and entered safely. One way of dealing safely with times is to have one column for the hour and a second for minutes.
Subject Data Data for a subject assessment should not be spread across multiple rows. If spread across multiple rows, a statistical program will read multiple observations for that subject.
Underscores Underscores (“_”) are characters. They may not appear in a column heading or in a character data string. However, if there, they will show up as part of the character string.
Composite Variables Avoid putting multiple responses separated by commas or other characters in a single column eg entering age and sex into a single column. The process of extracting the separate responses can be time consuming, challenging, and lead to error. Create separate columns for situations where there are multiple numeric variables. For example, if comorbidities are to be listed and a subject has two or three comorbidities, instead of a single column with conditions separated by commas, one could create three columns and list each of the three comorbidities in three separate columns. Alternatively, one could create columns (fields, variables) for each of the common comorbidites and note the presence or absence of each comorbidity plus have a separate column for “other” comorbidities.
Calculated Variables If we have the algorithm it is probably easiest to calculate new variables in the statistical program. For example, with weight and height, BMI can easily be calculated.
Summary calculations & Graphs Summary calculations and graphs store in the spreadsheet should be removed. Summary calculations end up being treated as additional subjects.
Hidden Columns Hidden columns are not lost. They are only hidden from view within Excel. If you do not want these columns to be seen by anyone, then delete them.
Excel Formatting Excel Formatting (colours, shading, fonts) are for improving the appearance of the spreadsheet. They do not affect the data stored in the spreadsheet. If you are trying to communicate information for the analysis it is better to do so by adding an indicator column. For example if a ‘strike over’ font is used to indicate ineligible subjects, this will be lost when the data is imported into a statistical program. It is better to have a separate column to indicate which subjects are eligible and ineligible.
Empty Rows & Columns Through various manipulations and modifications a spreadsheet the size of the spreadsheet (number of columns &rows) can exceed the number of variables & cases. If these are not deleted they become extra variables and data points.

Add a New Comment
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License