User Tools

Site Tools


help:data_structure

This is an old revision of the document!


Contents

Getting your Data Structure Right for SOFA

Data Format

Most often, you enter some data into SOFA and everything Just Works. You don't have to think about your data structure particularly. But sometimes you want to analyse one variable by another e.g. height by gender and SOFA doesn't seem to allow you to. Or you want to see if there is a difference between, for example, different years, and there is no way of doing it. Or perhaps you want to do a paired t-test and you can't get the correct results.

If you have trouble analysing your variables in SOFA Statistics, check that:

  1. Your data is structured the right way for the analysis you want. For example, if SOFA needs a column for year and a column for score, there will be a problem if your data has a column for 2015 score and a column for 2016 score.
  2. Any variables you need to analyse as numbers e.g. for correlation analyses or histograms, have actually been entered/imported as numeric data not as text.

Structuring data for analysis

The first step is to think about what you want to find out about the data. Here are some examples.

Types of SOFA Statistics analysis

Analysing One Variable "By" Another

The By variable must be a single variable with different values in it (long format), not one column per option (wide format). See http://www.theanalysisfactor.com/wide-and-long-data/.

E.g.

By Gender

The long format is good and the wide format is bad for this purpose.

By Year

Once again, the long format is good and the wide format is bad.

Relationships between two different variables

E.g. looking at linear correlation:

Age  Weight
56   86
22   55
...

In the appropriate SOFA dialog you would select one variable as A and the other as B.

Difference between two "paired" variables

E.g. looking to see if there is a difference between fuel consumption before a fuel gadget was added and afterwards:

NB each row would be the data for one vehicle (or one type of vehicle etc depending on what was being studied).

Consumption (before)    Consumption (after)
12.5                    11.7
16.1                    16.0
...

Or a difference in weight before and after a diet:

NB each row would be the data for one person.

Weight  Post-diet Weight
87      90
59      59
...

In the appropriate SOFA dialog you would select one variable as A and the other as B.

Restructuring your data

The most common problem is when your data has the data for different groups in different variables.

E.g. score data for three years:

2014 Female
186  167
179  170
...

The easiest way to handle this might be to change the data in a spreadsheet and import it in the restructured form.

  1. Insert group by column
  2. Transfer first variable (Male) by renaming it to the measure (Height) and populating the group by column (Gender) for that variable
  3. Transfer second variable by pasting height values below and completing the Gender column with the variable (Female)
  4. Delete the variable not needed (Female in this case)

NB You could have used 1 for Male and 2 for Female if you preferred and added value labels to Gender once the data was imported into SOFA Statistics. See Setting variable details e.g. labels

The same process can be used if there are multiple groups e.g. countries instead of genders.

Numbers stored in a text variable

If you imported your data into SOFA from a spreadsheet, the solution is probably to change the appropriate column data types to numeric and reimport the data. SOFA tries to warn you if it doesn't detect enough numeric variables for the analysis you are conducting e.g. you need at least two numeric variables to conduct a Pearson's R linear correlation analysis.

Contents

Wiki

help/data_structure.1427605598.txt.gz · Last modified: 2015/03/29 01:06 by admin