Workshop 2 - Part 2

Import, Export External Data


Example Datafiles

I have created a simple dataset for us to work with. The files are available via Dropbox at https://www.dropbox.com/sh/vlo5bzrl5ayo1bk/AADD0WieyuEdyGwiveuCoRr-a?dl=0

The data files include:

  • XLS - the older 1997-2003 EXCEL file format
  • XLSX - the newer EXCEL file workbook format
  • CSV - a comma delimited dataset
  • xxx_tab.TXT - a TAB delimited dataset
  • SAV - SPSS file format
  • and R scripts (xxx.R) for Workshops 1 and 2 so far...

We will work through importing each of these datafile types.


The dataset description

The dataset we'll be working with here has 8 variables and 20 cases (or 20 subjects).

SubjectID Age WeightPRE WeightPOST Height SES GenderSTR GenderCoded
1 45 150 145 5.6 2 m 1
2 50 167 166 5.4 2 f 2
3 35 143 135 5.6 2 F 2
4 44 216 201 5.6 2 m 1
5 32 243 223 6.0 2 m 1
6 48 165 145 5.2 2 f 2
7 50 132 132 5.3 2 m 1
8 51 110 108 5.1 3 f 2
9 46 167 158 5.5 2 NA
10 35 190 200 5.8 1 Male 1
11 36 230 210 6.2 1 m 1
12 40 200 195 6.1 1 f 2
13 45 180 185 5.9 3 f 2
14 52 240 220 6.5 2 m 1
15 24 250 240 6.4 2 M 1
16 35 175 174 5.8 2 F 2
17 51 220 221 6.3 2 m 1
18 43 230 215 2.6 2 m 1
19 36 190 180 5.7 1 female 2
20 44 260 240 6.4 3 male 1

IMPORT a basic CSV (comma delimited) or TAB (TAB delimited) text data file

Let's import a simple data file in CSV (comma delimited) format. This is the most common simple format used for exchanging data between different software/data handling packages.

In basic R the following functions are useful for reading in or importing in datasets:

  • read.table()
  • read.csv()
  • read.delim()

For each of these files you need to specify the data file, and possibly also supply whether the dataset has a header row, what the separator character is and more options. To see complete details see help(read.table)

For each example below, I am providing the specific directory path. However, if these datafiles are located in your current working directory you can just type the filename without the full path.

Option 1 - use read.table()

The basic code is read.table(file="xxx", header=TRUE, sep=",") but these options can be changed as needed if your file does not have a header row or if your file has a separator other than a comma - some separators can be spaces or tabs.

> data.rt <- read.table(file="C:/MyGithub/A_Series_of_R_Workshops/datasets/Dataset_01_comma.csv",
+                       header=TRUE,sep=",")
> data.rt
   SubjectID Age WeightPRE WeightPOST Height SES GenderSTR GenderCoded
1          1  45       150        145    5.6   2         m           1
2          2  50       167        166    5.4   2         f           2
3          3  35       143        135    5.6   2         F           2
4          4  44       216        201    5.6   2         m           1
5          5  32       243        223    6.0   2         m           1
6          6  48       165        145    5.2   2         f           2
7          7  50       132        132    5.3   2         m           1
8          8  51       110        108    5.1   3         f           2
9          9  46       167        158    5.5   2                    NA
10        10  35       190        200    5.8   1      Male           1
11        11  36       230        210    6.2   1         m           1
12        12  40       200        195    6.1   1         f           2
13        13  45       180        185    5.9   3         f           2
14        14  52       240        220    6.5   2         m           1
15        15  24       250        240    6.4   2         M           1
16        16  35       175        174    5.8   2         F           2
17        17  51       220        221    6.3   2         m           1
18        18  43       230        215    2.6   2         m           1
19        19  36       190        180    5.7   1    female           2
20        20  44       260        240    6.4   3      male           1

Option 2 - use read.csv()

Option 2 assumes that there is a header row and the datafile uses a comma separator.

> data.csv <- read.csv(file="C:/MyGithub/A_Series_of_R_Workshops/datasets/Dataset_01_comma.csv")
> data.csv
   SubjectID Age WeightPRE WeightPOST Height SES GenderSTR GenderCoded
1          1  45       150        145    5.6   2         m           1
2          2  50       167        166    5.4   2         f           2
3          3  35       143        135    5.6   2         F           2
4          4  44       216        201    5.6   2         m           1
5          5  32       243        223    6.0   2         m           1
6          6  48       165        145    5.2   2         f           2
7          7  50       132        132    5.3   2         m           1
8          8  51       110        108    5.1   3         f           2
9          9  46       167        158    5.5   2                    NA
10        10  35       190        200    5.8   1      Male           1
11        11  36       230        210    6.2   1         m           1
12        12  40       200        195    6.1   1         f           2
13        13  45       180        185    5.9   3         f           2
14        14  52       240        220    6.5   2         m           1
15        15  24       250        240    6.4   2         M           1
16        16  35       175        174    5.8   2         F           2
17        17  51       220        221    6.3   2         m           1
18        18  43       230        215    2.6   2         m           1
19        19  36       190        180    5.7   1    female           2
20        20  44       260        240    6.4   3      male           1

Option 3 - use read.delim()

Option 3 assumes that there is a header row and the datafile uses a TAB separator.

> data.tab <- read.delim(file="C:/MyGithub/A_Series_of_R_Workshops/datasets/Dataset_01_tab.txt")
> data.tab
   SubjectID Age WeightPRE WeightPOST Height SES GenderSTR GenderCoded
1          1  45       150        145    5.6   2         m           1
2          2  50       167        166    5.4   2         f           2
3          3  35       143        135    5.6   2         F           2
4          4  44       216        201    5.6   2         m           1
5          5  32       243        223    6.0   2         m           1
6          6  48       165        145    5.2   2         f           2
7          7  50       132        132    5.3   2         m           1
8          8  51       110        108    5.1   3         f           2
9          9  46       167        158    5.5   2                    NA
10        10  35       190        200    5.8   1      Male           1
11        11  36       230        210    6.2   1         m           1
12        12  40       200        195    6.1   1         f           2
13        13  45       180        185    5.9   3         f           2
14        14  52       240        220    6.5   2         m           1
15        15  24       250        240    6.4   2         M           1
16        16  35       175        174    5.8   2         F           2
17        17  51       220        221    6.3   2         m           1
18        18  43       230        215    2.6   2         m           1
19        19  36       190        180    5.7   1    female           2
20        20  44       260        240    6.4   3      male           1

IMPORT data from an EXCEL file XLS and XLSX formats

We will use the readxl package. Be sure to install this package install.package("readxl") and then load it using library(readxl).

Here is code for reading in a XLS format

> data.xls <- read_excel("C:/MyGithub/A_Series_of_R_Workshops/datasets/Dataset_01.xls", sheet=1)
> data.xls
   SubjectID Age WeightPRE WeightPOST Height SES GenderSTR GenderCoded
1          1  45       150        145    5.6   2         m           1
2          2  50       167        166    5.4   2         f           2
3          3  35       143        135    5.6   2         F           2
4          4  44       216        201    5.6   2         m           1
5          5  32       243        223    6.0   2         m           1
6          6  48       165        145    5.2   2         f           2
7          7  50       132        132    5.3   2         m           1
8          8  51       110        108    5.1   3         f           2
9          9  46       167        158    5.5   2      <NA>          NA
10        10  35       190        200    5.8   1      Male           1
11        11  36       230        210    6.2   1         m           1
12        12  40       200        195    6.1   1         f           2
13        13  45       180        185    5.9   3         f           2
14        14  52       240        220    6.5   2         m           1
15        15  24       250        240    6.4   2         M           1
16        16  35       175        174    5.8   2         F           2
17        17  51       220        221    6.3   2         m           1
18        18  43       230        215    2.6   2         m           1
19        19  36       190        180    5.7   1    female           2
20        20  44       260        240    6.4   3      male           1

Here is code for reading in a XLSX format

> data.xlsx <- read_excel("C:/MyGithub/A_Series_of_R_Workshops/datasets/Dataset_01.xlsx", sheet=1)
> data.xlsx
   SubjectID Age WeightPRE WeightPOST Height SES GenderSTR GenderCoded
1          1  45       150        145    5.6   2         m           1
2          2  50       167        166    5.4   2         f           2
3          3  35       143        135    5.6   2         F           2
4          4  44       216        201    5.6   2         m           1
5          5  32       243        223    6.0   2         m           1
6          6  48       165        145    5.2   2         f           2
7          7  50       132        132    5.3   2         m           1
8          8  51       110        108    5.1   3         f           2
9          9  46       167        158    5.5   2      <NA>          NA
10        10  35       190        200    5.8   1      Male           1
11        11  36       230        210    6.2   1         m           1
12        12  40       200        195    6.1   1         f           2
13        13  45       180        185    5.9   3         f           2
14        14  52       240        220    6.5   2         m           1
15        15  24       250        240    6.4   2         M           1
16        16  35       175        174    5.8   2         F           2
17        17  51       220        221    6.3   2         m           1
18        18  43       230        215    2.6   2         m           1
19        19  36       190        180    5.7   1    female           2
20        20  44       260        240    6.4   3      male           1

IMPORT data from a SPSS file in SAV format

In order to read datafiles from SPSS, other statistical software, and many other datafile types. To see detailed info look at help for the foreign package. Remember to install.packages("foreign") and load the package using library(foreign). Run the following code to see details on this package and all the formats supported:

help(package = "foreign")

We will use this package to read in an SPSS datafile in the SAV format. The function we will use is read.spss(). See more detailed information on the various options by running help(read.spss). For now we will specify the filename and tell the function to make sure the file is output as a data frame object using the to.data.frame option set to TRUE.

> data.spss <- read.spss(file = "C:/MyGithub/A_Series_of_R_Workshops/datasets/Dataset_01.sav",
+                        to.data.frame=TRUE)
> data.spss
   SubjectID Age WeightPRE WeightPOST Height            SES
1          1  45       150        145    5.6 Average Income
2          2  50       167        166    5.4 Average Income
3          3  35       143        135    5.6 Average Income
4          4  44       216        201    5.6 Average Income
5          5  32       243        223    6.0 Average Income
6          6  48       165        145    5.2 Average Income
7          7  50       132        132    5.3 Average Income
8          8  51       110        108    5.1    High Income
9          9  46       167        158    5.5 Average Income
10        10  35       190        200    5.8     Low Income
11        11  36       230        210    6.2     Low Income
12        12  40       200        195    6.1     Low Income
13        13  45       180        185    5.9    High Income
14        14  52       240        220    6.5 Average Income
15        15  24       250        240    6.4 Average Income
16        16  35       175        174    5.8 Average Income
17        17  51       220        221    6.3 Average Income
18        18  43       230        215    6.2 Average Income
19        19  36       190        180    5.7     Low Income
20        20  44       260        240    6.4    High Income
                  GenderSTR GenderCoded
1  m                               Male
2  f                             Female
3  F                             Female
4  m                               Male
5  m                               Male
6  f                             Female
7  m                               Male
8  f                             Female
9                                  <NA>
10 Male                            Male
11 m                               Male
12 f                             Female
13 f                             Female
14 m                               Male
15 M                               Male
16 F                             Female
17 m                               Male
18 m                               Male
19 female                        Female
20 male                            Male

Create some new variables and save the output

In the datafile we have weights measures at 2 time points and we have height. We can use this data to compute BMI. Since weight is in pounds and height is in inches, we can use the following formula:

BMI_PRE=(WeightPRE*703)/((Height*12)*(Height*12))

From here let's work with data.csv. Since we have weights and height we can compute BMI. Let's do that here with weights in pounds and height in decminal feet which we'll convert to inches in the formula given here. You'll notice that I'm selecting the variables using the $ dollar sign. I'm also creating 2 NEW variables bmiPRE and bmiPOST. By creating them on the left side of the <- and using the $ this automatically APPENDS these new variables to the exisiting data frame data.csv. When we do this the data frame data.csv will go from having 8 variables to 9 and then to 10. Watch the global environment window as you run each line of code below.

> data.csv$bmiPRE <- (data.csv$WeightPRE*703)/((data.csv$Height*12)**2)

And we'll do it again for the POST weights:

BMI_POST=(WeightPOST*703)/((Height*12)*(Height*12))

> data.csv$bmiPOST <- (data.csv$WeightPOST*703)/((data.csv$Height*12)**2)

Isn't there an easier way besides using $?

So, yes, it is a pain to have to type in the data frame followed by a dollar sign $ and then the variable name. If you know for sure you're going to mainly be working with one data frame, you can ATTACH the variables inside data frame to your current environment so you can access the variables withouth having to type the name of the data frame and $ each time. For more info see this blog post at R-boggers http://www.r-bloggers.com/to-attach-or-not-attach-that-is-the-question/

Once we attach the dataset, you can call the variables directly. See example below to compute the change in BMI from PRE-to-POST and then find the mean of these differences.

> attach(data.csv)
> 
> diff <- bmiPOST - bmiPRE
> mean(diff)
[1] -1.598245
> detach(data.csv)

ALWAYS remember to DETACH your data frame when finished.

Now that we have a new variable created the diff object, it is sitting in the global environment not attached to the original data frame. We can add it to the data frame data.csv as follows:

> data.csv$diff <- diff

Now that we've updated our dataset, let's save it using the basic save() function - we can save it as a R formatted file xxx.RData

EXPORT or SAVE the updated data

We can save it out as a RData file using the save() function.

> save(data.csv, 
+      file="C:/MyGithub/A_Series_of_R_Workshops/datasets/datacsv.RData")

Save the data out in a delimited format. First we'll do a comma delimited CSV file using write.csv().

> write.csv(data.csv, 
+           file="C:/MyGithub/A_Series_of_R_Workshops/datasets/datacsv.csv")

Next we'll do a TAB delimited text file using the write.table().

> write.table(data.csv, 
+           file="C:/MyGithub/A_Series_of_R_Workshops/datasets/datacsv.txt",
+           sep="\t")