Is a task repetitive, boring, and/or error-prone? Script it!#

Computers excel at some tasks: they easily find a typo in thousands of file names, don’t get tired and forget to click an option or accidentally mislabel a file, and happily repeat the same operations many, many times. Whenever possible, have computers do the repetitive and boring parts of dataset QC. If you find yourself dreading part of the preprocessing because it is tedious, step back and see if it can be automated.

In addition to reducing researcher annoyance (which is of value in itself!), well-designed scripts can reduce how many errors occur and how quickly they’re found (and corrected). For example, if a data file has a typo in its name a script will report the file as missing, while a person may click the file without noticing the typo. Scripts can also make it faster and easier to rerun an analysis or conversion if needed, such as to change a parameter.

When possible, have scripts read and write files directly from the intended permanent storage locations. For example, we store some files in box. These files can be transferred by clicking options in the box web GUI, but this is slow and it is easy to accidentally select the wrong file. Instead, preprocessing scripts can use boxr functions to retrieve and upload files directly. Box integration is not included here, but see this DMCC example for an example.

Example: Eprime to csv conversion#

Background and motivation#

Eprime is a commercial program for presenting stimuli and experiments. Eprime saves each participant’s data in a proprietary format and non-human-readable plain text. We create a csv version of the output as quickly as possible after data collection, both to check that the participant’s data is usable, and to have it in a long-term accessible format. (I suggest doing such a conversion for all non-standard file formats, not just eprime; store critical data in formats like nifti and text whenever possible.) This conversion task is a prime target for scripting: it must be done often, is annoying and slow to do “manually” (requiring clicking through a GUI and having access to a software license), and its accuracy can be tested algorithmically (e.g., by counting trial types).

The example uses eprime files from a heart beat counting task like that described in Pollatos, Traut-Mattausch, and Schandry (2009). The task starts with a five-minute baseline period, followed by three timed trials during which the participant is asked to count their heart beats. After each trial participants verbally report how many beats they counted and their confidence in the count. The same trial order is used for all participants: the first is 25 seconds, second 35 seconds, and third 45 seconds.

Implementation of the conversion script#

The goal is to convert the eprime data file for a participant into a human and computer-readable csv. We also want to confirm that the task was presented correctly, in this case, that the three trials, initial baseline, and final rest periods are present, and in the expected order and durations. Since this example is Eprime and R, we will use eMergeR library functions for parsing information out of the eprime “text recovery” file. (For long-term stability I suggest keeping code dependencies at the absolute minimum, but it is also generally inadvisable to reimplement complex procedures like eprime file parsing, given the time and effort required, plus likelihood of introducing errors.)

Startup and a look at the input data#

I generally suggest starting each script by loading any needed libraries, clearing R’s memory, setting options, and defining needed variables. For the demo we will convert files for four “participants”: demoSub1, demoSub2, demoSub3, and demoSub4.

knitr::opts_chunk$set(message=FALSE);  # don't print warning messages
library(eMergeR);   # for edatR(), as.data.frame.edat(). https://github.com/AWKruijt/eMergeR

# define some variables to be used later in this script
# I strongly recommend explicitly setting complete paths (e.g., in.path <- 'd:/projects/demo/")
# at the top of a script rather than using setwd() or similar, but the notebook requires relative paths.
in.path <- "example1files/";    # path to input file directory
out.path <- "example1files/output/";   # directory in which the converted files would be written

sub.ids <- paste0("demoSub", 1:4);  # declare vector of subject IDs
Loading required package: plyr

A conversion function#

Next is a function to carry out the conversion. This type of function could be stored in a separate file which is sourced at the top of the file with the other startup code. Note that the function includes several checks that the expected values are present. This is valuable for catching errors or changes as quickly as possible, such as if the task presentation script was accidentally changed partway through the study.

do.convert <- function(sub.id) {    # sub.id <- "demoSub1";
  out.fname <- paste0("interoception-", sub.id, "-converted.csv");  # name of the output file to be made
  # in a full script this would be a good place to check if the out.fname file already exists

  fname <- paste0("interoception_", sub.id, ".txt");   # build expected input filename
  if (file.exists(paste0(in.path, fname))) {   # file exists, so read it and convert.
    print(paste("loading file for", sub.id, "...."));  # a getting-started message

    e.in <- edatR(paste0(in.path, fname));     # eMergeR function for reading text recovery files
    e.tbl <- as.data.frame.edat(e.in, simplify=TRUE);   # initial conversion
    
    found.error <- FALSE;   # error-found flag to avoid using stop() or nested ifs
    # check that some key fields are present and have the expected values
    if (!identical(e.tbl$Procedure, c("baselineProc", "rest", "trigger", "rest", "trigger", "rest", "trigger", "rest"))) { 
      print("ERROR: mismatch in $Procedure!"); 
      found.error <- TRUE;  # change flag
    }

    # the eMergerR functions put the initial baseline session into e.tbl, but not all 
    # of its timing info. This next bit of code confirms that the fields are indeed missing 
    # from e.tbl, retrieves them from the trial_info part of the eMergerR object, checks them
    # and finally adds them to the correct places in e.tbl.
    if (!is.na(e.tbl$black.OnsetTime[1]) | !is.na(e.tbl$black.OffsetTime[1])) { 
      print("ERROR: have baseline onsets or offsets??"); 
      found.error <- TRUE;  # change flag
    }
    
    # check that the baseline period is of the expected duration
    tmp.vec <- e.in$trial_info$`1`; 
    # times in msec, so divide by 1000 for seconds, and 60 for minutes
    if ((as.numeric(tmp.vec["baseline5min.OffsetTime"]) - as.numeric(tmp.vec["baseline5min.OnsetTime"]))/60000 != 5) { 
      print("ERROR: the baseline was not 5 minutes"); 
      found.error <- TRUE;  # change flag
    } 
      
    if (found.error == FALSE) {   # all correct, so store values and write
      e.tbl$black.OnsetTime[1]  <- tmp.vec["baseline5min.OnsetTime"];    # add missing values to e.tbl
      e.tbl$black.OffsetTime[1] <- tmp.vec["baseline5min.OffsetTime"];
      
      # now have all the information, so write as a csv (commented for demo)
      # write.csv(e.tbl, paste0(out.path, out.fname), row.names=FALSE);
      
      if (file.exists(paste0(out.path, out.fname))) {   # confirm file present and print a success message
        print(paste("success! have", out.fname, "on disk."));
      } else { 
        print(paste("ERROR:", out.fname, "not made!")); 
      }
    }
  } else { print(paste0("ERROR: missing input file ", in.path, fname, "!")); }
}

Using the conversion function#

Then we can call the function to convert each participant’s data. Notice the success and error messages that are printed.

do.convert("demoSub1");
[1] "loading file for demoSub1 ...."
Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”
Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”
Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”
Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”
Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”
[1] "success! have interoception-demoSub1-converted.csv on disk."
do.convert("demoSub2"); 
[1] "ERROR: missing input file example1files/interoception_demoSub2.txt!"
do.convert("demoSub3"); 
[1] "loading file for demoSub3 ...."
[1] "success! have interoception-demoSub3-converted.csv on disk."
do.convert("demoSub4"); 
[1] "loading file for demoSub4 ...."
[1] "ERROR: the baseline was not 5 minutes"

Examining the output#

The conversion was successful for demoSubs 1 and 3, but not 2 or 4, why?

Looking at the names of the input files gives one answer:

print(list.files(in.path));  # print the names of the files in the input directory
[1] "interoception_demoSub1.txt" "interoception_demoSub3.txt"
[3] "interoception_demoSub4.txt" "interoception_demoSvb2.txt"
[5] "output"                    

For demoSub4, I edited the timing values in the raw eprime text recovery to have a too-long baseline period. What happens if you edit the conversion code to remove the test of baseline period duration? If particular experimental values are important for determining if the session went properly, test for these values as early and often as possible.

# to see the created csvs:
# read.csv(paste0(out.path, "interoception-demoSub1-converted.csv"));
read.csv(paste0(out.path, "interoception-demoSub3-converted.csv"));

# it can be easier to run a group of subjects with a loop:
# for (sid in 1:length(sub.ids)) { do.convert(sub.ids[sid]); }
A data.frame: 8 × 12
timingProceduretimetypetiming.Sampleblack.OnsetTimeblack.Durationblack.OffsetTimestartChime.OnsetTimestartChime.OffsetTimeendChime.OnsetTimeendChime.OffsetTime
<int><chr><int><chr><int><int><int><int><int><int><int><int>
1baselineProc300000baseline 1 17769 NA317769 NA NA NA NA
2rest 6000060 rest 250719760000567197 NA NA NA NA
3trigger 2500025 seconds3 NA NA NA567236592236592254593754
4rest 3000030 rest 459377830000623778 NA NA NA NA
5trigger 3500035 seconds5 NA NA NA623796658796658813660313
6rest 3000030 rest 666034030000690340 NA NA NA NA
7trigger 4500045 seconds7 NA NA NA690358735358735374736874
8rest 6000060 rest 873689760000796897 NA NA NA NA