Is a task repetitive, boring, and/or error-prone? Script it!#

Computers excel at some tasks: they easily find a typo in thousands of file names, don’t get tired and forget to click an option or accidentally mislabel a file, and happily repeat the same operations many, many times. Whenever possible, have computers do the repetitive and boring parts of dataset QC. If you find yourself dreading part of the preprocessing because it is tedious, step back and see if it can be automated.

In addition to reducing researcher annoyance (which is of value in itself!), well-designed scripts can reduce how many errors occur and how quickly they’re found (and corrected). For example, if a data file has a typo in its name a script will report the file as missing, while a person may click the file without noticing the typo. Scripts can also make it faster and easier to rerun an analysis or conversion if needed, such as to change a parameter.

When possible, have scripts read and write files directly from the intended permanent storage locations. For example, we store some files in box. These files can be transferred by clicking options in the box web GUI, but this is slow and it is easy to accidentally select the wrong file. Instead, preprocessing scripts can use boxr functions to retrieve and upload files directly. Box integration is not included here, but see this DMCC example for an example.

Example: Eprime to csv conversion#

Background and motivation#

Eprime is a commercial program for presenting stimuli and experiments. Eprime saves each participant’s data in a proprietary format and non-human-readable plain text. We create a csv version of the output as quickly as possible after data collection, both to check that the participant’s data is usable, and to have it in a long-term accessible format. (I suggest doing such a conversion for all non-standard file formats, not just eprime; store critical data in formats like nifti and text whenever possible.) This conversion task is a prime target for scripting: it must be done often, is annoying and slow to do “manually” (requiring clicking through a GUI and having access to a software license), and its accuracy can be tested algorithmically (e.g., by counting trial types).

The example uses eprime files from a heart beat counting task like that described in Pollatos, Traut-Mattausch, and Schandry (2009). The task starts with a five-minute baseline period, followed by three timed trials during which the participant is asked to count their heart beats. After each trial participants verbally report how many beats they counted and their confidence in the count. The same trial order is used for all participants: the first is 25 seconds, second 35 seconds, and third 45 seconds.

Implementation of the conversion script#

The goal is to convert the eprime data file for a participant into a human and computer-readable csv. We also want to confirm that the task was presented correctly, in this case, that the three trials, initial baseline, and final rest periods are present, and in the expected order and durations. Since this example is Eprime and R, we will use eMergeR library functions for parsing information out of the eprime “text recovery” file. (For long-term stability I suggest keeping code dependencies at the absolute minimum, but it is also generally inadvisable to reimplement complex procedures like eprime file parsing, given the time and effort required, plus likelihood of introducing errors.)

Startup and a look at the input data#

I generally suggest starting each script by loading any needed libraries, clearing R’s memory, setting options, and defining needed variables. For the demo we will convert files for four “participants”: demoSub1, demoSub2, demoSub3, and demoSub4.

knitr::opts_chunk$set(message=FALSE);  # don't print warning messages
library(eMergeR);   # for edatR(), as.data.frame.edat(). https://github.com/AWKruijt/eMergeR

# define some variables to be used later in this script
# I strongly recommend explicitly setting complete paths (e.g., in.path <- 'd:/projects/demo/")
# at the top of a script rather than using setwd() or similar, but the notebook requires relative paths.
in.path <- "example1files/";    # path to input file directory
out.path <- "example1files/output/";   # directory in which the converted files would be written

sub.ids <- paste0("demoSub", 1:4);  # declare vector of subject IDs

Loading required package: plyr

A conversion function#

Next is a function to carry out the conversion. This type of function could be stored in a separate file which is sourced at the top of the file with the other startup code. Note that the function includes several checks that the expected values are present. This is valuable for catching errors or changes as quickly as possible, such as if the task presentation script was accidentally changed partway through the study.

do.convert <- function(sub.id) {    # sub.id <- "demoSub1";
  out.fname <- paste0("interoception-", sub.id, "-converted.csv");  # name of the output file to be made
  # in a full script this would be a good place to check if the out.fname file already exists

  fname <- paste0("interoception_", sub.id, ".txt");   # build expected input filename
  if (file.exists(paste0(in.path, fname))) {   # file exists, so read it and convert.
    print(paste("loading file for", sub.id, "...."));  # a getting-started message

    e.in <- edatR(paste0(in.path, fname));     # eMergeR function for reading text recovery files
    e.tbl <- as.data.frame.edat(e.in, simplify=TRUE);   # initial conversion
    
    found.error <- FALSE;   # error-found flag to avoid using stop() or nested ifs
    # check that some key fields are present and have the expected values
    if (!identical(e.tbl$Procedure, c("baselineProc", "rest", "trigger", "rest", "trigger", "rest", "trigger", "rest"))) { 
      print("ERROR: mismatch in $Procedure!"); 
      found.error <- TRUE;  # change flag
    }

    # the eMergerR functions put the initial baseline session into e.tbl, but not all 
    # of its timing info. This next bit of code confirms that the fields are indeed missing 
    # from e.tbl, retrieves them from the trial_info part of the eMergerR object, checks them
    # and finally adds them to the correct places in e.tbl.
    if (!is.na(e.tbl$black.OnsetTime[1]) | !is.na(e.tbl$black.OffsetTime[1])) { 
      print("ERROR: have baseline onsets or offsets??"); 
      found.error <- TRUE;  # change flag
    }
    
    # check that the baseline period is of the expected duration
    tmp.vec <- e.in$trial_info$`1`; 
    # times in msec, so divide by 1000 for seconds, and 60 for minutes
    if ((as.numeric(tmp.vec["baseline5min.OffsetTime"]) - as.numeric(tmp.vec["baseline5min.OnsetTime"]))/60000 != 5) { 
      print("ERROR: the baseline was not 5 minutes"); 
      found.error <- TRUE;  # change flag
    } 
      
    if (found.error == FALSE) {   # all correct, so store values and write
      e.tbl$black.OnsetTime[1]  <- tmp.vec["baseline5min.OnsetTime"];    # add missing values to e.tbl
      e.tbl$black.OffsetTime[1] <- tmp.vec["baseline5min.OffsetTime"];
      
      # now have all the information, so write as a csv (commented for demo)
      # write.csv(e.tbl, paste0(out.path, out.fname), row.names=FALSE);
      
      if (file.exists(paste0(out.path, out.fname))) {   # confirm file present and print a success message
        print(paste("success! have", out.fname, "on disk."));
      } else { 
        print(paste("ERROR:", out.fname, "not made!")); 
      }
    }
  } else { print(paste0("ERROR: missing input file ", in.path, fname, "!")); }
}

Using the conversion function#

Then we can call the function to convert each participant’s data. Notice the success and error messages that are printed.

do.convert("demoSub1");

[1] "loading file for demoSub1 ...."

Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”

Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”

Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”

Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”

Warning message in .parse_file(txt_filename):
“input string '<ff><fe>' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?”

[1] "success! have interoception-demoSub1-converted.csv on disk."

do.convert("demoSub2"); 

[1] "ERROR: missing input file example1files/interoception_demoSub2.txt!"

do.convert("demoSub3"); 

[1] "loading file for demoSub3 ...."
[1] "success! have interoception-demoSub3-converted.csv on disk."

do.convert("demoSub4"); 

[1] "loading file for demoSub4 ...."
[1] "ERROR: the baseline was not 5 minutes"

Examining the output#

The conversion was successful for demoSubs 1 and 3, but not 2 or 4, why?

Looking at the names of the input files gives one answer:

print(list.files(in.path));  # print the names of the files in the input directory

[1] "interoception_demoSub1.txt" "interoception_demoSub3.txt"
[3] "interoception_demoSub4.txt" "interoception_demoSvb2.txt"
[5] "output"                    

For demoSub4, I edited the timing values in the raw eprime text recovery to have a too-long baseline period. What happens if you edit the conversion code to remove the test of baseline period duration? If particular experimental values are important for determining if the session went properly, test for these values as early and often as possible.

# to see the created csvs:
# read.csv(paste0(out.path, "interoception-demoSub1-converted.csv"));
read.csv(paste0(out.path, "interoception-demoSub3-converted.csv"));

# it can be easier to run a group of subjects with a loop:
# for (sid in 1:length(sub.ids)) { do.convert(sub.ids[sid]); }

A data.frame: 8 × 12
timing	Procedure	time	type	timing.Sample	black.OnsetTime	black.Duration	black.OffsetTime	startChime.OnsetTime	startChime.OffsetTime	endChime.OnsetTime	endChime.OffsetTime
<int>	<chr>	<int>	<chr>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
1	baselineProc	300000	baseline	1	17769	NA	317769	NA	NA	NA	NA
2	rest	60000	60 rest	2	507197	60000	567197	NA	NA	NA	NA
3	trigger	25000	25 seconds	3	NA	NA	NA	567236	592236	592254	593754
4	rest	30000	30 rest	4	593778	30000	623778	NA	NA	NA	NA
5	trigger	35000	35 seconds	5	NA	NA	NA	623796	658796	658813	660313
6	rest	30000	30 rest	6	660340	30000	690340	NA	NA	NA	NA
7	trigger	45000	45 seconds	7	NA	NA	NA	690358	735358	735374	736874
8	rest	60000	60 rest	8	736897	60000	796897	NA	NA	NA	NA

ISMRM'22 QC Book

Is a task repetitive, boring, and/or error-prone? Script it!

Contents