Random-sampling reads in a FASTQ file with BASH one-liner

Recently I’ve been entertained by the powerfulness of awk commands in bioinformatics. In this post, I’m going to explain the underlying principles in this awk-extensive BASH one-liners written for random-sampling of FASTQ reads, step-by-step.

1. Random-sampling single-read FASTQ

The one-liner is given as below. Note that I fixed some parts in the code which might cause some logical errors.

cat single.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) {printf("\n");} else { printf("\t");} }' | awk -v k=10000 'BEGIN{srand(systime() + PROCINFO["pid"])}{s=NR<=k?NR:int(rand()*NR);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" '{print $1"\n"$2"\n"$3"\n"$4 > "single_sub.fastq"}'

It is pretty illegible, huh? Nevertheless, believe me, the code is not as much as complex as you think. Before looking into the details, let’s split up the one-liner into four parts.

1.1 Printing out the contents of FASTQ into the pipe

cat single.fastq

This part is straightfoward. If you have gzipped fastq file (*.fastq.gz), use zcat instead.

1.2 Squishing every four lines into a single line

awk '{
  printf("%s", $0);
  if (NR % 4 == 0) {printf("\n")} else {printf("\t")}

The code above is a prettified version of the first awk command. The body block of the command first prints the line it fetched without a newline character, so every line of the file should be concatenated if the following if clause were not presented.  You can easily notice the role of if clause. It prints newline character between every four lines, otherwise it prints tab character.  

I’ll show an toy example to explain the result of the first awk command. Assume a simple FASTQ file with two reads:

$ cat test.fastq

Running the awk command above gives squished FASTQ file as below:

$ awk '{printf("%s", $0); if (NR % 4 == 0) {printf("\n")} else {printf("\t")}}' test.fastq
@read1  ATCGG   +length=5       <AAAA
@read2  AAAAA   +length=5       <AA77

1.3 Reservoir sampling in array R

awk -v k=10000 
'BEGIN {srand(systime() + PROCINFO["pid"])}
  s = NR <= k ? NR : int(rand() * NR);
  if (s < k) R[s] = $0
END {for (i in R) print R[i]}'

The second awk command implements reservoir sampling. Since here I don’t explain the mathematical justification of the reservoir sampling, see this article if you are interested. In short, it is an \(O(n)\) algorithm to sample \(k\) items from \(n\) items, which maintains an array \(R\) of length \(k\) while examining the items one by one. \(R\) is initialized with the first \(k\) items. For \(x\)th item(\(x >= k\)), a random index \(s\) between 0 through \(x\) is sampled. If \(s < k\), the the \(s\)th is assigned to \(R[x]\).

The awk code straightforwardly implements this algorithm. Can you follow the logic inside the code?

awk -v k=10000 
'BEGIN {srand(systime() + PROCINFO["pid"])}  # Just set random seeds.
  # The first k lines (NR <= k case) will be stored directly in R[s],
  # and if randomly sampled index (s) is less than k,
  # the sth item in the reservoir is replaced with current line.
  s = NR <= k ? NR : int(rand() * NR);
  if (s < k) R[s] = $0
END {for (i in R) print R[i]}'  # Print out all the lines in reservoir to the pipe.

1.4 Restoring the FASTQ format

awk -F"\t" '{print $1"\n"$2"\n"$3"\n"$4 > "single_sub.fastq"}'

The last part is simple. It separates each line with tab character, joins fields with newline character and prints.

2. Random-sampling paired-end FASTQ

The one-liner is given as below.

paste forward.fastq reverse.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=10000 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=NR<=k?NR:int(rand()*NR);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" '{print $1"\n"$3"\n"$5"\n"$7 > "forward_sub.fastq";print $2"\n"$4"\n"$6"\n"$8 > "reverse_sub.fastq"}'

For paired-end case, the core logic that uses reservoir sampling is just the same as single-read case. This case, the one-liner squishes a pair of reads with paste command, which concatenates two files line by line. The paste command works like:

$ cat test.read1.fastq
$ cat test.read2.fastq
$ paste test.read1.fastq test.read2.fastq
@read1.1        @read1.2
+length=5       +length=5
@read2.1        @read2.2
+length=5       +length=5
<AA77   <35A9

so that the read pairs are on the same line. Because two paired-end FASTQ files is merged like this, you can apply the same command as single-read case! Note that for gzipped FASTQ files, use paste <(zcat read1.fastq.gz) <(zcat read2.fastq.gz) instead.

3. Snakemake wrappers for FASTQ-subsampling

I added this one-liner commands in my collection of snakemake wrappers as subsample-fastq (since I pursue imperative-names of my custom snakemake rules). If you are too lazy to write or just copy-and-paste those one-liners, but you want to integrate fastq-subsampling workflow to your analysis pipeline, just use it by writing a snakemake rule as below!

rule subsample-fastq:
        # Required input.
        reads = ['data/{sample}/{sample}.fastq.read1.gz', 'data/{sample}/{sample}.fastq.read2.gz']
    threads: 1  # No more than 1 threads will be used.
        # Required parameters.
        k = 10000  # Number of sampled reads.
    log: 'logs/fastq_subsampling/{sample}.log'