Scalable FASTQ QC: Merging, Automation and MultiQC Reporting

Learn how to merge hundreds of FASTQ files, automate FastQC execution in parallel, and aggregate all results into a single interactive MultiQC report.

const metadata = ; Introduction You just got your sequencing data back. Instead of a few tidy files, you’re staring at 300 FASTQ files. Running FastQC on each one means 300 HTML reports to open manually. Click, scroll, close. Click, scroll, close. Repeat 298 more times. I once spent an entire afternoon doing exactly this before someone told me about MultiQC…. This guide shows you how to: - merge those files intelligently, - automate FastQC across your entire dataset, - and finally how to aggregate everything into one clean report. We’ll go from chaos to a reproducible workflow you can use on every project. Download the FASTQ QC Checklist A printable checklist with key steps and quick reference commands. Key Takeaways - Modern Illumina sequencing splits output into many files per sample (lanes, tiles, legacy demultiplexing via CASAVA/bcl2fastq), this is normal, not an error. - FASTQ files from the same sample and read direction (R1 or R2) can be safely merged with cat BEFORE alignment - Always verify your merge with line counts (zcat file | wc -l) or stream checksums to prevent silent corruption - FastQC can be parallelized with the -t flag, GNU parallel, or simple bash loops - MultiQC aggregates hundreds of FastQC reports into one interactive HTML dashboard - Ask your sequencing facility about --no-lane-splitting (bcl2fastq2/BCL Convert) to avoid fragmentation at the source - A standardized, documented QC pipeline saves hours on every project and prevents rookie mistakes Before You Start Software Requirements: - Linux/macOS terminal (or WSL on Windows) - FastQC (v0.11.9 or later) - MultiQC (v1.14 or later) - Python 3.6+ (required for MultiQC) - Basic bash/command-line knowledge Key Terms: - FASTQ: Text-based format for storing biological sequences and their quality scores. Each read consists of 4 lines: identifier, sequence, separator (+) and quality scores (ASCII-encoded). - R1/R2 Files: Paired-end sequencing produces two files per sample: R1 (forward reads) and R2 (reverse reads). These must be kept synchronized. R2 typically has lower quality due to sequencing chemistry. - Lanes: Physical divisions on an Illumina flow cell. Samples are often split across multiple lanes for throughput, creating multiple files per sample that need to be merged. - FastQC: Widely-used tool that generates quality metrics for sequencing data, producing an HTML report with visualizations for each input file. Does not aggregate across files. - MultiQC: Aggregation tool that combines outputs from FastQC and 150+ other bioinformatics tools into a single interactive HTML report. (The solution to “300 reports” problem.) - Demultiplexing: Process of separating pooled samples based on their unique barcode sequences after sequencing. Happens before you receive your data. - bcl2fastq: Illumina’s software for converting raw BCL files to FASTQ format. The --no-lane-splitting option can prevent file fragmentation. Why So Many Files? 1. Sequencing output is fragmented. Historically, Illumina software (CASAVA, older bcl2fastq) split data by tiles, lanes, or file size limits. Old habits die hard and many facilities still use these legacy configurations. 2. FastQC treats every file independently. There’s no built-in aggregation. 300 input files = 300 output reports. It scales linearly with your suffering. 3. Labs inherit outdated practices. Directory structures and workflows from 2015 are still running in 2025. Nobody wants to touch “the pipeline that works.” Verifying Your Installation Before we start, let's make sure the both the tools FastQC and MultiQC are actually installed. Check FastQC: `bash fastqc --version FastQC v0.12.1 ` Check MultiQC: `bash multiqc --version multiqc, version 1.21 ` If you need to install them: `bash Conda (recommended for reproducibility) conda install -c bioconda fastqc multiqc Or separately with pip/apt pip install multiqc sudo apt-get install fastqc Ubuntu/Debian brew install fastqc macOS ` Why Do We Need to Merge FASTQ Files When your sample is split across multiple lanes, those lanes are technical replicates of the same biological material, rather than separate experiments. Merging brings all that data together so downstream tools (e.g., aligners and variant callers) see the complete picture for each sample. Without merging, you get fragmented data management and unnecessary I/O overhead on hundreds of files. Some basic rules of thumb: - Merge R1 with R1, R2 with R2 (never mix them!) - Merge AFTER demultiplexing, BEFORE alignment - Same sample only Basic merge with cat: `bash cat sample001__R1_.fastq.gz > sample001_R1_merged.fastq.gz cat sample001__R2_.fastq.gz > sample001_R2_merged.fastq.gz ` Alternative: zcat for better compression: A Biostars user noted that directly concatenating gzipped files can produce larger output. For better compression efficiency: `bash Decompress, concatenate, recompress for smaller file size zcat sample001_L00_R1_.fastq.gz | gzip > sample001_R1_merged.fastq.gz ` Batch merge script: `bash #!/bin/bash mkdir -p merged SAMPLES=$(ls _R1_.fastq.gz | cut -d’_’ -f1 | sort -u) for SAMPLE in $SAMPLES; do echo “Merging $SAMPLE...” cat $__R1_.fastq.gz > merged/$_R1.fastq.gz cat $__R2_.fastq.gz > merged/$_R2.fastq.gz done echo “Done!” ` How to verify the merge worked (Do not skip!): Never skip verification. Silent corruption will ruin your downstream analysis and you won’t know until days later when your alignment fails or produces garbage. Method 1: Line count verification (recommended) `bash Count lines in original files (pipe through zcat to decompress) zcat sample001_L001_R1_.fastq.gz sample001_L002_R1_.fastq.gz | wc -l Output: 40000000 Count lines in merged file zcat sample001_R1_merged.fastq.gz | wc -l Output: 40000000 These numbers MUST match. If they don’t, your merge failed. ` Method 2: Stream checksum verification `bash Checksum the concatenated stream from originals cat sample001_L00_R1_.fastq.gz | md5sum Output: a1b2c3d4e5f6... - Checksum the merged file md5sum sample001_R1_merged.fastq.gz Output: a1b2c3d4e5f6... sample001_R1_merged.fastq.gz These checksums MUST be identical. ` Method 3: Gzip integrity check `bash gzip -t sample001_R1_merged.fastq.gz && echo “OK” || echo “CORRUPTED” ` Running FastQC at Scale There are several ways to run FastQC efficiently. Here are the options from simplest to fastest. Option 1: Single file (the slow way) `bash fastqc sample001_R1.fastq.gz Output: sample001_R1_fastqc.html and sample001_R1_fastqc.zip ` Option 2: Built-in threading with -t flag FastQC can process multiple files in parallel using its -t option: `bash fastqc -t 8 -o ./qc_results/ *.fastq.gz Started analysis of sample001_R1.fastq.gz Approx 5% complete for sample001_R1.fastq.gz ... Analysis complete for sample001_R1.fastq.gz ` Option 3: GNU parallel (even faster) For large datasets, GNU parallel gives you more control: `bash Install parallel if needed: sudo apt-get install parallel ls *.fastq.gz | parallel -j 8 fastqc -o ./qc_results/ ` Option 4: Simple bash loop (beginner-friendly) If you’re just starting out, a basic loop with progress reporting works fine: `bash #!/bin/bash mkdir -p qc_results FILES=(*.fastq.gz) TOTAL=$ COUNT=0 for FILE in “$”; do ((COUNT++)) echo “[$COUNT/$TOTAL] Processing $FILE...” fastqc -q -o qc_results/ “$FILE” done echo “FastQC complete!” ` Aggregating with MultiQC This is where the magic happens. MultiQC scans a directory for FastQC outputs and combines them into a single interactive report. `bash cd qc_results/ multiqc . /// MultiQC v1.21 | multiqc | Search path : /home/user/project/qc_results | searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 48/48 | fastqc | Found 48 reports | multiqc | Report : multiqc_report.html | multiqc | Data : multiqc_data | multiqc | MultiQC complete ` 48 reports → 1 interactive HTML file. Custom output with title and comments: `bash multiqc . \ --filename my_project_qc \ --title “WGS Batch Dec 2025” \ --comment “24 samples, 30X coverage” ` Complete Workflow Script Note for beginners: If you’re new to bash, work through the sections above first to understand each step. This script ties everything together into a single, reproducible pipeline. `bash #!/bin/bash fastq_qc_pipeline.sh Complete FASTQ QC workflow: merge, verify, QC, aggregate set -e Exit on error INPUT_DIR=“./raw_reads” MERGED_DIR=“./merged” QC_DIR=“./qc_results” THREADS=8 echo “=== FASTQ QC Pipeline ===“ echo “Input: $INPUT_DIR” echo “Threads: $THREADS” echo ““ Create directories mkdir -p “$MERGED_DIR” “$QC_DIR” Step 1: Merge FASTQ files by sample echo “[1/4] Merging FASTQ files...” cd “$INPUT_DIR” SAMPLES=$(ls _R1_.fastq.gz 2>/dev/null | cut -d’_’ -f1 | sort -u) for SAMPLE in $SAMPLES; do echo “ Merging $SAMPLE...” cat $__R1_.fastq.gz > “../$MERGED_DIR/$_R1.fastq.gz” cat $__R2_.fastq.gz > “../$MERGED_DIR/$_R2.fastq.gz” done cd .. Step 2: Verify merge integrity echo “[2/4] Verifying file integrity...” for FILE in “$MERGED_DIR”/*.fastq.gz; do if gzip -t “$FILE” 2>/dev/null; then echo “ OK: $(basename $FILE)” else echo “ ERROR: $(basename $FILE) is corrupted!” exit 1 fi done Step 3: Run FastQC echo “[3/4] Running FastQC...” fastqc -t “$THREADS” -o “$QC_DIR” “$MERGED_DIR”/*.fastq.gz Step 4: Generate MultiQC report echo “[4/4] Generating MultiQC report...” multiqc “$QC_DIR” -o “$QC_DIR” --filename final_qc_report echo ““ echo “=== Pipeline Complete ===“ echo “Report: $QC_DIR/final_qc_report.html” ` This gives you a solid, reproducible QC foundation. For production environments where you need real-time visibility into complex pipelines at scale, error monitoring tools designed for bioinformatics (like Tracer.cloud) can help you catch failures before they propagate through your entire analysis. Common Mistakes These come from real Biostars threads and community forums so I wouldn’t be surprised if you’ve made a couple yourself, I know I have… Mistake 1: Merging R1 with R2 `bash WRONG - breaks paired-end data! cat sample_R1.fastq.gz sample_R2.fastq.gz > sample_merged.fastq.gz CORRECT - keep R1 and R2 separate cat sample__R1_.fastq.gz > sample_R1.fastq.gz cat sample__R2_.fastq.gz > sample_R2.fastq.gz ` This mistake will silently destroy your paired-end analysis. Aligners expect R1 and R2 files to have matching read pairs in the same order. Mistake 2: Not verifying after merge As shown in Section 6, always verify with line counts: `bash Before merge zcat original_L00_R1_.fastq.gz | wc -l After merge zcat merged_R1.fastq.gz | wc -l Numbers must match! ` Silent corruption happens more often than you think, especially with network file systems or interrupted transfers. Mistake 3: Running FastQC before merging You’ll create hundreds of unnecessary reports. Merge first, QC second. Your MultiQC report will also be cleaner, per-lane variation can look like quality problems when it’s just normal technical noise. Mistake 4: Not checking both R1 and R2 R2 quality is often worse than R1 (sequencing chemistry degrades over the run). This is normal! But you need to check both files and compare them in MultiQC. If R2 looks significantly worse than usual, you may need trimming or re-sequencing. Mistake 5: Skipping gzip integrity checks `bash WRONG - will not catch silent corruption fastqc sample.fastq.gz CORRECT - always verify integrity gzip -t sample.fastq.gz && echo “OK” || echo “CORRUPTED” ` Corrupted gzip files will cause cryptic errors in downstream tools that are nearly impossible to diagnose. Mistake 6: Not requesting consolidated output from your facility From the Biostars thread: “Rather than accepting hundreds of individual files per sample, request that sequencing facilities use bcl2fastq with --no-lane-splitting or --fastq-cluster-count 0.” This prevents the problem at the source. Conclusion No more clicking through 300 HTML files. No more guessing which lane had the weird quality dip. One merged file per sample, one MultiQC report for everything. To summarize the workflow: 1. Merge by sample and read direction 2. Verify integrity with line counts (zcat | wc -l) or checksums 3. Run FastQC in parallel 4. Aggregate with MultiQC 5. Document and move on What QC nightmares have you run into? Drop them below and maybe we can save someone else the headache. References - [FastQC Documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) - [MultiQC Documentation](https://multiqc.info/) - [Seqera MultiQC](https://seqera.io/multiqc/) - [Biostars: Running FastQC on multiple files](https://www.biostars.org/p/141797/) - [Illumina: Concatenating FASTQ files](https://knowledge.illumina.com/software/cloud-software/software-cloud-software-reference_material-list/000002035) - [nf-core pipelines](https://nf-co.re/)

Get Started Now

Ready to See
Tracer In Action?

Start for free or

Tracer is the first pipeline monitoring system purpose-built for high-compute workloads that lives in the OS.

Product

Resources

Company

Status

Status Page

2025 The Forge Software Inc. | A US Delaware Corporation, registered at 99 Wall Street, Suite 168 New York, NY 10005 | Terms & Conditions | Privacy Policy | Cookies Policy