Methylation¶

Whole genome bisulfite sequencing is supported using the bismark2 pipeline. It can be turned on by setting analysis to wgbs-seq.

Example run¶

1. (skip if installed) Install bismark and bowtie2 references¶

bcbio_nextgen.py upgrade \
-u skip \
--genomes hg38 \
--aligners bowtie2 \
--aligners bismark

2. (optional) Install phage Lambda genome¶

phageL is often added to WGBS libraries as a spike-in to control the conversion ratio (phage DNA is unmethylated)

genome: https://www.ncbi.nlm.nih.gov/nuccore/J02459.1?report=fasta > lambda.fa
transcriptome: https://www.ncbi.nlm.nih.gov/nuccore/J02459.1?report=genbank (Send to -> Complete Record -> File > GFF3 > lambda.gff3)

gffread lambda.gff3 -T -o lambda.gtf

bcbio_setup_genome.py \
-f lambda.fa \
-g lambda.gtf \
-i seq bwa bowtie2 bismark \
-n phage \
-b lambda \
--buildversion J02459_1 \
-c 1

2. Create bcbio project structure and download input fastq files¶

This example run is based on data from Encode

mkdir wgbs_example
cd wgbs_example
mkdir config input final work
cd input
wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR423/008/SRR4235788/SRR4235788_1.fastq.gz
wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR423/008/SRR4235788/SRR4235788_2.fastq.gz

3. Create wgbs_example/config/bcbio.yaml¶

details:
- algorithm:
    aligner: bismark
  analysis: wgbs-seq
  description: NA12878BS
  genome_build: hg38
  files:
  - /path/to/wgbs_example/input/SRR4235788_1.fastq.gz
  - /path/to/wgbs_example/input/SRR4235788_2.fastq.gz
resources:
  trim_galore:
    options: ["--clip_r1 4", "--clip_r2 4", "--three_prime_clip_r1 4", "--three_prime_clip_r2 4"]
  bismark:
    bismark_threads: 4
    bowtie_threads: 2
upload:
  dir: ../final

4. Create and launch bcbio start script wgbs_example/work/bcbio.sh (O2 slurm example)¶

#!/bin/bash

# https://slurm.schedmd.com/sbatch.html
# https://wiki.rc.hms.harvard.edu/display/O2

#SBATCH --partition=priority        # Partition (queue) priority
#SBATCH --time=5-00:00              # Runtime in D-HH:MM format, 10:00:00 for hours
#SBATCH --job-name=wgbs             # Job name
#SBATCH -c 8		              	# cores
#SBATCH --mem=50G                  # Memory
#SBATCH --output=project_%j.out     # File to which STDOUT will be written, including job ID
#SBATCH --error=project_%j.err      # File to which STDERR will be written, including job ID
#SBATCH --mail-type=NONE            # Type of email notification (BEGIN, END, FAIL, ALL)

date
bcbio_nextgen.py ../config/bcbio.yaml -n 32
date

Parameters¶

Supported Kits¶

Name	directional?	TrimR1_5’	TrimR1_3’	TrimR2_5’	TrimR2_3’
accelngs	yes	10 nt	10 nt	19 nt	5
nebemseq	yes	5 nt	5	11	5
truseq	no	8 nt	8	8	8

In the algorithm section of the yaml:

aligner: bismark
kit: accelngs, nebemseq, truseq; setting a kit automatically applies the proper trimming options.

In the resources section of the yaml:

trim_galore trimming options:

resources:
  trim_galore:
    options: ["--clip_r1 8", "--clip_r2 8", "--three_prime_clip_r1 8", "--three_prime_clip_r2 8"]

bismark --non-directional option (default is directional mode and you don’t have to specify it) and threading options (use with caution, see benchmarking notes below; by default bcbio is trying to calculate the optimal number of bismark threads with this function; if you alter bismark_threads = X, request 5-7X cores for bcbio; some samples could be processed much faster, but some could fail, reduce threads then). By defaut, we are using --local --maxins 1000 alignment options (after benchmarking using nebemseq kit).

resources:
  bismark:
    options: ["--non_directional"]
    bismark_threads: 4
    bowtie_threads: 2

deduplicate_bismark options:

resources:
  deduplicate_bismark:
    options: ["--barcode"]

The following configs for the truseq kit are equivalent:

details:
- analysis: wgbs-seq
  genome_build: hg38
  algorithm:
    aligner: bismark
    kit: truseq

details:
- analysis: wgbs-seq
  genome_build: hg38
  algorithm:
    aligner: bismark
resources:
  trim_galore:
    options: ["--clip_r1 8", "--clip_r2 8", "--three_prime_clip_r1 8", "--three_prime_clip_r2 8"]
  bismark:
    options: ["--non_directional"]

Output¶

Project directory¶

multiqc - QC report, including information from Bismark for all samples (alignment rates, deduplication, M-Bias)

Sample directory¶

sample-bam_report.txt - bismark alignment report
sample-deduplication_report.txt
sample-ready.bam - sorted bam (even though we are using the unsorted bam throughout the pipeline).
bismark - Bismark output
bismark/sample.html - Bismark processing report

Steps¶

= a step is repeated for every sample

bcbio.yaml

details:
- algorithm:
    aligner: bismark
  analysis: wgbs-seq
  description: rep1
  files:
  - /path/to/ENCSR481JIW_rep1_R1.fastq.gz
  - /path/to/ENCSR481JIW_rep1_R2.fastq.gz
  genome_build: hg38
- algorithm:
    aligner: bismark
  analysis: wgbs-seq
  description: rep2
  files:
  - /path/to/ENCSR481JIW_rep2_R1.fastq.gz
  - /path/to/ENCSR481JIW_rep2_R2.fastq.gz
  genome_build: hg38
upload:
  dir: ../final

wgbsseqpipeline function

* Read trimmming

trim_galore  \
--cores 4 \
--length 30 \
--quality 30 \
--fastqc \
--paired \
-o /path/to/work/bcbiotx \
/path/to/ENCSR481JIW_rep1_R1.fastq.gz \ 
/path/to/ENCSR481JIW_rep1_R2.fastq.gz

* Bismark align

bismark \
--bowtie2 \
--temp_dir /path/to/work/bcbiotx/ \
--gzip \
--parallel 5 \
-o /path/to/work/bcbiotx \
--unmapped \
/path/to/genomes/Hsapiens/hg38/bismark/ \
-1 /path/to/work/trimmed/rep1/ENCSR481JIW_rep1_R1_val_1.fq.gz \
-2 /path/to/work/trimmed/rep1/ENCSR481JIW_rep1_R2_val_2.fq.gz 

* deduplicate bismark

deduplicate_bismark \
--output_dir /path/to/work/dedup/rep1 \
/path/to/work/align/rep1/rep1.bam

* bismark_calling bismark_methylation_extractor

bismark_methylation_extractor \
--no_overlap \
--comprehensive \
--cytosine_report \
--genome_folder /path/to/genomes/Hsapiens/hg38/bismark/ \
--merge_non_CpG \
--multicore 1 \
--buffer_size 5G \
--bedGraph \
--gzip /path/to/work/dedup/rep1/rep1.deduplicated.bam

* bismark2report

bismark2report \
--alignment_report /path/to/work/align/rep1/rep1_bismark/ENCSR481JIW_rep1_R1_val_1_bismark_bt2_PE_report.txt \
-o /path/to/work/cpg/rep1/bcbiotx/tmpypivttaa/rep1.html \
--mbias_report /path/to/work/cpg/rep1/rep1.deduplicated.M-bias.txt

* generate QC metrics; samtools sort

samtools sort -@ 16 -m 3276M -O BAM  \
-T /path/to/work/bcbiotx/tmpj82rrnlc/rep1.sorted-sort \
-o /path/to/work/bcbiotx/tmpj82rrnlc/rep1.sorted.bam \
/path/to/work/align/rep1/rep1.bam

* samtools index

samtools \
index -@ 16 \
/path/to/work/align/rep1/rep1.sorted.bam \
/path/to/work/bcbiotx/tmpr02on2ol/rep1.sorted.bam.bai

* samtools stats

samtools stats -@ 16 \
/path/to/work/align/rep1/rep1.sorted.bam > \
/path/to/work/bcbiotx/tmp5e6gerdb/rep1.txt

* downsample for fastqc

samtools view -O BAM -@ 16 \
-o /path/to/work/bcbiotx/tmp3z8btzek/rep1.sorted-downsample.bam \
-s 42.735 \
/path/to/work/align/rep1/rep1.sorted.bam

* fastqc

fastqc \
-d /path/to/work/qc/rep1/bcbiotx \
-t 16 \
--extract \
-o /path/to/work/qc/rep1/bcbiotx \
-f bam /path/to/work/qc/rep1/rep1.sorted-downsample.bam

* samtools idxstats

samtools idxstats /path/to/work/align/rep1/rep1.sorted.bam > /path/to/work/bcbiotx/rep1-idxstats.txt

* multiqc summary

multiqc -c /path/to/work/qc/multiqc/multiqc_config.yaml \
-f -l \
/path/to/work/qc/multiqc/list_files.txt \
-o /path/to/work/bcbiotx/

upload sample files to final; also see here
upload project file to final; also see here

Benchmarking¶

There is an extensive discussion on Bismark and trim_galore performance, Bismark github. We ran a test with NA12878 nebemseq data, 125 mln reads (72.5mln read pairs). We tested performance of bismark/bcbio using --parallel (bismark workers) and -p (bowtie threads) bismark settings. We measured the performance only of the alignment step using bcbio-nextgen-commands log timecodes. 16/2/100G RAM was an optimal parameters set, with other having 5X-10X longer runtimes. When running a cohort of samples ~50% passed with 16/2/100G, some processed broken bam files, re-running with 8/2/100G or 4/2/100G solved the issue, see more info here. For Lambda Phage genome we re-used trimming step results and 4/2/30G settings. The allocated RAM depends on the sequencing depth. For ~40GB reads input (r1.fq.gz + r2.fq.gz) we used 50G RAM/4/2 bismark/8threads total. When running out of RAM, bismark could fail without giving a warning, and pipeline continues, which causes lower % mapping efficiency. When it happens, you’ll see an error in the end of bcbio stderr from slurm. Use less cores and more RAM than, say 4/2/50G instead of 16/2/50G.

bcbio.yaml:

details:
- algorithm:
    aligner: bismark
    kit: nebemseq
  analysis: wgbs-seq
  description: NA12878_1
  files:
  - /path/to/NA12878_1.fq.gz
  - /path/to/NA12878_2.fq.gz
  genome_build: hg38
  metadata:
    batch: NA12878_1-batch
    phenotype: tumor
fc_name: bcbio
resources:
  bismark:
    bismark_threads: 4
    bowtie_threads: 2
upload:
  dir: ../final

Tests:

test_N	bismark_threads	bowtie_threads	time	bcbio -n	RAM
01	1	2	14h 42min	16	50G
02	1	4	>3.5 days	16	50G
03	1	8	1d 21h	16	50G
04	1	16	1d 15h	32	50G
05	1	32	>3day 14h	64	100G
06	2	2	2days 4h	16	50G
07	4	2	13h	16	50G
08	8	2	3h 34 min	16	50G
09	16	2	2h 42 min	32	100G
10	32	2	7h 40 min	64	250G
11	2	4	1d 23h	16	50G
12	2	8	11h 30 min	16	50G
13	2	16	11h 45 min	32	50G
14	2	32	>3days 14h	64	100G
15	4	2	4h 4 min	16	50G
16	4	4	15 h	16	50G
17	4	8	6 h	32	50G
18	4	16	15h	64	100G
19	4	32	1d 2h	128	200G
20	24	2	8h	48	192G

ipython parallelization is not implemented for wgbs-seq, use 1 node / multicore jobs.