Methylation¶
Whole genome bisulfite sequencing is supported using
the bismark2 pipeline.
It can be turned on by setting analysis
to wgbs-seq
.
Example run¶
1. (skip if installed) Install bismark and bowtie2 references¶
bcbio_nextgen.py upgrade \
-u skip \
--genomes hg38 \
--aligners bowtie2 \
--aligners bismark
2. (optional) Install phage Lambda genome¶
phageL is often added to WGBS libraries as a spike-in to control the conversion ratio (phage DNA is unmethylated)
genome: https://www.ncbi.nlm.nih.gov/nuccore/J02459.1?report=fasta > lambda.fa
transcriptome: https://www.ncbi.nlm.nih.gov/nuccore/J02459.1?report=genbank (Send to -> Complete Record -> File > GFF3 > lambda.gff3)
gffread lambda.gff3 -T -o lambda.gtf
bcbio_setup_genome.py \
-f lambda.fa \
-g lambda.gtf \
-i seq bwa bowtie2 bismark \
-n phage \
-b lambda \
--buildversion J02459_1 \
-c 1
2. Create bcbio project structure and download input fastq files¶
This example run is based on data from Encode
mkdir wgbs_example
cd wgbs_example
mkdir config input final work
cd input
wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR423/008/SRR4235788/SRR4235788_1.fastq.gz
wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR423/008/SRR4235788/SRR4235788_2.fastq.gz
3. Create wgbs_example/config/bcbio.yaml¶
details:
- algorithm:
aligner: bismark
analysis: wgbs-seq
description: NA12878BS
genome_build: hg38
files:
- /path/to/wgbs_example/input/SRR4235788_1.fastq.gz
- /path/to/wgbs_example/input/SRR4235788_2.fastq.gz
resources:
trim_galore:
options: ["--clip_r1 4", "--clip_r2 4", "--three_prime_clip_r1 4", "--three_prime_clip_r2 4"]
bismark:
bismark_threads: 4
bowtie_threads: 2
upload:
dir: ../final
4. Create and launch bcbio start script wgbs_example/work/bcbio.sh (O2 slurm example)¶
#!/bin/bash
# https://slurm.schedmd.com/sbatch.html
# https://wiki.rc.hms.harvard.edu/display/O2
#SBATCH --partition=priority # Partition (queue) priority
#SBATCH --time=5-00:00 # Runtime in D-HH:MM format, 10:00:00 for hours
#SBATCH --job-name=wgbs # Job name
#SBATCH -c 8 # cores
#SBATCH --mem=50G # Memory
#SBATCH --output=project_%j.out # File to which STDOUT will be written, including job ID
#SBATCH --error=project_%j.err # File to which STDERR will be written, including job ID
#SBATCH --mail-type=NONE # Type of email notification (BEGIN, END, FAIL, ALL)
date
bcbio_nextgen.py ../config/bcbio.yaml -n 32
date
Parameters¶
Supported Kits¶
Name |
directional? |
TrimR1_5’ |
TrimR1_3’ |
TrimR2_5’ |
TrimR2_3’ |
---|---|---|---|---|---|
accelngs |
yes |
10 nt |
10 nt |
19 nt |
5 |
nebemseq |
yes |
5 nt |
5 |
11 |
5 |
truseq |
no |
8 nt |
8 |
8 |
8 |
In the algorithm
section of the yaml:
aligner
:bismark
kit
:accelngs
,nebemseq
,truseq
; setting a kit automatically applies the proper trimming options.
In the resources
section of the yaml:
trim_galore trimming options:
resources:
trim_galore:
options: ["--clip_r1 8", "--clip_r2 8", "--three_prime_clip_r1 8", "--three_prime_clip_r2 8"]
bismark
--non-directional
option (default is directional mode and you don’t have to specify it) and threading options (use with caution, see benchmarking notes below; by default bcbio is trying to calculate the optimal number of bismark threads with this function; if you alter bismark_threads = X, request 5-7X cores for bcbio; some samples could be processed much faster, but some could fail, reduce threads then). By defaut, we are using--local --maxins 1000
alignment options (after benchmarking using nebemseq kit).
resources:
bismark:
options: ["--non_directional"]
bismark_threads: 4
bowtie_threads: 2
deduplicate_bismark options:
resources:
deduplicate_bismark:
options: ["--barcode"]
The following configs for the truseq
kit are equivalent:
details:
- analysis: wgbs-seq
genome_build: hg38
algorithm:
aligner: bismark
kit: truseq
details:
- analysis: wgbs-seq
genome_build: hg38
algorithm:
aligner: bismark
resources:
trim_galore:
options: ["--clip_r1 8", "--clip_r2 8", "--three_prime_clip_r1 8", "--three_prime_clip_r2 8"]
bismark:
options: ["--non_directional"]
Output¶
Project directory¶
multiqc - QC report, including information from Bismark for all samples (alignment rates, deduplication, M-Bias)
Sample directory¶
sample-bam_report.txt - bismark alignment report
sample-deduplication_report.txt
sample-ready.bam - sorted bam (even though we are using the unsorted bam throughout the pipeline).
bismark - Bismark output
bismark/sample.html - Bismark processing report
Steps¶
= a step is repeated for every sample
bcbio.yaml
details:
- algorithm:
aligner: bismark
analysis: wgbs-seq
description: rep1
files:
- /path/to/ENCSR481JIW_rep1_R1.fastq.gz
- /path/to/ENCSR481JIW_rep1_R2.fastq.gz
genome_build: hg38
- algorithm:
aligner: bismark
analysis: wgbs-seq
description: rep2
files:
- /path/to/ENCSR481JIW_rep2_R1.fastq.gz
- /path/to/ENCSR481JIW_rep2_R2.fastq.gz
genome_build: hg38
upload:
dir: ../final
-
trim_galore \ --cores 4 \ --length 30 \ --quality 30 \ --fastqc \ --paired \ -o /path/to/work/bcbiotx \ /path/to/ENCSR481JIW_rep1_R1.fastq.gz \ /path/to/ENCSR481JIW_rep1_R2.fastq.gz
-
bismark \ --bowtie2 \ --temp_dir /path/to/work/bcbiotx/ \ --gzip \ --parallel 5 \ -o /path/to/work/bcbiotx \ --unmapped \ /path/to/genomes/Hsapiens/hg38/bismark/ \ -1 /path/to/work/trimmed/rep1/ENCSR481JIW_rep1_R1_val_1.fq.gz \ -2 /path/to/work/trimmed/rep1/ENCSR481JIW_rep1_R2_val_2.fq.gz
-
deduplicate_bismark \ --output_dir /path/to/work/dedup/rep1 \ /path/to/work/align/rep1/rep1.bam
* bismark_calling bismark_methylation_extractor
bismark_methylation_extractor \ --no_overlap \ --comprehensive \ --cytosine_report \ --genome_folder /path/to/genomes/Hsapiens/hg38/bismark/ \ --merge_non_CpG \ --multicore 1 \ --buffer_size 5G \ --bedGraph \ --gzip /path/to/work/dedup/rep1/rep1.deduplicated.bam
-
bismark2report \ --alignment_report /path/to/work/align/rep1/rep1_bismark/ENCSR481JIW_rep1_R1_val_1_bismark_bt2_PE_report.txt \ -o /path/to/work/cpg/rep1/bcbiotx/tmpypivttaa/rep1.html \ --mbias_report /path/to/work/cpg/rep1/rep1.deduplicated.M-bias.txt
* generate QC metrics; samtools sort
samtools sort -@ 16 -m 3276M -O BAM \ -T /path/to/work/bcbiotx/tmpj82rrnlc/rep1.sorted-sort \ -o /path/to/work/bcbiotx/tmpj82rrnlc/rep1.sorted.bam \ /path/to/work/align/rep1/rep1.bam
* samtools index
samtools \ index -@ 16 \ /path/to/work/align/rep1/rep1.sorted.bam \ /path/to/work/bcbiotx/tmpr02on2ol/rep1.sorted.bam.bai
* samtools stats
samtools stats -@ 16 \ /path/to/work/align/rep1/rep1.sorted.bam > \ /path/to/work/bcbiotx/tmp5e6gerdb/rep1.txt
* downsample for fastqc
samtools view -O BAM -@ 16 \ -o /path/to/work/bcbiotx/tmp3z8btzek/rep1.sorted-downsample.bam \ -s 42.735 \ /path/to/work/align/rep1/rep1.sorted.bam
* fastqc
fastqc \ -d /path/to/work/qc/rep1/bcbiotx \ -t 16 \ --extract \ -o /path/to/work/qc/rep1/bcbiotx \ -f bam /path/to/work/qc/rep1/rep1.sorted-downsample.bam
* samtools idxstats
samtools idxstats /path/to/work/align/rep1/rep1.sorted.bam > /path/to/work/bcbiotx/rep1-idxstats.txt
-
multiqc -c /path/to/work/qc/multiqc/multiqc_config.yaml \ -f -l \ /path/to/work/qc/multiqc/list_files.txt \ -o /path/to/work/bcbiotx/
Benchmarking¶
There is an extensive discussion on Bismark and trim_galore performance, Bismark github.
We ran a test with NA12878 nebemseq data, 125 mln reads (72.5mln read pairs).
We tested performance of bismark/bcbio using --parallel
(bismark workers) and -p
(bowtie threads) bismark settings.
We measured the performance only of the alignment step using bcbio-nextgen-commands log timecodes. 16/2/100G RAM was an optimal parameters set, with other having 5X-10X longer runtimes. When running a cohort of samples ~50% passed with 16/2/100G, some processed broken bam files, re-running with 8/2/100G or 4/2/100G solved the issue, see more info here. For Lambda Phage genome we re-used trimming step results and 4/2/30G settings. The allocated RAM depends on the sequencing depth. For ~40GB reads input (r1.fq.gz + r2.fq.gz) we used 50G RAM/4/2 bismark/8threads total. When running out of RAM, bismark could fail without giving a warning, and pipeline continues, which causes lower % mapping efficiency. When it happens, you’ll see an error in the end of bcbio stderr from slurm. Use less cores and more RAM than, say 4/2/50G instead of 16/2/50G.
bcbio.yaml:
details:
- algorithm:
aligner: bismark
kit: nebemseq
analysis: wgbs-seq
description: NA12878_1
files:
- /path/to/NA12878_1.fq.gz
- /path/to/NA12878_2.fq.gz
genome_build: hg38
metadata:
batch: NA12878_1-batch
phenotype: tumor
fc_name: bcbio
resources:
bismark:
bismark_threads: 4
bowtie_threads: 2
upload:
dir: ../final
Tests:
test_N |
bismark_threads |
bowtie_threads |
time |
bcbio -n |
RAM |
---|---|---|---|---|---|
01 |
1 |
2 |
14h 42min |
16 |
50G |
02 |
1 |
4 |
>3.5 days |
16 |
50G |
03 |
1 |
8 |
1d 21h |
16 |
50G |
04 |
1 |
16 |
1d 15h |
32 |
50G |
05 |
1 |
32 |
>3day 14h |
64 |
100G |
06 |
2 |
2 |
2days 4h |
16 |
50G |
07 |
4 |
2 |
13h |
16 |
50G |
08 |
8 |
2 |
3h 34 min |
16 |
50G |
09 |
16 |
2 |
2h 42 min |
32 |
100G |
10 |
32 |
2 |
7h 40 min |
64 |
250G |
11 |
2 |
4 |
1d 23h |
16 |
50G |
12 |
2 |
8 |
11h 30 min |
16 |
50G |
13 |
2 |
16 |
11h 45 min |
32 |
50G |
14 |
2 |
32 |
>3days 14h |
64 |
100G |
15 |
4 |
2 |
4h 4 min |
16 |
50G |
16 |
4 |
4 |
15 h |
16 |
50G |
17 |
4 |
8 |
6 h |
32 |
50G |
18 |
4 |
16 |
15h |
64 |
100G |
19 |
4 |
32 |
1d 2h |
128 |
200G |
20 |
24 |
2 |
8h |
48 |
192G |
ipython parallelization is not implemented for wgbs-seq
, use 1 node / multicore jobs.