Pipeline Detail
BSA-seq 突变位点定位分析
面向 Bulked Segregant Analysis 的变异检测和候选区间定位流程,覆盖数据下载、质控、比对、GATK SNP calling、过滤、SNP-index 与 ED5 下游分析。
Metadata
流程元数据
先看应用场景、输入输出和工具依赖,再进入正文命令细节。
Difficulty
入门
Scenario
遗传定位
Estimated Time
0.5-1 天
Inputs
Workflow DAG
流程图
用步骤节点快速理解这个分析从原始数据到结果报告的流转关系。
测序数据下载
FastQC/Trimmomatic 质控
BWA-MEM 比对 TAIR10
GATK MarkDuplicates
HaplotypeCaller GVCF
CombineGVCFs/GenotypeGVCFs
SNP/INDEL 硬过滤
SNP-index/ED5 下游定位
Protocol
流程文档
正文保留 Markdown 排版、代码语言标识和表格样式,适合边学边复现。
BSA-seq 突变位点定位分析
适用场景
适用于 Bulked Segregant Analysis(BSA)或 QTL-seq 类型项目:将极端表型个体混池测序,与亲本或对照样本比较等位基因频率差异,从而定位候选突变区间。
本流程来自拟南芥 BSA 分析记录,参考基因组使用 TAIR10_chr_all.fas,核心样本包括 1-37、27-3-1、27N、Neg、top1α-10。流程前半段使用 BWA/GATK 产出高质量 VCF,后半段在 R 中计算 SNP-index、ΔSNP-index 和 ED5,并绘制候选区间图。
样本与分组
| Sample | Role | Notes |
|---|---|---|
1-37 | bulk / segregant | 用于 37 组合分析 |
27-3-1 | bulk / segregant | 用于 27 组合分析 |
27N | normal/control | 27 组合对照 |
Neg | wild type/control | 37 组合对照 |
top1α-10 | parent/reference mutant | 亲本或参考突变材料 |
1. 下载原始数据
./lnd login -u X101SC250910114-Z01-J003 -p ggeah41n
./lnd cp -d oss://CP2023071800097/H101SC250910114/RSPD00204/X101SC250910114-Z01/X101SC250910114-Z01-J003/01.RawData ../../../01_raw_data/
2. FastQC 质控与 Trimmomatic 裁剪
module load FastQC/0.11.9
bsub -J fastqc1-37 -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc 1-37_*"
bsub -J fastqc27-3 -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc 27-3*"
bsub -J fastqc27N -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc 27N*"
bsub -J fastqcNeg -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc Neg*"
bsub -J fastqctop1 -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc top1α*"
for sample in 1-37 27-3-1 27N Neg top1α-10
do
trimmomatic.sh PE -threads 10 -phred33 01_raw_data/${sample}_1.fq.gz 01_raw_data/${sample}_2.fq.gz 02_clean_data/${sample}_R1_paired.fq.gz 02_clean_data/${sample}_R1_unpaired.fq.gz 02_clean_data/${sample}_R2_paired.fq.gz 02_clean_data/${sample}_R2_unpaired.fq.gz HEADCROP:2
done
3. BWA-MEM 比对到 TAIR10
REF="./ref_genomic/TAIR10_chr_all.fas"
for sample in 1-37 27-3-1 27N Neg top1α-10
do
bwa mem -M -t 8 -R "@RG ID:${sample} SM:${sample}" ${REF} 02_clean_data/${sample}_R1_paired.fq.gz 02_clean_data/${sample}_R2_paired.fq.gz | samtools sort -@ 8 -o 03_alignment/${sample}.sorted.bam -
samtools flagstat 03_alignment/${sample}.sorted.bam > 03_alignment/${sample}_flagstat.log
samtools index 03_alignment/${sample}.sorted.bam
done
4. GATK MarkDuplicates 去重
for sample in 1-37 27-3-1 27N Neg top1α-10
do
gatk MarkDuplicates -I 03_alignment/${sample}.sorted.bam -M 03_alignment/${sample}.markup_metrics.txt -O 03_alignment/${sample}.sorted.markup.bam
samtools index 03_alignment/${sample}.sorted.markup.bam
done
5. HaplotypeCaller 生成 GVCF
REF="./ref_genomic/TAIR10_chr_all.fas"
for sample in 1-37 27-3-1 27N Neg top1α-10
do
gatk --java-options "-Xmx10G" HaplotypeCaller -R ${REF} -I 03_alignment/${sample}.sorted.markup.bam -O 04_snp_calling/${sample}.g.vcf.gz -ERC GVCF -stand-call-conf 10
done
6. 合并 GVCF 与联合分型
REF="./ref_genomic/TAIR10_chr_all.fas"
gatk CombineGVCFs -R ${REF} -V 04_snp_calling/1-37.g.vcf.gz -V 04_snp_calling/Neg.g.vcf.gz -V 04_snp_calling/top1α-10.g.vcf.gz -O 04_snp_calling/BSA-37.gvcf.gz
gatk CombineGVCFs -R ${REF} -V 04_snp_calling/27-3-1.g.vcf.gz -V 04_snp_calling/27N.g.vcf.gz -V 04_snp_calling/top1α-10.g.vcf.gz -O 04_snp_calling/BSA-27.gvcf.gz
gatk GenotypeGVCFs -R ${REF} -V 04_snp_calling/BSA-27.gvcf.gz -O 05_snp/BSA_no_filter-27.HC.vcf
7. SNP/INDEL 硬过滤
gatk SelectVariants -R ./ref_genomic/TAIR10_chr_all.fas -V 05_snp/BSA_no_filter-27.HC.vcf -select-type SNP -O 05_snp/SNPs-27.vcf
gatk VariantFiltration -V 05_snp/SNPs-27.vcf --filter-expression "MQ < 30.0" --filter-name "MQ_filter_SNP" --filter-expression "QD < 2.0" --filter-name "QD_filter_SNP" -O 05_snp/SNPs_filter-27.vcf
grep -E '^#|PASS' 05_snp/SNPs_filter-27.vcf > 05_snp/BSA_SNPs_filterPASSED-27.vcf
INDEL 可继续使用 SelectVariants -select-type INDEL 后按 MQ、SOR、QD、FS 进行硬过滤,并与 SNP 结果合并为最终 VCF。
8. R 下游:SNP-index 与 ED5 定位
library(QTLseqr)
library(vcfR)
library(ggplot2)
library(dplyr)
vcf <- read.vcfR("BSA_SNPs_filterPASSED-27.vcf")
chrom <- getCHROM(vcf)
pos <- getPOS(vcf)
ref <- getREF(vcf)
alt <- getALT(vcf)
ad <- extract.gt(vcf, "AD")
gt <- extract.gt(vcf, "GT")
ref_split <- masplit(ad, record = 1, sort = 0)
alt_split <- masplit(ad, record = 2, sort = 0)
df <- data.frame(
CHROM = chrom,
POS = pos,
REF = ref,
ALT = alt,
AD_REF.1_37 = ref_split[, 1],
AD_ALT.1_37 = alt_split[, 1],
AD_REF.Neg = ref_split[, 2],
AD_ALT.Neg = alt_split[, 2],
AD_REF.top1 = ref_split[, 3],
AD_ALT.top1 = alt_split[, 3],
P_Top1 = gt[, "top1α-10"],
son_1_37 = gt[, "1-37"],
son_Neg = gt[, "Neg"]
)
df <- df %>%
filter(
nchar(as.character(REF)) == 1,
nchar(as.character(ALT)) == 1,
!is.na(P_Top1),
(AD_REF.1_37 + AD_ALT.1_37) >= 10,
(AD_REF.Neg + AD_ALT.Neg) >= 10,
(AD_REF.top1 + AD_ALT.top1) >= 10
)
df_result <- df %>%
mutate(
SNPindex.L = AD_ALT.1_37 / (AD_REF.1_37 + AD_ALT.1_37),
SNPindex.S = AD_ALT.Neg / (AD_REF.Neg + AD_ALT.Neg),
deltaSNP = SNPindex.L - SNPindex.S
)
delta_extreme_points <- df_result %>%
filter(deltaSNP > 0.5 | deltaSNP < -0.5)
ggplot(df_result, aes(x = POS / 1e6, y = deltaSNP)) +
geom_point(color = "#000000", size = 0.5, alpha = 0.6) +
geom_point(
data = delta_extreme_points,
aes(color = ifelse(deltaSNP > 0, "Positive", "Negative")),
size = 1.2,
alpha = 0.8
) +
geom_hline(yintercept = c(-0.5, 0.5), linetype = "dashed") +
facet_wrap(~ CHROM, scales = "free_x", nrow = 3) +
labs(x = "Genomic Position (Mb)", y = "Delta SNP-index")
valid_rows <- complete.cases(df) &
df$AD_REF.1_37 + df$AD_ALT.1_37 > 0 &
df$AD_REF.Neg + df$AD_ALT.Neg > 0
df_filtered <- df[valid_rows, ]
ED_values <- apply(df_filtered, 1, function(row) {
ref_mutant <- as.numeric(row["AD_REF.1_37"])
alt_mutant <- as.numeric(row["AD_ALT.1_37"])
ref_wild <- as.numeric(row["AD_REF.Neg"])
alt_wild <- as.numeric(row["AD_ALT.Neg"])
depth_mutant <- ref_mutant + alt_mutant
depth_wild <- ref_wild + alt_wild
freq_ref_mutant <- ref_mutant / depth_mutant
freq_alt_mutant <- alt_mutant / depth_mutant
freq_ref_wild <- ref_wild / depth_wild
freq_alt_wild <- alt_wild / depth_wild
freq_diff <- sqrt(
(freq_ref_wild - freq_ref_mutant)^2 +
(freq_alt_wild - freq_alt_mutant)^2
)
freq_diff^5
})
result_df <- data.frame(
CHROM = df_filtered$CHROM,
POS = df_filtered$POS,
ED = ED_values
)
关键质控与解释
- FASTQ 阶段关注碱基质量、接头残留和头部异常碱基,当前流程使用
HEADCROP:2。 - 比对阶段保留 read group,方便 GATK 正确识别样本。
- GVCF 阶段采用每个样本单独 HaplotypeCaller,再按组合合并和联合分型。
- SNP 过滤至少包含
MQ >= 30与QD >= 2;INDEL 过滤可加入SOR和FS。 - 下游定位同时看
Delta SNP-index和ED5,比只看单一指标更稳。
主要输出
- FastQC 质控报告
- Trimmomatic clean reads
- 排序和去重后的 BAM
- 样本级 GVCF
- 合并分型 VCF
- 过滤后的 SNP/INDEL VCF
- SNP-index / Delta SNP-index 图
- ED5 全基因组定位图