PLINK Tips & Tricks: Speeding Up Large-Scale Genotype AnalysesPLINK is a fast, widely used open-source toolset for whole-genome association studies (GWAS) and population-based genetic analyses. When working with tens or hundreds of thousands of samples and millions of variants, naively running standard PLINK commands can become slow or memory-bound. This article collects practical tips, tricks, and workflows to maximize speed and efficiency with PLINK (both 1.⁄2.0 where relevant) while preserving data quality.
1) Choose the right PLINK version and build for your task
- PLINK 1.9 is extremely fast for many standard operations (LD pruning, basic QC, case/control association) and has highly optimized C++ codepaths.
- PLINK 2.0 introduces a new data format (.pgen/.pvar/.psam) designed for scalability, improved memory management, and new features (e.g., more genotype-encoding options and on-disk operations). Use PLINK 2.0 for very large projects and when you need its new features; use 1.9 when you require some legacy commands that may still be faster in 1.9.
- If you have multicore hardware, download the multithreaded binaries or compile with OpenMP support.
2) Use efficient input formats
- Use PLINK’s binary formats rather than text (e.g., .bed/.bim/.fam or .pgen/.pvar/.psam). Binary formats dramatically reduce I/O and memory overhead.
- Convert VCFs to PLINK binary formats only once and archive those converted files for repeated analyses rather than reconverting each run.
Example conversion commands:
# VCF to PLINK 1 binary plink --vcf input.vcf --make-bed --out data # VCF to PLINK 2 format plink2 --vcf input.vcf --make-pgen --out data2
3) Limit I/O and redundant computation
- Use –keep, –remove, –extract, and –exclude to subset samples/variants early in your pipeline so downstream steps operate on minimal data.
- When testing parameters or tuning scripts, work with a small chromosome or a random subset of samples first.
- Avoid repeating conversions: centralize converted binary files and access them directly.
Example:
plink --bfile data --extract snplist.txt --make-bed --out data_subset
4) Use chunking and parallelization
- Split by chromosome for embarrassingly parallel tasks (e.g., per-chromosome association tests, per-chromosome QC).
- For variant-level tasks, divide the variant list into chunks and run multiple PLINK instances concurrently (ensure each instance uses distinct output filenames).
- On cluster systems, submit each chromosome/chunk as a separate job.
Example shell snippet to run per-chromosome jobs:
for chr in {1..22}; do plink2 --pfile data --chr $chr --assoc --out assoc_chr${chr} & done wait
5) Use PLINK’s built-in multithreading (where available)
- PLINK 2.0 supports multithreading for several commands via the –threads flag. Start with modest thread counts (e.g., 4–8) and tune based on CPU and I/O behavior.
- Be mindful that too many threads can increase memory usage and I/O contention; monitor system load.
Example:
plink2 --pfile data --assoc --threads 8 --out assoc
6) Optimize memory usage
- Use PLINK 2.0’s on-disk operations when memory is limited. The pgen format enables streaming-like access to data without loading everything into RAM.
- If using PLINK 1.9, ensure you have enough RAM for whole-dataset operations or work in chunks/chromosomes.
- Remove unused variables and intermediate files; compress outputs that you don’t need frequently.
7) Quality control steps—do them efficiently
- Standard QC (missingness, MAF filtering, HWE) should be applied early to reduce dataset size.
- Combine multiple QC filters into a single command when possible to minimize multiple passes over the data.
Example one-pass QC:
plink --bfile data --geno 0.05 --mind 0.02 --maf 0.01 --hwe 1e-6 --make-bed --out data_qc
8) Use LD pruning/clumping smartly
- For analyses that need independent variants (e.g., PCA, PRS), use LD pruning (–indep-pairwise) or clumping (–clump) with appropriate window sizes and r2 thresholds.
- Run pruning on a representative subset (unrelated individuals) to save time.
Example:
plink --bfile data_qc --indep-pairwise 200 50 0.2 --out pruned plink --bfile data_qc --extract pruned.prune.in --make-bed --out data_pruned
9) PCA and relatedness in large cohorts
- Compute PCs on a pruned set of variants and on a subset of unrelated individuals; then project PCs to the full sample if needed.
- For KING/relatedness estimation, use dedicated tools (KING or PLINK’s –rel-cutoff) and run per-chromosome or in chunks if dataset is huge.
Workflow:
- LD-prune variants.
- Identify unrelated individuals (e.g., using KING).
- Compute PCA on unrelateds.
- Project PCs to related samples.
10) Association testing: use appropriate models and tools
- For simple single-variant tests, PLINK’s basic association tests are very fast. For mixed models or related samples, consider specialized tools (e.g., BOLT-LMM, SAIGE) that scale better for large sample sizes and control relatedness/population structure.
- Use PLINK for initial scans; pass filtered summary/variant lists to specialized tools when needed.
11) Reduce file sizes: compression and selective output
- Use compressed intermediate storage where possible (gzip for text outputs).
- Use flags to suppress verbose output; only produce the files you need (e.g., –assoc vs. full regression output).
- When using –out for many chunks, organize outputs into folders and later concatenate only necessary results.
12) Reproducible pipelines and logging
- Script every step (Bash, Snakemake, Nextflow) so processes can be re-run or parallelized across clusters.
- Log commands and timestamps. Save the exact PLINK binary version and parameters with outputs (e.g., write a small metadata file per run).
Example metadata stanza:
echo "plink_version: $(plink2 --version)" > run_metadata.txt echo "command: plink2 --pfile data --assoc --threads 8 --out assoc" >> run_metadata.txt
13) Practical command patterns and examples
- Convert and QC:
plink2 --vcf input.vcf --make-pgen --out data plink2 --pfile data --geno 0.05 --mind 0.02 --maf 0.01 --make-pgen --out data_qc
- Per-chromosome association (parallel):
for chr in {1..22}; do plink2 --pfile data_qc --chr $chr --glm --threads 4 --out glm_chr${chr} & done wait
- LD pruning and PCA:
plink2 --pfile data_qc --indep-pairwise 200 50 0.2 --out pruned plink2 --pfile data_qc --extract pruned.prune.in --pca approx --out pca
14) Common pitfalls and how to avoid them
- Over-parallelizing on a single disk causes I/O bottlenecks—use SSDs or limit concurrent jobs.
- Forgetting to filter variants beforehand can blow memory—apply QC early.
- Using too many threads without sufficient RAM leads to OOM kills—monitor memory.
- Not checking strand/allele alignment when merging datasets—use liftover/allele-checking tools and harmonize before merging.
15) When to move beyond PLINK
- For very large-scale mixed-model GWAS with hundreds of thousands of samples, consider tools like BOLT-LMM, SAIGE, REGENIE, or REGENIE2—these handle relatedness, case-control imbalance, and scaling more efficiently than basic PLINK regression.
- Use PLINK for QC, subset preparation, and quick exploratory analyses; use specialized tools for the final large-scale association scan.
16) Summary checklist (quick reference)
- Use binary (pgen/bed) formats.
- Filter and QC early.
- Chunk by chromosome or variant for parallel jobs.
- Use –threads (PLINK2) but monitor memory and I/O.
- LD-prune for PCA and related downstream tasks.
- Use specialized tools (BOLT-LMM/SAIGE) for very large mixed-model GWAS.
- Script everything and store metadata.
Following these guidelines will help you squeeze performance from PLINK pipelines while keeping analyses reproducible and robust. If you want, I can convert this into a checklist or Snakemake/Nextflow skeleton for your exact dataset and compute environment—tell me sample size, number of variants, and whether you have a cluster or a single multicore machine.
Leave a Reply