How to Use PLINK for GWAS and Quality Control

PLINK Tips & Tricks: Speeding Up Large-Scale Genotype AnalysesPLINK is a fast, widely used open-source toolset for whole-genome association studies (GWAS) and population-based genetic analyses. When working with tens or hundreds of thousands of samples and millions of variants, naively running standard PLINK commands can become slow or memory-bound. This article collects practical tips, tricks, and workflows to maximize speed and efficiency with PLINK (both 1.⁄₂.0 where relevant) while preserving data quality.

1) Choose the right PLINK version and build for your task

PLINK 1.9 is extremely fast for many standard operations (LD pruning, basic QC, case/control association) and has highly optimized C++ codepaths.
PLINK 2.0 introduces a new data format (.pgen/.pvar/.psam) designed for scalability, improved memory management, and new features (e.g., more genotype-encoding options and on-disk operations). Use PLINK 2.0 for very large projects and when you need its new features; use 1.9 when you require some legacy commands that may still be faster in 1.9.
If you have multicore hardware, download the multithreaded binaries or compile with OpenMP support.

2) Use efficient input formats

Use PLINK’s binary formats rather than text (e.g., .bed/.bim/.fam or .pgen/.pvar/.psam). Binary formats dramatically reduce I/O and memory overhead.
Convert VCFs to PLINK binary formats only once and archive those converted files for repeated analyses rather than reconverting each run.

Example conversion commands:

# VCF to PLINK 1 binary plink --vcf input.vcf --make-bed --out data # VCF to PLINK 2 format plink2 --vcf input.vcf --make-pgen --out data2

3) Limit I/O and redundant computation

Use –keep, –remove, –extract, and –exclude to subset samples/variants early in your pipeline so downstream steps operate on minimal data.
When testing parameters or tuning scripts, work with a small chromosome or a random subset of samples first.
Avoid repeating conversions: centralize converted binary files and access them directly.

Example:

plink --bfile data --extract snplist.txt --make-bed --out data_subset

4) Use chunking and parallelization

Split by chromosome for embarrassingly parallel tasks (e.g., per-chromosome association tests, per-chromosome QC).
For variant-level tasks, divide the variant list into chunks and run multiple PLINK instances concurrently (ensure each instance uses distinct output filenames).
On cluster systems, submit each chromosome/chunk as a separate job.

Example shell snippet to run per-chromosome jobs:

for chr in {1..22}; do   plink2 --pfile data --chr $chr --assoc --out assoc_chr${chr} & done wait

5) Use PLINK’s built-in multithreading (where available)

PLINK 2.0 supports multithreading for several commands via the –threads flag. Start with modest thread counts (e.g., 4–8) and tune based on CPU and I/O behavior.
Be mindful that too many threads can increase memory usage and I/O contention; monitor system load.

Example:

plink2 --pfile data --assoc --threads 8 --out assoc

6) Optimize memory usage

Use PLINK 2.0’s on-disk operations when memory is limited. The pgen format enables streaming-like access to data without loading everything into RAM.
If using PLINK 1.9, ensure you have enough RAM for whole-dataset operations or work in chunks/chromosomes.
Remove unused variables and intermediate files; compress outputs that you don’t need frequently.

7) Quality control steps—do them efficiently

Standard QC (missingness, MAF filtering, HWE) should be applied early to reduce dataset size.
Combine multiple QC filters into a single command when possible to minimize multiple passes over the data.

Example one-pass QC:

plink --bfile data --geno 0.05 --mind 0.02 --maf 0.01 --hwe 1e-6 --make-bed --out data_qc

8) Use LD pruning/clumping smartly

For analyses that need independent variants (e.g., PCA, PRS), use LD pruning (–indep-pairwise) or clumping (–clump) with appropriate window sizes and r2 thresholds.
Run pruning on a representative subset (unrelated individuals) to save time.

Example:

plink --bfile data_qc --indep-pairwise 200 50 0.2 --out pruned plink --bfile data_qc --extract pruned.prune.in --make-bed --out data_pruned

9) PCA and relatedness in large cohorts

Compute PCs on a pruned set of variants and on a subset of unrelated individuals; then project PCs to the full sample if needed.
For KING/relatedness estimation, use dedicated tools (KING or PLINK’s –rel-cutoff) and run per-chromosome or in chunks if dataset is huge.

Workflow:

LD-prune variants.
Identify unrelated individuals (e.g., using KING).
Compute PCA on unrelateds.
Project PCs to related samples.

10) Association testing: use appropriate models and tools

For simple single-variant tests, PLINK’s basic association tests are very fast. For mixed models or related samples, consider specialized tools (e.g., BOLT-LMM, SAIGE) that scale better for large sample sizes and control relatedness/population structure.
Use PLINK for initial scans; pass filtered summary/variant lists to specialized tools when needed.

11) Reduce file sizes: compression and selective output

Use compressed intermediate storage where possible (gzip for text outputs).
Use flags to suppress verbose output; only produce the files you need (e.g., –assoc vs. full regression output).
When using –out for many chunks, organize outputs into folders and later concatenate only necessary results.

12) Reproducible pipelines and logging

Script every step (Bash, Snakemake, Nextflow) so processes can be re-run or parallelized across clusters.
Log commands and timestamps. Save the exact PLINK binary version and parameters with outputs (e.g., write a small metadata file per run).

Example metadata stanza:

echo "plink_version: $(plink2 --version)" > run_metadata.txt echo "command: plink2 --pfile data --assoc --threads 8 --out assoc" >> run_metadata.txt

13) Practical command patterns and examples

Convert and QC:


plink2 --vcf input.vcf --make-pgen --out data plink2 --pfile data --geno 0.05 --mind 0.02 --maf 0.01 --make-pgen --out data_qc

Per-chromosome association (parallel):


for chr in {1..22}; do plink2 --pfile data_qc --chr $chr --glm --threads 4 --out glm_chr${chr} & done wait

LD pruning and PCA:


plink2 --pfile data_qc --indep-pairwise 200 50 0.2 --out pruned plink2 --pfile data_qc --extract pruned.prune.in --pca approx --out pca

14) Common pitfalls and how to avoid them

Over-parallelizing on a single disk causes I/O bottlenecks—use SSDs or limit concurrent jobs.
Forgetting to filter variants beforehand can blow memory—apply QC early.
Using too many threads without sufficient RAM leads to OOM kills—monitor memory.
Not checking strand/allele alignment when merging datasets—use liftover/allele-checking tools and harmonize before merging.

15) When to move beyond PLINK

For very large-scale mixed-model GWAS with hundreds of thousands of samples, consider tools like BOLT-LMM, SAIGE, REGENIE, or REGENIE2—these handle relatedness, case-control imbalance, and scaling more efficiently than basic PLINK regression.
Use PLINK for QC, subset preparation, and quick exploratory analyses; use specialized tools for the final large-scale association scan.

16) Summary checklist (quick reference)

Use binary (pgen/bed) formats.
Filter and QC early.
Chunk by chromosome or variant for parallel jobs.
Use –threads (PLINK2) but monitor memory and I/O.
LD-prune for PCA and related downstream tasks.
Use specialized tools (BOLT-LMM/SAIGE) for very large mixed-model GWAS.
Script everything and store metadata.

Following these guidelines will help you squeeze performance from PLINK pipelines while keeping analyses reproducible and robust. If you want, I can convert this into a checklist or Snakemake/Nextflow skeleton for your exact dataset and compute environment—tell me sample size, number of variants, and whether you have a cluster or a single multicore machine.

How to Use PLINK for GWAS and Quality Control

1) Choose the right PLINK version and build for your task

2) Use efficient input formats

3) Limit I/O and redundant computation

4) Use chunking and parallelization

5) Use PLINK’s built-in multithreading (where available)

6) Optimize memory usage

7) Quality control steps—do them efficiently

8) Use LD pruning/clumping smartly

9) PCA and relatedness in large cohorts

10) Association testing: use appropriate models and tools

11) Reduce file sizes: compression and selective output

12) Reproducible pipelines and logging

13) Practical command patterns and examples

14) Common pitfalls and how to avoid them

15) When to move beyond PLINK

16) Summary checklist (quick reference)

Comments

Leave a Reply Cancel reply

More posts

Step-by-Step Tutorial: Mastering Morpheus Photo Warper for Unique Photo Effects

Master Sender Strategies: How to Enhance Your Outreach and Engagement

T Movie Icon Pack_1: Elevate Your Digital Aesthetic

Innovations in Sequence Matrices: Enhancing Data Interpretation