### GWAS with GAPIT3

# Introduction

Genome Wide Association Studies (GWAS) test from thousands to millions of genetic variants across many genomes to find those statistically associated with a specific trait or disease. GWAS results have a range of applications, such as gaining insight into a phenotype’s underlying biology, estimating its heritability, and calculating genetic correlations.

Please cite GAPIT3 as:

Wang, J., & Zhang, Z. (2021). GAPIT Version 3: boosting power and accuracy for genomic association and prediction. *Genomics, proteomics & bioinformatics*, *19*(4), 629-640.

# Run GAPIT3 for GWAS

The Genome-Wide Association Studies Tool can be found under **Genetic Variation → Genome Wide Association Study. **The wizard consists of 4 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2, and Figure 3).

## Input

First of all, FLAIR requires two types of necessary files:

**VCF File:**this file must contain the SNPs that are going to be studied. It is originated from the Variant Calling step and might be filtered, although it is not necessary.**Phenotype Data:**tab-delimited file with the same sample names that in the VCF file in the first column and traits in several columns. Header is necessary.

Take into account that this tool can only associate variants to quantitative traits, but not to qualitative ones.

## Configuration 1

In this page, you can set parameters to filter your VCF file in terms of population genetics parameters (for instance, Hardy-Weinberg Equlibrium p-value or Minor Allele Frequency). In addition, you can set whether you want to normalize your phenotype data, as it is recommended that measures follow a normal distribution.P

**Population Genetics Filter Parameters:****Hardy-Weinberg Equilibrium P-value**: assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are considered to be out of HWE, and therefore excluded.**Minor Allele Frequency (MAF) Threshold**: minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. Include only sites with a Minor Allele Frequency greater than or equal to this value.**Missingness Threshold**: exclude**sites**on the basis of the proportion of missing data. If a variant is missing in a higher percentage of samples than the threshold set here, it is excluded.**Sample Missingness Threshold**: exclude**samples**on the basis of the proportion of missing data. If a sample do not have the minimum percentage of variants set here, this sample is excluded.

**Phenotype Data Preprocessing:****Remove Phenotype Outliers**: remove outliers in the phenotype data (e.g. outside 1.5 times the interquartile range above the upper quartile and below the lower quartile.**Normalize Phenotype Data**:

It is not recommended to remove outliers and normalize phenotype data at the same time.

## Configuration 2

In this page, different parameters in order to conduct the GWAS can be set:

**Use Kinship Matrix:**check this parameter if you want to use your own Kinship Matrix. Otherwise, it will be calculated before running the GWAS inside OmicsBox. A kinship matrix is an all-vs-all comparison among samples used to measure the degree of relatedness between individuals. It is used in GWAS to control for the effects of ancestry or family relatedness on the trait studied.**Kinship Matrix:**file with the Kinship Matrix. The kinship matrix file must be formatted as an n by n+1 matrix where the first column contains sample names, and the rest is a square symmetric matrix.__The first row of the kinship matrix file does not consist of headers__.**Kinship Group:**set which measure you want to use to group your samples in order to make the kinship matrix.**Kinship Algorithm:**establish the algorithm to make the kinship matrix.**Kinship Cluster**: establish how to cluster your samples to perform the kinship analysis.**Use Covariate Matrix:**check this parameter if you want to use your own Covariate Matrix. Otherwise, it will be calculated before running the GWAS inside OmicsBox. Covariates are variables that are thought to be related to the data being analyzed, but are not of primary interest. Regarding to GWAS, a covariate matrix is a table of variables that are used to adjust for the effects of other variables on the data being analyzed. It can be obtained using Principal Component Analysis (PCA).**Covariate Matrix:**file with the Covariate Matrix. This file must be formatted similarly to the phenotypic files (header and sample names in the first column).**Number of Dimensions:**number of dimensions to make a PCA and get the covariate matrix.**Model:**establish the model to use in the GWAS analysis. This model represents how to analyze the relationship between a trait and genetic variation.

# Results

Variant Calling has the following outputs:

**Association Table**with information for each SNP and Phenotype.**Genotype Charts:**information related only to the population structure and not to any phenotype introduced.**Phenotype Charts:**data associated to phenotypes, so there will be one chart per phenotype that was used.**Summary Report**.

## Association Table

This table has the following information for each SNP.

__Phenotype:__phenotype associated to that SNP.__SNP:__ID of the SNP. If the VCF did not have any ID, this field will have a combination of the chromosome name and the position.__Chromosome:__chromosome where the variation is found.__Position:__1-based position in the chromosome where the variation was found.__Minimum Allele Frequency:__frequency at which the second most common allele occurs in a given population.__Number of Samples Used:__this number can vary among the different phenotypes depending on the phenotypic information for each sample. In addition, some samples could have been filtered out during the sample filtering step.__Effect:__phenotypic variance attributable to that variant. If it is positive, the presence of the variant increment the power of the characteristic, whereas a negative value means that the presence of the variant diminish the quantitative value of that characteristic. A greater absolute value means a higher importance of the variant regarding to the phenotype.__P-value:__significance of the association.__Adjusted P-Value:__the Benjamini-Hochberg method is used.

## Genotype Charts

These charts are related to the set of a variants that is possessed by a population and they are not related to any phenotype.

**PCA:**this PCA is done with distances among samples taking into account only their genotypes, which appear in the VCF file. This PCA can be coloured by phenotypic values.**MAF Histogram:**distribution of the frequency at which the second most common allele in the whole population.**Marker Heterozigosity:**heterozigosity is the condition of having two different alleles at a locus. This histogram shows the proportion of sites that are heterozygotic. High level of heterozygosis indicated low quality.**Sample Heterozigosity:**this histogram shows the percentage of heterozygotic sites per sample. Again, high level of heterozygosis indicated low quality.

## Phenotype Charts

These charts depend on the values of the phenotypic traits in the population, so there are as many of each type as the number of phenotypes included in the analysis. There are two types:

**QQ-plot:**a qq-plot (quantile-quantile plot) shows the deviation of the observed P-values from the null hypothesis. That is to say, the X-axis represents the negative logarithm of expected p-values, and the Y-axis, the negative logarithm of the observed p-values. In a theoretical GWAS case where there are not causal polymorphisms, this plot will represent a diagonal line. Those variants that are significantly associated to the phenotype, will be represented plotted above the diagonal.

**Manhattan Plot:**summary chart that represents every position with a variant in the genome in the X-axis, and its negative logarithm p-value in the Y-axis. If a SNP is related to a specific phenotype according to the chosen model, that variant will be above the red horizontal line, which represented the threshold to accept a significant adjusted p-value.

The threshold value used in the horizontal line is the critical p-value. That is to say, the smaller p-value whose BH-adjusted p-value is bigger than 0.05.

## Summary Report

**Report** with information of the filtering step and the GWAS itself:

**Filtering Summary:**information about the number of SNPs and samples before and after the filtering (see filtering parameters).**GWAS Summary:**number of significant SNPs associated to different phenotypes.