NGStests

NGStests is an R package for testing for transcription factor (TF) differential binding in a ChIP-seq experiment, using next-generation sequence data, in a two-sample setting. The R package has currently been tested only in Linux/Unix environment. Download the R package here (version 1.1, May 2015). The package depends on other R packages: VGAM, splines, and stats4.

The procedure is as follows. Given two different sequences (from, say, test and control samples), the NGStests package constructs genomic windows, and gets the read counts per base-pair and per window. It then takes the differences of read counts between the two samples per corresponding location/base-pair. The differences are considered as observations and the underlying rates of both sequences are tested for equality (using the differences rather the original read counts). The main idea of the method is to consider the read counts per window to follow a Poisson distribution. We assume the read counts to be paired observations so that their differences would follow a Poisson Difference (PD) distribution.

Data preparation

The input data for the R package is basically in BED format, althought not all columns are needed for our purpose. It needs only the first two columns of the bed files, as the reads are of equal length, so that we only need the information on the starting position and the other information are not used. If data filtering is required, it needs to be done on the bam files before creating the bed files (see below). To select the first two columns of the bed files, there are many ways to do it, but we use awk or gawk to do it efficiently.

For handling the different file formats (bam and bed), we assume that samtools and bedtools are installed in the directory in your $PATH.

To prepare the data, do the following steps.

Installing (and loading) the package

Using the package

We describe here very briefly some examples to run the functios in the package. Please refer to the help page for more details.

Prior to the analysis, please set your working directory appropriately. Some functions in the package will save the data to disk to save memory. Because of this, please make sure that there is at least 2GB of disk space available (this of course depends on the coverage of your sequence). Please also make sure that you have permission to write to the working directory.

We first read the above input data (test-37-pos.txt and control-37-pos.txt) using

test.pos <- read.table("test-37-pos.txt", header=F)
control.pos <- read.table("control-37-pos.txt", header=F)

We then construct the genomic windows and counting the reads

data.bin <- NGSBinWin(test.pos, control.pos, bin.size=1, window.size=200)

In the above command line, the only quantities need to be input by user are the arguments bin.size and window.size. The value for bin.size can generally be set to one, which means that the reads are counted per base pair. Meanwhile, the value for window.size can be estimated from the NGSoptwin package, which uses the same input files as the NGStests package.

The above function creates RData files and write them to the disk (in the working directory) to save memory. It is important that each analysis is done on separate working directory, and you have enough space on the disk.

To perform the tests, run the following command.

res.all <- NGStestI(data.bin)

The above command, by default, will perform the Wald test (WT), exact test (ET), and likelihood ratio test (LRT) on the read counts data, and export the results to excel files in the working directory. The user can modify the default setting by changing the arguments method and/or output.format. See the help page by typing ?NGStestI [Enter] at the R prompt.

Please note that the p-values as the output of the above tests are raw unadjusted p-values. To reflect the multiplicity in the data, we can use

res.all.adj = NGSAdjSig(res.all, method = "FDR", cutoff = 0.05, out.format = "xls")

For plotting the distribution of reads in a region, you can use the following example command

temp.plot = NGSplot(test.pos, control.pos, chr=1, start=13333200, end=13333399, bin.size=1)

Sometimes, we are interested to test a single region only. This can be done via

res.single = NGStestSingle(test.pos, control.pos, chr=1, start=13333200, end=13333399, bin.size = 1, d = 0, rm.centro = TRUE, plot = TRUE)

Suppose that after performing the tests, we want to shift the windows to check whether we get a different result in some of the windows. Suppose we are interested to shift the location of the windows by 100 bp, we can run

data.bin.shifted = NGSshift(data.bin,shift=100)

and subsequently run the tests using the available commands, for example NGStestI().

Changing the window size for the tests is also possible. Suppose that you are now interested to identify genomic regions in the size of 300 bp, you can run

data.bin.300 = NGSwin(data.bin,window.size=300)

and subsequently run the tests using the available commands, for example NGStestI().

Citation

If you are using the NGStests package for your analysis, please use the following citation:

Nafisah, I, Gusnanto, A, Mahalingam Shanmugiah, V, Taylor, CC, Westhead, DR (2015) Accurate statistical tests to detect genomic changes in ChIP-seq experiments. Submitted




Last modified: Mon June 09 09:46:05 GMT 2015