NGSoptwin

NGSoptwin is an R package for estimating the optimal window size for analysis of low-coverage next-generation sequence data. The R package has currently been tested only in Linux/Unix environment. Download the R package here (version 1.0-2, 04 Feb 2013).

Given a step function on the density of read counts, we calculate the Akaike's information criterion (AIC) and cross-validation (CV) log-likelihood at different window sizes. The optimal window size is estimated at the window size that gives the lowest AIC or the maximum CV log-likelihood.

Please note that the setting of the analysis is in the low coverage sequence data (e.g. less than 0.1X), where 'binning' or 'windowing' is critical to obtain genomic features. However, we have used this analysis in a higher coverage sequence data without any problem.

Data preparation

The input data for the R package is basically in BED format, althought not all columns are needed for our purpose. It needs only the first two columns of the bed files, as the reads are of equal length, so that we only need the information on the starting position and the other information are not used. If data filtering is required, it needs to be done on the bam files before creating the bed files (see below). To select the first two columns of the bed files, there are many ways to do it, but we use awk or gawk to do it efficiently.

For handling the different file formats (bam and bed), we assume that samtools and bedtools are installed in the directory in your $PATH.

To use the LS041 data in the paper, do the following steps. These steps can be used as a guidance to prepare your own data for analysis.

Installing (and loading) the package

Using the package

We describe here very briefly some examples to run an estimation of optimal window size using the package. Please refer to the help page for more details.

We first read the above input data (LS041-test-37-pos.txt and LS041-control-37-pos.txt) using

test.pos <- read.pos("LS041-test-37-pos.txt")
control.pos <- read.pos("LS041-control-37-pos.txt")

We then calculate AIC and CV log-likelihood for several window sizes

LS041.optwin <- opt.win(test.pos, control.pos, win.size=c(seq(10,100,10), seq(150,950,50), seq(1000,2000,by=100))*1000)

The results can be plotted using

plot(LS041.optwin, min=T)# AIC

or

plot(LS041.optwin, "cv", min=T)# CV log-likelihood

If you are analysing only one sample, then you can use the command

LS041.optwin.onesample <- opt.win.onesample(test.pos, win.size=c(seq(10,100,10), seq(150,950,50), seq(1000,2000,by=100))*1000)

The results can be plotted using

plot(LS041.optwin.onesample, min=T)# AIC

or

plot(LS041.optwin.onesample, "cv", min=T)# CV log-likelihood

Once you are sure that the minimum AIC or maximum CV log likelihood has been identified in the figure, the optimal window size corresponding to the maximum/minimum can be identified using the following commands.

attr(LS041.optwin, "winsize")[which.min(LS041.optwin$test.aic)] # Tumour, using AIC

attr(LS041.optwin, "winsize")[which.min(LS041.optwin$control.aic)] # Tumour, using CV

attr(LS041.optwin, "winsize")[which.max(LS041.optwin$test.cv)] # Normal, using AIC

attr(LS041.optwin, "winsize")[which.max(LS041.optwin$control.cv)] # Normal, using CV

Citation

If you are using the NGSoptwin package for your analysis, please use the following citation:

Gusnanto, A., Taylor C.C., Nafisah, I., Wood, H.M., Rabbitts, P., and Berri, S. (2014). Estimating optimal window size for analysis of low-coverage next-generation sequence data, Bioinformatics, advance access, doi: 10.1093/bioinformatics/btu123




Last modified: Wed Apr 16 14:17:05 GMT 2014