Skip to content

Commit

Permalink
implement flag as an op. add License.
Browse files Browse the repository at this point in the history
  • Loading branch information
brentp committed Apr 29, 2015
1 parent 0cb84e8 commit 07a6a7f
Show file tree
Hide file tree
Showing 5 changed files with 106 additions and 39 deletions.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
The MIT License (MIT)

Copyright (c) 2015 Brent Pedersen and Aaron Quinlan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
53 changes: 20 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,29 @@ vcfanno
=======

vcfanno annotates a VCF with any number of *sorted* input BED, BAM, and VCF files.
It does this by finding overlaps as it streams over the sorted files and applying
It does this by finding overlaps as it streams over the data and applying
user-defined operations on the overlapping fields.

For VCF annotations, values are pulled by name from the INFO field. A variant from
the query VCF will only be annotated with a variant from an annotation file if they
have the same position and REF and share at least 1 ALT.
For VCF, values are pulled by name from the INFO field.
For BED, values are pulled from (1-based) column number.
For BAM, only depth (`count`) is currently supported.

For BED files, values are pulled from (1-based) column number.

For BAM files, only depth (`count`) is currently supported.
`vcfanno` is written in [go](http://golang.org)
It can annotate ~ 5,000 variants per second with 5 annotations from 3 files on a modest laptop.


`vcfanno` is written in [go](http://golang.org) and will make use of multiple CPU's
if your environment variable `GOMAXPROCS` is set to a number greater than 1. It can
annotate ~ 5,000 variants per second with 5 annotations from 3 files on a modest laptop.

We are actively developing `vcfanno` and appreciate your feedback as we navigate the
[fruit salad](https://www.biostars.org/p/7126/#7136) of the VCF format.
We are actively developing `vcfanno` and appreciate feedback and bug reports.

Usage
=====

Usage looks like:
After downloading the binary for your system (see section below) usage looks like:

vcfanno config.toml $input.vcf > $annotated.vcf
```Shell
./vcfanno example/conf.toml example/query.vcf
```

Where config.toml contains the information on any number of annotation files.
Example entries look like
Where config.toml looks like:

```
[[annotation]]
Expand All @@ -52,7 +47,7 @@ names=["ex_bam_depth"]

So from `ExAC.vcf` we will pull the fields from the info field and apply the corresponding
`operation` from the `ops` array. Users can add as many `[[annotation]]` blocks to the
conf file as desired.
conf file as desired. Files can be local as above, or available via http/https.

Example
-------
Expand All @@ -61,17 +56,13 @@ the example directory contains the data and conf for a full example. To run, eit
the appropriate binary for your system from **TODO** or build with:

```Shell
go get
go build -o vcfanno
```

from this directory.
Then, you can annotate with:

```Shell
./vcfanno example/conf.toml example/query.vcf > annotated.vcf
```
Or, to get the result a bit sooner:

```Shell
GOMAXPROCS=4 ./vcfanno example/conf.toml example/query.vcf > annotated.vcf
```
Expand All @@ -97,10 +88,11 @@ are `reduced`. Valid operations are:
+ mean
+ max
+ min
+ concat
+ count
+ concat // comma delimited list of output
+ count // count the number of overlaps
+ uniq
+ first
+ first
+ flag // presense/absence via vcf flag

Please open an issue if your desired operation is not supported.

Expand All @@ -124,18 +116,13 @@ vt decompose -s $VCF | vt normalize -r $REF - > $NORM_VCF
Development
===========

Again, this, along with the associated go libraries ([vcfgo](https://github.com/brentp/vcfgo),
This, and the associated go libraries ([vcfgo](https://github.com/brentp/vcfgo),
[irelate](https://github.com/brentp/irelate), [xopen](https://github.com/brentp/xopen)) are
under active development. A number of things are not yet supported and a number of features
will be added soon.
under active development. The following are on our radar:

- [ ] add flag op. just check for presence/overlap with annotation.
- [x] strip 'chr' prefix from chroms to prevent lack of overlap due to different names.
- [x] handle structural variants correctly. (SVLEN <DEL/DUP> / <INS> [len=0])
- [ ] decompose, normalize, and get allelic primitives for variants on the fly
(we have code to do this, it just needs to be integrated)
- [ ] improve test coverage for vcfanno (started, but needs more)
- [x] correct order of contigs from vcf writer.
- [ ] embed v8 to allow custom ops.

<!--
Expand Down
11 changes: 9 additions & 2 deletions reducers.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@ package main

import (
"fmt"
"github.com/brentp/irelate"
"github.com/brentp/vcfgo"
"strconv"
"strings"

"github.com/brentp/irelate"
"github.com/brentp/vcfgo"
)

type Reducer func([]interface{}) interface{}
Expand Down Expand Up @@ -94,6 +95,11 @@ func first(vals []interface{}) interface{} {
return vals[0]
}

// named vflag because of conflict with builtin.
func vflag(vals []interface{}) interface{} {
return len(vals) > 0
}

// Collect the fields associated with a variant into a single slice.
func Collect(v *vcfgo.Variant, rels []irelate.Relatable, cfg anno) [][]interface{} {
annos := make([][]interface{}, len(cfg.Names))
Expand Down Expand Up @@ -155,6 +161,7 @@ var Reducers = map[string]Reducer{
"count": Reducer(count),
"uniq": Reducer(uniq),
"first": Reducer(first),
"flag": Reducer(vflag),
}

// Partition separates the Related() elements by source.
Expand Down
27 changes: 27 additions & 0 deletions vcfanno.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ func Anno(queryVCF string, configs Annotations, outw io.Writer) {
panic(err)
}

// the *Prefix functions let 'chr1' == '1'
for interval := range irelate.IRelate(irelate.CheckOverlapPrefix, 0, irelate.LessPrefix, streams...) {
variant := interval.(*irelate.Variant)
if len(variant.Related()) > 0 {
Expand All @@ -120,6 +121,29 @@ func updateInfo(v *vcfgo.Variant, sep [][]irelate.Relatable, files []anno) {
}
}

func checkAnno(a anno) error {
if a.Fields == nil {
// Columns: BED/BAM
if a.Columns == nil {
return fmt.Errorf("must specify either 'fields' or 'columns' for %s", a.File)
}
if len(a.Ops) != len(a.Columns) && !strings.HasSuffix(a.File, ".bam") {
return fmt.Errorf("must specify same # of 'columns' as 'ops' for %s", a.File)
}
if len(a.Names) != len(a.Columns) && !strings.HasSuffix(a.File, ".bam") {
return fmt.Errorf("must specify same # of 'names' as 'ops' for %s", a.File)
}
}
// Fields: VCF
if a.Columns != nil {
return fmt.Errorf("specify only 'fields' or 'columns' not both %s", a.File)
}
if len(a.Ops) != len(a.Fields) {
return fmt.Errorf("must specify same # of 'fields' as 'ops' for %s", a.File)
}
return nil
}

func main() {

flag.Parse()
Expand All @@ -135,5 +159,8 @@ func main() {
if _, err := toml.DecodeFile(inFiles[0], &config); err != nil {
panic(err)
}
for _, a := range config.Annotation {
checkAnno(a)
}
Anno(inFiles[1], config, os.Stdout)
}
33 changes: 29 additions & 4 deletions vcfanno_test.go
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
package main

import (
"fmt"
"testing"

"github.com/brentp/irelate"
"github.com/brentp/vcfgo"
. "gopkg.in/check.v1"
"testing"
)

func Test(t *testing.T) { TestingT(t) }
Expand Down Expand Up @@ -78,9 +80,9 @@ func (s *AnnoSuite) TestAnno(c *C) {
}
cfgBed := anno{
File: "bed file",
Ops: []string{"mean", "max"},
Columns: []int{4, 5},
Names: []string{"bed_mean", "bed_max"},
Ops: []string{"mean", "max", "flag"},
Columns: []int{4, 5, 1},
Names: []string{"bed_mean", "bed_max", "bedFlag"},
}

sep := Partition(s.v1, 2)
Expand All @@ -95,4 +97,27 @@ func (s *AnnoSuite) TestAnno(c *C) {

c.Assert(s.v1.Info["bed_mean"], Equals, float32(111))
c.Assert(s.v1.Info["bed_max"], Equals, float32(222))

c.Assert(s.v1.Info["bedFlag"], Equals, true)

c.Assert(fmt.Sprintf("%s", s.v1.Info), Equals, "DP=35;dp_mean=66;dp_min=44;dp_max=88;dp_concat=44,88;dp_uniq=44,88;dp_first=44;bed_mean=111;bed_max=222;bedFlag")
}

func (s *AnnoSuite) TestCheck(c *C) {
cfgBed := anno{
File: "bed file",
Ops: []string{"mean", "max", "flag"},
Columns: []int{4, 5},
Names: []string{"bed_mean", "bed_max", "bedFlag"},
}
e := checkAnno(cfgBed)
c.Assert(e, ErrorMatches, "must specify same # of 'columns' as 'ops' for bed file")

cfgBed.Fields = []string{"abc", "def"}
e = checkAnno(cfgBed)
c.Assert(e, ErrorMatches, "specify only 'fields' or 'columns' not both bed file")

cfgBed.Columns = nil
e = checkAnno(cfgBed)
c.Assert(e, ErrorMatches, "must specify same # of 'fields' as 'ops' for bed file")
}

0 comments on commit 07a6a7f

Please sign in to comment.