-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
180 lines (112 loc) · 5.17 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
ESTpipe v0.4
ESTpipe is a pipeline for cleaning, annotating and submitting baboon
EST sequences into dbEST.
It is used in research for NIH grant "Baboon cDNA
sequencing" ( 1 R24 RR021863-01A1, CFDA No. 93.389, 2007-2010)
Many things have been hard coded in it in this initial submission, but
it can be easily modified to do process other ESTs.
The small test files that form part of the distribution, need to
written over by real data files. Once set up, the pipeline for one EST
library can be run by 'make'.
You can try:
make -s; make clean
When make runs without error, you have installed all the dependencies
and can start processing real data.
ANALYSIS STEPS
1. Reject sequences with less than 64 informative residues after masking
2. Remove linkers from the ends (currently disabled because not needed)
3. Reject sequences with a match to human michondrial genome
4. Find a human protein homolog
5. Write out in dbEST submission format
6. Process the quality file to match rejected ESTs
DIRECTORY STRUCTURE
Since make is particular about identifying the files by their name and
location, it is important to understand the directories used:
data reference sequences
raw where EST files are placed for processing
work all intermediate files are created here
done final output files are place here
The flow of EST sequence data is:
raw -> work -> done
bin programmes that make the pipeline
lib library code for programmes
t test for the library
(passing the tests depends on state of files in directories)
DEPENDENCIES
The following programs have to be installed for the pipeline to run
perl recent perl (5.8.0) is recommended
BioPerl use 1.6 release or a more recent SNV copy
http://www.bioperl.org/wiki/Getting_BioPerl
RepeatMasker http://www.repeatmasker.org/RMDownload.html
CrossMatch http://www.phrap.org
mdust http://compbio.dfci.harvard.edu/tgi/software/
NCBI BLAST blastall and formatdb are used
http://www.ncbi.nlm.nih.gov/BLAST/download.shtml
Note: If you have RepeatMasker/CrossMatch installed on other machine,
there is a trick for importing the processed sequence file.
PREPARE THE UNIPROT DATABASE
Copy the latest human swissprot and trembl files into data directory
using these names:
data/uniprot_sprot_human.dat.gz
data/uniprot_trembl_human.dat.gz
Make note of the release number for your records.
These files need to be preprocessed so that the header will contain
information about the source database.
To process these files and format the combined blast database run:
make formatdb
The data directory also contains the human mitochondrial reference
sequence. If that needs to be updated, replace it with identically
named file (humanmito.fa) and this step will reformat that database,
too.
COPY THE EST FILES
The fasta and quality files for the EST library need to be placed in
the raw directory:
raw/bab.fa
raw/bab.qual
The quality file is not currently used by the pipeline, but is
processed to remove the sames sequences as from the fasta file that
were found out to be too short, repetitive or contain mitochondial
sequences.
The pipeline is dependent on finding meta information about the
sequence and the library from the fasta header. See the example fasta
file.
FIND REPEAT ONLY SEQUENCES
RepeatMasker/CrossMatch is run as part of the pipeline.
However, if you have it installed on a different machine, the
processed file can be copied into:
raw/bab.fa.masked
This is the default setup. See the Makefile for details on
RepeatMasker command line options to use when running it.
RUN
At the root directory type:
make
Running removes all non-informative and contaminating sequences from
the final output.
The most time consuming step is the run the programme bin/estpipe that
does sequentially comparisons to reference sequences using BLAST using
BLAST. A detailed log is written to 'work/estpipe.log'. Any time
during the execution, you can analyse the log:
work/estpipe_progress
OUTPUT
The output is written into the done directory. All files are named
according to the EST library name. This was parsed from the fasta header.
An example listing of the final files in the output directory, 'done':
BABEVCEREB-C-01-1-7KB.dbest.1.gz BABEVCEREB-C-01-1-7KB.dbest.7.gz
BABEVCEREB-C-01-1-7KB.dbest.2.gz BABEVCEREB-C-01-1-7KB.dbest.gz
BABEVCEREB-C-01-1-7KB.dbest.3.gz BABEVCEREB-C-01-1-7KB.fasta.gz
BABEVCEREB-C-01-1-7KB.dbest.4.gz BABEVCEREB-C-01-1-7KB.lib.gz
BABEVCEREB-C-01-1-7KB.dbest.5.gz BABEVCEREB-C-01-1-7KB.qual.gz
BABEVCEREB-C-01-1-7KB.dbest.6.gz
The BABEVCEREB-C-01-1-7KB EST library has been written into
correspondingly named fasta, quality and dbEST files.
Importantly, the single dbEST file has been split into smaller files
each containing maximum of 10,000 entries that can be mailed as
attachments to dbEST ([email protected]).
Note that the dbEST lib file in this output was not created
automatically. You have to do it manually. See the dbEST submission
documentation for details.
LICENSE
ESTpipe is licensed under the same terms as Perl itself, which means
it is dually-licensed under either the Artistic or GPL licenses.
Heikki Lehvaslaiho
20 April 2009