Example on how to add a new Task #11

proto-tardigrade · 2019-09-05T23:23:43Z

Hello,

I'm interested in being able to add a new task, specifically a SequenceToFloatTask, to fine-tune the various models pre-trained on PFAM on the specific task at hand and evaluate their performance. Unfortunately, while reading through the source code I was unable to determine how the datasets are correlated with the tasks, and so was unsure how to figure this out on my own. I see that on the Task ABC there is a get_data() method that uses a data folder to load the specific files, and on main there is an additional get_data() function that receives a data folder as input and passes this along to each task's get_data() method, however on the eval() protein.command and main() protein.automain, both in main, this get_data() function is called without specifying any data_folder. If the way to specify/correlate a dataset to a specific task could be explained, then I might be able to figure this out on my own; however in either case an example of adding a new task would be really helpful!

Best,
Chase

rmrao · 2019-09-20T17:00:20Z

Hi Chase -

so the Task ABC provides an interface for everything you need to do to define a task. We used the sacred library for configuration, which auto-inserts arguments in (we're moving away from this in future work, as it's kind of odd and cumbersome and hard for other people to understand the code).

The key things to know are

The name of this class will, by default, be the name of the class in snake case, without task at the end. If you want to change this, just override the class's __str__ method.
The task will automatically look in {data_folder}/str(self)/str(self)_train.tfrecord. E.g. if your class is called is MyNewTask, then it will look in {data_folder}/my_new/my_new_train.tfrecord.
The main function knows about the tasks via the TaskBuilder class. You can manually add your class to TaskBuilder as the easiest solution.

If you want to add a new SequenceToFloatTask, I'd start by looking at the FluorescenceTask. The big things you'll need to change here are 1) the name of the task, 2) the deserialization function, and 3) the label (which is the name of a key to the dictionary of input that you return in the deserialization function). Everything else can remain exactly the same.

Then, add your data to the same TAPE data folder (be default tape/data). The folder should have the same name as your task, but in snake case. So if your task is called MyFancyNewTask, you'll create a folder called my_fancy_new and files my_fancy_new_train.tfrecord, my_fancy_new_valid.tfrecord. Alternatively, you can just override the get_train_files and get_valid_files methods in your new task to return your data files, regardless of where they are/what they're called.

Finally, import your task in TaskBuilder.py and add it to the dictionary at the beginning. You won't have to worry about main - everything should get auto-imported / understood.

I realize that this is pretty unintuitive, and we are looking at releasing a new version that's easier to modify. That should be released in a few months.

Kevin-chen-sheng · 2019-12-17T16:13:19Z

Hello,

I'm interested in being able to add a new task, specifically a SequenceToFloatTask, to fine-tune the various models pre-trained on PFAM on the specific task at hand and evaluate their performance. Unfortunately, while reading through the source code I was unable to determine how the datasets are correlated with the tasks, and so was unsure how to figure this out on my own. I see that on the Task ABC there is a get_data() method that uses a data folder to load the specific files, and on main there is an additional get_data() function that receives a data folder as input and passes this along to each task's get_data() method, however on the eval() protein.command and main() protein.automain, both in main, this get_data() function is called without specifying any data_folder. If the way to specify/correlate a dataset to a specific task could be explained, then I might be able to figure this out on my own; however in either case an example of adding a new task would be really helpful!

Best,
Chase

Have you successfully added a new mission? @proto-tardigrade

Kevin-chen-sheng · 2019-12-17T17:22:20Z

Hi Chase -

so the Task ABC provides an interface for everything you need to do to define a task. We used the sacred library for configuration, which auto-inserts arguments in (we're moving away from this in future work, as it's kind of odd and cumbersome and hard for other people to understand the code).

The key things to know are

The name of this class will, by default, be the name of the class in snake case, without task at the end. If you want to change this, just override the class's __str__ method.

The task will automatically look in {data_folder}/str(self)/str(self)_train.tfrecord. E.g. if your class is called is MyNewTask, then it will look in {data_folder}/my_new/my_new_train.tfrecord.

The main function knows about the tasks via the TaskBuilder class. You can manually add your class to TaskBuilder as the easiest solution.

If you want to add a new SequenceToFloatTask, I'd start by looking at the FluorescenceTask. The big things you'll need to change here are 1) the name of the task, 2) the deserialization function, and 3) the label (which is the name of a key to the dictionary of input that you return in the deserialization function). Everything else can remain exactly the same.

Then, add your data to the same TAPE data folder (be default tape/data). The folder should have the same name as your task, but in snake case. So if your task is called MyFancyNewTask, you'll create a folder called my_fancy_new and files my_fancy_new_train.tfrecord, my_fancy_new_valid.tfrecord. Alternatively, you can just override the get_train_files and get_valid_files methods in your new task to return your data files, regardless of where they are/what they're called.

Finally, import your task in TaskBuilder.py and add it to the dictionary at the beginning. You won't have to worry about main - everything should get auto-imported / understood.

I realize that this is pretty unintuitive, and we are looking at releasing a new version that's easier to modify. That should be released in a few months.

Sorry,There is no data folder under the tape folder(tape/data).I didn't find it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example on how to add a new Task #11

Example on how to add a new Task #11

proto-tardigrade commented Sep 5, 2019

rmrao commented Sep 20, 2019

Kevin-chen-sheng commented Dec 17, 2019

Kevin-chen-sheng commented Dec 17, 2019

Example on how to add a new Task #11

Example on how to add a new Task #11

Comments

proto-tardigrade commented Sep 5, 2019

rmrao commented Sep 20, 2019

Kevin-chen-sheng commented Dec 17, 2019

Kevin-chen-sheng commented Dec 17, 2019