Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example on how to add a new Task #11

Open
proto-tardigrade opened this issue Sep 5, 2019 · 3 comments
Open

Example on how to add a new Task #11

proto-tardigrade opened this issue Sep 5, 2019 · 3 comments

Comments

@proto-tardigrade
Copy link

Hello,

I'm interested in being able to add a new task, specifically a SequenceToFloatTask, to fine-tune the various models pre-trained on PFAM on the specific task at hand and evaluate their performance. Unfortunately, while reading through the source code I was unable to determine how the datasets are correlated with the tasks, and so was unsure how to figure this out on my own. I see that on the Task ABC there is a get_data() method that uses a data folder to load the specific files, and on main there is an additional get_data() function that receives a data folder as input and passes this along to each task's get_data() method, however on the eval() protein.command and main() protein.automain, both in main, this get_data() function is called without specifying any data_folder. If the way to specify/correlate a dataset to a specific task could be explained, then I might be able to figure this out on my own; however in either case an example of adding a new task would be really helpful!

Best,
Chase

@rmrao
Copy link
Collaborator

rmrao commented Sep 20, 2019

Hi Chase -

so the Task ABC provides an interface for everything you need to do to define a task. We used the sacred library for configuration, which auto-inserts arguments in (we're moving away from this in future work, as it's kind of odd and cumbersome and hard for other people to understand the code).

The key things to know are

  1. The name of this class will, by default, be the name of the class in snake case, without task at the end. If you want to change this, just override the class's __str__ method.
  2. The task will automatically look in {data_folder}/str(self)/str(self)_train.tfrecord. E.g. if your class is called is MyNewTask, then it will look in {data_folder}/my_new/my_new_train.tfrecord.
  3. The main function knows about the tasks via the TaskBuilder class. You can manually add your class to TaskBuilder as the easiest solution.

If you want to add a new SequenceToFloatTask, I'd start by looking at the FluorescenceTask. The big things you'll need to change here are 1) the name of the task, 2) the deserialization function, and 3) the label (which is the name of a key to the dictionary of input that you return in the deserialization function). Everything else can remain exactly the same.

Then, add your data to the same TAPE data folder (be default tape/data). The folder should have the same name as your task, but in snake case. So if your task is called MyFancyNewTask, you'll create a folder called my_fancy_new and files my_fancy_new_train.tfrecord, my_fancy_new_valid.tfrecord. Alternatively, you can just override the get_train_files and get_valid_files methods in your new task to return your data files, regardless of where they are/what they're called.

Finally, import your task in TaskBuilder.py and add it to the dictionary at the beginning. You won't have to worry about main - everything should get auto-imported / understood.

I realize that this is pretty unintuitive, and we are looking at releasing a new version that's easier to modify. That should be released in a few months.

@Kevin-chen-sheng
Copy link

Hello,

I'm interested in being able to add a new task, specifically a SequenceToFloatTask, to fine-tune the various models pre-trained on PFAM on the specific task at hand and evaluate their performance. Unfortunately, while reading through the source code I was unable to determine how the datasets are correlated with the tasks, and so was unsure how to figure this out on my own. I see that on the Task ABC there is a get_data() method that uses a data folder to load the specific files, and on main there is an additional get_data() function that receives a data folder as input and passes this along to each task's get_data() method, however on the eval() protein.command and main() protein.automain, both in main, this get_data() function is called without specifying any data_folder. If the way to specify/correlate a dataset to a specific task could be explained, then I might be able to figure this out on my own; however in either case an example of adding a new task would be really helpful!

Best,
Chase

Have you successfully added a new mission? @proto-tardigrade

@Kevin-chen-sheng
Copy link

Hi Chase -

so the Task ABC provides an interface for everything you need to do to define a task. We used the sacred library for configuration, which auto-inserts arguments in (we're moving away from this in future work, as it's kind of odd and cumbersome and hard for other people to understand the code).

The key things to know are

  1. The name of this class will, by default, be the name of the class in snake case, without task at the end. If you want to change this, just override the class's __str__ method.
  2. The task will automatically look in {data_folder}/str(self)/str(self)_train.tfrecord. E.g. if your class is called is MyNewTask, then it will look in {data_folder}/my_new/my_new_train.tfrecord.
  3. The main function knows about the tasks via the TaskBuilder class. You can manually add your class to TaskBuilder as the easiest solution.

If you want to add a new SequenceToFloatTask, I'd start by looking at the FluorescenceTask. The big things you'll need to change here are 1) the name of the task, 2) the deserialization function, and 3) the label (which is the name of a key to the dictionary of input that you return in the deserialization function). Everything else can remain exactly the same.

Then, add your data to the same TAPE data folder (be default tape/data). The folder should have the same name as your task, but in snake case. So if your task is called MyFancyNewTask, you'll create a folder called my_fancy_new and files my_fancy_new_train.tfrecord, my_fancy_new_valid.tfrecord. Alternatively, you can just override the get_train_files and get_valid_files methods in your new task to return your data files, regardless of where they are/what they're called.

Finally, import your task in TaskBuilder.py and add it to the dictionary at the beginning. You won't have to worry about main - everything should get auto-imported / understood.

I realize that this is pretty unintuitive, and we are looking at releasing a new version that's easier to modify. That should be released in a few months.

Sorry,There is no data folder under the tape folder(tape/data).I didn't find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants