Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Dataset #28

Open
A-BigBao opened this issue Aug 15, 2022 · 20 comments
Open

Training Dataset #28

A-BigBao opened this issue Aug 15, 2022 · 20 comments
Labels
enhancement New feature or request

Comments

@A-BigBao
Copy link

Is "A larger dataset" the training dataset? When will the data be released? Since the training data is essential for reproducing the model.

@guolinke
Copy link
Member

The dataset is very large, and we are looking for a solution for data hosting. Last week we submitted the request to "AWS Open Data Sponsorship Application", but didn't receive any response yet.

@lhatsk
Copy link

lhatsk commented Aug 18, 2022

In the mean time, it would be great if you could upload the scripts to generate the training features. Unfortunately, AFAICT they are missing. I'm especially interested in training the multimer variant. Thanks!

@guolinke
Copy link
Member

The multimer features mostly are the same as monomer ones, except the assembly of multiple chains.
You can refer this script https://github.com/dptech-corp/Uni-Fold/blob/main/scripts/get_pdb_assembly.py to generate the "pdb_assembly.json" we used.

@ZiyaoLi ZiyaoLi added the enhancement New feature or request label Sep 8, 2022
@ZiyaoLi ZiyaoLi pinned this issue Oct 8, 2022
@DimaMolod
Copy link

I am trying to download the "Full training dataset" using modelscope but the MsDataset.load() doesn't work for me because the connection gets broken by peer. The latest message I get is:
File "/home/dmolodenskiy/.conda/envs/py38/lib/python3.8/site-packages/requests/models.py", line 818, in generate raise ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@guolinke
Copy link
Member

@DimaMolod did it happen at the beginning, or already in-progress?

@DimaMolod
Copy link

DimaMolod commented Nov 18, 2022

hi @guolinke
it happens after 10-20 minutes of hanging. Seems like it is trying to connect during this time and finally the error message pops up, after the connection time is out.
The modelscope directory has been created with the following structure:

 modelscope/
    hub/
            datasets/
                downloads/
                    DPTech/
                        Uni-Fold-Data/
                            master/
                                Uni-Fold-Data.json
                                dataset_infos.json

thanks for you help!

(I'll also copy the last few messages from python here just in case you find it useful)

>>> ds = MsDataset.load(dataset_name='Uni-Fold-Data', namespace='DPTech', split='train')
2022-11-18 09:49:40,975 - modelscope - WARNING - Reusing dataset Uni-Fold-Data's python file (modelscope/hub/datasets/downloads/DPTech/Uni-Fold-Data/master/Uni-Fold-Data.json)
2022-11-18 09:49:41,498 - modelscope - WARNING - Reusing dataset Uni-Fold-Data's python file (modelscope/hub/datasets/downloads/DPTech/Uni-Fold-Data/master/dataset_infos.json)
2022-11-18 09:49:41,499 - modelscope - INFO - No subset_name specified, defaulting to the default

@lhatsk
Copy link

lhatsk commented Nov 18, 2022

I have the same issue. After re-trying I get now:

RequestError: {'status': -2, 'x-oss-request-id': '', 'details': "RequestError: HTTPSConnectionPool(host='dataset-hub.oss-cn-hangzhou.aliyuncs.com', port=443): Max retries exceeded with url: /public-unzip-dataset%2FDPTech%2FUni-Fold-Data%2Fmaster%2Fdatasets%2Fpdb_features%2F1e0z_A.feature.pkl.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b2f5d9bdd50>: Failed to establish a new connection: [Errno -2] Name or service not known'))"}

Does this include the training data for multimer?

@guolinke
Copy link
Member

We will report the issues to the modelscope. And yes, the multimer data is included.

@guolinke
Copy link
Member

The problem is due to the unstable network, as the data is hosted in China. The modelscope team promised they would fix it in the next 2 weeks.

@DimaMolod
Copy link

Thanks! Maybe meanwhile you could provide a script to generate the training dataset directory from scratch (from the downloaded databases)? I couldn't find it in the scripts directory.

@guolinke
Copy link
Member

@DimaMolod The data generation code is almost the same as the one used in inference, except for the label extraction from mmcif. @ZiyaoLi maybe we can add a script for the mmcif processing.

BTW, our data generation code highly relies on the cloud services (mostly Ali-cloud), because it is impossible to generate the data by a single machine. In particular, it takes us several months by hundreds of machines to generate these data. Therefore, we think it is less realistic to generate these data from scratch.

@lhatsk
Copy link

lhatsk commented Dec 12, 2022

Any news?

@guolinke
Copy link
Member

@lhatsk we are waiting for the fix from modelscope team. will post the updates here.

@WeianMao
Copy link

i fix the bug, please refer to this link modelscope/modelscope#51
@guolinke @lhatsk

@lhatsk
Copy link

lhatsk commented Dec 16, 2022

Thanks! Unfortunately, it still doesn't work for me.

RequestError: {'status': -2, 'x-oss-request-id': '', 'details': "RequestError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))"}

@WeianMao
Copy link

@lhatsk @guolinke i succuss yesterday, but failed today. it seems like the sever is unstable. is it possible to download the dataset from baidu drive? the issue exists too long.

@DimaMolod
Copy link

@DimaMolod The data generation code is almost the same as the one used in inference, except for the label extraction from mmcif. @ZiyaoLi maybe we can add a script for the mmcif processing.

BTW, our data generation code highly relies on the cloud services (mostly Ali-cloud), because it is impossible to generate the data by a single machine. In particular, it takes us several months by hundreds of machines to generate these data. Therefore, we think it is less realistic to generate these data from scratch.

Thank you, it would be very useful if you could upload a script for the label extraction from mmcif files.

@guolinke
Copy link
Member

guolinke commented Jan 6, 2023

@DimaMolod The data generation code is almost the same as the one used in inference, except for the label extraction from mmcif. @ZiyaoLi maybe we can add a script for the mmcif processing.
BTW, our data generation code highly relies on the cloud services (mostly Ali-cloud), because it is impossible to generate the data by a single machine. In particular, it takes us several months by hundreds of machines to generate these data. Therefore, we think it is less realistic to generate these data from scratch.

Thank you, it would be very useful if you could upload a script for the label extraction from mmcif files.

@teslacool can you merge the code into this repo?

@dingquanyu
Copy link

dingquanyu commented Jan 13, 2023

Hi,

I managed to resolve the 104 error shown above but then this ReadTimeoutError was reported. Could you maybe increase your default timeout from 60s to something longer?

Thanks a lot.

HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/DPTech/Uni-Fold-Data/oss/tree/?MaxLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ReadTimeoutError("HTTPConnectionPool(host='www.modelscope.cn', port=80): Read timed out. (read timeout=60)"))

I am trying to download the "Full training dataset" using modelscope but the MsDataset.load() doesn't work for me because the connection gets broken by peer. The latest message I get is: File "/home/dmolodenskiy/.conda/envs/py38/lib/python3.8/site-packages/requests/models.py", line 818, in generate raise ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@guolinke
Copy link
Member

@henrywotton you can report the issue to https://github.com/modelscope/modelscope

@ZiyaoLi ZiyaoLi unpinned this issue Sep 21, 2023
PKUfjh pushed a commit to PKUfjh/Uni-Fold that referenced this issue May 17, 2024
* parse_a3m_fast

* fix typo

* rewrite

* advance

* accel make msa feats

* change default fast

* fix

Co-authored-by: ziyao <[email protected]>
Co-authored-by: Ziyao Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants