-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A few questions for the usage #1
Comments
@jinserk Thank you for your interest in the project!
attributes=[
('image', 'float32', (1, 28, 28)),
('target', 'int64', (1))
]
traindata_saver({
'image': image,
'target': target
})
traindata_saver({
'image': image,
'target': target
}, refresh=True) # I will add this argument
Best Regard Tae Hwan |
@graykode Thank you very much for the detailed answers! It's really helpful, and very impressive. My first question is actually related to the heterogeneous shape of tensors, which means the case that the image size, in your mnist example, could be changed as sample by sample. Practically I'm working on a chemical problem -- molecule classification in chemistry or pharma companies -- and the input feature can be graphs whose sizes are various according to the molecules. I know this cannot be simply implemented using an attributes, and that's why I asked about the sparse matrices support. Hope this could be implemented and be used sooner! :) |
@jinserk I have a question, in order to make a model of pytorch, all input shapes must be the same, but I am curious how a tensor of heterogeneous shape can be an input of a model. |
Good question. Basically I'm using a fixed shape for the input of the model. In the training, I just pad the heterogeneous shapes of the input with a fixed shape of max dim values. I made a quick test to store all my dataset in the form of padded dense matrices and got almost 100 times bigger stored file, which is totally impractical. |
If so, how about storing the fixed tensor itself that goes into the model's input in matorage? This is the core idea of matorage. |
Thanks for the suggestion, @graykode ! I will try to do so, since I don't know how the "compression" will work well. I had once tried to store the fixed tensors but the serialized file was almost 400 GB (used |
I understand that there is currently no official support for hdf5 for sparse matrices.(This is not impossible to implement. Actually, there is currently an implementation such as https://github.com/appier/h5sparse) Therefore, the official pytable document also recommends compressing for sparse matrix. In fact, according to many sources, it is recommended to use compression when using the hdf5 format. (https://stackoverflow.com/a/25678471/5350490) According to this, since it is a sparse matrix, the original 512MB of data can be compressed up to 4.5KB. So, can you experiment with your 400GB data and give you the final compressed result size? Please try to In addition, apart from this, we will add a mechanism for sparse to our long-term plans!! |
It's really fantastic! Thank you so much for sharing this project.
I had a quick test with
minio
docker process and confirmed it works really well as expected.I'd like to ask a few questions about the usage:
DataLoader
and will load the data samples during the training -- so I wonder the loaded dataset size will be multiple of the original dataset size per process or not.The text was updated successfully, but these errors were encountered: