-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch06/03_bonus_imdb-classification #155
Comments
Arg, I updated this last week, I must have forgotten to push the changes. |
I think it should be all addressed now. |
I have pulled every commit until now. Currently testing with Windows and Docker container. For me, the test and validation files are corrupt when downloading via Windows already starts with problems with the reporthook:
Update for Windows. This line of code fixes the download issue: speed = int(progress_size / (1024 * duration)) if duration else 0 |
Worked with Windows now, same with Docker running Ubuntu. The issue was that the script created broken test and validation set. The split takes 5-10 minutes to run properly, even though it seems to use only a small amount of resources on my PC. This should be reflected in the README imho. I think this takes so long because of the text data, which is not ideal for a pandas df. Maybe there is a way to speed up this splitting process in
|
Thanks for testing. I will add the line later and investigate more. On my laptop the whole thing didn't take more than 40 sec (37 sec to be precise, see below) so maybe there's still something odd going on on Windows. EDIT: This is 20.16 for downloading, and 17 sec for processing, hence the 37 sec in total
As far as I understand, it's automatically truncating it to 512 (BERT doesn't support longer inputs). I can see if I can manually truncate it to suppress the message. |
Can you please also fix the README to:
(val.csv -> validation.csv) |
Oh wow, that should be a solid setup indeed. Out of curiosity, are do you have an SSD or an HD? This could maybe explain the difference. Just a hunch |
Even a M.2 SSD (Crucial P5 Plus 2TB) on a X470 mainboard. Also had no other programs running actively running in parallel when testing, so it must be something related to Windows |
I updated the code via #156 as you suggested. Regarding
I think this is a nonsense warning that gets triggered. There is no sequence longer than 256 tokens, I double checked that. I think it's seeing token IDs with larger values and then thinks there could potentially be longer sequences, but that's not true
That's an interesting one actually. I think that it would be a good comparison for the GPT model as well to alter the attention mask such that ignores the padding tokens. This is a bit more complicated as it will require some modifications to the |
Oh I see, I thought you ran it in Docker previously when you reported the Windows slowness. And wow, the Ubuntu one via Docker looks super slow as well. You mentioned it took 5-10 minutes on Ubuntu without Docker previously? The increase from 5-10 to 25 min I can perhaps understand. But still 5-10 min on Ubuntu sounds slow. When I run it on Google Colab or Lightning Studios (both use Ubuntu), it's just ~2-3 min maybe. I'm curious, how long does it take to just unzip the downloaded |
Thanks for the details. That's interesting, so basically most of the time is spend on the unzipping (3 out of the 3-5 min on windows). In my case it was a bit quicker: 29 seconds on my laptop macOS 5 seconds on Google Colab (Ubuntu) So maybe the Windows filesystem is maybe not ideal for this large amount of small files. Yes, it's a lot of files, I think 50k based on the description: https://ai.stanford.edu/~amaas/data/sentiment/ This was the dataset that I originally used for Chapter 6, but I already had a suspicion that it might test the readers' patience 😅, which is why I swapped it with a smaller one that is easier to work with. So, I think there is fundamentally no issue anymore after adding your fixes, correct? So I will close this issue. (But please correct me if I'm wrong, and thanks for these additional insights on the runtimes!) |
This might be still WIP, but I have issues reproducing the output in
ch06/03_bonus_imdb-classification
:gpt_download.py
andprevious_chapters.py
missing in folder, therefore cannot runpython train-gpt.py
as instructed in the READMEpython download-prepare-dataset.py
does not correctly create the test and validation set (train set seems to be fine though):ch06/02_bonus_additional-experiments
toch06/03_bonus_imdb-classification
, runningpython train-gpt.py
results in a val loss of NaNs:python train-bert-hf.py
andpython train-sklearn-logreg.py
-> insteatd of
val.csv
should bevalidation.csv
intrain-sklearn-logreg.py
(as defined indownload-prepare-dataset.py
)The text was updated successfully, but these errors were encountered: