Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

iget_records leaks memory when column count 15,000+ #198

Open
sodiray opened this issue Mar 21, 2020 · 4 comments
Open

iget_records leaks memory when column count 15,000+ #198

sodiray opened this issue Mar 21, 2020 · 4 comments

Comments

@sodiray
Copy link

sodiray commented Mar 21, 2020

The title is a little accusatory but this is currently the most likely cause to an issue I'm having 馃檹 馃槃

I have a client who is uploading an .xlsx file to a service using pyexcel to parse it. The client somehow managed to get a little over 16,000 columns in the latest sheet they uploaded and our hosted server died. Were using iget_records along with free_resources. Reading the docs this should allow us to iterate a single row in memory at a time - not reading the entire file at once (seen here and here)

The Issue

With .xlsx files having columns less that 200 the memory is managed correctly. Using a profiler I can see that the block iterating the iget_records generator increases the process' memory space by about one row, then dumps it on the next iteration. However, when parsing a file with over 15,000 columns the profiling indicates that the memory space allocated for each row yielded by iget_records is not dumped at the end of the block. The process memory soars over 3G in around 20 seconds.

Reproducability

My powers that be aren't as hip on open source as you --- my thanks and apologies to you for your hard work --- so I can't post exactly what we have. I wanted to get this posted to start the dialog and get your feedback @chfw I'm gonna start working on a small reproducible script/test.

Similar Issue

I see #131 has a similar story. However, it referes to iget_array - possibly this is just the same for iget_records?

@chfw
Copy link
Member

chfw commented Mar 21, 2020

Could you experiment with get_records() and see if the memory issue persists?

@sodiray
Copy link
Author

sodiray commented Mar 31, 2020

To wrap this, I did try get_records and got the same result (which was expected I believe). Delving deeper into the docs I found that iget_array isn't really ment to be performant 馃し鈥嶁檪 which is ok. I ended up using xlrd to get the dimensions of the excel sheet and then raise an error if its over an accepted range (250). So... a non-issue.

@chfw
Copy link
Member

chfw commented Apr 1, 2020

Would you like to try: https://github.com/pyexcel/pyexcel-xlsxr?

@chfw
Copy link
Member

chfw commented Jun 8, 2020

Any fixtures for me to reproduce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants