Skip to content

chore!: replace direct download of nyc_taxi with hvsampledata#1462

Merged
hoxbro merged 24 commits intomainfrom
hvsampledata
Dec 1, 2025
Merged

chore!: replace direct download of nyc_taxi with hvsampledata#1462
hoxbro merged 24 commits intomainfrom
hvsampledata

Conversation

@Azaya89
Copy link
Contributor

@Azaya89 Azaya89 commented Sep 26, 2025

This PR replaces the direct download of the large nyc_taxi data with the version from hvsampledata.

Testing with an editable install of hvsampledata

import hvsampledata as hvs
import datashader as ds
import pandas as pd

path = hvs.download("nyc_taxi_remote")
df = pd.read_parquet(path)
print(df.head())

agg = ds.Canvas().points(df, 'dropoff_x', 'dropoff_y')
print(agg.shape)
   tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  ...  tip_amount  dropoff_hour  pickup_hour
0  2015-01-15 19:05:39   2015-01-15 19:23:42                1  ...        3.25            19           19
1  2015-01-10 20:33:38   2015-01-10 20:53:28                1  ...        2.00            20           20
2  2015-01-10 20:33:38   2015-01-10 20:43:41                1  ...        0.00            20           20
3  2015-01-10 20:33:39   2015-01-10 20:35:31                1  ...        0.00            20           20
4  2015-01-10 20:33:39   2015-01-10 20:52:58                1  ...        0.00            20           20

[5 rows x 12 columns]
(600, 600)

@Azaya89 Azaya89 requested a review from Copilot September 26, 2025 13:26

This comment was marked as outdated.

@Azaya89 Azaya89 self-assigned this Sep 26, 2025
Copy link
Member

@hoxbro hoxbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should remove pyct + setuptools from pyproject.toml + pixi.toml

"import hvsampledata as hvs\n",
"\n",
"df = pd.read_csv('../data/nyc_taxi.csv', usecols=['dropoff_x', 'dropoff_y'])\n",
"df = hvs.nyc_taxi(\"pandas\", engine_kwargs={\"columns\": ['dropoff_x', 'dropoff_y']})\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"df = hvs.nyc_taxi(\"pandas\", engine_kwargs={\"columns\": ['dropoff_x', 'dropoff_y']})\n",
"df = hvs.nyc_taxi_remote(\"pandas\", engine_kwargs={\"columns\": ['dropoff_x', 'dropoff_y']})\n",

Also, mention that this is the first time this cell is run.

pixi.toml Outdated
setuptools = "*" # distutils for pyct
toolz = "*"
xarray = "*"
fastparquet = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that this has not indirectly pinned anything like pandas.

@codspeed-hq
Copy link

codspeed-hq bot commented Sep 26, 2025

CodSpeed Performance Report

Merging #1462 will improve performances by 48.77%

Comparing hvsampledata (22e7fc1) with main (a5ae17c)

Summary

⚡ 2 improvements
✅ 41 untouched

Benchmarks breakdown

Mode Benchmark BASE HEAD Change
Simulation test_layout[forceatlas2_layout] 63.9 ms 56 ms +14.11%
Simulation test_quadmesh_rectilinear[1024] 137 ms 92.1 ms +48.77%

@hoxbro hoxbro changed the title chore: replace direct download of nyc_taxi with hvsampledata chore!: replace direct download of nyc_taxi with hvsampledata Sep 27, 2025
Copy link
Member

@maximlt maximlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be first deprecated before being removed https://holoviz.org/about/heps/hep2.html.

@hoxbro
Copy link
Member

hoxbro commented Oct 9, 2025

I think this should be first deprecated before being removed holoviz.org/about/heps/hep2.html.

Are you okay with not having pyct as a required dependency? but still having the entrypoint and explaining that:

  1. the entrypoint is deprecated, but users need to install pyct for this to work and,
  2. maybe guide people to use hvsampledata if they want to download data?

This is also a little bit weird, because we don't break running code, but "only" the CLI.

@maximlt
Copy link
Member

maximlt commented Oct 10, 2025

The CLI is part of public interface so I think it deserves the same treatment as the API. I don't see an urgent reason to immediately remove pyct from the dependencies. I'm in favor of simply adding deprecation warnings, and removing the whole thing in a few releases.

from . import transfer_functions as tf # noqa (API import)
from . import data_libraries # noqa (API import)

with suppress(ImportError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place it should be in main.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

'numpy',
'pandas',
'param',
'pyct',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think pyct should be removed yet from the dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not removed totally. Just moved to optional dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes if I install datashader and run datashader example it breaks. So that's a breaking change. The goal is to deprecate it so people get a chance to stop using datashader example. Then pyct can be removed entirely from the dependencies.

@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

❌ Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.31%. Comparing base (a5ae17c) to head (22e7fc1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
datashader/__init__.py 71.42% 4 Missing ⚠️
datashader/__main__.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1462      +/-   ##
==========================================
- Coverage   88.34%   88.31%   -0.03%     
==========================================
  Files          96       96              
  Lines       18932    18943      +11     
==========================================
+ Hits        16725    16730       +5     
- Misses       2207     2213       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Azaya89 Azaya89 marked this pull request as ready for review October 15, 2025 17:32
@Azaya89
Copy link
Contributor Author

Azaya89 commented Oct 15, 2025

I added a temporary patch to allow the CI to pass pending a new hvsampledata release but otherwise I think it's ready for review now.

@Azaya89 Azaya89 requested review from Copilot, hoxbro and maximlt October 15, 2025 17:37

This comment was marked as spam.

@holoviz holoviz deleted a comment from Copilot AI Oct 15, 2025
pyproject.toml Outdated
# 2025-09
"ignore:Signature .* for <class 'numpy.longdouble'>.*:UserWarning",
# 2025-10
"ignore:The 'pyct' package bundled as a datashader dependency is deprecated since version 0.19 and will be removed in version 0.20.*" # https://github.com/holoviz/datashader/pull/1462
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this filtering. Running datashader in Python (and not CLI) should never show this warning.

@hoxbro hoxbro added this to the v0.19.0 milestone Nov 13, 2025
"import hvsampledata as hvs\n",
"import hvsampledata as hvs, pandas as pd\n",
"\n",
"df = hvs.nyc_taxi_remote(\"pandas\", engine_kwargs={\"columns\": ['dropoff_x', 'dropoff_y']})\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us keep this.

@Azaya89 Azaya89 requested a review from hoxbro November 26, 2025 14:43
pixi.toml Outdated
setuptools = "*" # distutils for pyct
toolz = "*"
xarray = "*"
pyarrow = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added back to where it previously was.

Comment on lines +4 to +9
with suppress(ImportError):
import pyct.cmd
from datashader import _warn_pyct_deprecated

_warn_pyct_deprecated(stacklevel=1)
pyct.cmd.fetch_data(name="data", path="examples", datasets="datasets.yml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scripts directory is for internal use only; please remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not remove this...


def _warn_pyct_deprecated(stacklevel=2):
warnings.warn(
"The 'pyct' package is deprecated since version 0.19 "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user saw this warning, do you think they would understand it?

I will also move warnings import into the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user saw this warning, do you think they would understand it?

Reads simple enough for me, yeah.

I will also move warnings import into the function.

OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase it: what does the pyct package do, and why is it affecting the deprecated functions? Without looking at the code, it seems very confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, OK I get it now. A user wouldn't have used pyct directly and would have been confused seeing the mention of it in the message. Will rephrase to mention the actual functions being called.

"""Wrapper to add deprecation warning to pyct functions."""
@wraps(func)
def wrapper(*args, **kwargs):
_warn_pyct_deprecated(stacklevel=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these stack levels are set correctly, but they are likely still emitted because the category is a FutureWarning.

You can test this out by setting the CategoryWarning to DeprecationWarning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Thanks. The correct stacklevel is 3 in this instance. I tested it via

import datashader as ds

ds.fetch_data("datashader")

@Azaya89 Azaya89 requested a review from hoxbro November 27, 2025 16:54
Copy link
Member

@hoxbro hoxbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small tweaks.

Will have @maximlt have the final say on the deprecations.

"The file `nyc_taxi.csv` used above is located at\n",
"[nyc_taxi.csv](https://github.com/holoviz/datashader/blob/main/examples/data/.data_stubs/nyc_taxi.csv) in the Datashader repository. When running this example, make sure the file is available locally and update the path accordingly.\n",
":::\n"
"The first time this cell is run, it may take some time for the NYC taxi dataset to be downloaded remotely (about 260MB).\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The first time this cell is run, it may take some time for the NYC taxi dataset to be downloaded remotely (about 260MB).\n",
"The first time this cell is run, it will download the NYC taxi dataset (about 260MB).\n",


# make pyct's example/data commands available if possible
from functools import partial
from functools import partial, wraps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's del wraps as we did with the others.

Comment on lines 5 to 11
with suppress(ImportError):
import pyct # noqa: F401

from . import _warn_pyct_deprecated
_warn_pyct_deprecated(stacklevel=3)
try:
import pyct.cmd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with suppress(ImportError):
import pyct # noqa: F401
from . import _warn_pyct_deprecated
_warn_pyct_deprecated(stacklevel=3)
try:
import pyct.cmd
try:
import pyct.cmd
from . import _warn_pyct_deprecated
_warn_pyct_deprecated(stacklevel=3)

pixi.toml Outdated
shapely = ">=2.0.0"
spatialpandas = "*"
streamz = "*"
hvsampledata = ">=0.1.4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hvsampledata = ">=0.1.4"
hvsampledata = ">=0.1.5a3"

Let's use a dev version to clearly indicate which version works here. We can remove the dev version once 0.1.5 is released.


warnings.warn(
"The 'fetch_data()', 'copy_examples()', and 'examples()' functions are "
"deprecated since version 0.19 and will be removed in version 0.20. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the release candidacy of datashader, I feel like 0.20 would be released at least 6 months after 0.19.

Seems like each minor release has been a year in between (ignoring 0.17.0, which did not correctly drop a Python version).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so do you suggest we remove it in 0.19.xx, and modify the message to that effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I suggest 0.20, and not 0.21. We would never drop something in a patch release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the message is correct then. It says 0.20 already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but still want to highlight it as we normally do two minor releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, as per HEP2, there's no requirement to indicate the removal version:

image

https://holoviz.org/about/heps/hep2.html is worth a read if you haven't done it yet.

@hoxbro hoxbro merged commit 9e30db1 into main Dec 1, 2025
14 checks passed
@hoxbro hoxbro deleted the hvsampledata branch December 1, 2025 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants