Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Adds first scenario for feature engineering examples #311

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Feb 14, 2023

This example shows how you can use the same feature definitions in Hamilton in an offline setting and use them in an online setting.

Assumptions:

  • the API request can provide the same raw data that training provides.
  • if you have aggregation features, you need to store the training result for them, and provide them to the online side.

Changes

  • adds feature_engineering folder to examples
  • adds scenario 1

How I tested this

  • ran this code locally

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@skrawcz skrawcz changed the title Adds basic scenario 1 for feature engineering Adds examples for feature engineering Feb 14, 2023
@skrawcz skrawcz changed the title Adds examples for feature engineering Adds first scenario for feature engineering examples Feb 19, 2023
@skrawcz skrawcz marked this pull request as ready for review February 19, 2023 22:35
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start -- I don't think this is going to be clear to most people who haven't really dug into this. A few thoughts:

  1. We can clarify the wording/make it crisper to specify why this is a problem, how its normally done, and why hamilton alleviates this
  2. We can give more context about what we're doing here/why its in an online context
  3. We can root on tooling that might be familiar to them. While loading fake models/whatnot makes sense, I think its going to confuse the users. So either load from a model/feature store they're used to, or (more likely) abstract it away and make it very clear that it could be implemented in many different ways.

This stuff is natural to us as we've been building online/batch inference/training tooling for years, but I think this will be extremely complex to most people out there, and fall flat. Hamilton is simple enough and makes this easy enough that this is a good chance to capture market share, but to do so we need to really hammer home a pattern and a motivation.

examples/feature_engineering/README.md Outdated Show resolved Hide resolved
examples/feature_engineering/scenario_1/constants.py Outdated Show resolved Hide resolved
This example shows how you can use the same feature definitions
in Hamilton in an offline setting and use them in an online setting.

Assumptions:
 - the API request can provide the same raw data that training provides.
 - if you have aggregation features, you need to store the training result
for them, and provide them to the online side.
This example shows how one might use Hamilton to compute
features in an offline and online fashion. The assumption here
is that the request passed into the API has all the raw data
required to compute features.

This example also shows how one might "override" some values
that are required for computing features, in this example they
are `age_mean` and `age_std_dev`. This can be required when you
computing aggregation features does not make sense at
inference time.
@skrawcz
Copy link
Collaborator Author

skrawcz commented Feb 20, 2023

Good start -- I don't think this is going to be clear to most people who haven't really dug into this. A few thoughts:

  1. We can clarify the wording/make it crisper to specify why this is a problem, how its normally done, and why hamilton alleviates this
  2. We can give more context about what we're doing here/why its in an online context
  3. We can root on tooling that might be familiar to them. While loading fake models/whatnot makes sense, I think its going to confuse the users. So either load from a model/feature store they're used to, or (more likely) abstract it away and make it very clear that it could be implemented in many different ways.

This stuff is natural to us as we've been building online/batch inference/training tooling for years, but I think this will be extremely complex to most people out there, and fall flat. Hamilton is simple enough and makes this easy enough that this is a good chance to capture market share, but to do so we need to really hammer home a pattern and a motivation.

That's the point of the scenarios, there is no one size fits all. That is, show the simplest possible thing, then one where there is a feature store, etc.

Will add more to motivation -- and draw some pictures.

I think that this makes things clearer what this file is, and is a lightweight
way to register feature sets that are used by a model.
To help set the tone and explain what feature engineering is,
as well as more context about the scenarios and the task.
I expand on the docs, and hopefully explain it a bit more that
is understandable to a novice.
As a way to show functionality that can be used to highlight
that they should be overridden in an online setting.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants