Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(reports): Set a minimum interval for each report's execution #28176

Merged
merged 6 commits into from May 8, 2024

Conversation

Vitor-Avila
Copy link
Contributor

SUMMARY

It's currently possible to create alerts and reports configured to be executed every minute. While it might be a suitable implementation for some use-cases, some users might create assets with a more recurrent frequency needed, leading to additional load to the DB which can cause slowness, timeouts, etc.

This PR introduces two configs that can be specified in superset_config.py:

# Set a minimum interval threshold between executions (for each Alert/Report)
# Value should be the amount of minutes (integer). Min: 1, Max: 60
ALERT_MINIMUM_INTERVAL_MINUTES = None
REPORT_MINIMUM_INTERVAL_MINUTES = None

This minimum interval is validated for both creation of new reports and also the modification of existing ones.

Note that this PR doesn't include a migration, so changing this config won't automatically change existing reports.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

No UI changes.

TESTING INSTRUCTIONS

Both integration and unit tests added. For manual testing:

  1. Specify a minimum interval in config:
ALERT_MINIMUM_INTERVAL_MINUTES = 2
REPORT_MINIMUM_INTERVAL_MINUTES = 2
  1. Try creating a report/alert that exceeds this limit.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@github-actions github-actions bot added the doc Namespace | Anything related to documentation label Apr 22, 2024
@rusackas rusackas requested a review from eschutho April 22, 2024 20:01
if report_type == ReportScheduleType.ALERT
else "REPORT_MINIMUM_INTERVAL_MINUTES"
)
minimum_interval = current_app.config.get(config_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to also perform some validation to make sure that this value is a number, or it will fail below at line 119.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch! Just fixed that

# Check if cron is set to every minute (*)
# or if it has an every-minute block (1-5)
if "-" in minutes_interval or minutes_interval == "*":
raise ReportScheduleFrequencyNotAllowed(report_type, minimum_interval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the first arg when raising will be the error message, so this feels a bit out of pattern. I'd suggest that we can either keep the error message very generic, and let the UI get more specific, or pass these in as optional kwargs (but not required).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the new implementation is better -- happy to further change it as well.

Test the ``validate_report_frequency`` method when there's
only a configuration for alerts and user is creating report.
"""
for schedule in TEST_SCHEDULES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @pytest.mark.parametrize would work well for you on these for loop tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, wasn't familiar with it! just added it

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about manipulating the crontab as a string to figure out the shortest interval — it can miss edge cases, and it forces constraints to the user due to the implementation details (in this case, the interval has to be in minutes, and has a maximum value of 60, which are artificial requirements).

An alternative solution would be to allow any minimum interval between reports:

ALERT_MINIMUM_INTERVAL_MINUTES = timedelta(seconds=90)

Or, as it seems to be more common for intervals in config.py:

ALERT_MINIMUM_INTERVAL_MINUTES = int(timedelta(seconds=90).total_seconds())

Then you can compute a few execution times programmatically and see if it violates the config (untested):

def is_valid_schedule(cron_schedule: str, config_value: int) -> bool:
    it = croniter(cron_schedule)
    previous = next(it)
    for _ in range(1000):  # how many? I have no idea...
        current = next(it)
        interval, previous = current - previous, current
        if interval.total_seconds() < config_value:
            return False

    return True

@Vitor-Avila
Copy link
Contributor Author

I'm a bit worried about manipulating the crontab as a string to figure out the shortest interval — it can miss edge cases, and it forces constraints to the user due to the implementation details (in this case, the interval has to be in minutes, and has a maximum value of 60, which are artificial requirements).

An alternative solution would be to allow any minimum interval between reports:

ALERT_MINIMUM_INTERVAL_MINUTES = timedelta(seconds=90)

Or, as it seems to be more common for intervals in config.py:

ALERT_MINIMUM_INTERVAL_MINUTES = int(timedelta(seconds=90).total_seconds())

Then you can compute a few execution times programmatically and see if it violates the config (untested):

def is_valid_schedule(cron_schedule: str, config_value: int) -> bool:
    it = croniter(cron_schedule)
    previous = next(it)
    for _ in range(1000):  # how many? I have no idea...
        current = next(it)
        interval, previous = current - previous, current
        if interval.total_seconds() < config_value:
            return False

    return True

hey @betodealmeida that was the route I was initially going for too, but "how many repetitions?" (in the for _ in range(x)) is what made me think of another alternative. There's a lot of flexibility in the configuration (like the user could make it every 3 minutes until minute 58, but then re-execute on 59, select multiple days with a different interval between each in a month, etc).

I can see benefits and concerns in both routes:

  • PR approach: Biggest advantage here is that its more limited scope can validate all executions, since it only cares to the minute piece. Biggest disadvantage is that it's more restrictive (limit can't be higher than 60 minutes), and that it's handling a cron schedule as a string.

  • Implementing cron validation: No limit on the configuration side (user could create any interval needed like days, hours, etc). The solution might not iterate on all executions.

Since I believe most Orgs would like to limit this in the minutes range (like at least 5/10 minutes interval), I thought it made more sense to go with the first route. But I'm happy to migrate to the second approach -- we could start with a loop with 1440 as the repetition count (amount of minutes in a day) and go from there.

@Vitor-Avila
Copy link
Contributor Author

thanks for the feedback, @eschutho 🙏 I'll put this on hold until we align on best route to move forward, before working on these.

@john-bodley
Copy link
Member

@Vitor-Avila how do these setting reflected in the UI given one can specifying whatever cron schedule they desire?

@Vitor-Avila
Copy link
Contributor Author

@Vitor-Avila how do these setting reflected in the UI given one can specifying whatever cron schedule they desire?

@john-bodley I haven't implemented any UI validation, so the user would face the toast error message when trying to save:

image

@codecov-commenter
Copy link

codecov-commenter commented May 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.28%. Comparing base (2e5f3ed) to head (5f18562).
Report is 43 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #28176       +/-   ##
===========================================
+ Coverage   60.49%   83.28%   +22.79%     
===========================================
  Files        1931      520     -1411     
  Lines       76241    36922    -39319     
  Branches     8566        0     -8566     
===========================================
- Hits        46122    30752    -15370     
+ Misses      28015     6170    -21845     
+ Partials     2104        0     -2104     
Flag Coverage Δ
hive 49.19% <13.55%> (+0.03%) ⬆️
javascript ?
mysql 77.56% <96.61%> (?)
postgres 77.69% <96.61%> (?)
presto 53.81% <13.55%> (+0.01%) ⬆️
python 83.28% <100.00%> (+19.80%) ⬆️
sqlite 77.16% <96.61%> (?)
unit 57.74% <42.37%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +103 to +106
logger.error(
"Invalid value for %s: %s", config_key, minimum_interval, exc_info=True
)
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to avoid raising a ValidationError here to prevent blocking users from creating alerts/reports until the configuration is fixed by an admin. Open to feedback, tho

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I just left a couple suggestions.

superset/commands/report/exceptions.py Outdated Show resolved Hide resolved
Comment on lines 1330 to 1331
ALERT_MINIMUM_INTERVAL = 0
REPORT_MINIMUM_INTERVAL = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would be clearer about the expected units of ALERT_MINIMUM_INTERVAL and REPORT_MINIMUM_INTERVAL.

Suggested change
ALERT_MINIMUM_INTERVAL = 0
REPORT_MINIMUM_INTERVAL = 0
ALERT_MINIMUM_INTERVAL = int(timedelta(seconds=0).total_seconds())
REPORT_MINIMUM_INTERVAL = int(timedelta(seconds=0).total_seconds())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh that's a great suggestion! I think I'll use int(timedelta(minutes=0).total_seconds()) since seconds are not really exposed in the report configuration, but it's def better than 0. 🙌

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 7, 2024
@eschutho eschutho merged commit 1bf0401 into apache:master May 8, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Namespace | Anything related to documentation lgtm This PR has been approved by a maintainer size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants