Skip to content

Add ClipWriterStage to video splitting pipeline #786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: aot/ray-video-clip-extraction
Choose a base branch
from

Conversation

suiyoubi
Copy link
Contributor

  • Introduced ClipWriterStage for writing clips and metadata during video processing.
  • Updated video_split_clip_example.py to include the new stage, allowing for clip writing functionality.
  • Enhanced command-line argument parsing for output clip path.
  • Added utility functions for managing storage paths and writing data in various formats.
  • Implemented unit tests for ClipWriterStage to ensure functionality and reliability.

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

suiyoubi added 4 commits July 9, 2025 07:40
…ding stages

- Introduced `video_split_clip_example.py` to demonstrate video splitting functionality.
- Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips.
- Implemented command-line arguments for configuring video processing parameters.
- Created utility functions for grouping iterables in `grouping.py`.
- Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`.

Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
Copy link

copy-pr-bot bot commented Jul 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

praateekmahajan and others added 11 commits July 10, 2025 16:03
…adStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage.

Signed-off-by: Ao Tang <[email protected]>
…_read_example to include verbose argument.

Signed-off-by: Ao Tang <[email protected]>
- Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations.

This enhances the testing coverage for video-related functionalities in the ray-curator project.

Signed-off-by: Ao Tang <[email protected]>
- Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Improved coverage for various functionalities including initialization, property calculations, and metadata extraction.

This update strengthens the reliability of video-related features in the ray-curator project.

Signed-off-by: Ao Tang <[email protected]>
@suiyoubi suiyoubi force-pushed the aot/ray-video-clip-writer branch from 0b0fcb8 to 1adb5d5 Compare July 14, 2025 17:39
suiyoubi added 2 commits July 14, 2025 10:45
- Introduced `ClipWriterStage` for writing clips and metadata during video processing.
- Updated `video_split_clip_example.py` to include the new stage, allowing for clip writing functionality.
- Enhanced command-line argument parsing for output clip path.
- Added utility functions for managing storage paths and writing data in various formats.
- Implemented unit tests for `ClipWriterStage` to ensure functionality and reliability.

Signed-off-by: Ao Tang <[email protected]>
@suiyoubi suiyoubi force-pushed the aot/ray-video-clip-writer branch from 1adb5d5 to b903c6f Compare July 14, 2025 17:45
suiyoubi and others added 8 commits July 14, 2025 10:59
- Improved `ClipWriterStage` to support writing additional metadata during video processing.
- Updated related utility functions to accommodate new metadata fields.
- Refined unit tests to cover the new functionality and ensure reliability.

Signed-off-by: Ao Tang <[email protected]>
* add documentfilter implementation

Signed-off-by: Sarah Yurick <[email protected]>

* fix nits and ruff

Signed-off-by: Sarah Yurick <[email protected]>

* add additional logic for setup, setup_on_node, and process_batch

Signed-off-by: Sarah Yurick <[email protected]>

* add pytests

Signed-off-by: Sarah Yurick <[email protected]>

* add dep

Signed-off-by: Sarah Yurick <[email protected]>

* more dep edits

Signed-off-by: Sarah Yurick <[email protected]>

* another dep

Signed-off-by: Sarah Yurick <[email protected]>

* add fasttext dep

Signed-off-by: Sarah Yurick <[email protected]>

* add jieba and mecab

Signed-off-by: Sarah Yurick <[email protected]>

* add default None params for setup_on_node and setup functions

Signed-off-by: Sarah Yurick <[email protected]>

* add praateek's suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* organize imports

Signed-off-by: Sarah Yurick <[email protected]>

* remove process_batch

Signed-off-by: Sarah Yurick <[email protected]>

* add _metadata to result

Signed-off-by: Sarah Yurick <[email protected]>

* add praateek's suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* ruff and post init for _name

Signed-off-by: Sarah Yurick <[email protected]>

* modify test

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
* copy over

Signed-off-by: Praateek <[email protected]>

* copy over

Signed-off-by: Praateek <[email protected]>

* add init to download

Signed-off-by: Praateek <[email protected]>

* move justext

Signed-off-by: Praateek <[email protected]>

* move resiliparse

Signed-off-by: Praateek <[email protected]>

* move trafilatura

Signed-off-by: Praateek <[email protected]>

* move get_stop_list_dict

Signed-off-by: Praateek <[email protected]>

* move download_utils.py to utils/download_utils.py

Signed-off-by: Praateek <[email protected]>

* move out to download.py

Signed-off-by: Praateek <[email protected]>

* move WarcIterator towarc_reader.py

Signed-off-by: Praateek <[email protected]>

* move CommonCrawlWARCExtractor to html_extractor

Signed-off-by: Praateek <[email protected]>

* remove commoncrawl.py

Signed-off-by: Praateek <[email protected]>

* create url_generation.py from download_utils

Signed-off-by: Praateek <[email protected]>

* tests dir

Signed-off-by: Praateek <[email protected]>

* copy over test_download.py as test_common_crawl.py

Signed-off-by: Praateek <[email protected]>

* add html_extractors/__init__

Signed-off-by: Praateek <[email protected]>

* move html_extractor to ProcessingStage

Signed-off-by: Praateek <[email protected]>

* update WarcReader to use ProecssingStage

Signed-off-by: Praateek <[email protected]>

* move to classes for url generation

Signed-off-by: Praateek <[email protected]>

* typo in name

Signed-off-by: Praateek <[email protected]>

* bug fixes in justext; rename resiliparse func; utils modular

Signed-off-by: Praateek <[email protected]>

* init file in for download/text

Signed-off-by: Praateek <[email protected]>

* justtext minor change

Signed-off-by: Praateek <[email protected]>

* support str in htmlextractor

Signed-off-by: Praateek <[email protected]>

* add a working example

Signed-off-by: Praateek <[email protected]>

* set source_files so that write can be hashed

Signed-off-by: Praateek <[email protected]>

* use pprint in example

Signed-off-by: Praateek <[email protected]>

* update comment

Signed-off-by: Praateek <[email protected]>

* all tests migrated + work

Signed-off-by: Praateek <[email protected]>

* update defaults in example; comments in stage

Signed-off-by: Praateek <[email protected]>

* add tests for url generation + PR review

Signed-off-by: Praateek <[email protected]>

* update download for aws

Signed-off-by: Praateek <[email protected]>

* rename aws to use_aws_to_donwload

Signed-off-by: Praateek <[email protected]>

* update resources

Signed-off-by: Praateek <[email protected]>

* change url generation to have ray-stage-spec

Signed-off-by: Praateek <[email protected]>

* make download fault tolerant

Signed-off-by: Praateek <[email protected]>

* refactor as per pr reviews; with tests

Signed-off-by: Praateek <[email protected]>

* add readme

Signed-off-by: Praateek <[email protected]>

* bug fix; update tests

Signed-off-by: Praateek <[email protected]>

* update record limit to None

Signed-off-by: Praateek <[email protected]>

* bug fixes

Signed-off-by: Praateek <[email protected]>

* pr comments

Signed-off-by: Praateek <[email protected]>

* add back test html extractor implementations

Signed-off-by: Praateek <[email protected]>

* remove cc example

Signed-off-by: Praateek <[email protected]>

* add column utils

Signed-off-by: Praateek <[email protected]>

* add todos

Signed-off-by: Praateek <[email protected]>

* Add Wikipedia download and extract stage

This commit introduces a comprehensive pipeline for downloading and processing Wikipedia dump files within the ray-curator framework. Key components include:

- **WikipediaUrlGenerator**: Generates URLs for Wikipedia dump files.
- **WikipediaDownloader**: Downloads .bz2 dump files using wget.
- **WikipediaIterator**: Parses Wikipedia XML dumps and extracts article content.
- **WikipediaExtractor**: Cleans Wikipedia markup and extracts meaningful text.

Additionally, an example script demonstrating the usage of the new stage is included, along with tests for each component to ensure functionality and reliability.

Documentation for the new stage is also provided to guide users in implementation and usage.

Signed-off-by: Abhinav Garg <[email protected]>

* merge from main

Signed-off-by: Praateek <[email protected]>

* move deps to text

Signed-off-by: Praateek <[email protected]>

* update dev

Signed-off-by: Praateek <[email protected]>

* update pyproject and test.yml

Signed-off-by: Praateek <[email protected]>

* remove cugraph extra pyproject

Signed-off-by: Praateek <[email protected]>

* move text to optional deps

Signed-off-by: Praateek <[email protected]>

* Refactor pyproject.toml: Remove unused dependencies and clean up dev section

Signed-off-by: Abhinav Garg <[email protected]>

* Remove unused Wikipedia example and related README documentation from the download text stages.

Signed-off-by: Abhinav Garg <[email protected]>

* Add method to fetch JSON dump data for Wikipedia and refactor dump date retrieval logic

- Introduced `_get_data_for_dump` method to handle fetching and parsing JSON dump data.
- Refactored logic in `_get_wikipedia_urls` to iterate through available dumps and check their status.
- Improved error handling for cases where dump data cannot be loaded or is not finished.

Signed-off-by: Abhinav Garg <[email protected]>

* Add README for custom download pipelines and remove Wikipedia stage documentation

- Introduced a new README.md file detailing the structure and implementation of custom download pipelines.
- Removed the outdated README.md for the Wikipedia download and extract stage to streamline documentation.

Signed-off-by: Abhinav Garg <[email protected]>

* Add num_workers_per_node method to DocumentDownloader and WikipediaDownloader

- Implemented num_workers_per_node method in DocumentDownloader to define the number of workers per node for downloading tasks.
- Overridden num_workers_per_node in WikipediaDownloader to return a fixed value of 1.
- Updated xenna_stage_spec method in DocumentDownloadStage to include the number of workers per node.

Signed-off-by: Abhinav Garg <[email protected]>

* Update WikipediaDownloader to use 2 workers and change logging level in WikipediaIterator

- Modified num_workers_per_node in WikipediaDownloader to return 2, allowing for increased parallelism during downloads.
- Changed logging from info to debug level in WikipediaIterator for extracted articles to reduce log verbosity.

Signed-off-by: Abhinav Garg <[email protected]>

---------

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
Co-authored-by: Praateek <[email protected]>
* Refactor Wikipedia extraction and URL generation logic

- Removed redundant return statement in `WikipediaExtractor` class.
- Simplified status check for dump data in `WikipediaUrlGenerator` by directly accessing the dictionary keys.
- Updated logging level in tests to ensure accurate assertions on log calls.
- Enhanced test cases for URL generation to cover various dump statuses.

These changes improve code clarity and maintainability while ensuring robust error handling in the Wikipedia download and extraction process.

Signed-off-by: Abhinav Garg <[email protected]>

* Add mwparserfromhell dependency to pyproject.toml

- Included `mwparserfromhell==0.6.5` in the text dependencies section of `pyproject.toml` to support parsing Wikipedia markup.

This addition enhances the functionality of the project by ensuring the necessary tools for processing Wikipedia data are available.

Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

---------

Signed-off-by: Abhinav Garg <[email protected]>
Signed-off-by: [Your Name] <[email protected]>


@dataclass
class ClipWriterStage(ProcessingStage[VideoTask, VideoTask]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any alternative design for this, if possible. This looks very manual and can only work for this specific task. Is there any way we can make this more generic?

E.g., for text, if you add a new field, it just shows up in the output parquet, but here we have to do everything manually. Basically can we reduce the lines in this code by say 2X ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a new implemnetation of the writer: GenericClipWriter, but I am a bit hesitate to replace it with the current cosmos_curate implementation. Let's remove it in the later PR once we verified all the stages output (captions/embeddings) can be successfully written in the new way.

Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Although too verbose as this is coming from cosmos curator.

suiyoubi and others added 10 commits July 21, 2025 06:26
…DownloadStage

- Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities.
- Updated the video_read_example to utilize the new composite stage.
- Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline.
- Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration.

This refactor simplifies the video reading and downloading process within the ray-curator framework.

Signed-off-by: Ao Tang <[email protected]>
…ntegrate new functionalities

- Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process.
- Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability.
- Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage.

These changes improve the clarity and efficiency of video processing within the ray-curator framework.

Signed-off-by: Ao Tang <[email protected]>
- Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing.
- Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware.
- Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings.

These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation.

Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
…uired output path argument

- Introduced a new GenericClipWriterStage for writing video clips and their metadata, consolidating the writing process and improving resource management.
- Updated the video_split_clip_example to require an output clip path, ensuring that users specify where to save the generated clips.
- The new stage supports parallel writing of clips and metadata, enhancing performance and flexibility in video processing workflows.

Signed-off-by: Ao Tang <[email protected]>
@suiyoubi suiyoubi force-pushed the aot/ray-video-clip-extraction branch from 29a106c to 012a2e1 Compare July 22, 2025 20:24
Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow this PR got messed up with force push. Can we correct this?

@suiyoubi
Copy link
Contributor Author

Somehow this PR got messed up with force push. Can we correct this?

THis is actually because it contains commits from the first merged PR (reader), I did the rebase for the aot/ray-video-clip-extraction but not to this one yet. Will do this once we merge the aot/ray-video-clip-extraction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants