Add ClipWriterStage to video splitting pipeline #786

suiyoubi · 2025-07-10T19:02:25Z

Introduced ClipWriterStage for writing clips and metadata during video processing.
Updated video_split_clip_example.py to include the new stage, allowing for clip writing functionality.
Enhanced command-line argument parsing for output clip path.
Added utility functions for managing storage paths and writing data in various formats.
Implemented unit tests for ClipWriterStage to ensure functionality and reliability.

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

…ding stages - Introduced `video_split_clip_example.py` to demonstrate video splitting functionality. - Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips. - Implemented command-line arguments for configuring video processing parameters. - Created utility functions for grouping iterables in `grouping.py`. - Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`. Signed-off-by: Ao Tang <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

copy-pr-bot · 2025-07-10T19:02:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…770)

…adStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage. Signed-off-by: Ao Tang <[email protected]>

…_read_example to include verbose argument. Signed-off-by: Ao Tang <[email protected]>

… additional metadata fields. Signed-off-by: Ao Tang <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

- Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations. This enhances the testing coverage for video-related functionalities in the ray-curator project. Signed-off-by: Ao Tang <[email protected]>

- Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Improved coverage for various functionalities including initialization, property calculations, and metadata extraction. This update strengthens the reliability of video-related features in the ray-curator project. Signed-off-by: Ao Tang <[email protected]>

…ay-video-clip-extraction Signed-off-by: Ao Tang <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

…age integration Signed-off-by: Ao Tang <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

- Introduced `ClipWriterStage` for writing clips and metadata during video processing. - Updated `video_split_clip_example.py` to include the new stage, allowing for clip writing functionality. - Enhanced command-line argument parsing for output clip path. - Added utility functions for managing storage paths and writing data in various formats. - Implemented unit tests for `ClipWriterStage` to ensure functionality and reliability. Signed-off-by: Ao Tang <[email protected]>

- Improved `ClipWriterStage` to support writing additional metadata during video processing. - Updated related utility functions to accommodate new metadata fields. - Refined unit tests to cover the new functionality and ensure reliability. Signed-off-by: Ao Tang <[email protected]>

* add documentfilter implementation Signed-off-by: Sarah Yurick <[email protected]> * fix nits and ruff Signed-off-by: Sarah Yurick <[email protected]> * add additional logic for setup, setup_on_node, and process_batch Signed-off-by: Sarah Yurick <[email protected]> * add pytests Signed-off-by: Sarah Yurick <[email protected]> * add dep Signed-off-by: Sarah Yurick <[email protected]> * more dep edits Signed-off-by: Sarah Yurick <[email protected]> * another dep Signed-off-by: Sarah Yurick <[email protected]> * add fasttext dep Signed-off-by: Sarah Yurick <[email protected]> * add jieba and mecab Signed-off-by: Sarah Yurick <[email protected]> * add default None params for setup_on_node and setup functions Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * organize imports Signed-off-by: Sarah Yurick <[email protected]> * remove process_batch Signed-off-by: Sarah Yurick <[email protected]> * add _metadata to result Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * ruff and post init for _name Signed-off-by: Sarah Yurick <[email protected]> * modify test Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>

* copy over Signed-off-by: Praateek <[email protected]> * copy over Signed-off-by: Praateek <[email protected]> * add init to download Signed-off-by: Praateek <[email protected]> * move justext Signed-off-by: Praateek <[email protected]> * move resiliparse Signed-off-by: Praateek <[email protected]> * move trafilatura Signed-off-by: Praateek <[email protected]> * move get_stop_list_dict Signed-off-by: Praateek <[email protected]> * move download_utils.py to utils/download_utils.py Signed-off-by: Praateek <[email protected]> * move out to download.py Signed-off-by: Praateek <[email protected]> * move WarcIterator towarc_reader.py Signed-off-by: Praateek <[email protected]> * move CommonCrawlWARCExtractor to html_extractor Signed-off-by: Praateek <[email protected]> * remove commoncrawl.py Signed-off-by: Praateek <[email protected]> * create url_generation.py from download_utils Signed-off-by: Praateek <[email protected]> * tests dir Signed-off-by: Praateek <[email protected]> * copy over test_download.py as test_common_crawl.py Signed-off-by: Praateek <[email protected]> * add html_extractors/__init__ Signed-off-by: Praateek <[email protected]> * move html_extractor to ProcessingStage Signed-off-by: Praateek <[email protected]> * update WarcReader to use ProecssingStage Signed-off-by: Praateek <[email protected]> * move to classes for url generation Signed-off-by: Praateek <[email protected]> * typo in name Signed-off-by: Praateek <[email protected]> * bug fixes in justext; rename resiliparse func; utils modular Signed-off-by: Praateek <[email protected]> * init file in for download/text Signed-off-by: Praateek <[email protected]> * justtext minor change Signed-off-by: Praateek <[email protected]> * support str in htmlextractor Signed-off-by: Praateek <[email protected]> * add a working example Signed-off-by: Praateek <[email protected]> * set source_files so that write can be hashed Signed-off-by: Praateek <[email protected]> * use pprint in example Signed-off-by: Praateek <[email protected]> * update comment Signed-off-by: Praateek <[email protected]> * all tests migrated + work Signed-off-by: Praateek <[email protected]> * update defaults in example; comments in stage Signed-off-by: Praateek <[email protected]> * add tests for url generation + PR review Signed-off-by: Praateek <[email protected]> * update download for aws Signed-off-by: Praateek <[email protected]> * rename aws to use_aws_to_donwload Signed-off-by: Praateek <[email protected]> * update resources Signed-off-by: Praateek <[email protected]> * change url generation to have ray-stage-spec Signed-off-by: Praateek <[email protected]> * make download fault tolerant Signed-off-by: Praateek <[email protected]> * refactor as per pr reviews; with tests Signed-off-by: Praateek <[email protected]> * add readme Signed-off-by: Praateek <[email protected]> * bug fix; update tests Signed-off-by: Praateek <[email protected]> * update record limit to None Signed-off-by: Praateek <[email protected]> * bug fixes Signed-off-by: Praateek <[email protected]> * pr comments Signed-off-by: Praateek <[email protected]> * add back test html extractor implementations Signed-off-by: Praateek <[email protected]> * remove cc example Signed-off-by: Praateek <[email protected]> * add column utils Signed-off-by: Praateek <[email protected]> * add todos Signed-off-by: Praateek <[email protected]> * Add Wikipedia download and extract stage This commit introduces a comprehensive pipeline for downloading and processing Wikipedia dump files within the ray-curator framework. Key components include: - **WikipediaUrlGenerator**: Generates URLs for Wikipedia dump files. - **WikipediaDownloader**: Downloads .bz2 dump files using wget. - **WikipediaIterator**: Parses Wikipedia XML dumps and extracts article content. - **WikipediaExtractor**: Cleans Wikipedia markup and extracts meaningful text. Additionally, an example script demonstrating the usage of the new stage is included, along with tests for each component to ensure functionality and reliability. Documentation for the new stage is also provided to guide users in implementation and usage. Signed-off-by: Abhinav Garg <[email protected]> * merge from main Signed-off-by: Praateek <[email protected]> * move deps to text Signed-off-by: Praateek <[email protected]> * update dev Signed-off-by: Praateek <[email protected]> * update pyproject and test.yml Signed-off-by: Praateek <[email protected]> * remove cugraph extra pyproject Signed-off-by: Praateek <[email protected]> * move text to optional deps Signed-off-by: Praateek <[email protected]> * Refactor pyproject.toml: Remove unused dependencies and clean up dev section Signed-off-by: Abhinav Garg <[email protected]> * Remove unused Wikipedia example and related README documentation from the download text stages. Signed-off-by: Abhinav Garg <[email protected]> * Add method to fetch JSON dump data for Wikipedia and refactor dump date retrieval logic - Introduced `_get_data_for_dump` method to handle fetching and parsing JSON dump data. - Refactored logic in `_get_wikipedia_urls` to iterate through available dumps and check their status. - Improved error handling for cases where dump data cannot be loaded or is not finished. Signed-off-by: Abhinav Garg <[email protected]> * Add README for custom download pipelines and remove Wikipedia stage documentation - Introduced a new README.md file detailing the structure and implementation of custom download pipelines. - Removed the outdated README.md for the Wikipedia download and extract stage to streamline documentation. Signed-off-by: Abhinav Garg <[email protected]> * Add num_workers_per_node method to DocumentDownloader and WikipediaDownloader - Implemented num_workers_per_node method in DocumentDownloader to define the number of workers per node for downloading tasks. - Overridden num_workers_per_node in WikipediaDownloader to return a fixed value of 1. - Updated xenna_stage_spec method in DocumentDownloadStage to include the number of workers per node. Signed-off-by: Abhinav Garg <[email protected]> * Update WikipediaDownloader to use 2 workers and change logging level in WikipediaIterator - Modified num_workers_per_node in WikipediaDownloader to return 2, allowing for increased parallelism during downloads. - Changed logging from info to debug level in WikipediaIterator for extracted articles to reduce log verbosity. Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Praateek <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> Co-authored-by: Praateek <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

* Refactor Wikipedia extraction and URL generation logic - Removed redundant return statement in `WikipediaExtractor` class. - Simplified status check for dump data in `WikipediaUrlGenerator` by directly accessing the dictionary keys. - Updated logging level in tests to ensure accurate assertions on log calls. - Enhanced test cases for URL generation to cover various dump statuses. These changes improve code clarity and maintainability while ensuring robust error handling in the Wikipedia download and extraction process. Signed-off-by: Abhinav Garg <[email protected]> * Add mwparserfromhell dependency to pyproject.toml - Included `mwparserfromhell==0.6.5` in the text dependencies section of `pyproject.toml` to support parsing Wikipedia markup. This addition enhances the functionality of the project by ensuring the necessary tools for processing Wikipedia data are available. Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]> Signed-off-by: [Your Name] <[email protected]>

abhinavg4 · 2025-07-21T05:53:28Z

ray-curator/ray_curator/stages/video/io/clip_writer.py

+
+
+@dataclass
+class ClipWriterStage(ProcessingStage[VideoTask, VideoTask]):


Is there any alternative design for this, if possible. This looks very manual and can only work for this specific task. Is there any way we can make this more generic?

E.g., for text, if you add a new field, it just shows up in the output parquet, but here we have to do everything manually. Basically can we reduce the lines in this code by say 2X ?

I have a new implemnetation of the writer: GenericClipWriter, but I am a bit hesitate to replace it with the current cosmos_curate implementation. Let's remove it in the later PR once we verified all the stages output (captions/embeddings) can be successfully written in the new way.

ray-curator/ray_curator/stages/video/io/clip_writer.py

abhinavg4

Looks good. Although too verbose as this is coming from cosmos curator.

…DownloadStage - Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities. - Updated the video_read_example to utilize the new composite stage. - Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline. - Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration. This refactor simplifies the video reading and downloading process within the ray-curator framework. Signed-off-by: Ao Tang <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

…ntegrate new functionalities - Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process. - Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability. - Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage. These changes improve the clarity and efficiency of video processing within the ray-curator framework. Signed-off-by: Ao Tang <[email protected]>

- Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing. - Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware. - Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings. These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation. Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Ao Tang <[email protected]>

…writer Signed-off-by: Ao Tang <[email protected]>

…uired output path argument - Introduced a new GenericClipWriterStage for writing video clips and their metadata, consolidating the writing process and improving resource management. - Updated the video_split_clip_example to require an output clip path, ensuring that users specify where to save the generated clips. - The new stage supports parallel writing of clips and metadata, enhancing performance and flexibility in video processing workflows. Signed-off-by: Ao Tang <[email protected]>

Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Ao Tang <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

…writer

abhinavg4

Somehow this PR got messed up with force push. Can we correct this?

suiyoubi · 2025-07-23T13:33:51Z

Somehow this PR got messed up with force push. Can we correct this?

THis is actually because it contains commits from the first merged PR (reader), I did the rebase for the aot/ray-video-clip-extraction but not to this one yet. Will do this once we merge the aot/ray-video-clip-extraction

suiyoubi added 4 commits July 9, 2025 07:40

Add video io reader

a0f3143

Add test

8261ccb

remove debug test

5108d2c

Signed-off-by: Ao Tang <[email protected]>

praateekmahajan and others added 11 commits July 10, 2025 16:03

[Ray] Add integration test to test backends for a specified pipeline (#…

b66a504

…770)

Add VideoReaderStage to video reading pipeline and update VideoDownlo…

0fde549

…adStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage. Signed-off-by: Ao Tang <[email protected]>

Update VideoDownloadStage to support verbose logging and modify video…

440992d

…_read_example to include verbose argument. Signed-off-by: Ao Tang <[email protected]>

Update outputs for VideoDownloadStage and VideoReaderStage to include…

6b69764

… additional metadata fields. Signed-off-by: Ao Tang <[email protected]>

Update CI workflow to include video dependencies for testing

4f85180

Signed-off-by: Ao Tang <[email protected]>

Merge remote-tracking branch 'origin/aot/ray-video-reader' into aot/r…

6452e7d

…ay-video-clip-extraction Signed-off-by: Ao Tang <[email protected]>

Add unit tests for grouping utilities

95c519a

Signed-off-by: Ao Tang <[email protected]>

Refactor video splitting pipeline to remove debug mode and enhance st…

fa8915b

…age integration Signed-off-by: Ao Tang <[email protected]>

Adding with_ for options in ProcessingStage and CompositeStage (#764)

967ac81

suiyoubi force-pushed the aot/ray-video-clip-writer branch from 0b0fcb8 to 1adb5d5 Compare July 14, 2025 17:39

suiyoubi added 2 commits July 14, 2025 10:45

Add video limit argument to video split clip example

602e27e

Signed-off-by: Ao Tang <[email protected]>

suiyoubi force-pushed the aot/ray-video-clip-writer branch from 1adb5d5 to b903c6f Compare July 14, 2025 17:45

suiyoubi and others added 8 commits July 14, 2025 10:59

[Ray] Add Download Extract Base Class + Common Crawl Stage (#738)

bd78b80

[Ray] Use Ray Actors where viable (#792)

7c6e88c

Merge remote-tracking branch 'origin/ray-api' into aot/ray-video-reader

f800d1f

Signed-off-by: Ao Tang <[email protected]>

Update pyproject.toml to include a trailing comma for pynvml dependency

808c968

Signed-off-by: Ao Tang <[email protected]>

abhinavg4 reviewed Jul 21, 2025

View reviewed changes

ray-curator/ray_curator/stages/video/io/clip_writer.py Outdated Show resolved Hide resolved

abhinavg4 approved these changes Jul 21, 2025

View reviewed changes

suiyoubi and others added 10 commits July 21, 2025 06:26

Merge branch 'ray-api' into aot/ray-video-reader

e261d0f

Merge branch 'aot/ray-video-reader' into aot/ray-video-clip-extraction

330f7a9

Signed-off-by: Ao Tang <[email protected]>

Merge branch 'aot/ray-video-clip-extraction' into aot/ray-video-clip-…

839a910

…writer Signed-off-by: Ao Tang <[email protected]>

Update ClipWriterStage to clarify local storage usage

45cf6d1

Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Ao Tang <[email protected]>

Remove deprecated GPU resource tests from ClipTranscodingStage

42d8429

Signed-off-by: Ao Tang <[email protected]>

Merge branch 'aot/ray-video-clip-extraction' into aot/ray-video-clip-…

3e27e4c

…writer

suiyoubi force-pushed the aot/ray-video-clip-extraction branch from 29a106c to 012a2e1 Compare July 22, 2025 20:24

abhinavg4 requested changes Jul 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ClipWriterStage to video splitting pipeline #786

Add ClipWriterStage to video splitting pipeline #786

Uh oh!

suiyoubi commented Jul 10, 2025

Uh oh!

copy-pr-bot bot commented Jul 10, 2025

Uh oh!

abhinavg4 Jul 21, 2025

Uh oh!

suiyoubi Jul 21, 2025

Uh oh!

Uh oh!

abhinavg4 left a comment

Uh oh!

abhinavg4 left a comment

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

Uh oh!



		@dataclass
		class ClipWriterStage(ProcessingStage[VideoTask, VideoTask]):

Add ClipWriterStage to video splitting pipeline #786

Are you sure you want to change the base?

Add ClipWriterStage to video splitting pipeline #786

Uh oh!

Conversation

suiyoubi commented Jul 10, 2025

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jul 10, 2025

Uh oh!

abhinavg4 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

suiyoubi Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

Uh oh!