-
Notifications
You must be signed in to change notification settings - Fork 154
Add ClipWriterStage to video splitting pipeline #786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: aot/ray-video-clip-extraction
Are you sure you want to change the base?
Add ClipWriterStage to video splitting pipeline #786
Conversation
…ding stages - Introduced `video_split_clip_example.py` to demonstrate video splitting functionality. - Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips. - Implemented command-line arguments for configuring video processing parameters. - Created utility functions for grouping iterables in `grouping.py`. - Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`. Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
…adStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage. Signed-off-by: Ao Tang <[email protected]>
…_read_example to include verbose argument. Signed-off-by: Ao Tang <[email protected]>
… additional metadata fields. Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
- Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations. This enhances the testing coverage for video-related functionalities in the ray-curator project. Signed-off-by: Ao Tang <[email protected]>
- Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Improved coverage for various functionalities including initialization, property calculations, and metadata extraction. This update strengthens the reliability of video-related features in the ray-curator project. Signed-off-by: Ao Tang <[email protected]>
…ay-video-clip-extraction Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
…age integration Signed-off-by: Ao Tang <[email protected]>
0b0fcb8
to
1adb5d5
Compare
Signed-off-by: Ao Tang <[email protected]>
- Introduced `ClipWriterStage` for writing clips and metadata during video processing. - Updated `video_split_clip_example.py` to include the new stage, allowing for clip writing functionality. - Enhanced command-line argument parsing for output clip path. - Added utility functions for managing storage paths and writing data in various formats. - Implemented unit tests for `ClipWriterStage` to ensure functionality and reliability. Signed-off-by: Ao Tang <[email protected]>
1adb5d5
to
b903c6f
Compare
- Improved `ClipWriterStage` to support writing additional metadata during video processing. - Updated related utility functions to accommodate new metadata fields. - Refined unit tests to cover the new functionality and ensure reliability. Signed-off-by: Ao Tang <[email protected]>
* add documentfilter implementation Signed-off-by: Sarah Yurick <[email protected]> * fix nits and ruff Signed-off-by: Sarah Yurick <[email protected]> * add additional logic for setup, setup_on_node, and process_batch Signed-off-by: Sarah Yurick <[email protected]> * add pytests Signed-off-by: Sarah Yurick <[email protected]> * add dep Signed-off-by: Sarah Yurick <[email protected]> * more dep edits Signed-off-by: Sarah Yurick <[email protected]> * another dep Signed-off-by: Sarah Yurick <[email protected]> * add fasttext dep Signed-off-by: Sarah Yurick <[email protected]> * add jieba and mecab Signed-off-by: Sarah Yurick <[email protected]> * add default None params for setup_on_node and setup functions Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * organize imports Signed-off-by: Sarah Yurick <[email protected]> * remove process_batch Signed-off-by: Sarah Yurick <[email protected]> * add _metadata to result Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * ruff and post init for _name Signed-off-by: Sarah Yurick <[email protected]> * modify test Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
* copy over Signed-off-by: Praateek <[email protected]> * copy over Signed-off-by: Praateek <[email protected]> * add init to download Signed-off-by: Praateek <[email protected]> * move justext Signed-off-by: Praateek <[email protected]> * move resiliparse Signed-off-by: Praateek <[email protected]> * move trafilatura Signed-off-by: Praateek <[email protected]> * move get_stop_list_dict Signed-off-by: Praateek <[email protected]> * move download_utils.py to utils/download_utils.py Signed-off-by: Praateek <[email protected]> * move out to download.py Signed-off-by: Praateek <[email protected]> * move WarcIterator towarc_reader.py Signed-off-by: Praateek <[email protected]> * move CommonCrawlWARCExtractor to html_extractor Signed-off-by: Praateek <[email protected]> * remove commoncrawl.py Signed-off-by: Praateek <[email protected]> * create url_generation.py from download_utils Signed-off-by: Praateek <[email protected]> * tests dir Signed-off-by: Praateek <[email protected]> * copy over test_download.py as test_common_crawl.py Signed-off-by: Praateek <[email protected]> * add html_extractors/__init__ Signed-off-by: Praateek <[email protected]> * move html_extractor to ProcessingStage Signed-off-by: Praateek <[email protected]> * update WarcReader to use ProecssingStage Signed-off-by: Praateek <[email protected]> * move to classes for url generation Signed-off-by: Praateek <[email protected]> * typo in name Signed-off-by: Praateek <[email protected]> * bug fixes in justext; rename resiliparse func; utils modular Signed-off-by: Praateek <[email protected]> * init file in for download/text Signed-off-by: Praateek <[email protected]> * justtext minor change Signed-off-by: Praateek <[email protected]> * support str in htmlextractor Signed-off-by: Praateek <[email protected]> * add a working example Signed-off-by: Praateek <[email protected]> * set source_files so that write can be hashed Signed-off-by: Praateek <[email protected]> * use pprint in example Signed-off-by: Praateek <[email protected]> * update comment Signed-off-by: Praateek <[email protected]> * all tests migrated + work Signed-off-by: Praateek <[email protected]> * update defaults in example; comments in stage Signed-off-by: Praateek <[email protected]> * add tests for url generation + PR review Signed-off-by: Praateek <[email protected]> * update download for aws Signed-off-by: Praateek <[email protected]> * rename aws to use_aws_to_donwload Signed-off-by: Praateek <[email protected]> * update resources Signed-off-by: Praateek <[email protected]> * change url generation to have ray-stage-spec Signed-off-by: Praateek <[email protected]> * make download fault tolerant Signed-off-by: Praateek <[email protected]> * refactor as per pr reviews; with tests Signed-off-by: Praateek <[email protected]> * add readme Signed-off-by: Praateek <[email protected]> * bug fix; update tests Signed-off-by: Praateek <[email protected]> * update record limit to None Signed-off-by: Praateek <[email protected]> * bug fixes Signed-off-by: Praateek <[email protected]> * pr comments Signed-off-by: Praateek <[email protected]> * add back test html extractor implementations Signed-off-by: Praateek <[email protected]> * remove cc example Signed-off-by: Praateek <[email protected]> * add column utils Signed-off-by: Praateek <[email protected]> * add todos Signed-off-by: Praateek <[email protected]> * Add Wikipedia download and extract stage This commit introduces a comprehensive pipeline for downloading and processing Wikipedia dump files within the ray-curator framework. Key components include: - **WikipediaUrlGenerator**: Generates URLs for Wikipedia dump files. - **WikipediaDownloader**: Downloads .bz2 dump files using wget. - **WikipediaIterator**: Parses Wikipedia XML dumps and extracts article content. - **WikipediaExtractor**: Cleans Wikipedia markup and extracts meaningful text. Additionally, an example script demonstrating the usage of the new stage is included, along with tests for each component to ensure functionality and reliability. Documentation for the new stage is also provided to guide users in implementation and usage. Signed-off-by: Abhinav Garg <[email protected]> * merge from main Signed-off-by: Praateek <[email protected]> * move deps to text Signed-off-by: Praateek <[email protected]> * update dev Signed-off-by: Praateek <[email protected]> * update pyproject and test.yml Signed-off-by: Praateek <[email protected]> * remove cugraph extra pyproject Signed-off-by: Praateek <[email protected]> * move text to optional deps Signed-off-by: Praateek <[email protected]> * Refactor pyproject.toml: Remove unused dependencies and clean up dev section Signed-off-by: Abhinav Garg <[email protected]> * Remove unused Wikipedia example and related README documentation from the download text stages. Signed-off-by: Abhinav Garg <[email protected]> * Add method to fetch JSON dump data for Wikipedia and refactor dump date retrieval logic - Introduced `_get_data_for_dump` method to handle fetching and parsing JSON dump data. - Refactored logic in `_get_wikipedia_urls` to iterate through available dumps and check their status. - Improved error handling for cases where dump data cannot be loaded or is not finished. Signed-off-by: Abhinav Garg <[email protected]> * Add README for custom download pipelines and remove Wikipedia stage documentation - Introduced a new README.md file detailing the structure and implementation of custom download pipelines. - Removed the outdated README.md for the Wikipedia download and extract stage to streamline documentation. Signed-off-by: Abhinav Garg <[email protected]> * Add num_workers_per_node method to DocumentDownloader and WikipediaDownloader - Implemented num_workers_per_node method in DocumentDownloader to define the number of workers per node for downloading tasks. - Overridden num_workers_per_node in WikipediaDownloader to return a fixed value of 1. - Updated xenna_stage_spec method in DocumentDownloadStage to include the number of workers per node. Signed-off-by: Abhinav Garg <[email protected]> * Update WikipediaDownloader to use 2 workers and change logging level in WikipediaIterator - Modified num_workers_per_node in WikipediaDownloader to return 2, allowing for increased parallelism during downloads. - Changed logging from info to debug level in WikipediaIterator for extracted articles to reduce log verbosity. Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Praateek <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> Co-authored-by: Praateek <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
* Refactor Wikipedia extraction and URL generation logic - Removed redundant return statement in `WikipediaExtractor` class. - Simplified status check for dump data in `WikipediaUrlGenerator` by directly accessing the dictionary keys. - Updated logging level in tests to ensure accurate assertions on log calls. - Enhanced test cases for URL generation to cover various dump statuses. These changes improve code clarity and maintainability while ensuring robust error handling in the Wikipedia download and extraction process. Signed-off-by: Abhinav Garg <[email protected]> * Add mwparserfromhell dependency to pyproject.toml - Included `mwparserfromhell==0.6.5` in the text dependencies section of `pyproject.toml` to support parsing Wikipedia markup. This addition enhances the functionality of the project by ensuring the necessary tools for processing Wikipedia data are available. Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]> Signed-off-by: [Your Name] <[email protected]>
|
||
|
||
@dataclass | ||
class ClipWriterStage(ProcessingStage[VideoTask, VideoTask]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any alternative design for this, if possible. This looks very manual and can only work for this specific task. Is there any way we can make this more generic?
E.g., for text, if you add a new field, it just shows up in the output parquet, but here we have to do everything manually. Basically can we reduce the lines in this code by say 2X ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a new implemnetation of the writer: GenericClipWriter, but I am a bit hesitate to replace it with the current cosmos_curate implementation. Let's remove it in the later PR once we verified all the stages output (captions/embeddings) can be successfully written in the new way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Although too verbose as this is coming from cosmos curator.
…DownloadStage - Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities. - Updated the video_read_example to utilize the new composite stage. - Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline. - Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration. This refactor simplifies the video reading and downloading process within the ray-curator framework. Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
…ntegrate new functionalities - Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process. - Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability. - Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage. These changes improve the clarity and efficiency of video processing within the ray-curator framework. Signed-off-by: Ao Tang <[email protected]>
- Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing. - Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware. - Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings. These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation. Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Ao Tang <[email protected]>
…writer Signed-off-by: Ao Tang <[email protected]>
…uired output path argument - Introduced a new GenericClipWriterStage for writing video clips and their metadata, consolidating the writing process and improving resource management. - Updated the video_split_clip_example to require an output clip path, ensuring that users specify where to save the generated clips. - The new stage supports parallel writing of clips and metadata, enhancing performance and flexibility in video processing workflows. Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
29a106c
to
012a2e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow this PR got messed up with force push. Can we correct this?
THis is actually because it contains commits from the first merged PR (reader), I did the rebase for the |
ClipWriterStage
for writing clips and metadata during video processing.video_split_clip_example.py
to include the new stage, allowing for clip writing functionality.ClipWriterStage
to ensure functionality and reliability.Description
Usage
# Add snippet demonstrating usage
Checklist