From 27c854da2d336f918d257f4c69712ef011e92a14 Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Mon, 2 Dec 2024 13:15:18 +0530 Subject: [PATCH 1/4] Updated Resources webpage with latest talks and links Signed-off-by: Aanchal Goyal --- resources.md | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-) diff --git a/resources.md b/resources.md index 4f5657a02..fb413e38b 100644 --- a/resources.md +++ b/resources.md @@ -1,3 +1,8 @@ +# New Features & Enhancements + +- Support for Docling 2.0 added to DPK in [pdf2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet/python) transform. The new updates allow DPK users to ingest other type of documents, e.g. MS Word, MS Powerpoint, Images, Markdown, Asciidocs, etc. +- Released [Web2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/web2parquet) transform for crawling the web. + # Data Prep Kit Resources ## 📄 Papers @@ -7,24 +12,43 @@ 3. [Scaling Granite Code Models to 128K Context](https://arxiv.org/abs/2407.13739) -## 🎤 Talks +## 🎤 External Events and Showcase 1. **"Building Successful LLM Apps: The Power of high quality data"** - [Video](https://www.youtube.com/watch?v=u_2uiZBBVIE) | [Slides](https://www.slideshare.net/slideshow/data_prep_techniques_challenges_methods-pdf-a190/271527890) 2. **"Hands on session for fine tuning LLMs"** - [Video](https://www.youtube.com/watch?v=VEHIA3E64DM) 3. **"Build your own data preparation module using data-prep-kit"** - [Video](https://www.youtube.com/watch?v=0WUMG6HIgMg) 4. **"Data Prep Kit: A Comprehensive Cloud-Native Toolkit for Scalable Data Preparation in GenAI App"** - [Video](https://www.youtube.com/watch?v=WJ147TGULwo) | [Slides](https://ossaidevjapan24.sched.com/event/1jKBm) +5. **"RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA ** - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md) +6. **Tech Educator summit** [IBM CSR Event](https://www.linkedin.com/posts/aanchalaggarwal_github-ibmdata-prep-kit-open-source-project-activity-7254062098295472128-OA_x?utm_source=share&utm_medium=member_desktop) +7. **Talk and Hands on session** at [MIT Bangalore](https://www.linkedin.com/posts/saptha-surendran-71a4a0ab_ibmresearch-dataprepkit-llms-activity-7261987741087801346-h0no?utm_source=share&utm_medium=member_desktop) +8. **PyData NYC 2024** - [90 mins Tutorial](https://nyc2024.pydata.org/cfp/talk/AWLTZP/) +9. **Open Source AI** [Demo Night](https://lu.ma/oss-ai?tk=A8BgIt) +10. [**Data Exchange Podcast with Ben Lorica**](https://thedataexchange.media/ibm-data-prep-kit/) +11. Unstructured Data Meetup - SF, NYC, Silicon Valley +12. IBM TechXchange Las Vegas +13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley +14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) +15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[%E2%80%A6]65125349376-FG8E/?utm_source=share&utm_medium=member_desktop) + ## Example Code +Find example code in readme section of each tranform and some sample jupyter notebooks for getting started [**here**](examples/notebooks) ## Blogs / Tutorials - [**IBM Developer Blog**](https://developer.ibm.com/blogs/awb-unleash-potential-llms-data-prep-kit/) +- [**Introductory Blog on DPK**](https://www.linkedin.com/pulse/unleashing-potential-large-language-models-through-data-aanchal-goyal-fgtff) +- [**DPK Header Cleanser Module Blog by external contributor**](https://www.linkedin.com/pulse/enhancing-data-quality-developing-header-cleansing-tool-kalathiya-i1ohc/?trackingId=6iAeBkBBRrOLijg3LTzIGA%3D%3D) -## Workshops -- **2024-09-21: "RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md) - -## Discord +# Relevant online communities - [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1286046139921207476) +- [**DPK is now listed in Github Awesome-LLM under LLM Data section**](https://github.com/Hannibal046/Awesome-LLM) +- [**DPK is now up for access via IBM Skills Build Download**](https://academic.ibm.com/a2mt/downloads/artificial_intelligence#/) +- [**DPK added to the Application Hub of “AI Sustainability Catalog”**](https://enterprise-neurosystem.github.io/Sustainability-Catalog/) + +## We Want Your Feedback! + Feel free to contribute to discussions or create a new one to share your [feedback](https://github.com/IBM/data-prep-kit/discussions) + From 66efc18dd22327ab749bdd74dbb8aba85d93a8b3 Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Tue, 3 Dec 2024 11:03:17 +0530 Subject: [PATCH 2/4] Updated links to 15 and Discord Signed-off-by: Aanchal Goyal --- resources.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/resources.md b/resources.md index fb413e38b..9263dd183 100644 --- a/resources.md +++ b/resources.md @@ -28,7 +28,7 @@ 12. IBM TechXchange Las Vegas 13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley 14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) -15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[%E2%80%A6]65125349376-FG8E/?utm_source=share&utm_medium=member_desktop) +15. DPK tutorial and hands on session at IIIT Delhi ## Example Code @@ -43,7 +43,7 @@ Find example code in readme section of each tranform and some sample jupyter not # Relevant online communities -- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1286046139921207476) +- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1303454647427661866) - [**DPK is now listed in Github Awesome-LLM under LLM Data section**](https://github.com/Hannibal046/Awesome-LLM) - [**DPK is now up for access via IBM Skills Build Download**](https://academic.ibm.com/a2mt/downloads/artificial_intelligence#/) - [**DPK added to the Application Hub of “AI Sustainability Catalog”**](https://enterprise-neurosystem.github.io/Sustainability-Catalog/) From da2c6c1a6d3b4e3b0b1db6d87646ec9e9f2ebdac Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Tue, 3 Dec 2024 15:47:37 +0530 Subject: [PATCH 3/4] Added working link for 15 Signed-off-by: Aanchal Goyal --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index 9263dd183..9a011c3f0 100644 --- a/resources.md +++ b/resources.md @@ -28,7 +28,7 @@ 12. IBM TechXchange Las Vegas 13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley 14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) -15. DPK tutorial and hands on session at IIIT Delhi +15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[…]565125349376-FG8E?utm_source=share&utm_medium=member_desktop) ## Example Code From 9fc6d5bbe4f2a11ba417e32d6d57119cbab45b97 Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Tue, 3 Dec 2024 18:49:29 +0530 Subject: [PATCH 4/4] Modified link for bullet 15 Signed-off-by: Aanchal Goyal --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index 9a011c3f0..3164f5ce3 100644 --- a/resources.md +++ b/resources.md @@ -28,7 +28,7 @@ 12. IBM TechXchange Las Vegas 13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley 14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) -15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[…]565125349376-FG8E?utm_source=share&utm_medium=member_desktop) +15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinelearning-activity-7263121565125349376-FG8E?utm_source=share&utm_medium=member_desktop) ## Example Code