In Data Preparation for AI and Analytics you’ll:
- Understand the importance of data quality
- Use AI to clean and prepare data
- Take advantage of Python and visual tools like Alteryx
- Apply the right data preparation technique for the right outcome
The Art of Data Alchemy is for anyone who works with data, from seasoned data architects to marketing pros and business analysts. It presents data preparation methods with clear language and concrete examples. You’ll explore tried-and-true approaches along with emerging generative AI techniques. You’ll especially appreciate the insights into automation and data governance.
The Art of Data Alchemy teaches you to tackle the challenges you’ll face as you work with data. You’ll master popular data wrangling tools like Python and Alteryx. Complex data prep concepts are broken down into clear, manageable steps and fully illustrated with engaging data sets—including data on the Titanic disaster, rating video games, sentiment analysis of Los Angeles restaurant recommendations, and more. The book is packed with vital advice for complex tasks, including merging multiple data sets, alerting systems for data quality, and scaling data preparation into large cloud-based pipelines. Learn universal techniques for data enrichment and transformation, and specialized approaches optimized for machine learning, analytics, and creating AI.
For data workers of all skill levels, who know Python and the basics of SQL.
Benoît Cayla is a computer engineer with over 25 years of data management experience and an expert in data management and AI. Throughout his career, he has had the privilege of working with major players like IBM, Informatica, and Tableau, contributing to large-scale projects in manufacturing, insurance, and finance.
Install and configure your environment
Some datasets have been modified from their original versions for compatibility with the provided code examples. To ensure the code works as intended, it is recommended to use the modified datasets (as they are referenced already). However, for reference and additional context, links to the original datasets are also included.
N.A.
Warning: In this chapter several specific Python and system libraries need to be installed beforehand. Please follow the procedure here
- Top restaurants in LA (2023)
- BBC News
- Folder Images
- Titanic disaster
Note: This chapter utilizes Google AI's capabilities (specifically, Gemini) because it offers a free-to-use LLM (Large Language Model). To ensure a smooth setup, follow the environment preparation instructions provided here.
Note: In this chapter, we’ll use Alteryx v2024.1.1.93 Patch:3 to demonstrate how to leverage a visual data preparation solution. To get started, you’ll need to install the Alteryx client. The installation procedure is described here.
The Alteryx exports (yxmd files) can be found here, you can just copy the file on your desktop and open them by using the Alteryx client.
Note: In this chapter, we’ll use databricks Community Edition to illustrate how to manage a dataset in a ditributed environment (Spark).
Most of the datasets used in this book have already been profiled. The outcomes can be found here.
