MCO ETL Proof of Concept

This application is an ETL pipeline for MCO, implementing a data integration solution using Airbyte. The project serves as a proof of concept to evaluate Airbyte's capabilities for ingesting CSV data and loading it into an Oracle database.

Project Objectives

Evaluate modern ELT architecture for MCO's data integration needs
Implement automated data ingestion from CSV sources using Airbyte
Establish reliable data transformation pipeline with dbt
Create proof of concept for Oracle database integration
Provide comprehensive documentation and evaluation framework

Background: Modern ELT Architecture

Modern data pipelines often follow the Extract-Load-Transform (ELT) pattern, which differs from traditional ETL in several key aspects:

Data Loading: Raw data is loaded directly into the target system before transformation, allowing for:
- Preservation of source data integrity
- Ability to reprocess data with different transformation logic
- Reduced complexity in the extraction phase
Transformation: Transformations occur within the target database, leveraging:
- Modern data warehouse computing power
- SQL-based transformations for better accessibility
- Version control and testing of transformation logic
Modularity: Each component is specialized and replaceable:
- Extraction tools focus on source system connectivity
- Loading ensures reliable data ingestion
- Transformation tools handle data modeling and business logic

Technology Choices

Our stack leverages best-in-class open-source tools for each component of the ELT pipeline:

Airbyte (Extract & Load)

Open-source data integration platform
Rich ecosystem of pre-built connectors
Handles incremental syncs and schema evolution
Strong community support and regular updates

Dagster (Orchestration)

Modern orchestration framework with rich UI
Asset-based paradigm for data dependencies
Built-in testing and monitoring capabilities
Excellent integration with dbt and other tools

dbt (Transform)

Industry standard for data transformation
Version controlled SQL transformations
Built-in testing and documentation
Modular approach to data modeling

PostgreSQL (Temporary Storage)

Robust and well-understood database
Compatible with both Airbyte and dbt
Serving as a temporary replacement for Oracle
Proven performance for analytical workloads

Superset (Visualization)

Feature-rich BI platform
Self-service analytics capabilities
Modern, interactive dashboards
Strong SQL-first approach

Stack

Airbyte (Extract & Load)
Dagster (Orchestration)
dbt (Transform)
PostgreSQL (Temporary storage, to be replaced with Oracle)
Superset (Visualization)

Architecture

Setup

Clone the repository and navigate to the project directory
Make the setup script executable and run it:
```
chmod +x setup.sh
./setup.sh
```
Note: Initial build may take several minutes. Monitor the console for service readiness.
Once ready, access the following interfaces:
- Dagster UI: http://localhost:3000
- Airbyte UI: http://localhost:8000
- Superset: http://localhost:8088

Development Workflow

Data Source Configuration
- Configure CSV source connectors in Airbyte
- Set up destination database connection
- Define sync schedules and options
Transformation Development
- Write and test dbt models
- Define data quality tests
- Document transformation logic
Pipeline Orchestration
- Configure Dagster workflows
- Set up dependencies between tasks
- Monitor pipeline execution
Data Visualization
- Create Superset dashboards
- Set up automated reporting
- Configure user access

Monitoring and Maintenance

Monitor pipeline health through Dagster UI
Check data quality test results in dbt
Review Airbyte sync logs
Monitor system resource usage

System Requirements

Docker
Minimum 32GB RAM (Airbyte and dbt operations are memory-intensive)
Multi-core CPU recommended (8+ cores for optimal performance)
50GB free disk space for Docker images and data storage

Future Considerations

Production Deployment
- High availability setup
- Backup and recovery procedures
- Monitoring and alerting
- Security hardening
Oracle Integration
- Migration from PostgreSQL
- Performance optimization
- Security configuration
- Backup strategies
Scale Considerations
- Horizontal scaling options
- Performance optimization
- Resource management
- Cost optimization

Credits

This project serves as a proof of concept for MCO's ETL requirements.

Inspired by:

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
dagster		dagster
dbt		dbt
generator		generator
resources		resources
shared		shared
superset		superset
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCO ETL Proof of Concept

Project Objectives

Background: Modern ELT Architecture

Technology Choices

Airbyte (Extract & Load)

Dagster (Orchestration)

dbt (Transform)

PostgreSQL (Temporary Storage)

Superset (Visualization)

Stack

Architecture

Setup

Development Workflow

Monitoring and Maintenance

System Requirements

Future Considerations

Credits

About

Releases

Packages

Contributors 4

Languages

License

andrashejj/ETL

Folders and files

Latest commit

History

Repository files navigation

MCO ETL Proof of Concept

Project Objectives

Background: Modern ELT Architecture

Technology Choices

Airbyte (Extract & Load)

Dagster (Orchestration)

dbt (Transform)

PostgreSQL (Temporary Storage)

Superset (Visualization)

Stack

Architecture

Setup

Development Workflow

Monitoring and Maintenance

System Requirements

Future Considerations

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages