Skip to content
This repository was archived by the owner on Oct 12, 2023. It is now read-only.

Commit 2399c09

Browse files
committed
Initial commit
0 parents  commit 2399c09

19 files changed

+924
-0
lines changed

.editorconfig

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
root = true
2+
3+
[*]
4+
end_of_line = lf
5+
charset = utf-8
6+
indent_style = tab
7+
indent_size = 4
8+
insert_final_newline = true

.github/.templateMarker

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
KOLANICH/python_project_boilerplate.py

.github/dependabot.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
version: 2
2+
updates:
3+
- package-ecosystem: "pip"
4+
directory: "/"
5+
schedule:
6+
interval: "daily"
7+
allow:
8+
- dependency-type: "all"

.github/workflows/CI.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
name: CI
2+
on:
3+
push:
4+
branches: [master]
5+
pull_request:
6+
branches: [master]
7+
8+
jobs:
9+
build:
10+
runs-on: ubuntu-22.04
11+
steps:
12+
- name: typical python workflow
13+
uses: KOLANICH-GHActions/typical-python-workflow@master
14+
with:
15+
github_token: ${{ secrets.GITHUB_TOKEN }}

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
__pycache__
2+
*.pyc
3+
*.pyo
4+
/*.egg-info
5+
/build
6+
/dist
7+
/.eggs

.gitlab-ci.yml

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
#image: pypy:latest
2+
image: registry.gitlab.com/kolanich/fixed_python:latest
3+
stages:
4+
- dependencies
5+
- build
6+
- test
7+
- tooling
8+
9+
build:
10+
tags:
11+
- shared
12+
stage: build
13+
variables:
14+
GIT_DEPTH: "1"
15+
PYTHONUSERBASE: ${CI_PROJECT_DIR}/python_user_packages
16+
17+
before_script:
18+
- export PYTHON_MODULES_DIR=${PYTHONUSERBASE}/lib/python3.7
19+
- export EXECUTABLE_DEPENDENCIES_DIR=${PYTHONUSERBASE}/bin
20+
- export PATH="$PATH:$EXECUTABLE_DEPENDENCIES_DIR" # don't move into `variables` any of them, it is unordered
21+
- pip3 install --user --pre --upgrade git+https://github.com/berkerpeksag/astor.git git+https://github.com/erikrose/more-itertools.git
22+
23+
script:
24+
- mkdir wheels
25+
- python3 learn/learnWDSeries.py --train --score 10
26+
- python3 setup.py bdist_wheel
27+
- coverage run --source=datag tests/test.py || true
28+
- coverage run --source=datag -m pytest --junitxml=./rspec.xml ./tests/tests.py || true
29+
- coverage report -m
30+
- coverage xml
31+
- ls -l ./dist
32+
- mv ./dist/*.whl ./wheels/datag-0.CI_python-py3-none-any.whl
33+
- pip3 install --upgrade --pre --user ./wheels/datag-0.CI_python-py3-none-any.whl
34+
35+
coverage: /^TOTAL\\s+.+?(\\d{1,3}%)$/
36+
37+
cache:
38+
paths:
39+
- $PYTHONUSERBASE
40+
41+
artifacts:
42+
paths:
43+
- dist
44+
reports:
45+
junit: ./rspec.xml
46+
cobertura: ./coverage.xml
47+
48+
checks:
49+
stage: tooling
50+
tags:
51+
- shared
52+
image: docker:latest
53+
variables:
54+
DOCKER_DRIVER: overlay2
55+
allow_failure: true
56+
services:
57+
- docker:dind
58+
script:
59+
- docker run --env SAST_CONFIDENCE_LEVEL=5 --volume "$PWD:/code" --volume /var/run/docker.sock:/var/run/docker.sock "registry.gitlab.com/gitlab-org/security-products/sast:latest" /app/bin/run /code
60+
#- docker run --env SOURCE_CODE="$PWD" --env CODECLIMATE_VERSION="latest" --volume "$PWD":/code --volume /var/run/docker.sock:/var/run/docker.sock "registry.gitlab.com/gitlab-org/security-products/codequality:latest" /code
61+
#- docker run --env DEP_SCAN_DISABLE_REMOTE_CHECKS="${DEP_SCAN_DISABLE_REMOTE_CHECKS:-false}" --volume "$PWD:/code" --volume /var/run/docker.sock:/var/run/docker.sock "registry.gitlab.com/gitlab-org/security-products/dependency-scanning:latest" /code
62+
63+
artifacts:
64+
reports:
65+
#codequality: gl-code-quality-report.json
66+
sast: gl-sast-report.json
67+
#dependency_scanning: gl-dependency-scanning-report.json

Code_Of_Conduct.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
No codes of conduct!

ReadMe.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
datag.py [![Unlicensed work](https://raw.githubusercontent.com/unlicense/unlicense.org/master/static/favicon.png)](https://unlicense.org/)
2+
===============
3+
~~![GitLab Build Status](https://gitlab.com/KOLANICH/datag.py/badges/master/pipeline.svg)~~
4+
~~![GitLab Coverage](https://gitlab.com/KOLANICH/datag.py/badges/master/coverage.svg)~~
5+
[![Libraries.io Status](https://img.shields.io/librariesio/github/KOLANICH/datag.py.svg)](https://libraries.io/github/KOLANICH/datag.py)
6+
~~[wheel](https://gitlab.com/KOLANICH/datag.py/-/jobs/artifacts/master/raw/wheels/datag-CI-py3-none-any.whl?job=build)~~
7+
[![Code style: antiflash](https://img.shields.io/badge/code%20style-antiflash-FFF.svg)](https://codeberg.org/KOLANICH-tools/antiflash.py)
8+
9+
This is a data cleansing, standardization and aggregation framework.
10+
11+
Assumme you have a few noisy bad-quality data tables produced by the ones not caring about their quality. These datasets are made just to say "we support open data", but in fact they have multiple issues.
12+
And we need to train a model on this piece of shit. In order to do it we need to make a candy of shit ..
13+
14+
Issues in scope
15+
---------------
16+
17+
| issue | fix |
18+
| --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
19+
| data contains typos, even identifiers meant to uniquily identify stuff contain typos! | custom function fixing the typo |
20+
| data in different units even for the same column | determine unit for each data in the dataset and validate it |
21+
| some data is completely junk, for example an atom containing 1000 protons or mass in coulombs | detect junk by encorporating domain knowledge and discard it |
22+
| columns names are semantically incorrect and different datasets use different columns | rename columns |
23+
| some columns contain multiple data encoded with some hand-crafted format | expand them into different columns, delete the original column |
24+
| some data field is repeated, but with different values | compute an estimate using the present values or discard the value |
25+
26+
27+
Issues out of scope
28+
-------------------
29+
* Imputation
30+
* (Re)balancing
31+
* encoding
32+
* any stuff doing machine learning (but you can implement it yourself)
33+
34+
35+
Pipeline
36+
--------
37+
* get a formal description on what you want from data to be
38+
* unit
39+
* constraints
40+
*
41+
* for each source:
42+
* get a raw record from a source
43+
* apply a transformation
44+
* apply in-source validation
45+
* do intersource
46+
* validation and consistency checks
47+
* merging and estimation
48+
49+
50+
Task decomposition
51+
------------------
52+
* `Spec` - a way to encode requirements to our data.
53+
* `Record` - just a dict with some additional properties.
54+
* `Source` - gets the records by their identifiers. Has
55+
* `priority`
56+
* `spec`
57+
* `entity`
58+
* `Entity` - a way to discover `Source`s providing us with `Record`s of the same kind. Acts as a namespace and as a final validator. Has
59+
* `spec`
60+
* `Rule` - transforms the data, detects errors and recovers the missing stuff.
61+
* `Disambiguator` - uses a dictionary for standardization of identifiers.
62+
* `Merger` - combines different datasets into a composit one.
63+
* `Pipeline` - a `Source` of the resulting dataset. Because it is a `Source`, it can be plugged further.

UNLICENSE

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
This is free and unencumbered software released into the public domain.
2+
3+
Anyone is free to copy, modify, publish, use, compile, sell, or
4+
distribute this software, either in source code form or as a compiled
5+
binary, for any purpose, commercial or non-commercial, and by any
6+
means.
7+
8+
In jurisdictions that recognize copyright laws, the author or authors
9+
of this software dedicate any and all copyright interest in the
10+
software to the public domain. We make this dedication for the benefit
11+
of the public at large and to the detriment of our heirs and
12+
successors. We intend this dedication to be an overt act of
13+
relinquishment in perpetuity of all present and future rights to this
14+
software under copyright law.
15+
16+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19+
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20+
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21+
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22+
OTHER DEALINGS IN THE SOFTWARE.
23+
24+
For more information, please refer to <https://unlicense.org/>

datag/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)