Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification for feature_group_strategies is not working with leave-one-out or leave-one-in #950

Open
ElenaVillano opened this issue Sep 11, 2024 · 1 comment

Comments

@ElenaVillano
Copy link

Hi everyone,

I'm running triage over [Red Hat 11.3.1-4] on Linux, Python 3.10.6, and using the v8 triage version. My database is in PostgreSQL 15.7 on x86_64-pc-linux-gnu, compiled by GCC (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit.

Configuration details:

config_version: 'v8'

random_seed: 1472385

temporal_config:
    feature_start_time: '2021-11-01'
    feature_end_time: '2022-12-31'

    label_start_time: '2021-11-01'
    label_end_time: '2022-12-31'
      
    model_update_frequency: '1month' # ventanas

    max_training_histories: '6month' # periodo de entrenamiento    
    training_label_timespans: ['4d'] # tiempo en que puede suceder la etiqueta 
    training_as_of_date_frequencies: '1d' # cada cuando tomas la decision
   
    test_durations: '1week'  # cuanto tiempo usarás ese modelo
    test_label_timespans: ['4d']
    test_as_of_date_frequencies: '1d' 

cohort_config: # Cohorte = Contenedores que llegaran a la terminal el siguiente día a partir del eta
    filepath: 'triage/sql/cohorts/cohorte_antes_de_arribo.sql'       
    name: 'arribo_buque'

label_config:  # Etiqueta = Si el contenedor saldrá entre 2 y 4 días
    filepath: 'triage/sql/labels/label_2_4_dias_estadia.sql'
    name: 'e2_4_dias'

feature_aggregations:
  -
    prefix: 'ecvr' # variables sencillas
    from_obj: 'ontology.entities' 
    knowledge_date_column: 'fecha_eta'
   
    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # peso_neto
        quantity: 'peso_neto'
        metrics:
          - 'max'
      - # peso_bruto
        quantity: 'peso_bruto'
        metrics:
          - 'max'

    categoricals_imputation:
      all:
        type: 'null_category' 

    categoricals:
      - # dimension
        column: 'dimension'
        metrics:
          - 'sum' 
        choices: ['20','40','45']
      - # ruta_linea_naviera
        column: 'ruta_linea_naviera'
        metrics:
          - 'sum' 
        choice_query: 'select distinct ruta_linea_naviera from ontology.entities'
    
    intervals: ['all']

  -
    prefix: 'mercha' # variables de mercancia
    from_obj: 'ontology.comportamiento'
    knowledge_date_column: 'fecha_eta'
   
    categoricals_imputation:
        all:
          type: 'null_category'

    categoricals:
      - # capitulo
        column: 'capitulo'
        metrics:
          - 'sum'
        choice_query: 'select distinct capitulo from ontology.comportamiento'
      - # seccion
        column: 'seccion' 
        metrics:
          - 'sum'
        choice_query: 'select distinct seccion from ontology.comportamiento'
    
    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # conteo_capitulo_2sem
        quantity: 
          ccap2s: 'conteo_capitulo_2sem'
        metrics:
          - 'min'
      - # conteo_capitulo_4sem
        quantity: 
          ccap4s: 'conteo_capitulo_4sem'
        metrics:
          - 'min'

    intervals: ['all']

  -
    prefix: 'consig' # variables de consignatario
    from_obj: 'ontology.comportamiento'
    knowledge_date_column: 'fecha_eta'

    categoricals_imputation:
      all:
        type: 'null_category' 

    categoricals:
      - # consignatario top10
        column: 'consignatario'
        metrics:
          - 'sum' 
        choice_query: 'with top50 as(select consignatario, count(consignatario) from ontology.comportamiento group by consignatario order by 2 desc limit 100) select consignatario from top50'
   
    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # conteo_consig_2sem
        quantity: 
          ccons2s: 'conteo_consig_2sem'
        metrics:
          - 'min'
      - # conteo_consig_4sem
        quantity: 
          ccons4s: 'conteo_consig_4sem'
        metrics:
          - 'min'

    intervals: ['all']

  -
    prefix: 'liru' # variables de linea y ruta contenedores
    from_obj: 'ontology.comportamiento'
    knowledge_date_column: 'fecha_eta'
   
    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # conteo_ruta_2sem
        quantity: 
          crut2s: 'conteo_ruta_2sem'
        metrics:
          - 'min'
      - # conteo_ruta_4sem
        quantity: 
          crut4s: 'conteo_ruta_4sem'
        metrics:
          - 'min'

    intervals: ['all']

## all, leave-one-out, leave-one-in, all-combinations
feature_group_strategies: ['leave-one-out']
#feature_group_strategies: ['all-combinations']
 
grid_config:
  'sklearn.tree.DecisionTreeClassifier':
        criterion: ['gini']
        max_depth: [5,10,~] 
        min_samples_split: [10,50,100] 
  'sklearn.ensemble.RandomForestClassifier':
        n_estimators: [200,300]
        criterion: ['gini']
        max_depth: [5,10]
        max_features: ['sqrt']
        min_samples_split: [10,50]
  'triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression':
        penalty: ['l1','l2']
        C: [0.01, 0.1, 1.0, 10]
  'sklearn.dummy.DummyClassifier':
        strategy: ['stratified']
  'sklearn.ensemble.ExtraTreesClassifier':
        n_estimators: [500]
        criterion: ['gini']
        max_depth: [5,10]
        max_features: ['sqrt']
        min_samples_split: [50,100]
  'triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature':
        rules:
            - [{feature: 'ecvr_entity_id_all_peso_neto_max', low_value_high_score: False}]

scoring:
    testing_metric_groups:
       -
          metrics: [precision@, recall@]
          thresholds:
            percentiles: [10, 20, 25, 30]
            top_n: [1000, 1400, 1750, 2100]


    training_metric_groups:
       -
          metrics: [precision@, recall@]
          thresholds:
            percentiles: [10, 20, 25, 30]
            top_n: [1000, 1400, 1750, 2100]

All the presented code worked fine until I used the feature_group_strategies in leave-one-out or leave-one-in. In both cases, I get the same error (detailed below). However, when I use feature_group_strategies: ['all-combinations'], it works, but it doesn't group the variables as expected, and I get results as if I were using all.

Command used:

triage experiment triage/experimentos_arribo/e2_ --n-db-processes 3 --n-processes 8 --no-validate --no-save-predictions

Everything runs smoothly until the matrix building step, where I encounter this error:

2024-09-08 15:17:14 - ERROR Child error
Traceback (most recent call last):
File "/Ccd/-pyenv/versions/tri-hp/lib/python3.10/site-packages/triage/experiments/multicore.py", line 166, in run_task_with_splatted_arguments return task_runner(**task)
File "/Ccd/pyenv/versions/tri-hp/lib/python3.10/site-packages/triage/component/architect/builders.py", line 321, in build_matrix
output, labels = self.stitch_csvs(feature_queries, label_query, matrix_store, matrix_uuid)
File "/Ccd/pyenv/versions/tri-hp/lib/python3.10/site-packages/triage/component/architect/builders.py", line 551, in stitch_csvs
if len(df_pl.get_column('as_of_date').head(1)[0].split)) > 1:
File "/Ccd/.pyenv/versions/tri-hp/lib/python3.10/site-packages/polars/dataframe/frame.py", line 6128, in get_column return self[name]
exceptions.ColumnNotFoundError: as_of_date

It seems like the as_of_date column is missing or not properly generated during matrix building, specifically when using the leave-one-out or leave-one-in strategies.

I expected the leave-one-out strategy to group variables accordingly and generate matrices without this error, but instead, the process halts when it reaches matrix building. I checked the matrices generated in the process and confirmed that the as_of_date column is indeed present.

My questions would be:

  • Is this a known issue with these feature grouping strategies?
  • Could this be related to how the as_of_date column is handled with these strategies?

Any guidance or suggestions would be greatly appreciated!

Thank you for your help.

@ElenaVillano ElenaVillano changed the title Especification for feature_group_strategies is not working with leave-one-out or leave-one-in don't work Specification for feature_group_strategies is not working with leave-one-out or leave-one-in Sep 11, 2024
@nanounanue
Copy link
Contributor

Adding to this all-combinations is not working anymore, it just runs all ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants