You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are commonly dealing with structured data where we need to know the "labeling" of axes and different datasets. For example, passing in a 2D numpy array, one might expect the features to be the columns and the rows to be samples. However, creating a resulting causal graph from just integer numbers is really hard to read and interpret. Therefore, we typically might instead use pandas DataFrame, so we can attach names to each node in the graph.
However, things get more complicated as we move towards general causal discovery, where we want to support multiple datasets. This is not so easy with a numpy array because you have an additional axis to keep track of and remember conventions of what you named it. This is not an issue for observational data because if you have multiple instances of observational data, typically you would just concatenate them along the sample axis. However, for interventions and multi-environment learning, this becomes complicated. For example, when you pass in data for an interventional causal discovery algorithm, it is desirable to index each dataset differently. However, there is no good way to do this with pandas. Multi-indexing is super confusing imo. Moreover, apparently pandas will even move away from supporting multi-dimensional analysis because it is so cumbersome.
I don't think this is something we need to change right away. We can hack multi-index or janky APIs in the meantime, but for longer-term stability, we might consider transitioning to defining the internal dataset as an Xarray. We should still support input from pandas and numpy arrays, but internally they are transformed to an xarray, which is then used to do causal discovery. This helps eliminate the need to pass around e.g. lists of pandas data frames with a list of intervention target and names or lists of numpy arrays with lists of node names and lists of intervention targets. Rather, we should strive to pass around a single instance data: XArray, which is ensured to have the relevant information.
The text was updated successfully, but these errors were encountered:
Problem Statement
We are commonly dealing with structured data where we need to know the "labeling" of axes and different datasets. For example, passing in a 2D numpy array, one might expect the features to be the columns and the rows to be samples. However, creating a resulting causal graph from just integer numbers is really hard to read and interpret. Therefore, we typically might instead use pandas DataFrame, so we can attach names to each node in the graph.
However, things get more complicated as we move towards general causal discovery, where we want to support multiple datasets. This is not so easy with a numpy array because you have an additional axis to keep track of and remember conventions of what you named it. This is not an issue for observational data because if you have multiple instances of observational data, typically you would just concatenate them along the sample axis. However, for interventions and multi-environment learning, this becomes complicated. For example, when you pass in data for an interventional causal discovery algorithm, it is desirable to index each dataset differently. However, there is no good way to do this with pandas. Multi-indexing is super confusing imo. Moreover, apparently pandas will even move away from supporting multi-dimensional analysis because it is so cumbersome.
https://stackoverflow.com/questions/42876278/when-to-use-multiindexing-vs-xarray-in-pandas
Possible solutions
I don't think this is something we need to change right away. We can hack multi-index or janky APIs in the meantime, but for longer-term stability, we might consider transitioning to defining the internal dataset as an Xarray. We should still support input from pandas and numpy arrays, but internally they are transformed to an xarray, which is then used to do causal discovery. This helps eliminate the need to pass around e.g. lists of pandas data frames with a list of intervention target and names or lists of numpy arrays with lists of node names and lists of intervention targets. Rather, we should strive to pass around a single instance
data: XArray
, which is ensured to have the relevant information.The text was updated successfully, but these errors were encountered: