Skip to content

Add support for netzschleuder #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
schochastics opened this issue Mar 28, 2025 · 8 comments · May be fixed by #23
Open

Add support for netzschleuder #22

schochastics opened this issue Mar 28, 2025 · 8 comments · May be fixed by #23
Assignees

Comments

@schochastics
Copy link

What is the feature or improvement you would like to see?
Add a function that allows to get networkdata from https://networks.skewed.de

Use cases for the feature
The website has a good API and this would allow users to easily get a more diverse set of realistic network data

I already have a prototype implementation. If this is a desired feature, I will start a PR.

(cc @szhorvat)

@maelle
Copy link

maelle commented Mar 28, 2025

I'd be in favor of this being a separate package under the igraph organization, to be used as a dependency. 😇

@schochastics
Copy link
Author

Sounds good to me, because that would allow a bit more freedom in development (adding dependencies).
Should that replace or augment igraphdata, or be completely separate?

@maelle
Copy link

maelle commented Mar 28, 2025

I'd say separate, as igraphdata is about sharing not accessing data, and if needed we can make igraphdata depend on the neztschleuder package?

@schochastics
Copy link
Author

Alright makes sense. Let's see if others agree 🙂

@szhorvat
Copy link
Member

So this was one of the "mentored projects", announced for Python, and we've looked into it with @ntamas a bit. It seems more complicated that expected. It looks like it may be necessary to implement an importer for the .gt format, as the GML and GraphML representations are often difficult to handle on Netzschleuder.

The GraphML use custom data types such as _pos for position, which the igraph C core does not currently support. See here for why handling unknown datatypes in some default manner is non-trivial: igraph/igraph#1731

Some of the GML files were outright corrupt, e.g. contained invalid character codes. Of course we can talk to Tiago and look for a fix.

igraph currently only supports scalar attributes (number, bool, string), which is a major limitation. It does not support e.g. a pair of coordinates as a single attribute. This is why it is typical to store x, y coordinates in two x, y attributes. Many of the files do come with non-scalar attributes. These need to be handled somehow.

To be precise, the Python and R interfaces of igraph do support arbitrary Python and R objects as attributes, but these are not accessible from C, therefore the file format readers can't handle them. Before the attribute handling can be overhauled (one of the major item in our last, rejected CZI application), we can try to handle this by reading such attributes as strings, and then deserializing these strings in the host language (Python or R). This may still necessitate changes to the C core format readers, to actually return this data as a string. As I said, in GraphML unknown types are simply not supported (the format itself does not standardize non-scalar attributes) while in GML composite types are ignored now with a warning.

Netschleuder actually serializes some of the non-scalar types into strings, even when the format (such as GML) would support them. In some cases, it does the serialization in a Python-specific way. Deserializing in R may be a challenge.

I must leave now, but I wanted to type up some of the concerns. We could discuss this in a live meeting sometime.

@schochastics
Copy link
Author

We definitely need to discuss this live but the networks also exist in zipped csv? I test a bit more to see if I run into problems.

@schochastics
Copy link
Author

This is the working(?) prototype. Just tested a few samples so far.
The nonstandard columns including the coordinates can be split into x and y node attributes quite easily.

#' Download a graph from the Netzschleuder data catalogue
#'
#' Netzschleuder (<https://networks.skewed.de/>) is a large online repository for
#' network datasets with the aim of aiding scientific research.
#' @param name character. name of the network dataset.
#' @param net character. If the dataset contains several networks this is the network name.
#' @param directed logical. Whether a directed graph is constructed.
#' @param bipartite logical. Whether a bipartite graph is constructed.
#' @return a new graph object.
#' @keywords graphs
#' @family foreign
#' @export
graph_from_netzschleuder <- function(name, net = NULL, directed = FALSE, bipartite = FALSE) {
  if (is.null(net)) {
    net <- name
  }
  zip_url <- paste0(
    "https://networks.skewed.de/net/", name, "/files/", net, ".csv.zip"
  )

  temp <- tempfile()
  download.file(zip_url, temp, quiet = TRUE)
  zip_contents <- unzip(temp, list = TRUE)
  edge_file_name <- zip_contents$Name[grepl("edge", zip_contents$Name)]
  node_file_name <- zip_contents$Name[grepl("node", zip_contents$Name)]

  edges_df <- read.csv(unz(temp, edge_file_name)) + 1

  names(edges_df)[c(1, 2)] <- c("from", "to")
  nodes_df <- read.csv(unz(temp, node_file_name))
  names(nodes_df)[1] <- "id"
  nodes_df$id <- nodes_df$id + 1
  if ("X_pos" %in% names(nodes_df)) {
    pos_array <- gsub("array\\(\\[|\\]|\\)", "", nodes_df[["X_pos"]])
    split_coords <- strsplit(pos_array, ",")

    x_vals <- sapply(split_coords, function(x) as.numeric(trimws(x[1])))
    y_vals <- sapply(split_coords, function(x) as.numeric(trimws(x[2])))

    nodes_df[["X_pos"]] <- NULL
    nodes_df$x <- x_vals
    nodes_df$y <- y_vals
  }
  on.exit(unlink(temp))
  g <- graph_from_data_frame(edges_df, directed = directed, vertices = nodes_df)
  if (bipartite) {
    types <- rep(FALSE, vcount(g))
    types[nodes_df$id %in% edges_df[, 1]] <- TRUE
    g <- set_vertex_attr(g, "type", value = types)
  }
  g
}

@schochastics schochastics self-assigned this Mar 29, 2025
@krlmlr krlmlr transferred this issue from igraph/rigraph Apr 3, 2025
@krlmlr
Copy link
Contributor

krlmlr commented Apr 3, 2025

Can we have it in igraphdata, separated into:

  • a reading function that returns a named list of data frames
  • a cleaning function
  • a simple call to igraph_from_data_frame()

?

@schochastics schochastics linked a pull request Apr 5, 2025 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants