Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitations and usage #39

Open
idimopoulos opened this issue Mar 21, 2023 · 1 comment
Open

Limitations and usage #39

idimopoulos opened this issue Mar 21, 2023 · 1 comment

Comments

@idimopoulos
Copy link

We are looking for possible replacements of easyrdf and hardf was the first one that we came across. However, upon deep testing, while the performance is way faster than easyrdf (from sweetrdf now), not all edge cases are being covered.

What is concerning us is the N-Triples format. The TriGWriter is escaping a limited amount of characters - they need to match

    const ESCAPE = '/["\\\\\\t\\n\\r\\b\\f]/';

and they are replaced by

$this->escapeReplacements = [
            '\\' => '\\\\', '"' => '\\"', "\t" => '\\t',
            "\n" => '\\n', "\r" => '\\r', \chr(8) => '\\b', "\f" => '\\f',
        ];

However, this leaves a huge list of characters that can make it missbehave.
According to https://www.w3.org/TR/rdf-testcases/#ntrip_strings, many other characters need escaping.

I created a small script that tests just the 255 first characters and the results are not looking good. The script is below

<?php

use pietercolpaert\hardf\TriGWriter;

include 'vendor/autoload.php';

// Generate a string of 10000 random characters including some unicode
// characters.
$randomString = '';
for ($i = 0; $i < 255; $i++) {
    $randomString .= chr($i);
}

// Test the performance of the Ntriples::escapeString() method.
$start = microtime(true);
$serializer = new TriGWriter(['format' => 'triple']);
$serializer->addTriple('http://hardf.org/subject', 'http://hardf.org/predicate', '"' . $randomString . '"');
$output = $serializer->end();
$end = microtime(true);

echo 'TriGWriter::end took ' . ($end - $start) . ' seconds. Memory usage: ' . memory_get_usage() . ' bytes. Memory peak usage: ' . memory_get_peak_usage() . ' bytes.' . PHP_EOL;

$endpoint = 'http://host.docker.internal:8890/sparql';
$graph = 'http://testing.com/';
$subject = 'http://hardf.org/subject';
$predicate = 'http://hardf.org/predicate';
$object = '"' . $randomString . '"';
$query = 'WITH <' . $graph . '> DELETE { <' . $subject . '> <' . $predicate . '> ?d } INSERT { ' . $output . ' } WHERE { <' . $subject . '> <' . $predicate . '> ?d }';
echo $query . PHP_EOL;

$start = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $endpoint);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/sparql-update']);
$response = curl_exec($ch);

if ($response === false) {
    echo "Error: " . curl_error($ch);
}
else {
    echo "Response: " . $response;
}

curl_close($ch);
$end = microtime(true);

echo 'SPARQL INSERT took ' . ($end - $start) . ' seconds. Memory usage: ' . memory_get_usage() . ' bytes. Memory peak usage: ' . memory_get_peak_usage() . ' bytes.' . PHP_EOL;

I was able to support all 127 initial characters by altering the escape pattern to

    const ESCAPE = '/["\\0\\\\\\t\\n\\r\\b\\f]/';

and the replacements array to

$this->escapeReplacements = [
            '\\' => '\\\\', '"' => '\\"', "\t" => '\\t',
            "\n" => '\\n', "\r" => '\\r', \chr(8) => '\\b', "\f" => '\\f',
           \chr(0) => '\\u0000',
        ];

(mainly supported \0 which is the end of stream if I am not mistaken).

However, as soon as I get to a couple of characters after 128, the inserted string is "0" or failing. With easyrdf, it takes way WAY more time to insert the data for large blobs of text, but you know, performance costing integrity is not really performance.

Our use case is that we have this CMS that the user has a WYSIWYG editor where they can paste whatever, meaning that a wrong copy/paste can cause one of these characters to be printed. But in case it is an intended character, we would want to avoid removing it.

My question is, are we missusing this library? Are there known/unknown limits to it? Or an intended philosophy to not consider non-printable/special characters as part of the supported string?

@zozlak
Copy link
Contributor

zozlak commented Feb 13, 2024

As far as I can tell your question touches three topics:

  1. N-triples literal characters which have to be escaped. The n-triples specification states that "An N-Triples document is a Unicode UNICODE character string encoded in UTF-8" and that a literal is (skipping the outer quotes for better readability) ([^#x22#x5C#xA#xD] | [ECHAR] | [UCHAR])*. This means that any UTF-8-encoded character other than four defined by the [^#x22#x5C#xA#xD] (double quote, backslash, new line and carriage return) is a valid n-triples literal value. The UCHAR part of the definition allows characters to be also written in an escaped form but it's only an option and, both for performance and readability reasons, it's just not worth to write characters in the unicode-escaped form while serializing an n-triple literal.

  2. N-triples literal characters parsing. The parser must handle all variants described by the specification, so also the unicode-escape syntax. But here we are talking about the writer and as far as I know, the hardf reader supports unicode-escapes.

  3. The SPARQL syntax is not fully in line with n-triples nor turtle (see your other question here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants