Limitations and usage #39

idimopoulos · 2023-03-21T09:32:40Z

We are looking for possible replacements of easyrdf and hardf was the first one that we came across. However, upon deep testing, while the performance is way faster than easyrdf (from sweetrdf now), not all edge cases are being covered.

What is concerning us is the N-Triples format. The TriGWriter is escaping a limited amount of characters - they need to match

    const ESCAPE = '/["\\\\\\t\\n\\r\\b\\f]/';

and they are replaced by

$this->escapeReplacements = [
            '\\' => '\\\\', '"' => '\\"', "\t" => '\\t',
            "\n" => '\\n', "\r" => '\\r', \chr(8) => '\\b', "\f" => '\\f',
        ];

However, this leaves a huge list of characters that can make it missbehave.
According to https://www.w3.org/TR/rdf-testcases/#ntrip_strings, many other characters need escaping.

I created a small script that tests just the 255 first characters and the results are not looking good. The script is below

<?php

use pietercolpaert\hardf\TriGWriter;

include 'vendor/autoload.php';

// Generate a string of 10000 random characters including some unicode
// characters.
$randomString = '';
for ($i = 0; $i < 255; $i++) {
    $randomString .= chr($i);
}

// Test the performance of the Ntriples::escapeString() method.
$start = microtime(true);
$serializer = new TriGWriter(['format' => 'triple']);
$serializer->addTriple('http://hardf.org/subject', 'http://hardf.org/predicate', '"' . $randomString . '"');
$output = $serializer->end();
$end = microtime(true);

echo 'TriGWriter::end took ' . ($end - $start) . ' seconds. Memory usage: ' . memory_get_usage() . ' bytes. Memory peak usage: ' . memory_get_peak_usage() . ' bytes.' . PHP_EOL;

$endpoint = 'http://host.docker.internal:8890/sparql';
$graph = 'http://testing.com/';
$subject = 'http://hardf.org/subject';
$predicate = 'http://hardf.org/predicate';
$object = '"' . $randomString . '"';
$query = 'WITH <' . $graph . '> DELETE { <' . $subject . '> <' . $predicate . '> ?d } INSERT { ' . $output . ' } WHERE { <' . $subject . '> <' . $predicate . '> ?d }';
echo $query . PHP_EOL;

$start = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $endpoint);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/sparql-update']);
$response = curl_exec($ch);

if ($response === false) {
    echo "Error: " . curl_error($ch);
}
else {
    echo "Response: " . $response;
}

curl_close($ch);
$end = microtime(true);

echo 'SPARQL INSERT took ' . ($end - $start) . ' seconds. Memory usage: ' . memory_get_usage() . ' bytes. Memory peak usage: ' . memory_get_peak_usage() . ' bytes.' . PHP_EOL;

I was able to support all 127 initial characters by altering the escape pattern to

    const ESCAPE = '/["\\0\\\\\\t\\n\\r\\b\\f]/';

and the replacements array to

$this->escapeReplacements = [
            '\\' => '\\\\', '"' => '\\"', "\t" => '\\t',
            "\n" => '\\n', "\r" => '\\r', \chr(8) => '\\b', "\f" => '\\f',
           \chr(0) => '\\u0000',
        ];

(mainly supported \0 which is the end of stream if I am not mistaken).

However, as soon as I get to a couple of characters after 128, the inserted string is "0" or failing. With easyrdf, it takes way WAY more time to insert the data for large blobs of text, but you know, performance costing integrity is not really performance.

Our use case is that we have this CMS that the user has a WYSIWYG editor where they can paste whatever, meaning that a wrong copy/paste can cause one of these characters to be printed. But in case it is an intended character, we would want to avoid removing it.

My question is, are we missusing this library? Are there known/unknown limits to it? Or an intended philosophy to not consider non-printable/special characters as part of the supported string?

The text was updated successfully, but these errors were encountered:

zozlak · 2024-02-13T14:59:15Z

As far as I can tell your question touches three topics:

N-triples literal characters which have to be escaped. The n-triples specification states that "An N-Triples document is a Unicode UNICODE character string encoded in UTF-8" and that a literal is (skipping the outer quotes for better readability) ([^#x22#x5C#xA#xD] | [ECHAR] | [UCHAR])*. This means that any UTF-8-encoded character other than four defined by the [^#x22#x5C#xA#xD] (double quote, backslash, new line and carriage return) is a valid n-triples literal value. The UCHAR part of the definition allows characters to be also written in an escaped form but it's only an option and, both for performance and readability reasons, it's just not worth to write characters in the unicode-escaped form while serializing an n-triple literal.
N-triples literal characters parsing. The parser must handle all variants described by the specification, so also the unicode-escape syntax. But here we are talking about the writer and as far as I know, the hardf reader supports unicode-escapes.
The SPARQL syntax is not fully in line with n-triples nor turtle (see your other question here).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limitations and usage #39

Limitations and usage #39

idimopoulos commented Mar 21, 2023

zozlak commented Feb 13, 2024 •

edited

Loading

Limitations and usage #39

Limitations and usage #39

Comments

idimopoulos commented Mar 21, 2023

zozlak commented Feb 13, 2024 • edited Loading

zozlak commented Feb 13, 2024 •

edited

Loading