Skip to content

Method Format::guessFormat() doesn't support turtle files without prefixes or a base URI #62

@Michiel-s

Description

@Michiel-s

The issue

Taking Example1 from https://www.w3.org/TR/turtle/

@base <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rel: <http://www.perceive.net/schemas/relationship/> .

<#green-goblin>
    rel:enemyOf <#spiderman> ;
    a foaf:Person ;    # in the context of the Marvel universe
    foaf:name "Green Goblin" .

<#spiderman>
    rel:enemyOf <#green-goblin> ;
    a foaf:Person ;
    foaf:name "Spiderman", "Человек-паук"@ru .

Instantiating a EasyRDF Graph based on this data is ok. It parses correctly.

However, when the prefixes are not provided an exception is thrown 'Unable to parse data of an unknown format.'

E.g.

<#green-goblin>
    rel:enemyOf <#spiderman> ;
    a foaf:Person ;    # in the context of the Marvel universe
    foaf:name "Green Goblin" .

<#spiderman>
    rel:enemyOf <#green-goblin> ;
    a foaf:Person ;
    foaf:name "Spiderman", "Человек-паук"@ru .

Analysis

According to the Turtle Grammer:

  • A turtleDoc is a set of statements [1]
  • A statement is a directive OR triples [2]

That means that a turtle document can, but not necessarily needs to, start with @prefix or @base statements.

In the logic of method Format::guessFormat(), turtle documents are only recognized when they start with prefix or base statements (with or without the @).

...
        } elseif (preg_match('/@prefix\s|@base\s/', $short)) {
            return self::getFormat('turtle');
        } elseif (
            preg_match('/prefix\s|base\s/i', $short)
            // see FormatTest::testGuessFormatTurtleByPrefix for an example
            && false === str_contains($short, '<?xml')
        ) {
            return self::getFormat('turtle');
        } elseif (preg_match('/^\s*<.+> <.+>/m', $short)) {
            return self::getFormat('ntriples');
        } else {
            return null;
        }
...

Solution space

I propose to add a few more possibilities to recognize turtle documents.

We need to keep in mind to be able to distinguish turtle from n-triples syntax. If I'm correct, turtle is a superset of ntriples. A valid n-triples document also complies to turtle, but not the other way around.

There are specific indicators that we are dealing with a turtle document, including:

  • The shorthand a is used for predicate rdf:type
  • There are compact URIs used, i.e. prefix colon localname enclosed by whitespace
  • There is usage of semicolon to close-continue predicateObjectLists

These are not allowed to be used in n-triples syntax so we are good here. This might result in more false positive hits for turtle when guessing the format resulting in parser errors. But I don't see a problem here, because otherwise the guessFormat method will return null which also results in an exception as described in this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions