Archive for the ‘xml’ Category

relaxng validation: specify range limit for integer

Saturday, June 20th, 2009

If you want to specify range limit for integers, you may be interested in the following parameters:

  • minInclusive
  • maxInclusive
  • minExclusive
  • maxExclusive

Note that ‘min’ and ‘max’ parameters do NOT exist, you must specify whether boundaries are included or not.

Example

<define name="mytype.content">
 <data type="positiveInteger">
  <param name="minInclusive">1</param>
  <param name="maxInclusive">31</param>
 </data>
</define>

With above example, validation will fail on any input that’s not an integer or that’s not within [1:31] range.

sources

XML validation & query urls (& failure)

Tuesday, May 12th, 2009

I tried to make my website xml-valid so that it can be loaded as xml by browsers that accept it (currently firefox3, chrome & opera) and be parsed faster.
Moreover, I simply wanted my website to be w3c valid with an XHTML header.

Problems come when you have urls with arguments in your page. Validation fails upon ‘&’ variable separator.

At first, what I did was simply to htmlencode the separator, replacing ‘&’ with ‘&amp;’. It works fine regarding content validation, but breaks down server-side code as ‘&amp;’ is not treated as the separator by php if raw url is directly pasted into the browser (note that normally all browsers automatically decode ‘&amp;’ so that when user click on the link, page is correctly served => error explained below should almost never occur).

Example:

<a href="myurl?key1=value1&amp;key2=value2">my_link</a>

// server side script corresponding to queried page
echo var_dump($_GET);
// ouput:
array(2) {
["key1"]=>
string(6) "value1"
["amp;key2"]=>
string(6) "value2"
}

Quick-solution

Personally I find it pretty annoying having to encode ‘&’ into ‘&amp;’ each time an url gets generated, and I also don’t like the idea of being dependent upon the browser for it to automatically decode ‘&amp;’ when it is used as a link.

A much cleaner solution is to change default var aggregator character to an xml-compliant one that can be urlencoded too (so that values of your query variables won’t interfere with it).

‘;’ meets these criteria (note: this character is recommended by w3c as an official alternative to ‘&’ => +++)

In php, you simply need to modify php.ini, and change ‘arg_separator.input’ value from ‘&’ to ‘;’ (don’t forget to restart you server for changes to be applied).
Be aware not to use a multi-character separator as each character will be treated as a separator per see.
We will need this multi-character behavior in our advantage to avoid our code breaking down in case we forgot to use ‘;’ as variable separtor, by using following list of separators:

arg_separator.input = ";&"

To implement these changes, do the following:

  • locate php.ini file (/etc/php5/apache2/php.ini  on my server)
  • open it, simply locate line with ‘arg_separator.input’ and uncomment it by removing ‘;’ char at beginning of the line.
  • do the same with ‘arg_separator.output’ if you need to
  • save your changes
  • restart your server (“sudo /etc/init.d/apache2 restart”)

That’s it! (you can start using your new xml-compliant urls)

Warning

If you implement this solution, keep in mind that it might NOT be an optimal choice for search-engine optimization. They might expect you to use standard ‘&’ character so that they can parse your request uri to index it better.
In my case, this did not come into consideration because my website is almost entirely private (identification required) => useless for search-engines.

A solution more search-engine compliant would be to use ‘/’ as a separator, which by the way may not need to make any changes to php per see if your app is built upon recent frameworks (such as Zend Framework) which will natively handle variable extractions (but you won’t get values into $_GET superglobal, check your framework documentation for more details)

Note1: you might also want to modify  arg_separator.output and set it to the same value if you use php functions to build/output some urls

Note2: if you don’t have access to php.ini file, don’t worry, ‘arg_separator.input’ can also be changed within httpd.conf or .htaccess files using the following directives: “php_value arg_separator.output &amp;” and “php_value arg_separator.input ;&”

sources

Validate content with regular expression within relax ng

Tuesday, April 14th, 2009

If you want to use regular expression within your relax-ng schema to validate content, you can do as follow:

<element name="group_ids">
 <data type="string">
  <param name="pattern">(d+,?)*</param>
 </data></element>

sources

match any node with relaxng

Wednesday, January 21st, 2009

To match any node with relax ng, you need to define your pattern step by step.

Ie. first you need to specify that any node name is to be matched (using special node <anyName/>) and then define your typical content.

For example, if you want to match any node name, but still restrict these nodes to string content only, you can do the following:

<element>
<anyName/>
<data type="string"/>
</element>

Read official tutorial (section 11) for more details

sources

Subclass / Overload relax-ng definitions

Tuesday, January 20th, 2009

Relax-ng is so great, you can subclass definitions to better match your needs.

Let’s say you’ve make a definition of an element in file elem1.rng as follow:

<define name="elem1.class">..def1..</define>

Now you want to modify this element .

Add extra choice

Let’s say you want elem1 to be able to match either def1 or another definition called def2.
To add a choice node, simply redefine this element in your current relax-ng file adding the ‘combine=”choice”‘ attribute, as follow:

<define name="elem1.class" combine="choice">..def2..</define>

Now it’s the same as having:

<define name="elem1.class"><choice><..def1..><..def2..></choice></define>

Add extra requested content

Let’s say now that you want to add another constraint to elem1, it needs to match both def1 AND def2.
Simply redefine this element in your current relax-ng file adding the ‘combine=”interleave”‘ attribute, as follow:

<define name="elem1.class" combine="interleave">..def2..</define>

Now it’s the same as having:

<define name="elem1.class"><interleave><..def1..><..def2..></interleave></define>

Replacing a definition

Now you don’t want elem1 to match def1 but simply to match def2, ie. you want to replace def1 with def2.
Here we don’t use any combine attribute, instead we simply redefine our element within the include node.

Typically include node is stand-alone as follow:

<include href="elem1.rng"/>

To subclass a definition, simply redefine your element within the include node as follow:

<include href="elem1.rng">
<define name="elem1.class">...def2...</define>
</include>

Now it’s as if elem1.rng had always contained the above definition refering def2.

sources

Validate MySQL datetime format with relax-ng

Tuesday, January 13th, 2009

Key for validating string content with relax-ng is to user the “pattern” parameter.

A MySQL datetime format is as follow:

2009-01-13 13:25:06 # output of 'select NOW();'

Corresponding relax-ng validation schema is:

<!-- define mysql datetime format content -->
<define name="global.mysql_datetime">
<data type="string"><param name="pattern">[0-9]{4}(-[0-9]{2}){2} [0-2][0-9](:[0-5][0-9]){2}</param></data>
</define>

Now you can simply call it in your relax-ng document using <ref name=”global.mysql_datetime”>, for example to ensure <creation_date> node has mysql datetime formated content, you could use:

<element name="creation_date"><ref name="global.mysql_datetime"/></element>

sources

XMLReader fails to load relax-ng schema

Tuesday, December 30th, 2008

I don’t know why php XMLReader library fails to load all my xml-schemas including some pretty basics.

Each time I end up with the following message: Warning: XMLReader::setRelaxNGSchemaSource() [xmlreader.setrelaxngschemasource]: Unable to set schema. This must be set prior to reading or schema contains errors. in /var/www/test.php on line 12

source code

Here is the source code that generated above error:

<?php
$rng_schema = <<<RNG
<?xml version="1.0"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<start>
<element name="node1"><text/></element>
</start>
</grammar>
RNG;
$xml_reader = new XMLReader();
$xml_reader->setRelaxNGSchemaSource($rng_schema);

workaround

My relax-ng schemas looks like working if using xmllint (part of libxml2-tools package) or sun-microsystem “msv” validator (https://msv.dev.java.net/).

Both of them return an error code if validation failed. We can use it to emulate a relax-ng validator calling them from php. Personnaly I prefer using xmllint since it bypasses java:

$output = shell_exec('xmllint --relaxng <path-to-relax-ng-schema.rng> <path-to-xml-file> > /dev/null 2>&1; echo $?');
if ('0' === trim($output)) {
  // when validation succeeds
} else {
  // when validation fails
}

sources

SimpleXML and default namespace

Tuesday, December 16th, 2008

It looks like php 5.2.6 has a problem with xpath queries on SimpleXMLElement with a default namespace!

Let’s say your xml file is the following:

<?xml version="1.0"?>
<request name="/do/action" xmlns="mynamespace">
</request>

Let’s say you create a SimpleXMLElement: $sxe = new SimpleXMLElement($xml_string);

Now you want to extract content of ‘name’ parameter in ‘request’ node:

$sxe->xpath('/request[@name]);

This will not return any result because you did not registered your default namespace.
Well let’s do it! First argument is the prefix, second is the namespace:

$sxe->registerXPathNamespace('', 'mynamespace');

If you run again your xpath query, you won’t get any result! Too bad! Looks like a simplexml bug to me.

solution

You must assign a prefix to your namespace when registering it with simplexml, and use it in your xpath query. Let’s say we will use ‘default’, then we should do the following:

$sxe->registerXPathNamespace('', 'mynamespace');
$sxe->xpath('/request[@name]);

Now it works.

Bad news is you need to register xpath namespaces on each SimpleXMLElement on which you’re calling ->xpath() (cf. XPath, SimpleXML and default namespace)

generic solution

A generic solution is given in a user comment on php simplexml documentation page:

$namespaces = $xml->getNamespaces(true);
if(isset($namespaces[""]))  // if you have a default namespace {
// register a prefix for that default namespace:
$xml->registerXPathNamespace("default", $namespaces[""]);
// and use that prefix in all of your xpath expressions:
$xpath_to_document = "//default:document";
} else $xpath_to_document = "//document";
$document = $xml->xpath($xpath_to_document);

sources

PHP and Relax-NG validation

Tuesday, December 16th, 2008

To validate an xml file with a Relax-NG schema in PHP, you can use the XMLReader library:

$xml_reader = new XMLReader();
$xml_reader->setRelaxNGSchema($relax_ng_file);
$xml_reader->XML($xml_input->asXML());
if (!$xml_reader->isValid()) throw new Exception('failed RelaxNG validation);

I did not find how to retrieve corresponding error.
If you need a more verbose relax-ng validator, you can use ‘xmllint‘ in linux command console (part of ‘libxml2′ package).

sources

xml namespaces

Wednesday, September 17th, 2008

XML namespaces are very easy to use, once we understand their purpose.

objective

Purpose of xml namespaces is to prevent namespace collision, ie. to prevent two objects having the same name but meaning two different things.

Bellow is an illustration taken from w3schools.com

Defining a default namespace for an element saves us from using prefixes in all the child elements. It has the following syntax:xmlns=”namespaceURI”

This XML carries HTML table information:

<table xmlns="http://www.w3.org/TR/html4/">
<tr>
<td>Apples</td>
<td>Bananas</td>
</tr>
</table>

This XML carries information about a piece of furniture:

<table xmlns="http://www.w3schools.com/furniture">
<name>African Coffee Table</name>
<width>80</width>
<length>120</length>
</table>

how to define namespace

Basically you have two types of namespaces: default namespaces and prefixed namespaces.
Each namespace must have its own namespace name, which makes it unique (at least different from the other namespace we want to differentiate from). This namespace name can be any string, generally we use a url (it does not important if this uri points to a real document or not, it’s simply used as a unique string identifier). Exemple: ‘xmlns=”http://www.qc4web.com”‘

Declaring a default namespace is as follow: <node1 xmlns=”mynamespacename“>.
By doing so, all nodes within our namespaced <node1> elements that do not have a prefix on their node name are considered part of “mynamespacename”, except if a child node defines its own default namespace too.

Declaring a prefixed namespace is as follow: <h:node2 xmlns:h=”myprefixednamespace“>.
In this example, we’ve chose letter “h” as a prefix, but you can choose another string if you want. From this point on, any node within our <node2> using “h” as node-name prefix will be considered as part of myprerfixednamespace“. Any inner-node not prefixed with “h” will be considered part of first namespace defined within a <node1> parent, or default namespace if none.

namespace scope

We’ve already seen namespace scope characteristics, but here is a more formal (yet crystal clear) definition, taken from official technical paper:

The scope of a default namespace declaration extends from the beginning of the start-tag in which it appears to the end of the corresponding end-tag, excluding the scope of any inner default namespace declarations. In the case of an empty tag, the scope is the tag itself.

This is very good news! It means we do not have to prefix all nodes when we merge two xml documents, we simply need to specify a different default namespace on the root of each merged node!

default namespaces vs. prefixed namespaces

Let’s see w3schools example stated above, imagine we want to merge the two xml documents, using prefixed namespace, we get:

<two_tables xmlns:g="http://www.w3.org/TR/html4/" xmlns:h="http://www.w3schools.com/furniture">
<g:table>
<g:tr>
<g:td>Apples</g:td>
<g:td>Bananas</g:td>
</g:tr>
</g:table>
<h:table>
<h:name>African Coffee Table</h:name>
<h:width>80</h:width>
<h:length>120</h:length>
</h:table>
<two_tables>

while using default namespace, we get:

<two_tables>
<table xmlns="http://www.w3.org/TR/html4/">
<tr>
<td>Apples</td>
<td>Bananas</td>
</tr>
</table>
<table xmlns="http://www.w3schools.com/furniture">
<name>African Coffee Table</name>
<width>80</width>
<length>120</length>
</table>
</two_tables>

In our case, the result is the same, each <table> has its own namespace (and is considered different).

Basically using prefixed namespaces, instead of default namespaces, is usefull when you mix elements of differents namespaces multiple times within the same document.

sources

  • http://www.w3.org/TR/REC-xml-names/#sec-namespaces (especially section 6, on namespaces scope)
  • http://www.w3schools.com/Xml/xml_namespaces.asp