Post

Aug 06
Updating Windows 8.1 Pro to Windows 10 Pro on a Surface Pro 2 … part 3

I had hoped that after Updating Windows 8.1 Pro to Windows 10 Pro on a Surface Pro 2 … oh the pain … that my Windows 10 issues would be solved. Well, I had hoped, anyway.

This evening, I lost all of my Windows 10 networking again. I've no idea why – I wasn't playing with any networking settings, or anything like that.

I reverted to Windows 8.1 again, and then installed Windows 10 for the 5th time.

How long will it last until the problem happens again? Sadly, I really don't know.

Aug 05
Updating Windows 8.1 Pro to Windows 10 Pro on a Surface Pro 2 … oh the pain …

I mentioned previously in Updating Windows 8.1 Pro to Windows 10 Pro on a Surface Pro 2 that it took me two goes to get Windows 10 running successfully (with a working network setup) on my Surface Pro 2. Unfortunately, it quickly went bad on my again.

It may have been partially my own fault, curiosity killing the cat. I knew that VPNs were an issue in upgrading to Windows 10. I naively assumed that once I had Windows 10 running, all VPN issues would be gone. So, I had a look at the new VPN support in Windows 10, and tried the "Add VPN" dialog, though I cancelled without adding one.

However, after that, I once again found that all networking was gone. A Surface without networking really is just a brick, so it's one of the worst problems you can have.

I tried reverting to Windows 8.1 again, and then I installed Windows 10 for the *third* time. However, no-go – I had no networking, just like my original attempt at installing Windows 10.

I had to wait a day or two until I had enough time for a fourth attempt. In the meantime, some new Windows updates arrived for Windows 8.1, and they may have helped. I also removed a couple more applications that had built-in VPNs.

So, last night I installed Windows 10 for the fourth time, and I have it working, with networking. I like Windows 10, it's good to have it back. What I don't know, as I write, is whether I'm past these networking issues now, or if they might yet come back to haunt me. We'll see how things unfold over the next week or two.

Jul 31
Updating Windows 8.1 Pro to Windows 10 Pro on a Surface Pro 2

So, Windows 10 was released on 29th July 2015, and the question was how long I should wait before installing it on my Microsoft Surface Pro 2. With previous releases of Windows, it was usual to wait a while for bug fixes to emerge. However, with Windows 10, Microsoft has moved to a strategy of many beta releases, with the "actual" release being the final beta. On that basis, the launch version of Windows 10 should already have been reasonably well sorted out, so I thought I would give it a try yesterday.

I backed up my Surface using 'File History', then ran the Windows 10 updater. It all went smoothly, it all looked good. But did it work?

Well, one big thing didn't work at all. Networking. There wasn't any, and that makes it pretty hard to do anything with a Surface (which depends on Wi-Fi, since it doesn't have an Ethernet plug).

So, I searched the web to see who else was having similar problems. There was no obvious case that sounded exactly like mine, but there were two clear themes:

  • people were recommending removing Cisco VPN software before installing Windows 10;
  • people were recommending installing all available Windows updates for 8.1 before updating to Windows 10.

So, the next step was to revert to Windows 8.1. That's easy to do (for up to a month after installation of Windows 10) using: Settings | Update & security | Recovery | Go back to Windows 8.1.

Once Windows 8.1 was restored, I had Wi-Fi again, and I ran the "Windows Update" app. Annoyingly, the first thing that the update app wanted to do was start downloading Windows 10. I stopped the download, deselected it, and selected all other Windows updates (in fact, I had to keep stopping the Windows 10 download, which was annoying).

I didn't have Cisco VPN installed, but I did have "OpenVPN" installed, so I removed it. I'm not sure if I'll need it for Windows 10, which appears to have a built-in VPN facility (not that I've had time to try it yet).

Finally, after installing the updates and rebooting, I used "Windows Update" again to install the Windows 10 update. This time, everything worked, and I have Wi-Fi.

So far, Windows 10 seems fine, haven't had any problems with anything else. It's still early days, but it looks good so far.

Sep 17
New build of SublimeText 3 Beta

I posted about the extensions I've done for the Sublime Text 3 Beta to support Turtle and SPARQL. I had mentioned that it wasn't clear whether Sublime Text was still being actively supported or not. It was a nice surprise today to receive a new update. Here are the changes:

Build 3065

----------

Release Date: 27 August 2014

 

    * Added sidebar icons

    * Added sidebar loading indicators

    * Sidebar remembers which folders are expanded

    * Tweaked window closing behavior when pressing ctrl+w / cmd+w

    * Improved quote auto pairing logic

    * Selected group is now stored in the session

    * Added remember_full_screen setting

    * Fixed a lockup when transitioning from a blinking to a solid caret

    * Fixed a crash in plugin_host

    * Fixed a crash triggered by Goto Anything cloning views

    * Windows: Added command line helper, subl.exe

    * OSX: Added 'New Window' entry to dock menu

    * Posix: Using correct permissions for newly created files and folders

    * API: Updated to Python 3.3.3

May 13
Syntax Colouring & Completions for Turtle & SPARQL in Sublime Text – The SPARQL Bit

[ Related posts: #1 ]

I haven't posted to the blog for a while because I was moving house. Now that is done, and I have a new Microsoft Surface Pro 2 for my daily train trips, I hope to be posting more regularly.

In the previous post, I described extensions for the Sublime Text 3 beta to support editing of Turtle files for RDF and OWL. I've now added support for SPARQL as well, and released both together as version 1.0 of the sublime-text-turtle-sparql library in GitHub.

I've also re-organised the files – the Turtle files are in the 'Turtle' sub-directory, and the SPARQL files are in the 'SPARQL' sub-directory. I found I had to organise the files this way in order to get Sublime text to add 'Turtle' and 'SPARQL' to its list of available syntaxes. Just copy these two directories into your Sublime Text 3 'Package' directory (found using 'Preferences | Browse Packages…).

For the Turtle extension, there are two files (as described in the previous post):

  • syntax definitions;
  • completions.

With SPARQL, there's also a number of 'snippet' files that provide templates for code blocks like 'SELECT', 'CONSTRUCT', etc.

Here's how it looks:

The actual colours that you get depend on which Sublime Text colour theme you have chosen, as the syntax file doesn't directly specify colours, only theme entries.

If you are trying out these extensions, feedback would be very welcome. Let me know what you think I should change or could improve. Thanks!

Mar 23
Syntax Colouring & Completions for Turtle & SPARQL in Sublime Text

Some of the code that I discuss in this blog is written on a netbook as I travel on the train between Cambridge and London. The netbook isn't the right kind of PC on which to run a full-blown IDE like IDEA Intellij, in my view, so I use a text editor instead. Text editors that I've used in the past include Emacs and jEdit, and sometimes Notepad++. However, I noticed that Sublime Text seemed to be popular with the Scala community, so I decided to try it out. It's a commercial product, and I decided to buy a licence (partly to support a fellow Australian). If you are thinking about buying it, I suggest you check out the Sublime Text forums first, as some users are asking why it is taking so long for Sublime Text 3 to move out of beta, and wondering where the product is going longer term. That said, I'm using the Sublime Text 3 beta, and I'm happy with it.

When I'm doing RDF/OWL projects, I often want to read/edit one or both of

These are not common formats for text editors to support, so I decided to have a go at creating the necessary add-ons for Sublime Text. Sublime Text allows you to add different kinds of extensions; the two I've been interested in are:

  • syntax definitions;
  • completions.

I wrote 'syntax definitions rather than 'syntax colouring' because Sublime Text separates colours from syntax using 'themes' (but that's a more involved discussion for another time).

To cut to the chase, he's how some Turtle (OWL from FIBO) looks now in Sublime Text:

The layout in this example, such as the extra blank lines, is due to the tool that produced the Turtle, it's nothing to do with Sublime Text.

Here's an example of a blank node:

Completions work by having a list of keywords and common strings, e.g in the following I'm typing ':Animal a owl:Class', and the completions have popped up as I pressed the 'o' in 'owl:Class':

So far, I've written Turtle support, and SPARQL support is a work in progress (there's a lot more to SPARQL's syntax, but it borrows most of what was done for Turtle). If you want to try it out, you can download it from my 'sublime-text-turtle-sparql' project on GitHub. Here's what to do:

  1. In Sublime Text, select 'Preferences | Browse Packages…' to find where your local packages directory is.
  2. In the directory, there is a 'User' subdirectory.
  3. Into that 'User' subdirectory, copy the files 'turtle.tmLanguage' (syntax) and 'turtle.sublime-completions'.

When you next start Sublime Text, 'Turtle' should be one of the language options. If it isn't, try this fix that has worked for me:

  1. Open a new, blank file.
  2. Select 'File | Save As…'.
  3. Select 'Turtle' as the file type in the 'Save As' dialog.
  4. Save the file somewhere (you can delete it after saving it).

If you want to modify the completions for yourself, see the unofficial Sublime Text documention on completions (odd as it may sound, the unofficial documentation is much better than the official documentation).

If you want to modify the syntax for yourself, see the unofficial Sublime Text documentation on syntax definitions. Also,

  • Save 'turtle.JSON-tmLanguage' to the same 'User' subdirectory as for the other files, as it's the source file that you will edit to generate the '.tmLanguage' syntax file.
  • Install 'Package Control'.
  • Use 'Package Control' to install 'AAAPackageDev'.
  • As well as reading the unofficial documentation, read the TextMate manual documentation on naming conventions for language grammars.
    • Note: Sublime Text, like some other editors, follows TextMate's approach to syntax.
    • You define regular expressions with a name, and the names are specially chosen to match entries in Sublime Text's 'theme' file. That's how syntax definitions eventually turn into colours in the editor, via the currently selected theme.
    • Sublime Text matches the syntax's regular expressions against your text. Once a piece of text has been matched by one regular expression, it won't match any others, so the order of the regular expressions is crucial.
Mar 19
A Scala API for RDF Processing

My general purpose language of choice these days is Scala. Recently, I was wanting to demonstrate some ideas about automated enhancement of RDF and OWL. I could encode some of what I wanted to do as SPARQL queries, but I didn't have any convenient way to chaining a sequence of queries into a workflow. To try and plug that gap, I've written a library that I call 'scala-rdf'. It's not a 'Scala API for RDF' in the sense of something that provides Scala classes for fundamental RDF constructs like statements (aka triples, quads, pents, etc.). Rather, my focus is on how to create an API that lends itself to compact encoding of RDF processing pipelines.

On that basis, 'scala-rdf' is currently a thin Scala wrapper for 'Sesame 2.7', which a Java implementation of an RDF triple store which supports SPARQL queries.

Now, in a triple store, there is no more than one occurrence of each triple (at least, I believe all current implementations work this way), so a triple store is really a 'set of triples' in a mathematical sense.

Scala has a well-defined generic 'Set' trait (a trait is like a Java interface, except that it can have actual method implementations as well as method signatures). The obvious thing seemed to be to implement a Scala layer to make a Sesame triple store expose the Scala 'Set' trait (interface).

However, the Scala collections library, which very rich and clever and very useful in most circumstances, proved rather hard to work with in terms of implementing one of its traits with my own class. There are many classes and traits in the collections library, and there are many dependencies between those classes and traits. The current implementations seem to be biased towards

  • in-memory data structures that can be read-write or read-only,
  • which have a definite left-to-right order to their elements, and
  • which can be processed in parallel.

Now, those assumptions are fine for Scala's in-built collections classes, but they don't match the characteristics of an RDF triple store that may be on-disk. The triple store may not be in-memory, it may not provide read-only functionality, but especially

  • it probably won't provide a parallel processing option, because that's outside of the functionality that SPARQL provides, and
  • the triples won't have a definite order, just as the rows in a table in a relational database don't have a definite order.

So, while I think that Scala 'Set' trait is a good model of the methods you would want, it wasn't plausible to use it directly (unless someone would like to advise otherwise on what I could have done).

Instead, I created an 'UnorderedSequentialSet' trait. It borrows the methods from the 'Set' trait that make sense for an unordered set that can only be processed sequentially. It also currently doesn't provide for read-only access, only read-write. It's the kind of 'set' that databases implement, in my experience. My initial implementation of this trait, just a thin layer on topic of Sesame's 'Repository' interface, is 'SesameTripleSet'.

I'll let you look through the code in your own time, but let me talk about testing. Scala has a sophisticated collections library, so if I was going to implement the methods, I needed a way to be sure I had done so in a way that is functionally identical to the way a Scala 'Set' implementation would do the same. To that end, I created the 'UnorderedSequentialHashSet' class. It takes Scala's built-in mutable 'HashSet' class, and extends it to support the 'UnorderedSequentialSet' trait. In all of my testing, I tested both 'SesameTripleSet' and 'UnorderedSequentialHashSet' with the same tests, so that I could confirm identical functioning of my implementation compared to the built-in equivalent.

The tests are written using the 'ScalaTest' library, and can be found here. In particular, worth noting are

  • 'UnorderedSequentialSetSpec', an abstract test class that defines the tests in one place, but can then be extended with the specifics necessary for the actual classes being tested;
  • 'StatementUnorderedSequentialHashSetSpec', which trivally extends 'UnorderedSequentialSetSpec' for an 'UnorderedSequentialHashSet' of Sesame 'Statements';
  • 'SparqlProcessorSpec', a test class that defines tests for classes which support SPARQL queries (note: this excludes 'UnorderedSequentialHashSet');
  • 'SesameTripleSetSpec', which trivially extends both 'UnorderedSequentialSetSpec' and 'SparqlProcessorSpec'.

That's a whirlwind introduction to 'scala-rdf'. In future posts, I plan to describe its use in more detail (this post has been more than a little breathless).

Mar 04
Generating an XML test corpus – Testing triple natural key search

[ Related posts: #1 , #2 , #3 , #4 , #5 , #6 , #7 , #8 , #9 ]

In the previous post, I found that MarkLogic 7's triple index seems to be a particular fast way to implementing linking and navigation between documents in MarkLogic. However, what if you don't have MarkLogic 7? It's still a comparatively new version, and some people aren't yet ready to move on from MarkLogic 6 in production.

The interesting thing about MarkLogic 7's approach to storing triples is that it doesn't implement a triple store that is separate to the XML store. This differs from what some other databases do. In MarkLogic 7, triples are stored as XML, and any XML in the correct format is interpreted as triples automatically. That means that you can store appropriately formatted triples in MarkLogic 6 now. Since they are XML, you can search that XML using the standard XML search facilities. Then, when you move to MarkLogic 7, you will immediately be able to benefit from ML7's triple store.

The question that arises is – how much slower is it to treat triples as XML, using only XML searching. It won't be as fast as using the triple store, but if it is comparable to the other linking/navigation methods, then it may be a no-cost way to prepare now for using ML7 in the future. The one thing you need to do is enable the appropriate indexing in MarkLogic.

So, let's cut to the chase – what are the results?

The time taken to traverse a document link is, on average, 27 +/- 7 milliseconds. This compares to 2.0 +/- 0.7 milliseconds for triple SPARQL queries, 20 +/- 3 milliseconds for compound natural key search, 26 +/- 7 milliseconds for the simple natural key search, 29 +/- 6 milliseconds for primary key search and 22 +/- 8 milliseconds for direct URL access. Please note that these times only apply to MarkLogic 7 running on my home PC, you should run the tests on your own setup if you want to do the same analysis for your own system.

So, in practical terms, using triples without the triple index is not significantly different to the other non-SPARQL approaches. It seems that there is no penalty for using triples encoded as XML now, even if you aren't yet using MarkLogic 7.

Here is the code that was used to make the measurements (the value "$maxDepth" was changed manually for each run, from 0 to 10).

test-triple-natural-key-search.xquery:[download this file]

xquery version "1.0-ml";
(:
Copyright 2014 Anthony Coates

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
:)

declare namespace sem = "http://marklogic.com/semantics";
declare namespace my = "http://ns.contakt.org/ml7/link";

declare variable $collection := "http://ns.contakt.org/ml7/docs/my/";
declare variable $ontology := "http://rdf.contakt.org/ml7/link#";
declare variable $maxCount := 1000000;
declare variable $maxDepth := 10;
declare variable $iterations := 10;

(: Given a document URI, retrieve the next one. In this case, searches for a 'hasRandomFile' triple from $doc to the random document, using simple indexed search. :)
declare function my:retrieveNext($docUri as sem:iri) as sem:iri {
   let $tripleElement := cts:search(
     fn:collection($collection)/my:doc/sem:triple,
     cts:and-query((
       cts:element-value-query(
         xs:QName("sem:predicate"),
         "http://rdf.contakt.org/ml7/link#hasRandomFile",
         "exact"
       ),
       cts:element-value-query(
         xs:QName("sem:subject"),
         $docUri,
         "exact"
       )
     ))
   )
   return sem:iri($tripleElement/sem:object)
};

(: Trace through a list of links recursively. :)
declare function my:traceLinks($docUri as sem:iri, $depth as xs:integer) as sem:iri {
   if ($depth = 0)
   then $docUri
   else my:traceLinks(my:retrieveNext($docUri), $depth - 1)
};

declare function my:timeTraceLinks() as xs:dayTimeDuration {
   let $startTime := xdmp:elapsed-time()
   let $startIndex := xdmp:random($maxCount + 1)
   let $startUrl := concat($collection, $startIndex)
   let $startDoc := fn:doc($startUrl)
   let $results := fn:doc(my:traceLinks($startDoc/my:doc/my:fileUrl, $maxDepth))
   return (xdmp:elapsed-time() - $startTime)
};

let $warmup := my:timeTraceLinks()
let $timings := for $i in 1 to $iterations return my:timeTraceLinks()
let $avg := fn:avg($timings)
let $min := fn:min($timings)
let $max := fn:max($timings)
return
   <results depth="{$maxDepth}" iterations="{$iterations}">
     <min>{$min}</min>
     <avg>{$avg}</avg>
     <max>{$max}</max>
   </results>

Mar 03
Generating an XML test corpus – Testing triple SPARQL queries

[ Related posts: #1 , #2 , #3 , #4 , #5 , #6 , #7 , #8 ]

So far, we've tested various different ways that document linking & navigation can be achieved in MarkLogic, most of them involving the MarkLogic index. In MarkLogic 7, there is a new kind of index, for RDF triples. An RDF triple is made up of 3 values, a subject, predicate and object. The 'subject' is a URI that refers to some object or thing, the 'predicate' is a URI that refers to some property or relationship, and the 'object' is the value of the property or relationship. The object can be a simple number, string, etc., but it can also be a URI for an object or thing. All RDF information is expressed in terms of such triples (although some RDF stores have extended this concept to 'quads' and even 'pents', i.e. quadruples or pentuples, but that's beyond the scope of this post).

Referring to the document structure, each document contains the following MarkLogic-specific XML:

<sem:triple xmlns:sem="http://marklogic.com/semantics">
  <sem:subject>http://ns.contakt.org/ml7/docs/my/190700</sem:subject>
  <sem:predicate>http://rdf.contakt.org/ml7/link#
hasRandomFile</sem:predicate>
  <sem:object>http://ns.contakt.org/ml7/docs/my/
621910</sem:object>
</sem:triple>

This is the MarkLogic XML encoding of an RDF triple. The subject is the document itself, i.e. the subject URI is the same as the document URI. The predicate is a URI that uniquely identifies a 'hasRandomFile' relationship between the subject and the object. The object is the document URI for the random file to which the document has a link.

Triples can be treated specially in MarkLogic 7. They can be indexed in a special way, and they can be searched using the SPARQL query language. You have to enable triple indexing explicitly in MarkLogic 7, but once you enable it, it is enabled for all triples (so unlike element/attribute indices, you don't have to keep adding indices – the triple index is either on or off).

In MarkLogic 7, you typically run SPARQL queries from within an XQuery. In this case, the XQuery was written to implement each document-to-document query as a separate SPARQL query. Before looking at the query code, let's cut to the chase and see how the measurements went.

Now, these curves don't fit the straight-line fit as well as previous curves did. Why not? Well, I'll get to that. The time taken to traverse a document link is, on average, 2.0 +/- 0.7 milliseconds. This compares 20 +/- 3 milliseconds for compound natural key search, 26 +/- 7 milliseconds for the simple natural key search, 29 +/- 6 milliseconds for primary key search and 22 +/- 8 milliseconds for direct URL access. Please note that these times only apply to MarkLogic 7 running on my home PC, you should run the tests on your own setup if you want to do the same analysis for your own system.

In short, in my tests, using indexed triples was 10-15 times faster than the other methods. That's an amazing improvement. For some purposes, it may be grounds enough for migrating from MarkLogic 6 to 7. Indeed, the times are so small that I think the not so clean fit of the data is largely because the normal measurement errors seem much larger (10x or so) compared to these measurements.

Here is the code that was used to make the measurements (the value "$maxDepth" was changed manually for each run, from 0 to 10). The function "my:retrieveNext" generates the SPARQL query. The basic query is

SELECT ?object
WHERE {
  ?subject <http://rdf.contakt.org/ml7/link#hasRandomFile> ?object .
}

which searches for a triple with some unspecificied subject, the specific URI that we are using for the 'hasRandomFile' predicate, and some unspecified object. It then returns only the object. Note that in MarkLogic 7, it doesn't matter in which document these triples occur. They can occur anywhere in the database. In "my:retrieveNext", the value of the subject is fixed in the "sem:sparql" query by specifying its value in the second argument, a map. That fixes the subject to be the document URI, i.e. we restricted our search to 'hasRandomFile' associations from the current document, and we know there is only one of those. Otherwise, the query is much like the previous queries. A document is loaded only at the start and end of each run, not at intermediate stages, as all intermediate information is available in the triple index.

test-triple-sparql-query.xquery:[download this file]

xquery version "1.0-ml";
(:
Copyright 2014 Anthony Coates

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
:)

declare namespace sem = "http://marklogic.com/semantics";
declare namespace my = "http://ns.contakt.org/ml7/link";

declare variable $collection := "http://ns.contakt.org/ml7/docs/my/";
declare variable $ontology := "http://rdf.contakt.org/ml7/link#";
declare variable $maxCount := 1000000;
declare variable $maxDepth := 10;
declare variable $iterations := 10;

(: Given a document URI, retrieve the next one. In this case, searches for a 'hasRandomFile' triple from $doc to the random document. :)
declare function my:retrieveNext($docUri as sem:iri) as sem:iri {
   map:get(
     sem:sparql(
       'SELECT ?object WHERE { ?subject <http://rdf.contakt.org/ml7/link#hasRandomFile> ?object . }',
       map:new(
         map:entry("subject", sem:iri($docUri))
       )
     ),
     "object"
   )
};

(: Trace through a list of links recursively. :)
declare function my:traceLinks($docUri as sem:iri, $depth as xs:integer) as sem:iri {
   if ($depth = 0)
   then $docUri
   else my:traceLinks(my:retrieveNext($docUri), $depth - 1)
};

declare function my:timeTraceLinks() as xs:dayTimeDuration {
   let $startTime := xdmp:elapsed-time()
   let $startIndex := xdmp:random($maxCount + 1)
   let $startUrl := concat($collection, $startIndex)
   let $startDoc := fn:doc($startUrl)
   let $results := fn:doc(my:traceLinks($startDoc/my:doc/my:fileUrl, $maxDepth))
   return (xdmp:elapsed-time() - $startTime)
};

let $warmup := my:timeTraceLinks()
let $timings := for $i in 1 to $iterations return my:timeTraceLinks()
let $avg := fn:avg($timings)
let $min := fn:min($timings)
let $max := fn:max($timings)
return
   <results depth="{$maxDepth}" iterations="{$iterations}">
     <min>{$min}</min>
     <avg>{$avg}</avg>
     <max>{$max}</max>
   </results>

Feb 23
Generating an XML test corpus – Testing compound natural key search

[ Related posts: #1 , #2 , #3 , #4 , #5 , #6 , #7 ]

The previous post looked at searching on a simple, single natural key. However, natural keys are often composed of multiple values. So, in this test, I am using a compound natural key composed of two values.

Referring to the document structure, each document contains the following two elements:

<my:index major="190" minor="700">190700</my:index>
<my:randomIndex major="621" minor="910">621910</my:randomIndex>

In the previous post, I did the search on the element content. In this blog post, the two attributes "major" and "minor" will be used for the search. The "major" and "minor" values are derived from the full index values; the last 3 digits are the minor value, and the digits before that are the major value. So for an index "190700", "700" is the minor value, "190" is the major value.

For this search, I had to set up an attribute index in MarkLogic 7:

So how does this compound natural key compare to the equivalent simple natural key?

From the slopes of the min/average/max curves, the time taken to traverse a document link is, on average, 20 +/- 3 milliseconds. This compares to 26 +/- 7 milliseconds for the simple natural key search, 29 +/- 6 milliseconds for primary key search and 22 +/- 8 milliseconds for direct URL access. Please note that these times only apply to MarkLogic 7 running on my home PC, you should run the tests on your own setup if you want to do the same analysis for your own system.

Now, I can't explain why a compound natural key search would be faster than a simple natural key search. It's not what you would expect. However, what's more important, I would argue, is that it seems that there was no penalty in MarkLogic for using this compound natural key rather than the single natural key. That's an important result, because if you consider relational databases, people are often desperate to reduce compound keys to a single column, as they (apparently) don't get sufficient performance out of compound-key queries. For MarkLogic 7 in this case, that's clearly a non-issue, and that's good news because it's better to be able to avoid munging multiple values into single field just to create a single-value key for search purposes.

Here is the code that was used to make the measurements (the value "$maxDepth" was changed manually for each run, from 0 to 10). It is pretty much identical to the code for simple natural key search; the main difference is the definition of the function "my:retrieveNext".

test-compound-natural-key-search.xquery:[download this file]

xquery version "1.0-ml";
(:
Copyright 2014 Anthony Coates

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
:)

declare namespace sem = "http://marklogic.com/semantics";
declare namespace my = "http://ns.contakt.org/ml7/link";

declare variable $collection := "http://ns.contakt.org/ml7/docs/my/";
declare variable $ontology := "http://rdf.contakt.org/ml7/link#";
declare variable $maxCount := 1000000;
declare variable $maxDepth := 10;
declare variable $iterations := 10;

(: Given a document, retrieve the next one. In this case, searches for a document where the 'my:index' attributes match the 'my:randomIndex' attributes from $doc. :)
(: The two parts of the index are passed as a space separated string. :)
declare function my:retrieveNext($docIndices as xs:string) as xs:string {
   let $indices := fn:tokenize($docIndices, " ")
   let $majorIndex := $indices[1]
   let $minorIndex := $indices[2]
   let $docElement := cts:search(
     fn:collection($collection)/my:doc,
     cts:and-query((
       cts:element-attribute-value-query(
         xs:QName("my:index"),
         xs:QName("major"),
         $majorIndex,
         "exact"
       ),
       cts:element-attribute-value-query(
         xs:QName("my:index"),
         xs:QName("minor"),
         $minorIndex,
         "exact"
       )
     ))
   )
   let $randomIndex := $docElement/my:randomIndex
   return xs:string(fn:concat($randomIndex/@major, " ", $randomIndex/@minor))
};

(: Trace through a list of links recursively. :)
declare function my:traceLinks($docIndices as xs:string, $depth as xs:integer) as xs:string {
   if ($depth = 0)
   then $docIndices
   else my:traceLinks(my:retrieveNext($docIndices), $depth - 1)
};

declare function my:timeTraceLinks() as xs:dayTimeDuration {
   let $startTime := xdmp:elapsed-time()
   let $startIndex := xdmp:random($maxCount + 1)
   let $startUrl := concat($collection, $startIndex)
   let $startDoc := fn:doc($startUrl)
   let $startIndexElem := $startDoc/my:doc/my:index
   let $startIndices := fn:concat($startIndexElem/@major, " ", $startIndexElem/@minor)
   let $resultIndices := my:traceLinks($startIndices, $maxDepth)
   let $indices := fn:tokenize($resultIndices, " ")
   let $majorIndex := $indices[1]
   let $minorIndex := $indices[2]
   let $result := cts:search(
     fn:collection($collection),
     cts:and-query((
       cts:element-attribute-value-query(
         xs:QName("my:index"),
         xs:QName("major"),
         $majorIndex,
         "exact"
       ),
       cts:element-attribute-value-query(
         xs:QName("my:index"),
         xs:QName("minor"),
         $minorIndex,
         "exact"
       )
     ))
   )
   return (xdmp:elapsed-time() - $startTime)
};

let $warmup := my:timeTraceLinks()
let $timings := for $i in 1 to $iterations return my:timeTraceLinks()
let $avg := fn:avg($timings)
let $min := fn:min($timings)
let $max := fn:max($timings)
return
   <results depth="{$maxDepth}" iterations="{$iterations}">
     <min>{$min}</min>
     <avg>{$avg}</avg>
     <max>{$max}</max>
   </results>

1 - 10Next