Notes on the Processing of Large XBRL Instances 1.0

Working Group Note 31 October 2012

Copyright © 2012 XBRL International Inc., All Rights Reserved.

This version:
<http://www.xbrl.org/WGN/large-instance-processing/WGN-2012-10-31/large-instance-processing-WGN-WGN-2012-10-31.html>
Editor:
Mark Goodhand, CoreFiling <mrg@corefiling.com>
Contributor:
Paul Warren, CoreFiling <pdw@corefiling.com>

Status

Circulation of this Working Group Note is unrestricted. Other documents may supersede this document. Recipients are invited to submit comments to spec@xbrl.org, and to submit notification of any relevant patent rights of which they are aware and provide supporting documentation.

Abstract

This note considers the challenges with processing very large XBRL instances, and outlines a backwards-compatible mechanism for facilitating efficient stream-based processing of such instances.

Table of Contents

1 Introduction
2 XBRL's use of XML
3 Processing approaches
3.1 In-memory models
3.2 Stream-based processing
3.2.1 Validation state
3.3 Persistent data stores
3.3.1 Forms of persistence
3.3.1.1 XBRL awareness on read
3.3.1.2 XBRL awareness on write
3.3.2 Performance considerations
4 XBRL Formula and XPath
5 Supporting efficient stream-based processing
5.1 A syntax for declaring context and unit constraints
5.1.1 Backwards compatibility
5.1.2 Handling footnotes

Appendices

A References
B Intellectual property status (non-normative)
C Acknowledgements (non-normative)
D Document history
E Errata corrections in this document


1 Introduction

Although XBRL [XBRL 2.1] is a general-purpose language for reporting dimensionally-qualified business data, initial adoption focused on financial reports, which tend to be relatively small — tens of contexts and hundreds of facts, amounting to hundreds of kilobytes of data.

More recently, XBRL has been applied to the interchange of large, highly dimensional datasets. This has resulted in XBRL instances that contain hundreds of thousands of contexts and facts, reaching hundreds of megabytes or even gigabytes in size.

These large instances have exposed the performance limitations of existing XBRL tools and prompted discussions about the performance implications of certain XBRL 2.1 design decisions.

2 XBRL's use of XML

One of the earliest design decisions for XBRL 2.1 was the use of XML [XML] as a base format. Compared with binary formats, XML has significant benefits for human readability. Compared with other text formats, such as CSV, it offers a standardised, self-describing syntax, well-defined support for character encoding, a powerful validation mechanism (in the form of XML Schema [XML Schema Structures]), and a range of mature processing tools based on auxiliary specifications (such as XSLT and XQuery).

These advantages made XML a popular choice for the interchange of business data, and a natural starting point for the XBRL standard. Although XML is inherently verbose compared with other representations, it compresses well, and advances in computing power and memory capacity have allowed ever-larger data sets to be handled.

3 Processing approaches

This section considers some alternative approaches for processing XBRL instances.

3.1 In-memory models

One approach to processing XBRL instances is to parse the entire document into an in-memory representation. Such models support fast random access to any fact, unit, or context in the instance.

The most common document-based representation is the Document Object Model, which typically requires memory in the order of 5-10 times the size of the document. The actual requirements will vary according to the implementation and the processing options (e.g. to defer node expansion, 'intern' string values, or include PSVI annotations).

There are a range of alternative models, some based on W3C standards (such as the XPath Data Model), and others developed independently (such as Python's ElementTree API).

Some models are heavier than others, and the lack of a standard XBRL infoset means that models may retain more XML baggage than is strictly necessary. The need for a comprehensive XML model is reinforced by support for unrestricted XPath 2 expressions in XBRL Formula, which require syntax-level details that could otherwise be discarded (see Section 4).

XBRL International's Abstract Modelling Task Force and API Signature Task Force are currently working to define which information is semantically significant from an XBRL perspective. However, even a trimmed-down, XBRL-specific model will require space proportional to the size of the instance document.

3.2 Stream-based processing

A fundamentally different approach is to process instance documents as they arrive over the wire, without at any point assembling a model of the entire document.

Instead of exposing a comprehensive model, the processor serves up a series of events (either in push style, as with SAX, or pull style, as with StAX). To ensure that the XML is well formed, such processors must maintain a list of unclosed tags (ancestors of the current element). An XBRL processor additionally needs to keep track of inherited attributes such as @xml:base and @xml:lang, the set of in-scope namespace bindings, and whatever context is required to support schema validation.

Stream-based processors tend to require memory proportional to the depth of the XML tree, rather than the size of the document as a whole, so the flat nature of XBRL instances would seem to make them well suited to streamed processing. Unfortunately, the dimensional data in XBRL instances is stored in normalised form, with contexts, units, and footnotes interspersed with the facts they relate to. While this may have seemed flexible and efficient for small documents with several facts per context, this arrangement significantly undermines stream based processing, especially for large documents with a large number of contexts.

Whereas a more hierarchical structure would naturally cluster facts by unit and context, XBRL places no constraints on the ordering or scope of these constructs. It does not require contexts and units to appear before the facts that use them, and it provides no means of specifying when all the facts for a given context or unit have been discovered.

An XBRL processor cannot generate an event for a fact until it has encountered the context and unit that the fact refers to. Equally, the processor cannot forget a context or unit until it is sure no further facts will refer to it, and in the absence of external hints, this means retaining it until the end of the document.

The situation is further complicated by footnotes, which are represented by XLink extended links, arcs, and resources [XLINK]. If the processor wishes to include footnote information in a fact event, it must wait until all <footnoteLink> elements have been processed, and again this means waiting until the end of the document.

Without additional constraints on the structure of an XBRL instance, single-pass, stream-based processing is practically impossible. A possible solution to this problem is provided in Section 5

3.2.1 Validation state

While stream-based processing is suitable for many tasks, some operations require processors to keep additional state. A common example is semantic ('business rules') validation, which comes in various forms.

Simple value constraints ("TotalAssets MUST be non-negative") and non-occurrence constraints ("Sales of Product A MUST NOT be reported for Region X") do not require any state at all - an error can be raised as soon as an offending fact is encountered.

Occurrence constraints ("TotalAssets MUST be reported", "SalesCosts MUST be reported in both periods") and co-occurrence constraints ("CurrentAssets MUST be reported if TotalAssets is reported") require state to be held until the rule is satisfied or the end of the document is reached (at minimum, a boolean variable representing whether each fact has been encountered). However, it is common for such rules to apply on a context-by-context basis ("TotalAssets MUST be reported for every context where CurrentAssets is reported"), in which case the validator must keep occurrence markers for every context associated with the relevant facts.

Value co-constraints ("TotalAssets MUST equal CurrentAssets + NonCurrentAssets") obviously require a validator to store more information about the facts, and again these will typically apply on a context-by-context basis.

It is important to note that the amount of state that must be maintained is a function of the rules, rather than the size of the instance document. A regulator may require vast amounts of information to be reported, but only want to initially enforce certain basic checks relating to key facts. Even in the case of rules that apply for each context, it may be possible to ignore large portions of the instance document.

The size of the instance document can have a bearing on how long the state must be kept. For example, if we encounter some items involved in a summation-item check, we need to retain information about these items until we're sure we've encountered the summation item and all of the contributing items (bearing in mind that we may encounter a duplicate at any point). Accordingly, instance consumers may wish to receive a set of smaller instances, clustered according to the validation groups, rather than a single large document. However, this may have implications for subsequent processing if the instance boundaries are considered semantically significant.

Note too that whereas processors can easily determine which facts are involved in summation-item relationships, validation rules expressed in other formats (such as XBRL Formula or Java code) may be difficult or impossible to analyse, rendering stream-based instance document filtering impractical without supplementary processing hints to specify which facts are subject to validation.

Finally, it is worth noting that certain rules ("Facts MUST NOT be duplicated", "Duplicates facts MUST be consistent") effectively require a complete model of the instance document, undermining the benefits of stream-based processing. Where such checks are required, instance consumers will have to consider the resource and performance trade-offs between in-memory and persisted models. A possible strategy is to perform certain basic checks in a synchronous fashion (rejecting submissions that fail) while deferring more expensive validation and analysis, perhaps generating reports that can be viewed by filers or analysts at a later point.

3.3 Persistent data stores

It is worth noting a third approach, which may be used independently or in combination with those outlined above. In this model, an instance is is written to persistent storage, and subsequently read as needed to perform validation and other onward processing.

3.3.1 Forms of persistence

The instance document may be stored faithfully on disk, placed in an XML database, converted to a relational or 'NoSQL' representation, or transformed into a custom binary format. The time required to store the instance, and to subsequently read from it, will vary according to the specific approach taken. Like in-memory models, persisted models may vary in the degree of syntax-level information they retain.

3.3.1.1 XBRL awareness on read

If the instance is persisted to disk verbatim, as a stream of bytes, it may be subsequently parsed in multiple passes to extract a subset of the information it contains. For example, a processor could select a particular set of facts by name, ignoring all others. It could then scan the document for only those units, contexts, and footnotes that are relevant to the facts under consideration. These separate queries could be handled using a stream-based XML processor, avoiding above-noted challenges with stream-based XBRL processing. With this approach, the XBRL awareness is applied in the reading phase rather than the writing phase. Here the memory requirements are kept low at the expense of repeated XML parsing, which could be significant for large instances, even if an efficient XML processor is used.

The cost of repeated XML parsing may be avoided by using an XML database, though the choice of implementation and indexing strategy can significantly affect the cost of initial storage and subsequent retrieval.

Alternatively, a simple stream-based XML processor may be used to shred XBRL into a non-XML model that closely mirrors the syntax of an XBRL instance - separate tables for facts, units, contexts, and footnotes, with referential integrity checks absent or deferred until the entire document has been processed. Here again, processes reading from the data store require a certain amount of knowledge about XBRL syntax details - they operate on a physical rather than a logical model.

3.3.1.2 XBRL awareness on write

Rather than building awareness of XBRL syntax into the processes that read from a persistent data store, processing systems may transform instances into another format that is easier to subsequently analyse. This is effectively a hybrid approach, with the initial analysis handled by an XBRL processor employing an in-memory model ( Section 3.1) or a stream-based approach ( Section 3.2).

Even if an in-memory model is used for the initial processing, a persistent data store can reduce the overall memory requirements for a consuming system by minimising the time that a given instance is held in main memory. While one process sequentially transforms instances into the data store, other processes (perhaps with no understanding of XBRL syntax) can validate and analyse all of the instances that have previously been transformed.

3.3.2 Performance considerations

Rather than requiring memory proportional to the instance as a whole, processes reading from persistent data stores must keep in memory only the data required to support a given processing task. This memory saving comes at the cost of decreased responsiveness, as data is written to and then repeatedly drawn from sources that are slower than main memory. Caching, indexing, and query optimisation may reduce the amount of data that needs to be read from persistent storage, and advances in hardware (such as solid state disks) may reduce the cost of each read, but such systems will struggle to match the performance of well-written queries against a full in-memory model.

4 XBRL Formula and XPath

XPath is a powerful language for navigating and analysing XML documents. Because XBRL builds on XML Schema, it was natural to use XPath 2 for the expression language at the core of XBRL Formula. This simplified the drafting of the specification, as it avoided the need to precisely specify a wide range of operators and functions for XML Schema data types - work that had already been done by the W3C's XPath working group. It also allowed implementors of XBRL Formula software to take advantage of mature third-party XPath libraries, reducing the burden of development, testing, and optimisation.

By contrast, XPath's document navigation features are of little use. This is partly because XBRL instances tend to be flat, but primarily because XBRL Formula uses a dimensional model, which clusters facts by context. The formula processor iterates through dimensional space, binding appropriate values to variables used in assertions.

A typical 'value assertion' is shown below:

<valueAssertion test="$v:Assets = $v:Liabilities + $v:Equities"/>

Note that the expression above does not access the context node, and has no connection with the syntax of XBRL instance documents.

However, as mentioned in Section 3.1, XBRL Formula allows arbitrary XPath expressions, which are free to explore every XML element in the instance and to load and co-compute with any other XML documents (such as rate tables). There may be occasions where this is useful, but it is usually unnecessary, and this flexibility comes at a cost:

  1. Rules become more difficult for humans to understand in their raw form.
  2. Formula editing applications may struggle to present complex rules in a user-friendly format.
  3. Rules can depend on irrelevant syntax details, rather than focusing on semantics.
  4. Processors must maintain and query a comprehensive XML model of the instance document.

This final item is particularly important in the case of large instances. Support for arbitrary XPath expressions requires a full XPath data model. Such models may be backed by an XML database, by DOM, or by a lighter-weight implementation such as Saxon's TinyTree, but because they are comprehensive and general-purpose, these models will struggle to match the performance of models that focus purely on XBRL semantics.

With these concerns in mind, the Formula Working Group is investigating a standard approach to restricting the use of arbitrary XPath in formula linkbases, with a view to being able to map XPath onto other, simpler underlying technologies, including alternatives to an XML data model.

5 Supporting efficient stream-based processing

Although XBRL instances can be analysed effectively through in-memory and persisted models, many circumstances call for efficient stream-based XBRL processing. Unfortunately, for reasons discussed in Section 3.2, the syntax of XBRL 2.1 renders this impossible: because facts, contexts, units can be interspersed, the processor may need to store significant portions of the document in memory as it assembles fact events. While it is possible to imagine alternative syntaxes for conveying the data in XBRL instances, it would take a significant investment of time and resources to achieve standardisation, and this effort would risk confusion in the marketplace, potentially undermining the XBRL standard.

5.1 A syntax for declaring context and unit constraints

The amount that a processor needs to store in memory could be drastically reduced if it could rely on contexts and units appearing before the facts that reference them, and if it could know when it had seen all of the facts that reference them.

This could be achieved by following two simple rules:

  1. Contexts and units must appear before the facts that reference them.
  2. Facts may only reference the most recently declared context and unit.

This would give rise to instance documents that look something like this:

<unit id="u1"/>
<context id="c1"/>
<fact1 contextRef="c1" unitRef="u1"/>
<fact2 contextRef="c1" unitRef="u1"/>
<fact3 contextRef="c1" unitRef="u1"/>
<context id="c2"/>
<fact1 contextRef="c2" unitRef="u1"/>
<fact2 contextRef="c2" unitRef="u1"/>
<fact3 contextRef="c2" unitRef="u1"/>

The benefit to a consuming processor is that it only needs to hold one context and one unit in memory at any given time - as soon as it encounters another context or unit declaration, it can forget about the previous one.

As it stands, the approach is a bit flawed. If you imagine a case where you have facts reported against multiple contexts and multiple units then this is going to result in a large amount of duplication of unit or context declarations, leading to an instance document that is unnecessarily large. Whilst real world instance documents often have a large number of contexts (due to use of dimensions), it's rare to see more than a handful of units. In this case, the benefits of applying the second constraint above to units is limited.

To address this, we can allow the instance author to specify of three different serialisation conventions for each of units and contexts. The three options are:

  • None - no constraint on ordering, as per standard XBRL v2.1.
  • Pre-declare - units/contexts must be declared before they are used.
  • Immediate pre-declare - the referenced unit/context must be the most recent declaration.

An instance document would declare which serialisation conventions it adhered to by including a couple of additional attributes, e.g.:

  • contextSerialisationConvention="none|predeclare|immediate"
  • unitSerialisationConvention="none|predeclare|immediate"

These could take the form of either custom attributes on the <xbrli:xbrl> element or, perhaps more appropriately, a processing instruction.

The combination of contextSerialisationConvention="immediate" and unitSerialisationConvention="predeclare" is the most likely to be useful for typical documents.

A related but more general solution would indicate exactly how many contexts or units need to be kept in memory at any given time, with an assumption that all are predeclared, e.g.

  • contextBuffer="1..INF"
  • unitBuffer="1..INF"

Such flexibility may be useful for tuples that contain facts with a mix of contexts.

5.1.1 Backwards compatibility

It should be noted that the approach described here is completely backwards compatible. Documents conforming to this proposal would be completely valid XBRL v2.1, and could be consumed by any XBRL v2.1 processor (provided that it could cope with the document size).

Where a reporting regime is likely to encounter large documents, it would be open to receivers to specify a minimum level of "streamability". For example, they could insist that units are at least "pre-declared" and that contexts are "immediately pre-declared".

5.1.2 Handling footnotes

The solution outlined above covers only units and contexts. A similar issue exists for footnotes, but with some differences:

  • Each fact has exactly one context and at most one unit, but it may have several footnotes.
  • Whereas facts refer to contexts and units, footnotes refer to facts.
  • Whereas @contextRef and @unitRef are straightforward ID references, footnotes are defined using XLink.
  • While a fact-footnote relationship associates a fact with a resource, additional arcroles have been proposed that associate facts with other facts.

Despite these and other differences, we believe a similar solution can be devised.

Appendix A References

XBRL 2.1
XBRL International Inc.. "Extensible Business Reporting Language (XBRL) 2.1 Includes Corrected Errata Up To 2008-07-02"
Phillip Engel
, Walter Hamscher, Geoff Shuetrim, David vun Kannon, and Hugh Wallis.
(See http://www.xbrl.org/Specification/XBRL-RECOMMENDATION-2003-12-31+Corrected-Errata-2008-07-02.htm)
XLINK
W3C (World Wide Web Consortium). "XML Linking Language (XLink) Version 1.0"
Steve DeRose
, Eve Maler, and David Orchard.
(See http://www.w3.org/TR/xlink/)
XML
W3C (World Wide Web Consortium). "Extensible Markup Language (XML) 1.0 (Fifth Edition)"
Tim Bray
, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau.
(See http://www.w3.org/TR/REC-xml/)
XML Schema Structures
W3C (World Wide Web Consortium). "XML Schema Part 1: Structures Second Edition"
Henry S. Thompson
, David Beech, Murray Maloney, and Noah Mendelsohn.
(See http://www.w3.org/TR/xmlschema-1/)

Appendix B Intellectual property status (non-normative)

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to XBRL International or XBRL organizations, except as required to translate it into languages other than English. Members of XBRL International agree to grant certain licenses under the XBRL International Intellectual Property Policy (www.xbrl.org/legal).

This document and the information contained herein is provided on an "AS IS" basis and XBRL INTERNATIONAL DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

The attention of users of this document is directed to the possibility that compliance with or adoption of XBRL International specifications may require use of an invention covered by patent rights. XBRL International shall not be responsible for identifying patents for which a license may be required by any XBRL International specification, or for conducting legal inquiries into the legal validity or scope of those patents that are brought to its attention. XBRL International specifications are prospective and advisory only. Prospective users are responsible for protecting themselves against liability for infringement of patents. XBRL International takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Members of XBRL International agree to grant certain licenses under the XBRL International Intellectual Property Policy (www.xbrl.org/legal).

Appendix C Acknowledgements (non-normative)

This document benefited from review and feedback from a number of people both within and outside the Base Specification and Maintenance Working Group. We are grateful, in particular, to the following people for their input:

Appendix D Document history

DateAuthorDetails
12 September 2012Mark Goodhand

Initial draft.

29 October 2012Mark Goodhand

Addressed feedback on the initial draft. Added a section on XBRL Formula and XPath and another on the relationship between validation and stream-based processing.

30 October 2012Mark Goodhand

Final editorial changes as agreed on the 2012-10-29 spec call.

Appendix E Errata corrections in this document

This appendix contains a list of the errata that have been incorporated into this document. This represents all those errata corrections that have been approved by the XBRL International Base Specification and Maintenance Working Group up to and including 31 October 2012. Hyperlinks to relevant e-mail threads may only be followed by those who have access to the relevant mailing lists. Access to internal XBRL mailing lists is restricted to members of XBRL International Inc.

No errata have been incorporated into this document.