lethe.datastore – Data storage and representation

The current version of the site managed by Lethe is a set of nodes. The data is stored in the format described in this section, with the history being provided by the DVCS. All persistent data is accessed via the DVCS, with the local filesystem being accessed directly only for the index and cache. (Direct filesystem access to the working copy would not work well with branching and concurrent modification.)

Todo

write what index and cache are

Todo

some of this should be move to user or design documentation, while the rest is for development only

A node is what other wikis call a ‘document’, ‘page’ or ‘article’. A different name is used, since these suggest bigger or more professional objects than what Lethe is designed for. A node can represent an article, a single note, a comment or a tagged URI. (The use of the term ‘node’ comes from Texinfo.)

The API

Data storage.

class lethe.datastore.Store(dvcs, index_path, cache_path)

Data store for a single site.

Constructor arguments:

Parameters:
  • dvcslethe.dvcs.Repository instance storing the data of the site
  • index_path – path to the index file or None to use no index
  • cache_path – path to the cache directory

Index and cache must be unique to the site. They should not be removed for performance reasons, while they can be regenerated from the data in the DVCS.

If the index is specified, it is available via the index attribute and updated on call to save_nodes.

Todo

redesign the API so old versions can be accessed

cache_path = None

Path to the cache directory.

dvcs = None

The lethe.dvcs.Repository instance storing the data.

get_node(uuid)

Return a node of the specified uuid.

Parameters:uuid – a UUID of the node, as a hexadecimal string
Raises:NodeNotFoundError if the site does not have such a node
get_node_raw(uuid)

Like get_node, but not using the index.

Don’t use this method outside the index code, it is slow.

get_node_unloaded(uuid)

Return a node without loading its data.

Won’t fail for nonexistent nodes.

Use it only if you implement custom loading of node data.

index = None

lethe.index.Index for this store or None if not specified.

iter_nodes()

Iterate all nodes in the store.

Only indexed nodes are found.

iter_nodes_raw()

Iterate all nodes in the store.

Don’t use this method outside the index code, it is slow. It is designed only for use in index synchronization code, so iter_nodes and get_node can then get the same data.

load_props()

Load site properties from the DVCS.

make_node()

Return a new node instance that doesn’t refer to any existing node.

props = None

Site properties as a lethe.props.Properties object.

save_nodes(nodes, description, removed_nodes=(), author=None, date=None)

Commit modified node objects and site properties.

Parameters:
  • nodes – a node object or an iterable of nodes
  • description – commit description
  • removed_nodes – an iterable of node objects to remove
  • author – commit author, leave None for DVCS default
  • date – commit date or None for DVCS default

Todo

think how to support branches; needs storing base revisions in node objects?

class lethe.datastore.Node(store, uuid)

Representation of a single node.

Don’t construct this object directly, use Store.get_node and Store.make_node methods.

Nodes support access to their data, loading from the DVCS, deserialization for committing to the DVCS and extracting words for full text search indexing.

Todo

lazily load node text when needed

binary

Node binary attachment as a byte string or None if there is no binary for this node.

binary_meta

Return a lethe.binmeta.MetadataExtractor instance for the node binary attachment.

canonical_id

Return the identifier used in canonical node URI.

This is the slug or UUID.

extract_binary_words()

Iterate words from the node binary.

extract_text_words()

Iterate words from the node text.

forget_binary()

Make next access to the node binary get it from the DVCS.

Call it if you set node data from a source other than the DVCS and do not load the binary attachment data.

formatted_text

Return a lethe.ext.format.Format instance for the node text.

Raises KeyError:
 if there is no extension handling the declared content type
has_binary

True if the node has a binary. This works without loading the possibly big binary attachment into memory.

The value set to this property is used if the binary is not loaded. Otherwise, the DVCS is queried for the file.

has_image

True if the node binary attachment has a type that can be used in the img HTML tag.

last_changed

datetime.datetime representing the time of the last committed change of this node.

Returns:the newest revision date or None if unknown
load()

Get node data from the DVCS.

Results are undefined if this node object already has data.

Raises:NodeNotFoundError if there is no such node in the DVCS
props = None

Node properties as a lethe.props.Properties object.

serialize()

Return node data as a dictionary mapping DVCS file name to content string.

tags = None

Node tags as a set of strings.

text = None

Node text as a string.

uuid = None

UUID of the node as string.

exception lethe.datastore.NodeNotFoundError

Exception raised when loading or getting a node that does not exist or was deleted.

On-disk format

The data format is optimized for diffability (important when merging or reviewing changes in a DVCS). All files use simple text formats (binary node attachments and possibly complex node text files are exceptions to this).

Since a node has no unique and constant data associated to it that could be used in its file names, it is identified by an UUID. Other wikis use titles or slugs, while both are optional in Lethe and titles are not required to be unique. Simple solutions like sequential numbers or words from the original title or text would lead to unrelated nodes getting the same paths, resulting in merge conflicts.

Top-level files

The props file is the site property file. They are prepended to node properties when loading a node. They do not work as default values, since in some cases multiple values might be useful for the same property. (For example, specifying the license in site properties and in a node will list both licenses for the node.)

Todo

support overriding site properties in nodes

These two directories are used for the Web interface and HTML output generation:

  • templates: Jinja2 template overrides
  • static: static files, they work as if they were in the HTML site root. This directory is designed for files like robots.txt or favicon.ico, not e.g. images that should be stored as nodes.

There are two additional entries that should not be stored in the DVCS: index.sqlite file and cache directory.

Todo

links to sections describing these future files

Node files

Each node is stored in a subdirectory named by its UUID as 32 hexademical characters with lowercase letters. Random UUIDs as described by RFC 4122 and implemented by Python uuid module are used.

The following files are included in each node subdirectory:

  • text: node text, currently plain text

    Todo

    in future formatted text with property specifying the format; link to the chapter

  • binary: node attachment or image. If both are included, text describes the binary. Use other nodes linking to this one to use it as an attachment or image as in other wikis.

  • props: node property file

  • tags: a sorted lists of node tags (any single line strings), one per line

A minimal valid node contains just an empty props file. Other files missing are interpreted as empty text, empty set of tags and no binary attachment.

Why a standard extensible format like JSON isn’t used instead of such ad-hoc text files to store a node? The formats of tags and property files work well with a relational database as a cache and they have more diffable formats, being line-oriented.