lethe.binmeta – Binary file metadata extraction

This module provides metadata and text extraction for binary attachments.

It is intended to be used to support full text search on binary attachments and node import with properties based on the attachment metadata.

class lethe.binmeta.MetadataExtractor(data)

Base class for metadata extractors. There should be one subclass per file type.

An instance is made for a specific binary (a byte string data) and provides methods to extract the metadata.

Subclasses should override all methods and the content_type class attribute.

content_type = (u'application/octet-stream',)

A sequence of binary_content_type property values handled by this extractor.

data = None

The binary data.

extract_words()

Iterate words of the binary’s text as strings.

The default implementation returns an empty sequence.

get_metadata()

Return a lethe.props.Properties object representing the document metadata.

The binary_content_type property is always returned (the default is application/octet-stream). For formats that support it, bib_author and title properties should be returned too.

classmethod supports_file(file_name, data)

Returns True if this extractor can handle the file.

Parameters:
  • file_name – file name or None if unknown
  • data – file content byte string

The default accepts all files.

class lethe.binmeta.HTMLExtractor(data)

Metadata extractor for HTML files.

Supports only text and title extraction.

Todo

implement support for more metadata if it’s useful in a general way using some of the popular HTML metadata standards

class lethe.binmeta.PDFExtractor(data)

Metadata extractor for PDF files.

PyPDF2 is used to parse the file.

Extracted metadata are title and author from the document info dictionary.

class lethe.binmeta.PopplerPDFExtractor(data)

Metadata extractor for PDF files using Poppler via PyGObject.

Extracted metadata are title and author from the document info dictionary.

class lethe.binmeta.MagicExtractor(data)

A metadata extractor using libmagic via python-magic.

Only the binary_content_type property is found.

lethe.binmeta.get_extractor(data, file_name=None, content_type=None, extractors=(<class 'lethe.binmeta.HTMLExtractor'>, <class 'lethe.binmeta.PopplerPDFExtractor'>, <class 'lethe.binmeta.PDFExtractor'>, <class 'lethe.binmeta.MagicExtractor'>, <class 'lethe.binmeta.MetadataExtractor'>))

Return a MetadataExtractor instance for the specified file.

Parameters:
  • data – file content as a byte string
  • file_name – file name or None to not compare by name
  • content_type – file content type or None if unknown
  • extractors – a sequence of MetadataExtractor subclasses to choose from, the default is all standard ones
Returns:

a MetadataExtractor subclass instance, or if custom extractors sequence was passed and none of them matched, None

lethe.binmeta.IMAGE_FORMATS = frozenset([u'image/jpeg', u'image/svg+xml', u'image/png', u'image/x-icon', u'image/gif', u'image/x-bmp', u'image/webp', u'image/bmp'])

A set of MIME content types that are used for images that Web browsers can render.

Previous topic

lethe.node_import – Import nodes from other sources

Next topic

lethe.utils – Miscellaneous functions

This Page