Pre-release
Edit me on GitHub

Cataloging

Substance D provides application content indexing and querying via a catalog. A catalog is an object named catalog which lives in a service named catalogs within your application's resource tree. A catalog has a number of indexes, each of which keeps a certain kind of information about your content.

The Default Catalog

A default catalog named system is installed into the root folder's catalogs subfolder when you start Substance D. This system catalog contains a default set of indexes:

  • path (a path index)

    Represents the path of the content object.

  • name (a field index), uses content.__name__ exclusively

    Represents the local name of the content object.

  • interfaces (a keyword index)

    Represents the set of interfaces possessed by the content object.

  • content_type (a field index)

    Represents the Substance D content-type of an object.

  • allowed (an allowed index)

    An index which can be used to filter resultsets using principals and permissions.

  • text (a text index)

    Represents the text searched for when you use the filter box within the folder contents view of the SDI.

Adding a Catalog

The system catalog won't have enough information to form all the queries you need. You'll have to add a catalog via code related to your application. The first step is adding a catalog factory.

A catalog factory is a collection of index descriptions. Creating a catalog factory doesn't actually add a catalog to your database, but it makes it possible to add one later.

Here's an example catalog factory named mycatalog:

from substanced.catalog import (
    catalog_factory,
    Text,
    Field,
    )

@catalog_factory('mycatalog')
class MyCatalogFactory(object):
    freaky = Text()
    funky = Field()

In order to activate a @catalog_factory decorator, it must be scanned using the Pyramid config.scan() machinery. This will allow you to use substanced.catalog.CatalogsService.add_catalog() to add a catalog with that factory's name:

# in a module named blog.__init__

from pyramid.config import Configurator

def main(global_config, **settings):
    config = Configurator()
    config.include('substanced')
    config.scan('blog.catalogs')
    # .. and so on ...

Once you've done this, you can then add the catalog to the database in any bit of code that has access to the database. For example, in an event handler when the root object is created for the first time.

from substanced.root import Root
from substanced.event import subscribe_created

@subscribe_created(Root)
def created(event):
    root = event.object
    service = root['catalogs']
    service.add_catalog('mycatalog', update_indexes=True)

Object Indexing

Once a new catalog has been added to the database, each time a new catalogable object is added to the site, its attributes will be indexed by each catalog in its lineage that "cares about" the object. The object will always be indexed in the "system" catalog. To make sure it's cataloged in custom catalogs, you'll need to do some work. To index the object in a custom application index, you will need to create an index view for your content using substanced.catalog.indexview, and scan the resulting index view using pyramid.config.Configurator.scan():

For example:

from substanced.catalog import indexview

class MyCatalogViews(object):
    def __init__(self, resource):
        self.resource = resource

    @indexview(catalog_name='mycatalog')
    def freaky(self, default):
        return getattr(self.resource, 'freaky', default)

An index view class should be a class that accepts a single argument, (conventionally named resource), in its constructor, and which has one or more methods named after potential index names. When it comes time for the system to index your content, Substance D will create an instance of your indexview class, and it will then call one or more of its methods; it will call methods on the indexview object matching the attr passed in to add_indexview. The default value passed in should be returned if the method is unable to compute a value for the content object.

Once this is done, whenever an object is added to the system, a value (the result of the freaky(default) method of the catalog view) will be indexed in the freaky field index.

You can attach multiple index views to the same index view class:

from substanced.catalog import indexview

class MyCatalogViews(object):
    def __init__(self, resource):
        self.resource = resource

    @indexview(catalog_name='mycatalog')
    def freaky(self, default):
        return getattr(self.resource, 'freaky', default)

    @indexview(catalog_name='mycatalog')
    def funky(self, default):
        return getattr(self.resource, 'funky', default)

You can use the "index_name" parameter to indexview to tell the system that the index name is not the same as the method name in the index view:

from substanced.catalog import indexview

class MyCatalogViews(object):
    def __init__(self, resource):
        self.resource = resource

    @indexview(catalog_name='mycatalog')
    def freaky(self, default):
        return getattr(self.resource, 'freaky', default)

    @indexview(catalog_name='mycatalog', index_name='funky')
    def notfunky(self, default):
        return getattr(self.resource, 'funky', default)

You can use the context parameter to indexview to tell the system that this particular index view should only be executed when the class of the resource (or any of its interfaces) matches the value of the context:

from substanced.catalog import indexview

class MyCatalogViews(object):
    def __init__(self, resource):
        self.resource = resource

    @indexview(catalog_name='mycatalog', context=FreakyContent)
    def freaky(self, default):
        return getattr(self.resource, 'freaky', default)

    @indexview(catalog_name='mycatalog', index_name='funky')
    def notfunky(self, default):
        return getattr(self.resource, 'funky', default)

You can use the indexview_defaults class decorator to save typing in each indexview declaration. Keyword argument names supplied to indexview_defaults will be used if the indexview does not supply the same keyword:

from substanced.catalog import (
    indexview,
    indexview_defaults,
    )

@indexview_defaults(catalog_name='mycatalog')
class MyCatalogViews(object):
    def __init__(self, resource):
        self.resource = resource

    @indexview()
    def freaky(self, default):
        return getattr(self.resource, 'freaky', default)

    @indexview()
    def notfunky(self, default):
        return getattr(self.resource, 'funky', default)

The above configuration is the same as:

from substanced.catalog import indexview

class MyCatalogViews(object):
    def __init__(self, resource):
        self.resource = resource

    @indexview(catalog_name='mycatalog')
    def freaky(self, default):
        return getattr(self.resource, 'freaky', default)

    @indexview(catalog_name='mycatalog')
    def notfunky(self, default):
        return getattr(self.resource, 'funky', default)

You can also use the substanced.catalog.add_indexview() directive to add index views imperatively, instead of using the @indexview decorator.

Querying the Catalog

You execute a catalog query using APIs of the catalog's indexes.

from substanced.util import find_catalog

catalog = find_catalog(resource, 'system')
name = catalog['name']
path = catalog['path']
# find me all the objects that exist under /somepath with the name 'somename'
q = name.eq('somename') & path.eq('/somepath')
resultset = q.execute()
for contentob in resultset:
    print contentob

The calls to name.eq() and path.eq() above each return a query object. Those two queries are ANDed together into a single query via the & operator between them (there's also the | character to OR the queries together, but we don't use it above). Parentheses can be used to group query expressions together for the purpose of priority.

Different indexes have different query methods, but most support the eq method. Other methods that are often supported by indexes: noteq, ge, le, gt, any, notany, all, notall, inrange, notinrange. The AllowedIndex supports an additional allows() method.

Query objects support an execute method. This method returns a hypatia.util.ResultSet. A hypatia.util.ResultSet can be iterated over; each iteration returns a content object. hypatia.util.ResultSet also has methods like one and first, which return a single content object instead of a set of content objects. A hypatia.util.ResultSet also has a sort method which accepts an index object (the sort index) and returns another (sorted) hypatia.util.ResultSet.

catalog = find_catalog(resource, 'system')
name = catalog['name']
path = catalog['path']
# find me all the objects that exist under /somepath with the name 'somename'
q = name.eq('somename') & path.eq('/somepath')
resultset = q.execute()
newresultset = resultset.sort(name)

Note

If you don't call sort on the hypatia.util.ResultSet you get back, the results will not be sorted in any particular order.

Querying Across Catalogs

In many cases, you might only have one custom attribute that you need indexed, while the system catalog has everything else you need. You thus need an efficient way to combine results from two catalogs, before executing the query:

system_catalog = find_catalog(resource, 'system')
my_catalog = find_catalog(resource, 'mycatalog')
path = system_catalog['path']
funky = my_catalog['funky']
# find me all funky objects that exist under /somepath
q = funky.eq(True) & path.eq('/somepath')
resultset = q.execute()
newresultset = resultset.sort(system_catalog['name'])

Filtering Catalog Results Using the Allowed Index

The Substance D system catalog at substanced.catalog.system.SystemCatalogFactory contains a number of default indexes, including an allowed index. Its job is to index security information to allow security-aware results in queries. This index allows us to filter queries to the system catalog based on whether the principal issuing the request has a permission on the matching resource.

For example, the below query will find:

  • all of the subresources inside a folder
  • which is of content type News Item
  • which the current user also possesses the view permission against
system_catalog = find_catalog(resource, 'system')
path = system_catalog['path']
interfaces = system_catalog['content_type']
allowed = system_catalog['allowed']
q = ( path.eq(resource, depth=1, include_origin=False) &
      content_type.eq('News Item') &
      allowed.allows(request, 'view')
    )
return q

Filtering Catalog Results Using The Objectmap

It is possible to postfilter catalog results using the substanced.objectmap.ObjectMap.allowed() API. For example:

def get_allowed_to_view(context, request):

    catalog = find_catalog(context, 'system')
    q = catalog['content_type'].eq('News Item')
    resultset = q.execute()

    objectmap = find_objectmap(context)
    return objectmap.allowed(
              resultset.oids, request.effective_principals, 'view')

The result of allowed() is a generator which returns oids, so the result must be listified if you intend to index into it, or slice it, or what-have-you.

Setting ACLs

The objectmap keeps track of ACLs in a cache to make catalog security functionality work. Note that for the object map's cached version of ACLs to be correct, you will need to set ACLs in a way that helps keep track of all the contracts. For this, the helper function substanced.util.set_acl() can be used. For example, the site root at substanced.root.Root finishes with:

set_acl(
    self,
    [(Allow, get_oid(admins), ALL_PERMISSIONS)],
    registry=registry,
    )

Using set_acl this way will generate an event that will keep the objectmap's cache updated. This will allow the allowed index to work and the substanced.objectmap.ObjectMap.allowed() method to work.

Deferred Indexing and Mode Parameters

As a lesson learned from previous cataloging experience, Substance D natively supports deferred indexing. As an example, in many systems the text indexing can be done after the change to the object is committed in the web request's transaction. Doing so has a number of performance benefits: the user's request processes more quickly, the work to extract text from a Word file can be performed later, less chance to have a conflict error, etc.

As such, the substanced.catalog.system.SystemCatalogFactory, by default, has indexes that aren't updated immediately when a resource is changed. For example:

# name is MODE_ATCOMMIT for next-request folder contents consistency
name = Field()

text = Text(action_mode=MODE_DEFERRED)
content_type = Field()

The Field indexes use the default of MODE_ATCOMMIT. The Text overrides the default and set action_mode to MODE_DEFERRED.

There are three such catalog "modes" for indexing:

  • substanced.interfaces.MODE_IMMEDIATE means indexing action should take place as immediately as possible.
  • substanced.interfaces.MODE_ATCOMMIT means indexing action should take place at the successful end of the current transaction.
  • substanced.interfaces.MODE_DEFERRED means indexing action should be performed by an external indexing processor (e.g. drain_catalog_indexing) if one is active at the successful end of the current transaction. If an indexing processor is unavailable at the successful end of the current transaction, this mode will be taken to imply the same thing as MODE_ATCOMMIT.

Running an Indexer Process

Great, we've now deferred indexing to a later time. What exactly do we do at that later time?

Indexer processes are easy to write and schedule with supervisor. Here is an example of a configuration for supervisor.conf that will run in indexer process every five seconds:

[program:indexer]
command = %(here)s/../bin/sd_drain_indexing %(here)s/production.ini
redirect_stderr = true
stdout_logfile = %(here)s/../var/indexing.log
autostart = true
startsecs = 5

This calls sd_drain_indexing which is a console script that Substance D automatically creates in your bin directory. Indexing messages are logged with standard Python logging to the file that you name. You can view these messages with the supervisorctl command tail indexer. For example, here is the output from sd_drain_indexing when changing a simple Document content type:

2013-01-07 11:07:38,306 INFO  [substanced.catalog.deferred][MainThread] no actions to execute
2013-01-07 11:08:38,329 INFO  [substanced.catalog.deferred][MainThread] executing <substanced.catalog.deferred.IndexAction object oid 5886459017869105529 for index u'text' at 0x106e52910>
2013-01-07 11:08:38,332 INFO  [substanced.catalog.deferred][MainThread] executing <substanced.catalog.deferred.IndexAction object oid 5886459017869105529 for index u'interfaces' at 0x106e52dd0>
2013-01-07 11:08:38,333 INFO  [substanced.catalog.deferred][MainThread] executing <substanced.catalog.deferred.IndexAction object oid 5886459017869105529 for index u'content_type' at 0x1076e2ed0>
2013-01-07 11:08:38,334 INFO  [substanced.catalog.deferred][MainThread] committing
2013-01-07 11:08:38,351 INFO  [substanced.catalog.deferred][MainThread] committed

Overriding Default Modes Manually

Above we set the default mode used by an index when Substance D indexes a resource automatically. Perhaps in an evolve script, you'd like to override the default mode for that index and reindex immediately.

The index_resource on an index can be passed an action_mode flag that overrides the configured mode for that index, and instead, does exactly what you want for only that call. It does not permanently change the configured default for indexing mode. This applies also to reindex_resource and unindex_resource. You can also grab the catalog itself and reindex with a mode that overrides all default modes on each index.

Autosync and Autoreindex

If you add substanced.catalogs.autosync = true within your application's .ini file, all catalog indexes will be resynchronized with their catalog factory definitions at application startup time. Indices which were added to the catalog factory since the last startup time will be added to each catalog which uses the index factory. Likewise, indices which were removed will be removed from each catalog, and indices which were modified will be modified according to the catalog factory. Having this setting in your .ini file is like pressing the Update indexes button on the Manage tab of each of your catalogs. The SUBSTANCED_CATALOGS_AUTOSYNC environment variable can also be used to turn this behavior on. For example export SUBSTANCED_CATALOGS_AUTOSYNC=true.

If you add substanced.catalogs.autoreindex = true within your application's .ini file, all catalogs that were changed as the result of an auto-sync will automatically be reindexed. Having this setting in your .ini file is like pressing the Reindex catalog button on the Manage tab of each catalog which was changed as the result of hitting Update indexes. The SUBSTANCED_CATALOGS_AUTOREINDEX environment variable can also be used to turn this behavior on. For example export SUBSTANCED_CATALOGS_AUTOREINDEX=true.

Forcing Deferral of Indexing

There may be times when you'd like to defer all catalog indexing operations, such as during a bulk load of data from a script. Normally, only indexes marked with MODE_DEFERRED use deferred indexing, and actions associated with those indexes are even then only actually deferred if an index processor is active.

You can force Substance D to defer all catalog indexing using the substanced.catalogs.force_deferred flag in your application's .ini file. When this flag is used, all catalog indexing operations will be added to the indexer's queue, even those indexes marked as MODE_IMMEDIATE or MODE_ATCOMMIT. Deferral will also happen whether or not the indexer is running, unlike during normal operations.

When you use this flag, you can stop the indexer process, do your bulk load, and start the indexer again when it's convenient to have all the content indexing done in the background.

The SUBSTANCED_CATALOGS_FORCE_DEFERRED environment variable can also be used to turn this behavior on. For example export SUBSTANCED_CATALOGS_FORCE_DEFERRED=true.