Cataloging ========== Substance D provides application content indexing and querying via a *catalog*. A catalog is an object named ``catalog`` which lives in a service named ``catalogs`` within your application's resource tree. A catalog has a number of indexes, each of which keeps a certain kind of information about your content. The Default Catalog ------------------- A default catalog named ``system`` is installed into the root folder's ``catalogs`` subfolder when you start Substance D. This ``system`` catalog contains a default set of indexes: - path (a ``path`` index) Represents the path of the content object. - name (a ``field`` index), uses ``content.__name__`` exclusively Represents the local name of the content object. - interfaces (a ``keyword`` index) Represents the set of interfaces possessed by the content object. - content_type (a ``field`` index) Represents the Substance D content-type of an object. - allowed (an ``allowed`` index) An index which can be used to filter resultsets using principals and permissions. - text (a ``text`` index) Represents the text searched for when you use the filter box within the folder contents view of the SDI. Adding a Catalog ---------------- The ``system`` catalog won't have enough information to form all the queries you need. You'll have to add a catalog via code related to your application. The first step is adding a catalog factory. A catalog factory is a collection of index descriptions. Creating a catalog factory doesn't actually add a catalog to your database, but it makes it possible to add one later. Here's an example catalog factory named ``mycatalog``: .. code-block:: python from substanced.catalog import ( catalog_factory, Text, Field, ) @catalog_factory('mycatalog') class MyCatalogFactory(object): freaky = Text() funky = Field() In order to activate a ``@catalog_factory`` decorator, it must be *scanned* using the Pyramid ``config.scan()`` machinery. This will allow you to use :meth:`substanced.catalog.CatalogsService.add_catalog` to add a catalog with that factory's name: .. code-block:: python # in a module named blog.__init__ from pyramid.config import Configurator def main(global_config, **settings): config = Configurator() config.include('substanced') config.scan('blog.catalogs') # .. and so on ... Once you've done this, you can then add the catalog to the database in any bit of code that has access to the database. For example, in an event handler when the root object is created for the first time. .. code-block:: python from substanced.root import Root from substanced.event import subscribe_created @subscribe_created(Root) def created(event): root = event.object service = root['catalogs'] service.add_catalog('mycatalog', update_indexes=True) Object Indexing --------------- Once a new catalog has been added to the database, each time a new *catalogable* object is added to the site, its attributes will be indexed by each catalog in its lineage that "cares about" the object. The object will always be indexed in the "system" catalog. To make sure it's cataloged in custom catalogs, you'll need to do some work. To index the object in a custom application index, you will need to create an *index view* for your content using :class:`substanced.catalog.indexview`, and **scan** the resulting index view using :meth:`pyramid.config.Configurator.scan`: For example: .. code-block:: python from substanced.catalog import indexview class MyCatalogViews(object): def __init__(self, resource): self.resource = resource @indexview(catalog_name='mycatalog') def freaky(self, default): return getattr(self.resource, 'freaky', default) An index view class should be a class that accepts a single argument, (conventionally named ``resource``), in its constructor, and which has one or more methods named after potential index names. When it comes time for the system to index your content, Substance D will create an instance of your indexview class, and it will then call one or more of its methods; it will call methods on the indexview object matching the ``attr`` passed in to ``add_indexview``. The ``default`` value passed in should be returned if the method is unable to compute a value for the content object. Once this is done, whenever an object is added to the system, a value (the result of the ``freaky(default)`` method of the catalog view) will be indexed in the ``freaky`` field index. You can attach multiple index views to the same index view class: .. code-block:: python from substanced.catalog import indexview class MyCatalogViews(object): def __init__(self, resource): self.resource = resource @indexview(catalog_name='mycatalog') def freaky(self, default): return getattr(self.resource, 'freaky', default) @indexview(catalog_name='mycatalog') def funky(self, default): return getattr(self.resource, 'funky', default) You can use the "index_name" parameter to ``indexview`` to tell the system that the index name is not the same as the method name in the index view: .. code-block:: python from substanced.catalog import indexview class MyCatalogViews(object): def __init__(self, resource): self.resource = resource @indexview(catalog_name='mycatalog') def freaky(self, default): return getattr(self.resource, 'freaky', default) @indexview(catalog_name='mycatalog', index_name='funky') def notfunky(self, default): return getattr(self.resource, 'funky', default) You can use the ``context`` parameter to ``indexview`` to tell the system that this particular index view should only be executed when the class of the resource (or any of its interfaces) matches the value of the context: .. code-block:: python from substanced.catalog import indexview class MyCatalogViews(object): def __init__(self, resource): self.resource = resource @indexview(catalog_name='mycatalog', context=FreakyContent) def freaky(self, default): return getattr(self.resource, 'freaky', default) @indexview(catalog_name='mycatalog', index_name='funky') def notfunky(self, default): return getattr(self.resource, 'funky', default) You can use the ``indexview_defaults`` class decorator to save typing in each ``indexview`` declaration. Keyword argument names supplied to ``indexview_defaults`` will be used if the ``indexview`` does not supply the same keyword: .. code-block:: python from substanced.catalog import ( indexview, indexview_defaults, ) @indexview_defaults(catalog_name='mycatalog') class MyCatalogViews(object): def __init__(self, resource): self.resource = resource @indexview() def freaky(self, default): return getattr(self.resource, 'freaky', default) @indexview() def notfunky(self, default): return getattr(self.resource, 'funky', default) The above configuration is the same as: .. code-block:: python from substanced.catalog import indexview class MyCatalogViews(object): def __init__(self, resource): self.resource = resource @indexview(catalog_name='mycatalog') def freaky(self, default): return getattr(self.resource, 'freaky', default) @indexview(catalog_name='mycatalog') def notfunky(self, default): return getattr(self.resource, 'funky', default) You can also use the :func:`substanced.catalog.add_indexview` directive to add index views imperatively, instead of using the ``@indexview`` decorator. Querying the Catalog -------------------- You execute a catalog query using APIs of the catalog's indexes. .. code-block:: python from substanced.util import find_catalog catalog = find_catalog(resource, 'system') name = catalog['name'] path = catalog['path'] # find me all the objects that exist under /somepath with the name 'somename' q = name.eq('somename') & path.eq('/somepath') resultset = q.execute() for contentob in resultset: print contentob The calls to ``name.eq()`` and ``path.eq()`` above each return a query object. Those two queries are ANDed together into a single query via the ``&`` operator between them (there's also the ``|`` character to OR the queries together, but we don't use it above). Parentheses can be used to group query expressions together for the purpose of priority. Different indexes have different query methods, but most support the ``eq`` method. Other methods that are often supported by indexes: ``noteq``, ``ge``, ``le``, ``gt``, ``any``, ``notany``, ``all``, ``notall``, ``inrange``, ``notinrange``. The :class:`~substanced.catalog.indexes.AllowedIndex` supports an additional :meth:`~substanced.catalog.indexes.AllowedIndex.allows()` method. Query objects support an ``execute`` method. This method returns a :class:`hypatia.util.ResultSet`. A :class:`hypatia.util.ResultSet` can be iterated over; each iteration returns a content object. :class:`hypatia.util.ResultSet` also has methods like ``one`` and ``first``, which return a single content object instead of a set of content objects. A :class:`hypatia.util.ResultSet` also has a ``sort`` method which accepts an index object (the sort index) and returns another (sorted) :class:`hypatia.util.ResultSet`. .. code-block:: python catalog = find_catalog(resource, 'system') name = catalog['name'] path = catalog['path'] # find me all the objects that exist under /somepath with the name 'somename' q = name.eq('somename') & path.eq('/somepath') resultset = q.execute() newresultset = resultset.sort(name) .. note:: If you don't call ``sort`` on the :class:`hypatia.util.ResultSet` you get back, the results will not be sorted in any particular order. Querying Across Catalogs ------------------------ In many cases, you might only have one custom attribute that you need indexed, while the ``system`` catalog has everything else you need. You thus need an efficient way to combine results from two catalogs, before executing the query: .. code-block:: python system_catalog = find_catalog(resource, 'system') my_catalog = find_catalog(resource, 'mycatalog') path = system_catalog['path'] funky = my_catalog['funky'] # find me all funky objects that exist under /somepath q = funky.eq(True) & path.eq('/somepath') resultset = q.execute() newresultset = resultset.sort(system_catalog['name']) Filtering Catalog Results Using the Allowed Index ------------------------------------------------- The Substance D system catalog at :class:`substanced.catalog.system.SystemCatalogFactory` contains a number of default indexes, including an ``allowed`` index. Its job is to index security information to allow security-aware results in queries. This index allows us to filter queries to the system catalog based on whether the principal issuing the request has a permission on the matching resource. For example, the below query will find: - all of the subresources inside a folder - which is of content type ``News Item`` - which the current user also possesses the ``view`` permission against .. code-block:: python system_catalog = find_catalog(resource, 'system') path = system_catalog['path'] content_type = system_catalog['content_type'] allowed = system_catalog['allowed'] q = ( path.eq(resource, depth=1, include_origin=False) & content_type.eq('News Item') & allowed.allows(request, 'view') ) return q Filtering Catalog Results Using The Objectmap --------------------------------------------- It is possible to postfilter catalog results using the :meth:`substanced.objectmap.ObjectMap.allowed` API. For example: .. code-block:: python def get_allowed_to_view(context, request): catalog = find_catalog(context, 'system') q = catalog['content_type'].eq('News Item') resultset = q.execute() objectmap = find_objectmap(context) return objectmap.allowed( resultset.oids, request.effective_principals, 'view') The result of :meth:`~substanced.objectmap.ObjectMap.allowed` is a generator which returns oids, so the result must be listified if you intend to index into it, or slice it, or what-have-you. Setting ACLs ------------ The objectmap keeps track of ACLs in a cache to make catalog security functionality work. Note that for the object map's cached version of ACLs to be correct, you will need to set ACLs in a way that helps keep track of all the contracts. For this, the helper function :func:`substanced.util.set_acl` can be used. For example, the site root at :class:`substanced.root.Root` finishes with: .. code-block:: python set_acl( self, [(Allow, get_oid(admins), ALL_PERMISSIONS)], registry=registry, ) Using ``set_acl`` this way will generate an event that will keep the objectmap's cache updated. This will allow the ``allowed`` index to work and the :meth:`substanced.objectmap.ObjectMap.allowed` method to work. Deferred Indexing and Mode Parameters ------------------------------------- As a lesson learned from previous cataloging experience, Substance D natively supports deferred indexing. As an example, in many systems the text indexing can be done after the change to the object is committed in the web request's transaction. Doing so has a number of performance benefits: the user's request processes more quickly, the work to extract text from a Word file can be performed later, less chance to have a conflict error, etc. As such, the :class:`substanced.catalog.system.SystemCatalogFactory`, by default, has indexes that aren't updated immediately when a resource is changed. For example: .. code-block:: python # name is MODE_ATCOMMIT for next-request folder contents consistency name = Field() text = Text(action_mode=MODE_DEFERRED) content_type = Field() The ``Field`` indexes use the default of `MODE_ATCOMMIT`. The ``Text`` overrides the default and set ``action_mode`` to `MODE_DEFERRED`. There are three such catalog "modes" for indexing: - :class:`substanced.interfaces.MODE_IMMEDIATE` means indexing action should take place as immediately as possible. - :class:`substanced.interfaces.MODE_ATCOMMIT` means indexing action should take place at the successful end of the current transaction. - :class:`substanced.interfaces.MODE_DEFERRED` means indexing action should be performed by an external indexing processor (e.g. ``drain_catalog_indexing``) if one is active at the successful end of the current transaction. If an indexing processor is unavailable at the successful end of the current transaction, this mode will be taken to imply the same thing as ``MODE_ATCOMMIT``. Running an Indexer Process -------------------------- Great, we've now deferred indexing to a later time. What exactly do we do at that later time? Indexer processes are easy to write and schedule with ``supervisor``. Here is an example of a configuration for ``supervisor.conf`` that will run in indexer process every five seconds:: [program:indexer] command = %(here)s/../bin/sd_drain_indexing %(here)s/production.ini redirect_stderr = true stdout_logfile = %(here)s/../var/indexing.log autostart = true startsecs = 5 This calls ``sd_drain_indexing`` which is a console script that Substance D automatically creates in your ``bin`` directory. Indexing messages are logged with standard Python logging to the file that you name. You can view these messages with the ``supervisorctl`` command ``tail indexer``. For example, here is the output from ``sd_drain_indexing`` when changing a simple ``Document`` content type:: 2013-01-07 11:07:38,306 INFO [substanced.catalog.deferred][MainThread] no actions to execute 2013-01-07 11:08:38,329 INFO [substanced.catalog.deferred][MainThread] executing 2013-01-07 11:08:38,332 INFO [substanced.catalog.deferred][MainThread] executing 2013-01-07 11:08:38,333 INFO [substanced.catalog.deferred][MainThread] executing 2013-01-07 11:08:38,334 INFO [substanced.catalog.deferred][MainThread] committing 2013-01-07 11:08:38,351 INFO [substanced.catalog.deferred][MainThread] committed Overriding Default Modes Manually --------------------------------- Above we set the default mode used by an index when Substance D indexes a resource automatically. Perhaps in an evolve script, you'd like to override the default mode for that index and reindex immediately. The ``index_resource`` on an index can be passed an ``action_mode`` flag that overrides the configured mode for that index, and instead, does exactly what you want for only that call. It does not permanently change the configured default for indexing mode. This applies also to ``reindex_resource`` and ``unindex_resource``. You can also grab the catalog itself and reindex with a mode that overrides all default modes on each index. Autosync and Autoreindex ------------------------ If you add ``substanced.catalogs.autosync = true`` within your application's ``.ini`` file, all catalog indexes will be resynchronized with their catalog factory definitions at application startup time. Indices which were added to the catalog factory since the last startup time will be added to each catalog which uses the index factory. Likewise, indices which were removed will be removed from each catalog, and indices which were modified will be modified according to the catalog factory. Having this setting in your ``.ini`` file is like pressing the ``Update indexes`` button on the ``Manage`` tab of each of your catalogs. The ``SUBSTANCED_CATALOGS_AUTOSYNC`` environment variable can also be used to turn this behavior on. For example ``export SUBSTANCED_CATALOGS_AUTOSYNC=true``. If you add ``substanced.catalogs.autoreindex = true`` within your application's ``.ini`` file, all catalogs that were changed as the result of an auto-sync will automatically be reindexed. Having this setting in your ``.ini`` file is like pressing the ``Reindex catalog`` button on the ``Manage`` tab of each catalog which was changed as the result of hitting ``Update indexes``. The ``SUBSTANCED_CATALOGS_AUTOREINDEX`` environment variable can also be used to turn this behavior on. For example ``export SUBSTANCED_CATALOGS_AUTOREINDEX=true``. Forcing Deferral of Indexing ---------------------------- There may be times when you'd like to defer all catalog indexing operations, such as during a bulk load of data from a script. Normally, only indexes marked with ``MODE_DEFERRED`` use deferred indexing, and actions associated with those indexes are even then only actually deferred if an index processor is active. You can force Substance D to defer all catalog indexing using the ``substanced.catalogs.force_deferred`` flag in your application's ``.ini`` file. When this flag is used, all catalog indexing operations will be added to the indexer's queue, even those indexes marked as ``MODE_IMMEDIATE`` or ``MODE_ATCOMMIT``. Deferral will also happen whether or not the indexer is running, unlike during normal operations. When you use this flag, you can stop the indexer process, do your bulk load, and start the indexer again when it's convenient to have all the content indexing done in the background. The ``SUBSTANCED_CATALOGS_FORCE_DEFERRED`` environment variable can also be used to turn this behavior on. For example ``export SUBSTANCED_CATALOGS_FORCE_DEFERRED=true``.