kamailio/modules/db_cassandra/doc/db_cassandra_admin.xml

<?xml version="1.0" encoding='ISO-8859-1'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [

<!-- Include general documentation entities -->
<!ENTITY % docentities SYSTEM "../../../docbook/entities.xml">
%docentities;

]>

<!-- Module User's Guide -->

<chapter>

	<title>&adminguide;</title>

	<section>
	<title>Overview</title>
	<para>
		Db_cassandra is one of the &siprouter; database modules. It does
		not export any functions executable from the configuration scripts,
		but it exports a subset of functions from the database API, and thus,
		other modules can use it as a database driver, instead of, for
		example, the Mysql module.
	</para>
	<para>
		The storage backend is a <emphasis>Cassandra</emphasis> cluster and
		this module provides an SQL interface to be used by other modules for
		storing and retrieving data. Because Cassandra is a NoSQL distributed
		system, there are limitations on the operations that can be performed.
		The limitations concern the indexes on which queries are performed, as
		it is only possible to have simple conditions (equality comparison only)
		and only two indexing levels.  These issues will be explained in an example
		below.
	</para>
	<para>
		Cassandra DB is especially suited for storing large data or data that requires
		distribution, redundancy or replication. One usage example is
		a distributed location system in a platform that has a cluster of &siprouter;
		servers, with more proxies and registration servers accessing the same location
		database. This was actually the main use case we had in mind when implementing
		this module. Please NOTE that it has only been tested with the
		<emphasis>usrloc</emphasis>, <emphasis>auth_db</emphasis> and <emphasis>domain</emphasis> modules.
	</para>
	<para>
		You can find a configuration file example for this usage in the module - kamailio_cassa.cfg.
	</para>
	<para>
		Because the module has to do the translation from SQL to Cassandra NoSQL
		queries, the schemas for the tables must be known by the module.
		You will find the schemas for location, subscriber and version tables in
		utils/kamctl/dbcassandra directory. You have to provide the path to the
		directory containing the table definitions by setting the module parameter
		schema_path.
	</para>
	<para>
		There is no need to configure a table metadata in Cassandra cluster.
		You only need to define a keyspace with the name of the database and for each table
		a column family inside that keyspace with the name of the table. The comparator
		and validators should be either UTF8Type or ASCIIType.
		Example:
	</para>
<programlisting format="linespecific">
   ...
   create keyspace openser;
   use openser;
   create column family 'location' with comparator='UTF8Type' and
default_validation_class='UTF8Type' and key_validation_class='UTF8Type';
   ...
</programlisting>

	<para>
		Special attention was given to performance in Cassandra. Therefore, the
		implementation uses only the native row indexing in Cassandra and no secondary
		indexes, because they are costly. Instead, we simulate a secondary index by
		using the column names and putting information in them, which is very efficient.
		Also, for deleting expired records, we let Cassandra take care of this with
		its own mechanism (by setting the TTL for columns).
	</para>
	</section>

	<section>
	<title>Dependencies</title>
	<section>
		<title>&siprouter; Modules</title>
		<para>
		The following modules must be loaded before this module:
			<itemizedlist>
			<listitem>
			<para>
				<emphasis>No dependencies on other &siprouter; modules</emphasis>.
			</para>
			</listitem>
			</itemizedlist>
		</para>
	</section>
	<section>
		<title>External Libraries or Applications</title>
		<para>
		The following libraries or applications must be installed before running
		&siprouter; with this module loaded:
			<itemizedlist>
			<listitem>
			<para>
				<emphasis>Thrift library</emphasis> version 0.6.1 .
			</para>
			</listitem>
			</itemizedlist>
		</para>
		<para> The implementation was tested with Cassandra version 1.0.1 .</para>
	</section>
	</section>

	<section>
	<title>Parameters</title>
	<section>
		<title><varname>schema_path</varname> (string)</title>
		<para>
			The directory where the files with the table schemas are located.
			This directory has to contain the subdirectories corresponding to the
			databases' names (name of the directory = name of the database).
			These directories, in turn, contain the files with the table schemas.
			See the schemas in utils/kamctl/dbcassandra directory.
		</para>
		<example>
		<title>Set <varname>schema_path</varname> parameter</title>
<programlisting format="linespecific">
   ...
   modparam("db_cassandra", "schema_path",
               "/usr/local/sip-router/etc/kamctl/dbcassandra")
   ...
</programlisting>
		</example>
	</section>
	</section>

	<section>
	<title>Functions</title>
		<para>
		NONE
		</para>
	</section>

	<section>
	<title>Installation</title>
		<para>
		Because it dependes on an external library, the db_cassandra module is not
		compiled and installed by default. You can use one of these options:
		</para>
		<itemizedlist>
			<listitem>
			<para>
			- edit the "Makefile" and remove "db_cassandra" from "excluded_modules"
			list. Then follow the standard procedure to install &siprouter;:
			"make all; make install".
			</para>
			</listitem>
			<listitem>
			<para>
			- from command line, run: 'make all include_modules="db_cassandra";
			make install include_modules="db_cassandra"'.
			</para>
			</listitem>
		</itemizedlist>
	</section>


	<section>
		<title>Table schema</title>
		<para>
		The module must know the table schema for the tables that will be used.
		You must configure the path to the schema directory by setting the
		<emphasis>schema_path</emphasis> parameter.
		</para>
		<para>
		A table schema document has the following structure:
		<itemizedlist>
			<listitem>
			<para>
			First row: <emphasis>the name and type of the columns</emphasis> in the form name(type)
			separated by spaces. The possible types are: string, int, double and timestamp.
			</para>
			<para>
			The<emphasis>timestamp</emphasis> type has a special meaning. Only one column of this type can
			be defined for a table, and it should contain the expiry time for that record.
			If defined this value will be used to compute the <emphasis>ttl</emphasis> for the columns
			and Cassandra will automatically delete the columns when they expire. Because we want the
			ttl to have meaning for the entire record, we must ensure that when the ttl is updated, it
			is updated for all columns for that record. In other words, to update the expiration time
			of a record, an insert operation must be performed from the point of view of the db_cassandra
			module ("insert" in Cassandra means "replace if exists or insert new record otherwise"). So, if
			you define a table with a timestamp column, the update operations on that table that also
			update the timestamp must update all columns. So, these update operations must in fact be insert
			operations.
			</para>
			</listitem>
			<listitem>
			<para>
			Second row: <emphasis>the columns that form the row key</emphasis> separated by space.
			</para>
			</listitem>
			<listitem>
			<para>
			Third row: <emphasis>the columns that form the secondary key</emphasis> separated by space.
			</para>
			</listitem>
		</itemizedlist>
		</para>
		<para>
		Bellow you can see the schema for the <emphasis>location</emphasis> table:
		</para>
	<para>
	</para>

<programlisting format="linespecific">
   ...
   callid(string) cflags(int) contact(string) cseq(int) <emphasis>expires(timestamp)</emphasis> flags(int) last_modified(int) methods(int) path(string) q(double) received(string) socket(string) user_agent(string) username(string) ruid(string) instance(string) reg_id(int)
   <emphasis>username</emphasis>
   <emphasis>contact</emphasis>
   ...
</programlisting>

	<para>
		Observe first that the <emphasis>row key is the username</emphasis> and the <emphasis>secondary index is the contact</emphasis>.
		We have also defined a timestamp column - <emphasis>expires</emphasis>.
		In this example, both the row key and the secondary index are defined by only one column,
		but they can be formed out of more columns.  You can list them separated by space.
	</para>

	<para>
		To understand why the schema looks like this, we must first see which
		queries are performed on the location table.
		(The 'callid' condition was ignored as it doesn't really have a well defined role in the SIP RFC).
	</para>
	<itemizedlist>
		<listitem>
			<para>
				When Invite received, lookup location: select where <emphasis>username='..'</emphasis>.
			</para>
		</listitem>
		<listitem>
			<para>
				When Register received, update registration:
				update where <emphasis>username='..'</emphasis> and <emphasis>contact='..'</emphasis>.
			</para>
		</listitem>
	</itemizedlist>
	<para>
		So, the relation between these keys is the following:
	</para>
	<itemizedlist>
		<listitem>
		<para>
		The unique key for a table is actually the combination of row key + secondary key.
		</para>
		</listitem>
		<listitem>
		<para>
		A row defined by a row key will contain more records with different secondary keys.
		</para>
		</listitem>
	</itemizedlist>
	<para>
		The timestamp column that leaves the Cassandra cluster to deal with deleting expired
		record can be used only with a modification to the usrloc module that replaces the
		update performed at re-registration with an insert operation (so that all columns
		are updated).
		This behavior can be enabled by setting a parameter in the usrloc module
		<emphasis>db_update_as_insert</emphasis>:
	</para>
	<para>
	</para>

<programlisting format="linespecific">
   ...
   modparam("usrloc", "db_update_as_insert", 1)
   ...
</programlisting>

	<para>
		The alternative would have been to define an index on the expire column and
		run a external job to periodically delete the expired records. However,
		obviously, this would be more costly.
	</para>

	</section>

	<section>
	<title>Limitations</title>
		<para>
			The module can be used only when the queries use only one index, which is also the
			unique key, or have two indexes that form the unique key like in the usrloc usage.
		</para>
	</section>

</chapter>