
Author:  Kevin Bealer
Updated: April 2008

----- Individual Alias Files -----

Naming:   <any-name>.[np]al
Encoding: text
Style:    key-value

  Alias files are used to aggregate database volumes, filter the set
  of sequences included from those volumes using identifier lists (GI
  lists, TI lists, and OID masks), and modify user-visible statistical
  and summary data such as the database "title".  Some alias files are
  generated by scripts or programs, but they can also be hand-written.

  Lines starting with '#' or containing only whitespace are comments,
  and have no effect on the meaning of the alias file.

  Lines that are not comments should be key/value lines; these should
  consist of a non-whitespace key followed by tabs and/or spaces,
  followed by a value.  The result of processing an alias file is a
  key / value map.  Each recognized key has a specific meaning to the
  SeqDB and readdb libraries.  If a key is unrecognized, that line is
  ignored (rationale: forward compatibility).  No key should appear
  twice in the same alias file.  If a key can accept a list of values,
  all values should be listed on one line, separated by spaces.  The
  order of input lines is not important (except in Group Alias Files,
  described below) nor is end-of-line whitespace.

--- Group Alias Files ---

Naming:   index.alx
Encoding: text
Style:    key-value

  Group Alias Files are a concatenation of the normal (individual)
  alias files in a directory.  For each individual file that would
  exist, the Group Alias File should contain a line with the key of
  "ALIAS_FILE" and the value of the filename of that alias file,
  followed by that alias file's contents.

  The filename used with the ALIAS_FILE key should follow the BlastDB
  naming convention (either <abc>.nal for nucleotide databases or
  <abc>.pal for protein databases).  A separate file with this name
  need not actually exist outside the group alias file, though as of
  April 2008, these files are still kept, and are needed by the readdb
  library (which does not support the Group Alias File feature).

  The motivation for this feature was the case of opening several
  hundreds of microbial databases at once.  Each Microbial database is
  implemented as an alias file applying filtering to one or more of a
  small set of "source" databases such as "nt".  This usage pattern
  resulted in very slow construction of SeqDB objects before the Group
  Alias File feature was introduced; with the feature, the startup
  time is much improved.  The improvement is due primarily to a
  reduction in the number of system calls needed, since one file is
  used, instead of many hundreds.  The degree of improvement is
  probably much greater on network-accessed file systems.

  For more information, see the "ALIAS_FILE" key described below.

--- Alias File Keys ---

  The following is a quick description of currently supported or used
  keys and their effects.  It includes keys found in the "standard"
  databases, as well as some that are implemented in the source code
  but not used in production (as far as I know).  For our purposes the
  "standard set" refers to the set of production databases currently
  found under $BLASTDB.  The information written here is probably more
  accurate and complete where it describes SeqDB than where it
  describes readdb.


Key:   ALIAS_FILE
Usage: (only in group alias files)
Type:  special

  This key is not found in normal alias files, but only in the Group
  Alias File, named index.alx.  This file improves SeqDB's performance
  during alias file processing for cases where a large number of alias
  files (in the same directory) are opened by one SeqDB instance.

  The Group Alias File feature works by including all alias files from
  a given directory in one large aggregated alias file named index.alx
  so that only one file needs to be opened and parsed.  This provides
  significant speedups, especially when working with network file
  systems, as it reduces the quantity of system calls, open files, and
  network traffic.

  The data for each included file must be marked to indicate to which
  alias file it belongs.  This is done by including a key/value line
  with the key of ALIAS_FILE, and the value of the original (source)
  filename, followed by the lines of data from that file.  This key is
  not considered part of the individual (embedded) alias file but only
  indicates to which alias file the following lines belong.

  When SeqDB needs to open an alias file in a specific directory, it
  first checks whether it has ever looked for an alias file in that
  directory before.  If not, it looks for a Group Alias File (always
  named "index.alx") in that directory.  If one is found, SeqDB does
  not open, read, or even check for the existence of any individual
  (non-group) alias files from that directory, instead assuming that
  "index.alx" accurately represents all of the individual alias files.
  
  Currently, individual alias files do still exist (for the sake of
  code using the readdb library), and in directories where there is a
  group alias file, it must be kept in sync with the "individual"
  versions of the alias files.  This can be done by rebuilding the
  group alias file each time another alias file changes, or on a fixed
  schedule (once a day for example).  Currently, the Group Alias File
  feature is only applied to the Microbial directory. This file is produced by
  the script build-alias-index, which can be found in
  trunk/c++/src/objtools/blast/seqdb_writer.

Key:   DBLIST
Usage: universal
Type:  data

  This key specifies the list of volumes and/or other alias files
  included by this alias file.  The order of the names listed here is
  not significant.  All filtering options applied by this alias file
  will be applied to all elements of this list.

  Alias files exist to aggregate and modify the contents or blast
  database volumes, so this key should appear in every alias file; an
  alias file not containing this key does not include any sequences.

Key:   FIRST_OID
Usage: (not found)
Type:  filter

  This key specifies the first OID to include from the volume
  specified by this alias file.

  The OIDs ranges are inclusive; range such as 10-20 are interpreted
  as including both sequence "10" and sequence "20", a total of 11
  sequences.  Since BlastDB text-based formats use 1-based notation
  for OIDs and programmatic APIs assume 0-based notation, the included
  OIDs at the API (readdb or SeqDB) level are 9 through 19.

  As of 4/1/2008, no database in the standard set uses this key.

Key:   GILIST
Usage: common
Type:  filter

  This key specifies the name of a file which should contain a binary
  or text list of GI values.  For information on
  how these files are encoded, see that documentation for "ID list".

Key:   LAST_OID
Usage: (not found)
Type:  filter

  This key specifies the last OID to include from the volume specified
  by this alias file.

  The OIDs ranges are inclusive; range such as 10-20 are interpreted
  as including both sequence "10" and sequence "20", a total of 11
  sequences.  Since BlastDB text-based formats use 1-based notation
  for OIDs and programmatic APIs assume 0-based notation, the included
  OIDs at the API (readdb or SeqDB) level are 9 through 19.

  As of 4/1/2008, no database in the standard set uses this key.

Key:   LENGTH
Usage: common
Type:  meta-data

  Each database volume has a length, which is the total length of all
  the sequences in the database volume.  The length of an alias file
  database is taken to be the length of all the sequences in all the
  volumes (and other alias files) it includes, but not including any
  sequences removed via alias file based filtering.

  Ordinarily, if an alias file does not do any filtering, the length
  is taken to be the sum of the included objects.  If the alias file
  does modify its volumes via filtering, the length is computed by
  summing the lengths of all sequences included in the database.  This
  field overrides the values produced in these ways.

  The advantage of specifying this field is to avoid the need for an
  iteration procedure in SeqDB, thus saving computation at runtime.
  If the DBLIST refers to databases which increase in size over time,
  the LENGTH field should be updated when the databases are, to avoid
  out-of-date data.  Otherwise, the statistical data from BLAST
  searches loses accuracy over time.  (NOTE: I'm not sure, but I think
  that in readdb, not specifying this field result in the volume size
  being considered as the included sequence size, which is a loss of
  accuracy, compared to SeqDB's spending CPU time to compute a value.)

  Note: When the user provides a GI, TI, or other ID list to the SeqDB
  constructor, the sequence iteration summing procedure will always be
  done, and this field will be ignored.

Key:   MAXOID
Usage: rare
Type:  meta-data, unnecessary

  This key specifies the maximum OID for a database.  This information
  is provided by some programs which write alias files.  However, the
  maximum OID can be computed easily without looking at this field,
  and the alias file information may get out of date, so neither SeqDB
  nor readdb uses or even parses the information in this field.

Key:   MAX_SEQ_LENGTH
Usage: (not found)
Type:  meta-data

  This field is used to override the maximum sequence length field
  stored in database volumes.  This field is not used by SeqDB or
  readdb.

  Note: The purpose of this field was probably to allow pre-allocation
  of a buffer that is large enough to work on any sequence found in a
  database.  In practice this approach can fail due to incorrect alias
  file data, possibly resulting in a buffer overrun condition and a
  core dump or worse.  (A safer technique is to start with a small or
  empty buffer, expanding it whenever the next sequence wouldn't fit.)

  As of 4/1/2008, no database in the standard set uses this key.

Key:   MEMB_BIT
Usage: common
Type:  filter

  The membership bit is used for the OID mask filtering technique (see
  also OIDLIST).  OID masks work a little like GI lists, but use bit
  maps of the OID space of each database volume, rather than lists of
  sequence identifiers.

  Each sequence in a database may have multiple deflines due to the
  non-redundancy feature (see blastdb_concepts.txt), so the OID
  masking technique provides a way to select which sets of deflines to
  include for each alias-file-defined database.  The MEMB_BIT in the
  alias file is matched to a set of membership bits found in each
  sequence; the sequence is included, as are those deflines which
  include the specified bit.

  Further filtering may remove additional deflines.  If all deflines
  are removed, the sequence can be considered to have been removed by
  filtering.  In the case of SeqDB (at least), this OID will not be
  removed from OID iteration, because checking for this kind of
  overlapping filtering is expensive given the current database
  design.  To check for this condition, test whether the list of
  deflines is empty.

Key:   NSEQ
Usage: common
Type:  meta-data

  This key provides the number of sequences included via this alias
  file.  If no filtering is applied to the data, it is not useful to
  provide this, since the calculation in that case is trivial.  If
  filtering is used, this field can be used to save SeqDB (or readdb)
  some CPU time that would be required to sum up the lengths of all
  sequences included by the filtering configuration.  However, in some
  cases, such as with user-provided ID/GI/TI list filtering, SeqDB
  will do this summation process anyway, on the basis that the alias
  file data cannot be a correct estimate in light of the user ID list.

  For more information on the tradeoffs involved, see the description
  of the LENGTH field.

Key:   OIDLIST
Usage: common
Type:  filter

  This key specifies the name of a file which should contain a binary
  bitmap of the OIDs in a volume.  For information on how these files
  are encoded, see that documentation for "ID list".

  Typically, this is used for non-redundant databases, where multiple
  deflines exist for each sequence.  Since there is usually a desire
  to filter at the defline (rather than just OID) level, this key is
  usually used in combination with the MEMB_BIT key (see also).

Key:   STATS_NSEQ
Usage: common
Type:  meta-data

  This key provides the total of all sequence lengths, for statistical
  purposes, for this alias file.  This key's value is never computed
  automatically, and does not correspond to a database volume field.
  If this key is missing, the field value is returned as zero. (The
  code using this information will select another value to use.)

Key:   STATS_TOTLEN
Usage: common
Type:  meta-data

  This key provides the number of included sequences, for statistical
  purposes, for this alias file.  This key's value is never computed
  automatically, and does not correspond to a database volume field.
  If this key is missing, the field value is returned as zero. (The
  code using this information will select another value to use.)

Key:   TILIST
Usage: common
Type:  filter

  This key specifies the name of a file which should contain a binary
  or text list of TI (trace identifier) values.  For information on
  how these files are encoded, see that documentation for "ID list".

  As of 4/1/2008, no database in the standard set uses this key.
  (But this feature is new and future use is planned.)

Key:   TITLE
Usage: common
Type:  meta-data

  All BlastDB database volumes include a TITLE field, a string which
  is specified when the volume is formatted.  The title of an alias
  file is defined recursively -- the value from the alias file TITLE
  key line if one is present, otherwise, the concatenated TITLES of
  included volumes and alias-files, interleaved with "; ".

  This concatenation will often have annoying duplication for multiple
  volume databases, and for single volume alias files with filtering,
  it can be misleading, such as "All nucleotide sequences.".  Usually
  it is desirable to replace this generated title with something more
  concise and expressive, describing the intention of the alias file
  and possibly any filtering or other modifications.

---
