Humdrum Extras

tindex manpage


    tindex -- Create melodic search indexes from **kern data for use with the themax search engine.


    tindex [-E[-f string[-a[-bfile(s)/directory(ies) [ > output ]


-a Include all pitch and rhythm features in output index data (equivalent to -pr).
-b Include bibliographic (reference) records in output index data.
-B string Filter bibliographic (reference) records according to the list given in string.
-d dir Append directory name to filename tags. Removes any existing directory prefix.
-D Suppress directory prefix in filename tags.
-E Suppress extra fields (key and meter descriptions)
-f string Extract particular features using a list of tags for the features to extract.
-G Do not index grace notes.
-i Store the instrument name within the initial tag.
-l # Limit the number of notes processed for the entry to first # of the file.
-r Include all rhythmic fields in output. If used without -p, then it will suppress the extraction of pitch features. The -a option will also include all rhythm features in the extracted feature index.
-p Include all pitch features in output (default behavior).
-Q Turn off quiet mode. Don't suppress control messages which start with '#' and describe important option settings for the themax and theloc commands.
-t file Substitute the filename in the first column of output data with a different string (typically a control number for database management).
--end For chords (multi-stops), encode the last note in the token rather than the first note.
--fermata Encode fermatas as segmentation boundaries.
--phrase Encode phrase endings as segmentation boundaries.
--moly Extract multiple monophonic entries from a polyphonic data file. Only the first sub-spine layer in each spines will be parsed. The default behavior of tindex it to extract data from all layers of all **kern data spines. [ore info]
--rest Encode rests as segmentation boundaries.


    The tindex program is used to generate a music search index which is used as input into the music searching program called themax. The tindex is used to generate search indexes of short monophonic incipits for Themefinder but can also be used to create search indices for complete polyphonic compositions which are encoded in the Humdrum file format as **kern data.

    The tindex program is an expanded version of the original themebuilder command written in AWK by David Huron, utilizing Humdrum Toolkit commands to extract pitch features from monophonic **kern data. The tindex program can emulate the original version by using the --mono option.

    The tindex program allows indexing of polyphonic music consisting of strictly monophonic voices (molyphonic) as well as more complicated polyphony characteristic of keyboard music where voices enter and drop out from the overall texture of the music. The tindex/themax program pair can only handle monophonic sequences, so the tindex program can either choose the first or last note listed in a chord (multi-stop token). Other added capabilities include rhythmic feature extraction, reference (bibliographic) record extraction, selective feature extraction, grace note inclusion/exclusion, and segmentation boundary encoding.

    As a basic example, consider the following six Humdrum files:


    Passing these files to tindex will generate a search entry line for each file:

    tindex ex?.krn > index.thema

    Application of the search index

    Once the index data has been created, it can be used by the themax command to search for feature patterns in the index. For example, themax can be used to search for the song(s) which contain the pitch sequence "e-flat, f, g". The default output of the themax command is a list of entries from the original thema index file created by tindex which match the search query. In this case there is one file (ex4.krn) which contains the pitch sequence:

    themax -p "e- f g" index.thema

    If you want to perform an "AND" search with another independent musical feature, then the output from themax can be piped into another call to the program with the matches from the first search. To search for features in parallel (such as pitch and rhythm at the same time), the search queries are given as multiple options to a single call to themax. For example, the "e-flat, f, g" sequence occurs with the durations "dotted-half, eighth, eighth" which can be queried by the -u option:

    themax -p "e- f g" -u "2. 8 8" index.thema

    An additional program called theloc (thema location) can then be used to identify the location in the original file when the --location option is given to themax. In this case "=1B1" means the matched sequence occurs starting at measure 1, beat 1 in the original data file (ex4.krn):

    themax -p "e- f g" -u "2. 8 8" index.thema --location | theloc -N

    An additional output option from the theloc program will also mark the individual notes which caused the match. In the following search, the "e-flat, f, g" sequence is searched without considering the rhythm, and there are two locations in the file where the query is found. The --ending option has to be supplied along with the --location option so that both the starting and ending notes of the matches can be highlighted in the output data (otherwise, only the first note at the start of the match will be marked).

    themax -p "e- f g" index.thema --location --ending | theloc --mark

    Each matched note is marked with an "@" character, and an !!!RDF: record explaining that character's meaning (a matched note) is given at the bottom of the file. The example on the right is generated by adding the --tie option to the theloc program so that it will highlight all tied notes after the first note in a group of notes tied together. The marking character can be used to locate matches in a text editor (by searching for "@" in the resulting file, or the marking character can be used to highlight the note in graphical music notation, such as coloring the matched notes red:

    Thema index pitch fields

    Each line contains multiple fields separated by a tab character, and each field except the first one at the start of the line begins with a unique tag character to facilitate searches in the thema command. The ten tab-separated default entries on each line are:

    1. An identification string which, by default, contains the name of the original file followed by two colons (which may be split by an instrument tag), and then the spine number in the original file from which the extracted features occur.
    2. key -- starting with uppercase Z for major modes or lowercase z for minor modes, then the tonic note of the key (in uppercase), terminated by an equals sign (=). Example: ZG= which represents G major. Note that there must be a key designation record in the file in order for the key to be extracted into the index, and only the first key designation in the spine will be encoded in the index.
    3. twelve-tone interval -- starting with an open curly brace ({), then a string of intervals without spaces, using m (minus) for falling melodic intervals, p (positive) for rising intervals (p is also used before repeated notes). Example: {m2m1m4m1p3m2p4m2p3m1p3m12p4p1p0
    4. pitch refined contour -- starting with # and followed by five possible characters: d = down a diatonic step, D = down a diatonic leap (greater than an interval of a 2nd), s = same pitch (repeated note), u = up a step, U = up a leap. Example: #ddDdUdUdUdUDUus
    5. pitch gross contour -- a three-level description of the melodic contour (as opposed to 5 for refined contour). The data field starts with a colon (:), and then has three possible characters: U = up (next note is a higher pitch than the current note), S = same or repeated note, D = down. Example: :DDDDUDUDUDUDUUS
    6. scale degree -- starts with a percent sign (%) followed by the numbers 1 through 7 to indicate the seven diatonic steps of a major or minor scale. Accidentals are ignored, so both C and C# in C major are labeled as 1. Example: %5431721324355711 Note that there must be a key designation record in the file in order for the scale degrees to be extracted from the data.
    7. musical interval -- abbreviations of the standard names for musical intervals. This field starts with a right curly brace (}), followed by a sequence of intervals without spaces which consist of three parts: (1) the interval direction (x for down, X for up), (2) the quality of the diatonic interval (M=major, m=minor, P=perfect, A=augmented, d=diminished, and (3) the diatonic distance as a number, such as 3 for a third. Example: }xM2xm2xM3xm2Xm3xM2XM3xM2Xm3xm2Xm3xP8XM3Xm2P1
    8. twelve-tone pitch class. Starting with a j, then followed by the diatonic pitch classes, starting with C = 0, C-sharp/D-flat = 1, D = 2. For two digit pitch classes, letters of the alphabet are substituted: A-sharp/B-flat = 10 -> A, and B/C-flat = 11 -> B. Example: j20B7697B90B22677
    9. diatonic pitch class -- Starting with J and followed by the pitch class names. This field is the only one which separates individual notes by spaces. Diatonic pitch names are in upper case (A through G) followed by an accidentals: # for sharps/double sharps, and - for flats/double flats. Example: JD C B G F# A G B A C B D D F# G G
    10. metric description -- starting with an M, followed by the numeric values for the time signature, and then followed by quadruple, triple, etc which describes the type of metric cycle, followed by simple or compound depending on if the top number in the time signature is divisible by 3. Example: M4/4quadruplesimple

    Rhythmic analysis option

    Using the -r extracts eight rhythmic features into the output search index. When the -r option is used alone, the pitch features are suppressed. To include both pitch and rhythm features use the option pair -p -r or -a (for all) to include all musical features.

    1. duration gross contour (~)
    2. duration refined contour (^)
    3. duration (as an inter-onset-interval) (!)
    4. beat level (&)
    5. metric level (`)
    6. metric refined contour (')
    7. metric gross contour (@)
    8. beat position (=)

    Listed below are example rhythmic features extracted from the six melodies given above. The first example extracted only the rhythmic features with the -r option, while the second example extracts all musical features with the -a option (both pitch and rhythm features).

    tindex -r ex?.krn > index.thema

    tindex -a ex?.krn > index.thema

    Selective feature indexing

    By default, tindex will extract all pitch features and no rhythmic features (to simulate the behavior of the original themebuilder program. To extract only all rhythm features, use the -r option. To extract both all pitch and all rhythm features, use either -a or -p -r.

    However, if you only want a specific subset of any of the extractable features, use the -f option followed by a list of the features to extract according to the feature tags in the following table. This option is useful when only specific musical features will be searched. In these case, index file size will be minimized and search processing time will be increased by only including the desired musical features.

    Feature tag     
    PCH, P, PC
    Diatonic Pitch Class: C, C#, D-, D, E--, F#, F##, G, etc.
    MI, DI, INT, I
    Diatonic Interval: +P5, -3, P1, +P8, etc.
    SD, S, D
    Diatonic Scale Degree: 1, 2, 3, 4, 5, 6, 7.
    Twelve-tone Pitch Class: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B.
    Twelve-tone Interval: 0, -1, +2, +5, -7, etc.
    PGC, GC, CON
    Pitch Gross Contour: U, D, S.
    PRC, RC
    Pitch Refined Contour: U, u, D, d, S.
    DUR, IOI
    Duration: 8, 4, 16, 8., 12, etc.
    DGC, RGC
    Duration Gross Contour:
    DRC, RRC
    Duration Refined Contour:
    Beat Level
    Metric Level
    Metric Position
    Metric Refined Contour
    Metric Gross Contour

    When a particular feature has more than one tag, any of those tags are aliases for the same musical feature. For example, to extract only the diatonic pitch class feature, use the option -f "PCH", or equivalently: -f "P", -f "PC".

    tindex -f "PCH" ex?.krn > output.thema

    The musical key and meter features can be suppressed with the -E option (meaning: no extra features).

    tindex -E -f "PCH" ex?.krn > output.thema

    Multiple selected features can be extracted by adding them to the -f option string, separated by one or more space characters. The ordering of the features in the option string does not matter: all features will be output in the canonical order required for searching with the themax command. In the following example, the pitch and duration features are extracted. Even though the features are listed in the order duration/pitch, the output index is ordered pitch/duration.

    tindex -E -f "DUR PCH" ex?.krn > output.thema

    Including segmentation markers

    Rests, fermatas and phrase endings can be used to segment the output stream of pitches/rhythms so that searches will not cross over boundaries defined by these features. If any of the following options are used to include segment markers (R) in the output data, control messages will be output with the index data if the -Q option is used to enable them.

    Rests as segmentation boundaries

    The --rest option will cause "R" segmentation markers to be placed within all extracted pitch and rhythm features whenever pitch sequences are separated by one or more rests. Only one rest marker will be inserted between two pitch/rhythm features, even if there are multiple intervening rests.

    For pitch/rhythm interval features, if a single pitch is surrounded by rests both before and after the note, there will be two "R" markers in a row in the extracted features data (see example below). This is used to keep track of the number of notes in the original musical data for later alignment of the search matches in the musical data.

    tindex -a input > output
    tindex -a --rest input > output

    Fermatas as segmentation boundaries

    Like rests, fermatas can function as segmentation boundaries, particularly by implicitly marking phrase endings. Use the --fermata option to add an R marker to the feature sequences when a pitch with a fermata is found in the input data. Fermatas must occur on the first note in chord (multi-stop) **kern tokens. Fermatas on rests are ignored; include the --rest option instead. The following example show some sample music with fermatas and rests. The data can be extracted in three ways which will cause different segmentation markers (R) to be inserted within the output indexing data. Notice that when --fermata and --rest are both used at the same time, only one R segmentation marker will be generated in the output index data.

    no segmentation
    --fermata only
    --fermata & --rest
    --rest only

    Phrase endings as segmentation boundaries

    Phrase endings (}) can also be used to mark segmentation boundaries with the R character in the output feature index data. Phrase endings can fall on rests.

    no segmentation
    --phrase only
    --phrase & --rest
    --rest only

    Grace notes

    By default, grace notes information is extracted from the input music. If grace notes should be suppressed in the output index, use the -G option. Use the -Q option to output the #NOGRACE control message which is needed by the theloc command to work properly. Also, use the -Q in themax when passing the data to theloc.

    tindex input > output
    adding the -G option

    Polyphonic option

    Polyphonic data extraction can be done by using the --poly option. This option extracts multiple entries for a file, with one line for each **kern spine in the file.

    Only the first layer of a spine is used for building an index. For example, here is a file with two spines of **kern data:


    Running the command "tindex --poly poly.krn" will generate two entries:

    Each entry adds a double colon (::) after the filename (or text string substitution when using the -t option), followed by the spine number from which the indexing data was extracted. Note that the second column of data in the second spine is currently ignored.

    Fully polyphonic melodic extraction

    Use the --poly2 option to extra all sub-spine melodic sequences from a file. The --poly option will only extract the first layer (first occurrence of a given spine on the line), while --poly2 will extract all layers (all sub-spine occurrences on the line).

    tindex --poly2 poly.krn -E -f "P"

    Notice that with the --poly2 option, an additional line is added to the output index data. The third line in the above index data represents the pitch sequence found in the second subs-pine (i.e., the second layer) of the second spine. All secondary subs-pine data is indicated in the voice number after the filename after a period character after the primary spine number. In the above example "2.2" means that the sequence is from the second spine in the file (first 2), and in the second sub-spine in spine 2 (second 2).

    When secondary subs-pines are not contiguous, a segmentation marker will be added to the output data.


    tindex --poly2 poly2.krn -EfP

    In the above example, the pitch sequence "ff gg" in the second subs-pine of the second spine is not immediately followed by the pitch sequence "bb ccc" later in the sub-spine. Therefore, a segmentation marker (R) is added between these two sub-sequences. In addition, since the second sub-spine (second layer) of the second spine does not start at the beginning of the music, a segmentation marker starts the index data for the second layer of the second spine.

    Instrument label

    In order to allow searching by instrument, the -i option can be given to store an instrumental name within the initial tag field of an index entry. Instrument names are give in Humdrum **kern data with a tandem interpretation starting with the characters *I:. Currently the instrument label will only work when the --poly or --poly2 option is also given.


    Running the command "tindex -i --poly instrument.krn" will generate two entries which include the instrument label:

    Including bibliographic records

    When the -b option is given to tindex, all bibliographic (reference) records found in the input Humdrum file will be appended to the end of the feature list on an output index line. All bibliographic records will be placed in sorted ASCII order which is required for searching in multiple records using themax. Each bibliographic entry will be separated by a tab character on the output index line.

    The -B option can be used to select only particular bibliographic records to store in the output index data. For example, to only store title records (if they are present in the input data), then the option would be -B "OTL". In this case all other bibliographic records, such as COM (composer's name records) will be suppressed.

    To allow more than one bibliographic record type in the -B record filter string, each bibliographic key should be separate by spaces, colons, and/or commas. For example, to allow for the composer and title only in the output index, use "-B "COM, OTL" or -B "COM:OTL". The order of the bibliographic keys in the argument string for -B is not important, since the output index data will always produce bibliographic records in sorted ASCII order.

    The bibliographic keys within the -B string are actually regular expressions. This allows for more specific filtering rules, such as:

    1. -B "^C" == allow all bibliographic records which start with a capital C, such as COM (composer) and CDT (composer's dates).
    2. -B "^C,^O" == allow all bibliographic records which start with either a C or an O.
    3. -B "^OTL$" == allow bibliographic records which match exactly to OTL, and suppress records such as OTL1, OTL@@FRE (title in the original language of French), or OTL@ENG (title translated into English).

    By default -B "OTL" will match to bibliographic keys such as: OTL, OTL1, OTL@@FRE, OTL@ENG since all of them contain the string "OTL". The regular expression anchors for start and end of line (^ and $) are local to each bibliographic key in a -B option string.

    Control messages

    Command-line settings which can affect the operation of themax and theloc are stored in control messages in the output data if the -Q option is specified. These messages start with a hash sign (#). All of these messages are suppressed in the output if the -Q option is not given. Messages will not contain tab characters on the line, which could interfere with the search mechanism within themax. Here is a list of the messages which may occur:

    The --rest option was given in the command-line call to tindex. This option includes "R" markers in the output data which are used to prevent pitch sequences from crossing rest boundaries. The default behavior of #NOREST will be include in the index if the --rest option has not been used, so that multiple indexes extracted with different option settings can be processed together properly.
    The --fermata option was given in the command-line call to tindex. This option includes "R" markers in the output data which are used in a similar manner to the --rest option to prevent pitch sequences from crossing phrase boundaries. Fermatas are encoded in **kern data as semi-colons (;). The default behavior of #NOFERMATA will be include in the index if the --fermata option has not been used, so that multiple indexes extracted with different option settings can be processed together properly.
    The --phrase option was given in the command-line call to tindex. This option includes "R" markers in the output data which are used in a similar manner to the --rest and --fermata option to prevent pitch sequences from crossing phrase boundaries. Phrases endings are encoded in **kern data as closing curly braces (}). The default behavior of #NOPHRASE will be include in the index if the --phrase option has not been used, so that multiple indexes extracted with different option settings can be processed together properly.
    The -G option was given in the command-line call to tindex. This option suppresses grace note indexing in the output data. If you use the -G option, this control message is required as input into the theloc command (or the option to specify that grace notes were ignored). The default behavior of #GRACE will be include in the index if the -G option has not been used, so that multiple indexes extracted with different option settings can be processed together properly.
    The --overlap option in the themax will cause the #OVERLAP message to be printed in its output (currently only in certain cases). A "#NOOVERLAP" may be present in the output from themax to indicate that this option was not used.

    Directory processing

    The tindex program can be given a mixture of files and directories as command-line arguments. Each Humdrum file will generate a line of data in the output index. If a directory is given to the program, then all files within that directory and its subdirectories will be processed if they end in .krn or .thm. The path names of the files will be include in the output index data.

    Chord processing

    The tindex program processes sequences of notes, and therefore it is not useful for searching notes occurring at the same time (see sonority for that). When tindex encounters a chord (or multi-stop) token, it processes only the first note in the token. Typically this note is the lowest note in the chord (although this is not required). If you instead prefer the highest note in the chord, use the --end option to extract the last note in multi-stop tokens.


    tindex -E -f "PCH" chord.krn > output.thema

    tindex --end -E -f "PCH" chord.krn > output.thema

    Note offsets

    When data is processed with tindex, the usual assumption is that the first note of the data is the first note in the music. If you want to partially index a musical score, chop it into selected pieces and index each piece separately. In order to link back to the score with theloc, add a comment like this to the start of the extracted **kern spine:
    This comment will be read by tindex as a note offset value which will be stored after the voice number, preceded by a semicolon.

    For example, the following music contains music in 2/4 and 3/4. Since each entry in a thema index can only indicate a single key/meter, the music can be chopped into two segments, one for each section. The second segment of the music starts with the 7th note of the original music, so add !noff:7 before the first data line in the second segment:

    1st part
    2nd part

    When tindex processes the two parts, the note offset value will be stored in the entry for the second segment:

    tindex -p twometer[AB].krn

    In order to fully link back to the original file, add a global comment to the segmented files which gives the name of the original file:

          !!original-filename: twometer.krn

    Then when the index data is created with tindex the original filename will be used instead of the segment's filename:

    1st part
    2nd part
    tindex -p twometer[AB]2.krn

    Now when themax is used, the correct note numbers will be marked. For example, searching for the pitch sequence "G A" should find two matches—one starting on note 4 and the other starting on note 7 in the original file.

    themax -p "ga" --loc twometer2.index

    This information can be fed into theloc to mark the matched notes in the original file:

    cat twometer.thema | theloc -m

    Which can then be converted to highlighted notes in a conversion to graphical music notation:

    tindex twometer[AB]2.krn | themax --loc -p "ga" | theloc -m \
    | autostem | hum2muse | muse2ps =z21j | pstopnm -dpi=300 \
    | convert - -trim -negate -alpha copy -resize '33%' -negate towmeter.png

    If you only want to search music selectivly in triple meter, the split data segments make this possible:

    tindex twometer[AB]2.krn | themax --loc -T 3/4 -p "ga" | theloc -m \
    | autostem | hum2muse | muse2ps =z21j | pstopnm -dpi=300 \
    | convert - -trim -negate -alpha copy -resize '33%' -negate towmeterTriple.png



    Input arguments or piped data which are expected to be Humdrum files can also be web addresses. For example, if a program can process files like this:
           program file.krn
    It can also read the data over the web:
    Piped data works in a somewhat similar manner:
           cat file.krn | program
    is equivalent to a web file using ths form:
           echo | program

    Besides the http:// protocol, there is another special resource indicator prefix called humdrum:// which downloads data from the kernscores website. For example, using the URI humdrum://brandenburg/bwv1046a.krn:

          program humdrum://brandenburg/bwv1046a.krn
    will download the URL:
    Which is found in the Musedata Bach Brandenburg Concerto collection.

    This online-access of Humdrum data can also interface with the classical Humdrum Toolkit commands by using humcat to download the data from the kernscores website. For example, try the command pipeline:

          humcat humdrum://brandenburg/bwv1046a.krn | census -k


    • thememakerx: extracts incipits from Humdrum files of entire compositions.
    • themax: uses the output index data from tindex to search for melodic sequences.
    • theloc: takes output from themax and identifies the location of matches within the original files which were indexed by tindex.
    • sonority: useful for harmonic searching rather than melodic searching.


    The compiled tindex program can be downloaded for the following platforms:
    • Linux (i386 processors) (dynamically linked) compiled on 6 Oct 2013.
    • Windows compiled on 29 Jun 2012.
    • Mac OS X/i386 compiled on 13 Nov 2013.

    The source code for the program was last modified on 7 Apr 2013. Click here to go to the full source-code download page.