Humdrum Extras

simil manpage


COMMAND

    simil -- Edit-distance similarity measurements.

SYNOPSIS

    simil [-nr[-x lengthinputfile [templatefile[> output]

OPTIONS

-n Do not normalize the similarity measurement by the length of the template. (See also the -R option)
-r reverse the order of the input and template files on the command-line (so that one or the other can be input from standard input or piped from another command).
-x # variable template mode, examining template substrings of length # (which is less than the length of the template).

Extended options:
-a Append input source data used in analysis. See -s option as well. (not implemented yet)
-N Ignore null records in source spine.
-p Prepend input source data used in analysis. See -s option as well. (not implemented yet)
-R Raw edit distances. Do not scale or normalize the edit distance values.
-s display the search sequences in a parallel spine to the **simil data. [more info]
-S Do not print spaces between tokens when using the -s option. [more info]
-t # Print matches which only exceed a threshold similarity value. The similarity measurements are values in the range from 0 (low similarity) to 1 (high similarity). The default value is 0. [more info]
-w weights Import edit weights from a file instead of using default values of 1.0 for all editing operations. [more info]
-W Print a list of the edit distance weights used in the analysis as global comments at the bottom of the output. [more info]

Individual editing weights can be changed on the command-line:

--d1 # Change the score for deleting a non-repeated token in the source string.
--d2 # Change the score for deleting a non-repeated token in the template string.
--r1 # Change the score for deleting a repeated token in the source string.
--r2 # Change the score for deleting a repeated token in the template string.
--s0 # Change the score for substituting a token that is repeated in neither the source string nor the template string.
--s1 # Change the score for substituting a token that is repeated in the source string only.
--s2 # Change the score for substituting a token that is repeated in template string only.
--s3 # Change the score for substituting a token that is repeated in the source string and in the template string.

DESCRIPTION

    The simil program measures similarity between two sequences by using the Damerau-Levenshtein distance, which is the minimum numbers of editing operations needed in order to transform one sequence into another.

    Edit distance (Damerau-Levenshtein distance)

    Edit distance measures how many insertions, deletions and substitutions are needed in order to convert one sequence into another. Suppose that you want to convert the letters in "DESK" into the letters in "RAVEN". The following figure shows all of the possible insertions, deletions and substitutions which can be done to their letters when converting between the two words.

    There are many paths through the above grid which can be followed from "DESK" at the top left corner to "RAVEN" in the bottom right corner. However, the most interesting path(s) is the one which takes the fewest steps. In this case, it takes at least five steps to get from DESK to RAVEN. The following figure highlights four possible paths which each require five edit operations. Substituting the letter "E" in DESK for the letter "E" in RAVEN is assigned zero cost, while all other segments are given a cost of one unit.

    In each case, the paths require five changes to go from DESK to RAVEN. No path through the set of edits can have fewer than five steps, and the maximum value of nine is equivalent to following along the edges in either direction. Here are the intermediate text states for the each of the above paths:

    DESK
    RESK
    RASK
    RAVK
    RAVE
    RAVEN
    DESK
    RDESK
    RADESK
    RAVESK
    RAVENK
    RAVEN
    DESK
    RESK
    RAESK
    RAVSK
    RAVEK
    RAVEN
    DESK
    RDESK
    RAESK
    RAVESK
    RAVEK
    RAVEN

    Basic simil usage

    The simil program requires two inputs: (1) a source file in the Humdrum file format, and (2) a search template file containing only a sequence of search strings. The resulting output is a Humdrum file containing a list of similarity measurements.

    simil source template > output
    template
    source
    output

    In this case, each line of the output file gives the similarity of the template sequence to a sequence in the source starting on the same line of data. For example, the first similarity value of 0.51 is the similarity value between the string "ABC" in the template when compared to the first three data tokens in the source file: "XAB". A similarity value of 1.00 indicates an exact match, so notice that the second similarity value compares "ABC" in the template to "ABC" in the source file starting on the second line of data. Values towards zero indicate less similarity between the search template and the source data.

    By default, simil scales the edit distance scores by the equation: e-d/L, where d is the raw edit distance score, L is the number of elements in the template sequence, and e is 2.718281828.... If you want to see the raw edit distance scores which are inverted such that 0 indicates an exact match and larger values indicate less similarity, then use the -R option. The -n option is a unnormalized variant of the default output, where the similarity score is mapped with the equation e-d.

    simil source template > output
    template
    source
    default
    -n
    -R

    The -R option makes is clearer what has happened in the comparison of "ABC" to "XAB". In this case the 2 is generated by deleting the "X" at the start of the source sequence, and adding a "C" at the end of the source sequence.

    Note that the search template elements and the source data tokens do not have to be single-letter values, but can be any arbitrary strings. The simil command treats the entire string as a single unit when comparing elements between the target and source data. Note that it does not analyze internal similarities between strings, so "cat" and "hat" may seem similar, but as string elements being compared, simil will determine that they are not equivalent in any way. Here is an example of using simil with longer source and template tokens:

    simil source template > output
    template
    source
    output

    In this case, the highest similarity (0.56) occurs when the sequence "ominously, phantasia, and phantasma" lines up between the source and the search template. The value 0.37 represents the case when there is no similarity between the source and template.

    Editing weights

    By default, all editing operations cost the same amount (1.0 units). However, you can alter the default values in two ways: (1) by loading them from a initialization file, or (2) by changing the weights from the command-line.

    A file containing the new editing weights can be read by using the -w option followed by the name of a file which contains the weights. The weights file can contains lines of comments which start with #, and lines of code which start with the edit rule name followed by the revised score for that rule. Here is a sample file (not all rules have to be given in the file, only the ones which differ from the default weights of 1.0):

    Alternatively, individual editing weights can be set from the command line as in the following example:

    simil --s0 1.6 --s1 0.7 --s3 0.7 source template > output
    template
    source
    output
    Note that the two dashes in front of the edit rule option names are required since the option name length is more than one character. The editing rule abbreviations are case insensitive in both the initialization file and in the command-line options. So, for example, "D2" and "d2" can be used interchangeably.

    By specifying the -W option, a list of editing weights will be appended to the output data as global comments:

    simil -W --s0 1.6 --s1 0.7 --s3 0.7 source template > output
    template
    source
    output

    Sub-sequence template matching

    By using the -x option, you can search for the best substring match from the template in the source at each point. For example, with a template of the letters "ABC" and the option -x 2, all substrings of length two which can be generated from "ABC" will be compared to the source data, and a list of the top matching subsequences will be included in the output analysis.

    simil -x 2 source template > output
    template
    source
    output

    The **simixrf contains a list of integers which indicates the position of the substring(s) in the template which best match the source file at each position. In this case there are two substrings being compared to the source data: (1) "AB" and (2) "BC". When there is a similarity measurement tie between two or more substrings, indexes for all best-matching substrings are displayed, separated by commas.

    Various options

    To output an extra spine containing the tokens sequence used from the source file for a particular simil score, use the -s option. Additionally, the -S (capital S) option can be used to suppress a space from being printed between the successive elements in the sequence.

    simil [-s [-S]] source template > output
    template
    source
    default
    -s
    -s -S

    To selectivly print information about only higher similarity values, use the -t option to suppress printing of data of low-similarity. For example, adding -t 0.70 to the command line options will prevent any similarity value less than 0.70 from being displayed in the output:

    simil -Sst 0.7 source template > output
    template
    source
    output

    In the output, lines with simil scores less than the threshold value are displayed as null tokens so that the relative locations of the values are preserved. You can remove the null token lines by piping the output into the Humdrum command "rid -d".

EXAMPLES

    As an example application of the simil program, consider these transcriptions of the Exultet (Easter Proclamation) in 28 manuscripts from the 13th and 14th centuries. In these sources, the most common 15-note diatonic pitch sequence is GFGAGAGFGAGAGFF, which occurs 270 times in the 28 sources (use the "context -n 15" and "kern -x" commands on the data files to verify -- removing editorial rests and accidentals).

    Beyond knowing that the GFGAGAGFGAGAGFF sequence occurs 270 times in the sources, it may be useful to know where these sequences occur in the music. To do this, the simil command can be used to search the pitch sequence of the music and the value of 1.00 can be searched for in the resulting analysis spine. Here is the analysis for Exultet from the manuscript Av:

    simil source template -p > output
    template
    source
    output

    Scroll through the output column above and look for a value of 1.00 in the first column. Whenever this occurs, the sequence GFGAGAGFGAGAGFF will be found in the second column starting on that line. To see patterns in the occurrence of the sequence, it is more illuminating to plot the edit-distance similarity measurements throughout the music as shown below.

    In Exultet source M2, however, the sequence GFGAGAGFGAGAGFF never occurs. Instead, the sequence has been altered.

    simil source template -p > output
    template
    source
    output

    The following command displays the sequences in the M2 source which are most similar to the template sequence. In this case GFGAGAGFAGAGFFG is the most common pattern which substitutes for GFGAGAGFGAGAGFF.

          simil source template -Ss -t 0.8| \
              grep '       GF' | cut -f2 | sort | uniq -c | sort -nr
     
    Examining the two patterns, it becomes clear that the difference is due to a deleted "G" in the middle of the pattern (or an inserted "G", depending on which source is earlier or your point of view):
         Av:   GFGAGAGFGAGAGFF
         M2:   GFGAGAGF AGAGFFG

    When comparing the two sequences, the last G in M2 should be ignored since the limit of 15 notes in the analysis may have chopped the G from the end of the sequences in the Av source.

REFERENCES

SEE ALSO

    • Humdrum Toolkit's simil documentation which also applies to the simil program except for the usage of the initialization file, simil.rc, which must be read using the -w option in simil.
    • Balign: Aygün, E.; Oommen, B. & Cataltepe, Z. Peptide classification using optimal and information theoretic syntactic modeling Pattern Recognition, Elsevier, 2010

DOWNLOAD

    The compiled simil program can be downloaded for the following platforms:
    • Linux (i386 processors) (dynamically linked) compiled on 28 Jun 2012.
    • Windows compiled on 29 Jun 2012.
    • Mac OS X/i386 compiled on 13 Nov 2013.

    The source code for the program was last modified on 8 Dec 2009. Click here to go to the full source-code download page.