GTM
(General Text Matcher)
version 1.3
available from: http://nlp.cs.nyu.edu/GTM/
written by: (in chronological order of first contribution) Ryan Green, Joseph P. Turian, I. Dan Melamed, Luke Shen, Ali Argyle, Ben Wellington, Daniel Galron

TABLE OF CONTENTS
  1. Introduction
  2. Building the Package
  3. Running a Sanity Check
  4. Invocation
  5. Input File Format
  6. Output Format
  7. Sample Output
  8. Feedback

***

1. INTRODUCTION

This software is available at http://nlp.cs.nyu.edu/GTM/
The website contains updates, as well as digests and signatures so that
you can verify the integrity of this package.

This software measures the similarity between texts. To learn more
about using this software to automatically evaluate machine
translation, and to read about the details of the scoring method,
please consult Proteus technical report #03-005 "Evaluation of Machine
Translation and its Evaluation" available at
http://nlp.cs.nyu.edu/eval/

To learn more about our work, please visit the Proteus Project homepage
at http://nlp.cs.nyu.edu/

The license for this software is in the file LICENSE.

The Proteus Project is supported by grants and contracts from the
National Science Foundation (NSF) and the Defense Advanced Research
Projects Agency (DARPA), as well as by gifts from Sun Microsystems.

***

2. BUILDING THE PACKAGE

Note that this section only pertains to the source distribution.
If you are using the binary distribution, you can skip to section 3.

To build the package on a Unix system, just type 'make'.

To build the package on a Windows system, invoke javac on the source
files:
	javac *.java

***

3. RUNNING A SANITY CHECK

If you would like to run a sanity check on the package, you can do so
automatically or manually.


AUTOMATIC SANITY CHECK:

There is an included test suite. This will only work under Unix. If you
are under Windows, you will need to run a manual sanity check.

To invoke the test suite, please type:
	./sanitycheck.pl
(You may have to edit sanitycheck.pl to change the invocation string.)

The sanitycheck.pl script will read in tests/optionlist which contains,
on each line, the test name (test00 ... test99) and the options to pass
to GTM. For each test, the script will invoke GTM with the test's
options and compare the output to tests/test??.out

If everything goes well, all 100 test cases should pass, and you can
move on to Section 4 (Invocation).

If one or more of the tests fail, the diff between the expected and the
actual output will be displayed. You should run a manual sanity check
(subsequently described) and email us a bug report.


MANUAL SANITY CHECK:

If the sanity check fails or if you are under Windows, you can invoke
the program manually to make sure that the actual output matches the
expected output.

Section 7 (Sample Output) contains sample command lines, as well as
the expected output. If the output your program generates does not
match our output, please email us a bug report.

If you would like to invoke the individual tests, you can find
the test options in tests/optionlist and the expected output in
tests/tests??.out

***

4. INVOCATION

The program can be invoked with the following options:

java gtm					\
	[-v]					\
	[-e exponent]				\
	[+s|-s|--scoresegs|--noscoresegs]	\
	[+d|-d|--scoredocs|--noscoredocs]	\
	[-t| --textin]                          \
        [-l| --lineFile]                                \
	[--dumphits hitsfile]			\
	[--dumpruns runsfile]			\
	testfile referencefile(s)

If you are using the pre-compiled JAR file, to invoke the program you
will use:

java -jar gtm.jar				\
	[-v]					\
	[-e exponent]				\
	[+s|-s|--scoresegs|--noscoresegs]	\
	[+d|-d|--scoredocs|--noscoredocs]	\
	[-t| --textin]                          \
        [-l| --lineFile]                                \
	[--dumphits hitsfile]			\
	[--dumpruns runsfile]			\
	testfile referencefile(s)

The parameters are as follows:

	-v			Enable verbose mode.

				At the beginning of program execution,
				all program options will be displayed.

				Whenever a score is output, verbose
				mode will subsequently display those
				variables used to calculate the score,
				as well as the segment's precision and
				recall (alternative scores to the
				standard F-measure).

	-e exponent             Use this exponent, instead of the
				default 1.0, for calculating the score
				of a run.

	+s / --scoresegs        Show the scores for individual
				segments.

	-s / --noscoresegs      Do not show the scores for individual
				segments.
				(default)

	+d / --scoredocs        Show the scores for documents.
				(default)

	-d / --noscoredocs      Do not show the scores for documents.

	-t / --textin		The testfile and the referencefile(s)
				are in plain text format.
				Default is sgml format

	--dumpfile file		Specify the file for dumping the hits
				and runs

	--dumphits		toggle dumping hits

	--dumpruns		toggle dumping runs

	testfile                The location of an SGML file containing
				the test documents. This testfile
				can be in plain text format if the -t 
				option is used. 

	referencefile(s)        The location of one or more SGML files
				containing the reference documents. 
				This testfile can be in plain text 
				format if the -t option is used. 


As indicated above, the invocation:
	java gtm testfile referencefile(s)
is equivalent to:
	java gtm -e 1.0 --noscoresegs --noscoredocs	\
		testfile referencefile(s)

If options are provided that clobber earlier options, the latter
options are used. For example:
	java gtm --noscoresegs --scoresegs ...
is equivalent to:
	java gtm --scoresegs ...

***

5. INPUT FILE FORMAT

The input files by default are SGML files that conform to the following
specification. This specification is a subset of the NIST DTD for
evaluation of language translation (version 1.1), which is located at
http://www.nist.gov/speech/tests/mt/doc/mteval.dtd

As an alternative to SGML, in version 1.1 the files can be plain text
ifthe -t option is specified.

Each input line is parsed individually. The fields are case-sensitive.

Note that *no* pre-processing is done on the input texts. Users who
want case-insensitive matching or stemming should pre-process their
input files prior to invoking GTM.

Also note that different machines can produce inconsistent
string-matching results on non-standard ASCII depending on what
the system default encoding is, i.e. text that is
not alphabetic, numerical, or a punctuation mark. 
Therefore as of version 1.1 of gtm all files are read in as 
ISO8859-15 (Latin characters)
If your reference texts contain non Latin characters, you 
can either preprocess them or use a different encoding in gtm.java.

SGML specs:

A new document is started with the line:
<doc docid="DOCID" sysid="SYSID">
	DOCID is the document ID. The document ID's for different
	input files must be aligned. In other words, the first docid in
	each input file must be the same, the second docid in each
	input file must be the same, &tc.

	SYSID is the system ID. There must be only one SYSID in one
	input file. This field is present for compatibility with the
	MT DTD.

A segment within a document is given by:
<seg id=SEGID> TEXT </seg>
	SEGID is the segment ID. As with the docid's, the segment
	ID's for different input files must have the same values in the
	same order, but they need not be continuously numbered or even
	numeric.

	TEXT contains the text of the segment. The text is tokenized at
	whitespace boundaries.

All other input lines are silently ignored.

To emphasize what was stated above, each input line is parsed
*individually*. For example, if a <seg id=SEGID> and its </seg> appear on
different lines, this segment will be *ignored*, the parser looks for
both tags on the same line of input.

Sample input files are located in samples/
	samples/ara-sys[12]     Sample system (machine) translations
	samples/ara-ref[12]     Sample reference (human) translations

The files were all translated from the same source documents.  Each
file contains translations for 2 documents (named "sampledoc-0" and
"sampledoc-2"), with each document containing 3 segments (numbered 1
through 3).

Plain Text:

If you are using the -t option, the entire input file will be wrapped 
inside one segment onside one document. 



***

6. OUTPUT FORMAT

Segment scores, if they are output (--scoresegs), will be displayed on
lines containing three fields:
DOCID SEGID score

Document scores, if they are output (--scoredocs), will be displayed on
lines containing two fields:
DOCID score

Note that the document score is not merely the average of the segment
scores. Rather, segment scoring is a special case of document scoring:
Segments are treated as documents that contain only one segment.  For a
precise definition of how document scores are calculated, please
consult Proteus technical report #03-005 "Evaluation of Machine
Translation and its Evaluation" available at
http://nlp.cs.nyu.edu/eval/

If the input files have DOCIDs or SEGIDs that do not match, the program
will abort with an error message upon reaching this mis-match.

***

7. SAMPLE OUTPUT

Scoring the segments but not the documents of sys1 using ref2 as a
reference:
	% java gtm +s -d samples/ara-sys1.txt samples/ara-ref2.txt
	sampledoc-0 1 0.13333333333333333
	sampledoc-0 2 0.45217391304347826
	sampledoc-0 3 0.5714285714285714
	sampledoc-1 1 0.28571428571428564
	sampledoc-1 2 0.3777777777777778
	sampledoc-1 3 0.35051546391752575

Scoring the documents of sys1 using both references ref1 and ref2:
	% java gtm						\
		samples/ara-sys1.txt				\
		samples/ara-ref1.txt samples/ara-ref2.txt
	sampledoc-0 0.517412935323383
	sampledoc-1 0.5083932853717026

Scoring the segments and documents of sys2 using both references ref1
and ref2:
	% java gtm +s +d					\
		samples/ara-sys2.txt				\
		samples/ara-ref1.txt samples/ara-ref2.txt
	sampledoc-0 1 0.3448275862068966
	sampledoc-0 2 0.5862068965517241
	sampledoc-0 3 0.6101694915254238
	sampledoc-0 0.5588235294117647
	sampledoc-1 1 0.4347826086956522
	sampledoc-1 2 0.5287356321839081
	sampledoc-1 3 0.6627218934911242
	sampledoc-1 0.5758354755784062

***

8. FEEDBACK

Questions? Comments? Suggestions?
	Bugs? Code patches? Feature requests?

We welcome all feedback.
Please email feedback to I. Dan Melamed : <lastname> at cs dot nyu dot edu

***

$Id: README 7 2008-01-27 21:16:28Z melamed $
