English   

  
 05  Alignment of Multilingual Parallel Corpus



Vanilla Alignment and Alignments HunAlign


The file is fully XML, we must use the UTF-8 encoding to handle all character sets (French-Greek for example).
Example of use for Lithuanian-Swedish alignment: Before launching it make sure you have uncompressed (using gunzip command for example) the alignment file.
gunzip jrc-lt-sv.xml.gz
Then, you need to get and unpack the two corpora: tar xzf jrc-lt.tgz tar xzf jrc-sv.tgz
Then you can launch this program using a perl5 interpreter: perl getAlignmentWithText.pl -acquisDir . jrc-lt-sv.xml > jrc-lt-sv_withText.xml =head1
COMMENTS
We have deliberally chosen to parse the texts without an XML parser.
The format of Xml texts is well known, and the script has to be as fast as possible to handle 8000 texts in less than 5 minutes.
=head1 AUTHORS camelia.ignat@jrc.it, bruno.pouliquen@jrc.it =cut


Manual Alignment  EN-CS
139 segments > 107 segments    71 segments > 101 segments


Copyright  2006 Milan Condak  www.condak.cz