| The file is fully XML, we must use the UTF-8 encoding to handle all character sets (French-Greek for example). |
| Example of use for Lithuanian-Swedish alignment: Before launching it make sure you have uncompressed (using gunzip command for example) the alignment file. |
| gunzip jrc-lt-sv.xml.gz |
| Then, you need to get and unpack the two corpora: tar xzf jrc-lt.tgz tar xzf jrc-sv.tgz |
| Then you can launch this program using a perl5 interpreter: perl getAlignmentWithText.pl -acquisDir . jrc-lt-sv.xml > jrc-lt-sv_withText.xml =head1 |
| COMMENTS |
| We have deliberally chosen to parse the texts without an XML parser. |
| The format of Xml texts is well known, and the script has to be as fast as possible to handle 8000 texts in less than 5 minutes. |
| =head1 AUTHORS camelia.ignat@jrc.it, bruno.pouliquen@jrc.it =cut |