mc-dir.pl is a tool for converting a directory with xml documents into one xml file suitable for Tamino massload utility inoxmld -- version 1.0 by Jan Harmsen 08-January-2002
mc-dir.pl is no official Software AG product, please read mc-dir.pl for further information
mc-dir.pl has been tested in conjunction with Tamino 2.3.1 / Tamino 3.1 on SuSE Linux / Win2K.
Watch out:
The Tamino massloader inoxmld consumes approximately 10
times the space of the raw xml data as temporary working space
!!
If your index is large, even more space is needed.
To massload 200MB of XML data you will need at least 2 GB of
temporary working space !!
mc-dir.pl is a script to convert a directory of xml documents into one big xml file which can be loaded directly into Tamino with Tamino massload utility inoxmld
|
name: |
mc-dir.pl (massload conversion of a directory) |
|
input parameters: |
mandatory: name of a directory with XML documents e.g.
c:/mc-directoryoptional: -nodocname to prevent usage of XML
filename for attribute ino:docname |
|
output: |
XML output file (for use with Tamino massload utility) + log file |
|
usage: |
perl mc-dir.pl c:\mc-directory |
The files in the input directory must be well-formed xml documents, i.e. they must have the following format:
<?xml version="1.0"
encoding=.....>
<root_element>
..... content ....
</root_element>
Any Doctype declaration will be removed automatically by mc-dir.pl because the massload utility inoxmld requires this. Any Processing Instructions (PIs) or comments appearing BEFORE the root element are kept.
To test mc-dir.pl simply run test-mc.bat, this will run
mc-dir.pl on the test-directory
./mc-directory
To use mc-dir.pl
put XML documents of the same doctype (i.e. the XML docs
have all the same root element) in a directory, e.g.
C:/mc-directory
make sure that perl is installed and included in the
$PATH:
open a shell / command prompt and type "perl -v" You should get a
message telling you which version of perl is installed. If there is
no perl installed, install it. Perl for Microsoft Windows is
available for free from www.activestate.com
mc-dir.pl does not require any special modules.
run the massload conversion program by typing e.g. perl C:\bin\mc-dir.pl D:\Data\mc-directory This
will convert the xml documents from the files in
D:\Data\mc-directory\ into the format needed by the Tamino
massload utility inoxmld. The output of mc-dir.pl will be stored in
the output file
D:\Data\mc-directory.xml, important information will be
logged in D:\Data\mc-directory.log
After the conversion you can load the data into Tamino with inoxmld, e.g.:
inoxmld server=localhost:3204
collection=ccnlpub/CCNLpub \
input=D:\Data\mc-directory.xml
log=D:\Data\mc-directory-log.xml \
norejects
The XML port number of the database (3204 in the example above) can be determined with the Tamino Manager: click on Databases --> DB-Name --> Properties --> Ports The port number to be used is the number of the XML port.
the filenames in mc-directory are read into an array
each file which is an .xml or .XML file is read line by line. If the line contains no complete tag, the next line is appended automatically (until the root element is found).
the further processing depends on whether the first document is beeing processed or has already been processed:
if the first xml document is being processed:
The xml declaration,
<ino:request> and
<ino:object> are written.
PIs and Comments wich appear before the root element are written, if there is a DTD declaration it will be removed.
The root element is extracted.
All lines are written until the EOF (end of file) is reached.
The closing tag </ino:object> is
written.
after the first xml document has been processed:
The xml declaration is skipped and
<ino:object> is written.
PIs and Comments wich appear before the root element are written, if there is a DTD declaration it will be removed.
The occurence of the correct root element (from the firstprocessed xml document) is checked. If the document doesn't contain the correct root element, an error message is printed.
All lines are written until the EOF (end of file) is reached.
The closing tag </ino:object> is
written.
When there are no files left, the closing tag
</ino:request> is written and the output file is
closed.
The time needed to upload xml documents with inoxmld depends
mainly on the complexity of the schema and of the size and number
of xml documents.
Here some test results for Tamino 2.3.1.4 on a WinNT workstation,
Pentium 4 with 1.7 GHz and 512 MB RAM (the size of the Tamino TSD2
schema was 2.5 MB, for loading 10 GB of xml data 20 hours were
needed):
| number of docs | filesize in bytes | time in seconds | avg. docsize in kB | loadtime ms/doc | load kB/s |
schema 1: |
|||||
| 7419 | 233.039.371 | 1283 | 31 | 172 | 177 |
| 2547 | 332.300.218 | 1739 | 127 | 682 | 186 |
| 5925 | 336.734.739 | 1572 | 55 | 265 | 209 |
| 8500 | 206.796.668 | 1108 | 24 | 130 | 182 |
schema 2: |
|||||
| 7419 | 233.039.371 | 821 | 31 | 111 | 277 |
| 2547 | 332.300.218 | 1138 | 127 | 447 | 285 |