Saturday, January 12, 2008

Split XML Records with Perl Script

A colleague, Mahlon Anderson, and I were thinking of ways of splitting up a fairly large XML file, which had approximately 27K records in it. I wanted to split this file into smaller ones, each having about 250-300 records, because my former web host service kept complaining about constant CPU quota overloads during uploads. A Perl based splitter script quickly came to mind.

With the web host service, I had plenty disk space and plenty bandwidth but limited CPU usage. Apparently, I didn’t notice that sticky point in the fine print while signing up for the service.

A different splitting implementation was later used as the permanent solution, but here is Mahlon's "quick and dirty" XML Perl splitter-- printed with his permission of course.

# vi split.pl
#!/usr/bin/perl
$file = @ARGV[0];

open(_FH, "< $file") or die "Unable to open file\n";

$count = 0;
$files_counter=1;
$max_records = 300;

while (_FH)
{
if($count == 0)
{
$filename = $file . "_part_" . $files_counter;
open(FH2, "> $filename") or die "Unable to open file: $filename\n";
$count++;
}

if (grep /<\/item>/, $_ )
{
$count++;
}

print FH2 $_;

if ($count == $max_records + 1)
{
$count = 0;
$files_counter++;
close(FH2);
}
}
:wq!

# ./split.pl bigxmlfile.xml

9 comments:

Jean-Marc Liotier said...

Handy, but it is actually more a XML splitter than a XML parser. In addition it assumes the flat structure of a steady stream of items.

esofthub said...

A very good point jean-marc liotier.

I wrote an AWK-based parser script to get it into the XML format, which I shared a few months ago.

Anyways, I'll make the changes to better reflect its functionality. For our small project, it was a steady stream.

UX-admin said...

Right. AWK is that perfect tool for the job in this particular case.

Anyways, if the web hoster is giving you crap, why just not roll out your own infrastructure?

All you need is a basement with electricity, a used rack, and some sniping on ebay, or, if you're good with logistics, brand new servers at fraction of price of the those cheap DELLs (some assembly required).

I'm kind of surprised. With your expertise, I didn't think you'd stand for depending on someone else for hosting.

esofthub said...

ux-admin,

From previous posts, I know you're a big AWK supporter.

I thought about that scenario many times but living in South Korea as a US expat legally limits my enterprise options on their soil. If I lived in the US, it would be a completely different story.

Believe me, I hate depending on someone else and almost to a fault. Obviously, I'm still looking for a solution to this problem...

Ideas?

Erek Dyskant said...

I'm assuming a simple file upload isn't causing their CPU to go crazy, or they've got other problems, so I'm assuming that you're parsing the XML on the server side as it's uploaded.

Have you considered nicing the receiving process down, or adding some sleep statements between iterations that'll make your % CPU usage seem lower?

Cheers,
Erek

esofthub said...

Erek,

IMHO, they had other problems. I created another tool and they wouldn't let me run it because special requirements were needed.

At first, I tried to load the larger XML file but it was killing their CPU. Then I used this splitter to better manage it. For some reason, my web host was even sensitive to these smaller files after several uploads. I guess you get what you paid for.

ux-admin is right -- I have bigger issues. Allowing someone else to run my site is crap but my choices are limited right now. I'm looking for an alternative solution.

Interestingly enough, the servers I work with have several web servers and are very powerful. The web servers are just minor points.

starfrit said...

I tried it because I'm expreriencing the same problem, but the script doesn't work.

I'm having a Syntax error at split.pl line 29, near : (:wq!)
Execution aborted because of compilation errors.

What do I have to do to correct the case ?

esofthub said...

Starfrit,

I just ran it. It worked fine. Maybe Blogger parsed a tag. Did you do a copy and paste? Note below, I added an underscore, "_" in the while statement, <_FH>, because Blogger was complaining about during the submission of this comment. Remove it in your real script.

# more splitter.pl
#!/usr/bin/perl
$file = @ARGV[0];

open(FH, "< $file") or die "Unable to open file\n";

$count = 0;
$files_counter=1;
$max_records = 300;

while ( <_FH> )
{
if($count == 0)
{
$filename = $file . "_part_" . $files_counter;
open(FH2, "> $filename") or die "Unable to open file: $filename\n";
$count++;
}

if (grep /<\/item>/, $_ )
{
$count++;
}

print FH2 $_;

if ($count == $max_records + 1)
{
$count = 0;
$files_counter++;
close(FH2);
}
}

Now run it.
#./splitter.pl myxml.xml

Here's the output...
# ls -l myxml.xml*
-rwxrwxrwx 1 root other 25372146 Sep 19 2007 myxml.xml
-rw-r--r-- 1 root other 308835 Aug 16 22:05 myxml.xml_part_1
-rw-r--r-- 1 root other 282294 Aug 16 22:05 myxml.xml_part_10
-rw-r--r-- 1 root other 295288 Aug 16 22:05 myxml.xml_part_11
-rw-r--r-- 1 root other 298320 Aug 16 22:05 myxml.xml_part_12
-rw-r--r-- 1 root other 303570 Aug 16 22:05 myxml.xml_part_13
-rw-r--r-- 1 root other 297563 Aug 16 22:05 myxml.xml_part_14
-rw-r--r-- 1 root other 304841 Aug 16 22:05 myxml.xml_part_15
-rw-r--r-- 1 root other 298786 Aug 16 22:05 myxml.xml_part_16
-rw-r--r-- 1 root other 293452 Aug 16 22:05 myxml.xml_part_17
-rw-r--r-- 1 root other 293747 Aug 16 22:05 myxml.xml_part_18
-rw-r--r-- 1 root other 309752 Aug 16 22:05 myxml.xml_part_19
-rw-r--r-- 1 root other 284603 Aug 16 22:05 myxml.xml_part_2
-rw-r--r-- 1 root other 304117 Aug 16 22:05 myxml.xml_part_20
-rw-r--r-- 1 root other 310830 Aug 16 22:05 myxml.xml_part_21
...

John Bokma said...

If you use lexical filehandles you not only circumvent Bloggers mangling, but also make your live easier.

E.g.:

open my $fh, '<', $filename or die "Can't open '$filename' for reading: $!";
while ( my $line = <$fh> ) {
...
}
close $fh or die "Can't close '$filename': $!";

Notice that I've added some more good Perl programming advice (like correctly reporting errors, using 3 parameter version of open, etc.).

Also:
$file = @ARGV[ 0 ];

is technically not correct, you want:

$file = $ARGV[ 0 ];

or

$file = shift;