speed optimization

Recommend this page to a friend!

speed optimization

Subject:	speed optimization
Summary:	here is a little parsing algorithm I made for me yesterday
Messages:	5
Author:	atanas markov
Date:	2007-06-20 04:36:44
Update:	2007-06-26 04:54:48

1. speed optimization

Report abuse

atanas markov - 2007-06-20 04:36:44

Hello,
My name is Atanas Markov. Yesterday I had to search for a parser to split mime messages for me. I found yours and nameko. But the two parsers are really slow. On my laptop a 50MB message with 3 mp3 files in it and skype setup was parsed in more than 900sec. Normally 2-3Megs took a few seconds. But then I decided to make a faster parsing routine. Now 50Megs are 17-18seconds. My summary is about more than 10 times faster work. I don't say my routine is really optimized or has all your functionality. But it can be easily made to give all the results you give now. I only needed file splitting and header extraction. I needed a fast routine as I write a part of a support system that receives a lot of mails and many of these are bigger then 2-3megs.
So here is my code. Not pretty, not well commented, but I hope it may help you.

<?php

$time= microtime(true);

ini_set('max_execution_time',"900");

$f= fopen('test/sample/mnogo.eml','r');

if (!$f){
die('file problem');
}

/*

written by Atanas Markov([email protected])

*/

/*
Routine start here

*/
$parts= array();
//header+body
$hdr= true;
$headers= array();
$bounds= array();

while (!feof($f)){
$s= fgets($f);
if (!$hdr){
//we read body

/*
check for a boundary, but do it with as few checks as possible
*/

//check for starting --
$flag= false;
$len= strlen($s);
for ($i=0;$i<$len;$i++){
if ($s[$i]!=' '){
if ($s[$i]=='-'&&$s[$i+1]=='-'){
$flag= true;
}
break;
}
}

$bound= false;

if ($flag){
//we have starting --, so check for boundaries
$sneed= trim($s);
foreach ($bounds as $boundval){
if ($sneed==$boundval||$sneed==$boundval.'--'){
$bound= true;
break;
}
}
}

if (!$bound){
if (!$hasbody&&trim($s)!=''){
//do we have a body or just a boundary
$hasbody= true;
}
$lastpos= ftell($f); //get file position
} else {
if ($hasbody){
//we store headers only and body start and end. if we have sth like $body.= $line it's really slow
$parts[]= array('headers'=>$headers,
'body'=>array('start'=>$firstpos,
'end'=>$lastpos
));

}
$hdr= true;
$body='';
}

} else {
if (trim($s)==''){
//empty line- header end
$hdr= false;
$body= '';
$hasbody= false;
$firstpos= ftell($f);
} else {
//we may slow a bit if we parse headers here or just parse them later
//now the routine writes all headers from previous parts in the new one.
//I think this may be improved, but it doesn't make problems for my parsing
$hdr= explode(':',$s);
$headers[$hdr[0]]= $hdr[1];
if ($pos=strpos($s,'boundary="')!==false){
//new boundary
$sx= strstr($s,'boundary="');
$sx= strstr($sx,'"');
$bounds[]= '--'.substr($sx,1,strlen($sx)-4);
}
}
}
}

if ($hasbody){
//save last part on eof
$parts[]= array('headers'=>$headers,
'body'=>array('start'=>$firstpos,
'end'=>$lastpos
));
}

/*
Read data parts(now it's much faster as it's done at once
*/
foreach ($parts as $id=>$value){
fseek($f,$parts[$id]['body']['start']);

if ($parts[$id]['body']['end']-$parts[$id]['body']['start']>60000){
//this is a test and I had no memory to put all data in it, but
//needed to check read speed
$pos= $parts[$id]['body']['start'];
$end= $parts[$id]['body']['end'];
$chunk= $end-$pos;
if ($chunk > 60000) {
$chunk= 60000;
}

while ($pos < $end){
$data= fread($f,$pos,$chunk);

$pos= $pos+$chunk;
$chunk= $end-$pos;
if ($chunk > 60000) {
$chunk= 60000;
}
}
} else {
$parts[$id]['body']['data']= fread($f,$parts[$id]['body']['end']-$parts[$id]['body']['start']);
}

if ($parts[$id]['body']['end']-$parts[$id]['body']['start']>60000) {
$parts[$id]['body']['data']= '...';
}

}

fclose($f);

print ('<br><br>'.sprintf('%.6f',microtime(true)-$time)).'<br>
<br>
<br>
';

print_r($parts);

?>

I hope my idea can help you.

[email protected]

2. Re: speed optimization

Report abuse

Manuel Lemos - 2007-06-20 21:23:26 - In reply to message 1 from atanas markov

Thank you for sharing your approach.

As you said, your parse is not as complete.

It is hard to take advantage of your code because it is a different approach.

Feel free to post other suggestions to optimize the parsing of the current implementation of the MIME parser class.

3. Re: speed optimization

Report abuse

atanas markov - 2007-06-23 06:39:16 - In reply to message 2 from Manuel Lemos

Yes, the approach is a bit different(not much as it reads headers and seeks boundaries in an optimized way, but the concept is the same), but it gives more than 50 times the speed. And with extending headers parsing it gives you all the functionality. I have no time to write a new version of your parser using my concept. But I think using an approach with less regular expressions and reading data in one piece will speed up your parser 20-30 times and more. So I gave you what works fine.
I corrected the code. If you are interested to see the new version and some examples how to parse headers from this algorythm, send me a mail [email protected]

About speed advantage I'll give this sample- a file of 50megs with 5 mp3 files and skype setup in it is parsed in more than 900 seconds with your parser and 18 with mine on my laptop. The same test on a hosting machine with RAID, 2Gb memory was parsed in 2 seconds by my parser. Even small messages from your samples directory are parsed in 0.002secs versus 0.2 secs... So I think it's worth thinking about changing the parser algorythm...
In a few minutes I'll send here my corrections to my parser(handles multiline headers, there's some headers parsing). So you'll have an example how to do almost all that your parser does. I don't say my code doesn't need to be corrected to do all. It's written in 2 days and has a specific task to do- get mails and save them as attachments, vody and very basic header info.

4. Re: speed optimization

Report abuse

atanas markov - 2007-06-23 06:46:21 - In reply to message 3 from atanas markov

so here is the demo how to make things work. this is a cut from my mail parsing script. it includes all except database access to save info...

#!/usr/bin/php -nq
<?php
define ('ATT_PATH','/www/host.bg/cp2/root/fork/attachments');

ini_set('auto_detect_line_endings',true);

//virus code from supportparse.php
function decodeHeader($encodedtext) {
$elements=imap_mime_header_decode($encodedtext);
$result='';
for($i=0;$i<count($elements);$i++){
$charset= strtolower($elements[$i]->charset);
$charset= str_replace('windows-','cp',$charset);

if ($charset=='default'||
!$charset||
$charset==''){
$charset= 'cp1251';
}

if ($charset!='utf-8'){
$result.=iconv($charset, 'utf-8', $elements[$i]->text);
global $log;
if ($log){
fputs($log,'ICONV CALL HERE'.$elements[$i]->charset."\n");
}
} else {
$result.= $elements[$i]->text;
}
}
return $result;
}

/**
* Reads a MIME file and gets its parts into array. Doesn't read data, but
* gives a filename, beginning and end of data bodies. Must be parsed later,
* but is really fast
*
* @param string $file
* @return array Body parameters(headers, etc...)
*/
function splitmime($file){
global $log;
//mime splitting written by A.Markov([email protected])
$f= fopen($file,'r');

if (!$f){
die('file problem');
}

$parts= array();
$hdr= true;
$headers= array();
$bounds= array();

while (!feof($f)){
$s= fgets($f);

if (!$hdr){
$flag= false;
$len= strlen($s);
for ($i=0;$i<$len-1;$i++){
if ($s[$i]!=' '){
if ($s[$i]=='-'&&$s[$i+1]=='-'){
$flag= true;
}
break;
}
}

$bound= false;

if ($flag){
$sneed= trim($s);
foreach ($bounds as $boundval){
if ($sneed==$boundval||$sneed==$boundval.'--'){
$bound= true;
break;
}
}
}

if (!$bound){
if (!$hasbody&&trim($s)!=''){
$hasbody= true;
}
$lastpos= ftell($f);
} else {
if ($hasbody){
$parts[]= array('headers'=>$headers,
'body'=>array('start'=>$firstpos,
'end'=>$lastpos,
'file'=>$file
));

}
$hdr= true;
$body='';

//clear these...
unset($headers['Content-Type']);
unset($headers['Content-Transfer-Encoding']);
unset($headers['Content-Disposition']);
unset($headers['Content-ID']);
}

} else {
if (trim($s)==''){
$hdr= false;
$body= '';
$hasbody= false;
$firstpos= ftell($f);
} else {
if (strpos($s,':')!==false){
$hdr= explode(':',$s,2);
$headers[trim($hdr[0])]= trim($hdr[1]);
$lasthdr= trim($hdr[0]);
} else {
$headers[$lasthdr].= $s;
}
if ($pos=strpos($s,'boundary="')!==false){
$sx= strstr($s,'boundary="');
$sx= strstr($sx,'"');
$bounds[]= '--'.substr($sx,1,strlen($sx)-3);
}
}
}
}

fclose($f);

if ($hasbody){
$parts[]= array('headers'=>$headers,
'body'=>array('start'=>$firstpos,
'end'=>$lastpos,
'file'=>$file
));
}

foreach ($parts as $id=>$value){
$enc='';
if ($value['headers']['Content-Transfer-Encoding']){
$enc= $value['headers']['Content-Transfer-Encoding'];
}

if (strtolower($enc)=='quoted-printable')
$enc='qp';
else if (strtolower($enc)=='base64')
$enc='b64';

$ctype= $value['headers']['Content-Type'];
$ctype= explode(';',$ctype);

$cset= false;
foreach ($ctype as $val){
if (preg_match('#charset=("|\')?([^\s\'":;]+)#i', $val, $m1)){
$cset=$m1[2];
}
}

$ctype= $ctype[0];

$parts[$id]['body']['encoding']= $enc;
$parts[$id]['body']['content_type']= $ctype;
$parts[$id]['content_type']= $ctype;
$parts[$id]['body']['charset']= $cset;
$parts[$id]['charset']= $cset;
}

return $parts;
}

/**
* Get mime body data
*
* @param array $body
* @param string $savefile
* @return unknown
*/
function getMimeBody(&$body,$savefile= false){
global $log;
$f= fopen($body['file'],'r');

fseek($f,$body['start']);

$data= fread($f,$body['end']-$body['start']);

fclose($f);
if ($log) fputs($log,$savefile);
if ($body['encoding']=='qp'){
$data=imap_qprint($data);
} elseif ($body['encoding']=='b64'){
$data=base64_decode($data);
}

if ($body['charset']){
$data= iconv($body['charset'],'utf-8',$data);
}
if ($log) fputs($log,$savefile.'...');
if ($savefile){
$save= fopen($savefile,'w');
fwrite($save,$data);
fclose($save);
chmod($savefile,0666);
if ($log) fputs($log,$savefile);
unset($data);
} else {
return $data;
}
}

global $file;
global $argv;

$file=$argv[2]; //we need the second file
if (!file_exists($file)){
exit;
}

$parts= &splitmime($file);

$message= '';

//join all headers
$headers= array();
foreach ($parts as $part){
$headers= $headers+$part['headers'];
}

$line= 'From:'.$headers['From'];
if (preg_match("#^From:\s*([^\012\015]+)#i", $line, $m)){
$name=$m[1];
$mail=$m[1];
if (preg_match("#(.*)<([^>]+)>#", $mail, $m)){
$name=decodeHeader(trim($m[1]));
$mail=$m[2];
$name=str_replace('"', '', $name);
}
}

$tid= false;
$line= 'Subject:'.$headers['Subject'];
if (preg_match("#^Subject:\s*([^\012\015]+)#i", $line, $m)){
$subject=decodeHeader($m[1]);
if (preg_match('#\[ne_iztrivai_id: (\d+)\](.*)#i', $subject, $m)){
$subject=trim($m[2]);
$tid=$m[1];
}
}

$message_text= false;

foreach ($parts as $part){
if (strpos($part['headers']['Content-Type'],'text/html')!==false&&
!$part['headers']['Content-Disposition']){

$message_text= getMimeBody($part['body']);
$message_text=preg_replace('#<br(\s*/)?>#i', "\015\012", $message_text);
//strip style, script, links
$message_text = preg_replace("'<script[^>]*>.*</script>'siU",'',$message_text);
$message_text = preg_replace('/<a\s+.*?href="([^"]+)"[^>]*>([^<]+)<\/a>/is', '\2 (\1)', $message_text);
$message_text = preg_replace("'<style[^>]*>.*</style>'siU",'',$message_text);
//js events replace
$message_text=str_replace('onkey','on_key',$message_text);
$message_text=str_replace('onmouse','on_mouse',$message_text);
$message_text=str_replace('onclick','on_click',$message_text);
$message_text=str_replace('onchange','on_change',$message_text);
$message_text=trim(preg_replace('#(\015?\12)+#', "\015\012", $message_text));
if (!$message_text){
$message_text= '   ';
}
} elseif (!$message_text&&
strpos($part['headers']['Content-Type'],'text/plain')!==false&&
!$part['headers']['Content-Disposition']){
//get plain only if no html
$message_text= getMimeBody($part['body']);
}
}

if (!$message_text){
$message_text= 'Mail not parsed.../Empty body';
}

if (!file_exists(ATT_PATH)){
mkdir(ATT_PATH);
}

foreach ($parts as $part){
if ($part['headers']['Content-Disposition']){
$fname= false;
$arr= explode(';',$part['headers']['Content-Type']);
if ($log) fputs($log,print_r($arr,true));
foreach ($arr as $value){
$val= explode('=',$value);
if (trim($val[0])=='name'){
$fname= str_replace('"','',$val[1]);
}
}

$arr= explode(';',$part['headers']['Content-Disposition']);
if ($log) fputs($log,print_r($arr,true));
foreach ($arr as $value){
$val= explode('=',$value);
if (trim($val[0])=='filename'){
$fname= str_replace('"','',$val[1]);
}
}

if ($fname){
do{
$tmpfile= time().mt_rand();
} while (file_exists(ATT_PATH.'/'.$tmpfile));

getMimeBody($part['body'],ATT_PATH.'/'.$tmpfile);
}
}

}
}

ini_restore('auto_detect_line_endings');

?>

5. Re: speed optimization

Report abuse

Manuel Lemos - 2007-06-26 04:54:48 - In reply to message 3 from atanas markov

I am sure there is plenty of room for optimization in the MIME parser class. However certain things that you used in your approach are not very convenient for many users.

For instance, you use extensions like IMAP and ICONV that are not available in all PHP environments. The MIME parser class uses base PHP code that does not use special extensions that may not be available in all distributions and thus it may fail.

Another aspect is that you seem to load all the message in memory. If the message is too large, it may not be parsed in most PHP environments that are limited by configuration to 8MB for each script. The MIME parser class can load messages in chunks of limited size, so it can parse arbitrarily long messages without exceeding the PHP memory limits.

Actually it can read large messages directly from a POP3 mailbox without storing the message data in intermediate files and without exceeding the PHP memory limits.

All the parsing is buffered. Currently it uses an 8000 bytes buffer when reading messages from files or streams of data. If that would improve the parser performance, it would not be hard to make the buffer size configurable so you can make it buffer more data at once if you have enough memory.

Anyway, I think the speed bottlenecks are not in the buffering. It is a matter of using a profiler tool and find which code parts are taking more time could be worth optimizing with faster approaches without compromising the flexibility that the class provides. For now, I do not have time to do it, Maybe somebody with more free time can take a look at that.

About us

Advertise on this site

For more information send a message to info at phpclasses dot org.