Tuesday, 6 September 2011

Firefox Development: PHP: Parsing firefox bookmarks.html

In this post I will cover how to parse firefox bookmarks.html file.

I am sure many of you have been turned off by the way firefox stores the bookmarks. I personally find it not pretty at all. My main complaint is that it uses old datalist format that somewhat resembles html, but when you start parsing it, it turns out nowhere near. One of the main features(?) of this format is that tags are not closed unless they contain nested children, so you have to rely on newline characters to figure out what the document is trying to tell you.

Why, tell me, why not just use xml like every normal application?
Beats me.

Anyways, I had to write a parser in PHP to accommodate that ugly format. It consists of two functions the starter and the recursive-helper. To get a nested Array of bookmarks just call FF_parseBookmarks($s) where $s is the text your code have read from the bookmarks.html.


<?php
// Firefox library
function FF_parseBookmarks_($aLines,&$i=0,$aProps=NULL) {
if(!$aProps) { $aProps = array(); }
if($i == 0) {
$bBeginning = true;
}
$a = array_merge($aProps,array(
'nodes' => array()
));
$sDTMode = 'folder';
$sNewFolderProps = array();
for(;$i<count($aLines);$i++) {
$s = trim($aLines[$i]);
// print $s."\n";
if(eregi('^<DL>',$s)) {
$i++;
$a['nodes'][] = FF_parseBookmarks_($aLines,$i,$sNewFolderProps);
$sNewFolderProps = array();
} else if(eregi('^<DT><H3([^>]+)>([^<]+)',$s,$aRegs)) {
$sDTMode = 'folder';
$sCreated = $sModified = '';
if(eregi('ADD_DATE="([^"]+)"',$aRegs[1],$aRegs1)) {
$sCreated = $aRegs1[1];
}
if(eregi('LAST_MODIFIED="([^"]+)"',$aRegs[1],$aRegs1)) {
$sModified = $aRegs1[1];
}
if(eregi('PERSONAL_TOOLBAR_FOLDER="true"',$aRegs[1])) {
$sNewFolderProps['is_toolbar'] = '1';
}
$sNewFolderProps['title'] = $aRegs[2];
$sNewFolderProps['created'] = $sCreated;
$sNewFolderProps['modified'] = $sModified;

} else if(eregi('^<DD>([^>]+)',$s,$aRegs)) {
if($sDTMode=='folder') {
$sNewFolderProps['descrption'] = $aRegs[1];
} else {
$a['nodes'][count($a['nodes'])-1]['description'] = $aRegs[1];
}
} else if(eregi('^<DT><A([^>]+)>([^<]+)',$s,$aRegs)) {
$sDTMode = 'link';
$sURL = $sCreated = $sModified = $sTags = '';
if(eregi('href="([^"]+)"',$aRegs[1],$aRegs1)) {
$sURL = $aRegs1[1];
}
if(eregi('ADD_DATE="([^"]+)"',$aRegs[1],$aRegs1)) {
$sCreated = $aRegs1[1];
}
if(eregi('LAST_MODIFIED="([^"]+)"',$aRegs[1],$aRegs1)) {
$sModified = $aRegs1[1];
}
if(eregi('SHORTCUTURL="([^"]+)"',$aRegs[1],$aRegs1)) {
$sTags = $aRegs1[1];
}

$a['nodes'][] = array(
'url' => $sURL,
'title' => $aRegs[2],
'created' => $sCreated,
'modified' => $sModified,
'tags' => $sTags
);
} else if(eregi('^</DL>',$s,$aRegs)) {
return $a;
}
}
if($bBeginning) {
return $a['nodes'];
}
return $a;
}
function FF_parseBookmarks($s) {
$s = ereg_replace('ICON="[^"]+"','',$s);
$a = explode('<DL>',$s,2);
$s = '<DL>'.$a[1];
//$s = str_replace('<HR>',"</DL>\n<DL>\n",$s);
$aLines = explode("\n",$s);
return FF_parseBookmarks_($aLines);
}

?>

No comments:

Post a Comment