Boyer-Moore string matching algorithm php implementation

Discussion in 'PHP' started by pein87, Aug 17, 2010.

  1. pein87

    pein87 Active Member

    Joined:
    Aug 6, 2010
    Messages:
    173
    Likes Received:
    47
    Trophy Points:
    28
    Occupation:
    Web Dev
    Location:
    Limbo
    I tried to port the Boyer-Moore string matching algorithm from C to php, I have not tested it out yet because I have a project thats taking all of my time. I was wondering if someone would give it a go and let me know how it works and provide some feed back/help improve it. I picked them because they are said to be the fastest methods to do string matches. Yeah I know since php and C are so much alike I'm hoping that it works perfectly else I write my own algorithm for it using php.

    Boyer-Moore string match algorithm php version
    PHP:
    <?php
     
    define
    ('ALPHABET_SIZE',1);
     
    function 
    compute_prefix($str,$size,&$result=0
    {
        
    $result $size 1;
        
    $q;
        
    $k;
        
    $result[0] = 0;
     
        
    $k 0;
        for (
    $q 1$q $size$q++) 
        {
            while (
    $k && $str[$k] != $str[$q])
                
    $k $result[$k-1];
     
            if (
    $str[$k] == $str[$q])
                
    $k++;
            
    $result[$q] = $k;
        }
    }
     
    function 
    prepare_badcharacter_heuristic($str,$size,$result=ALPHABET_SIZE)
    {
     
        
    $i 0;
     
        for (
    $i 0$i ALPHABET_SIZE$i++)
            
    $result[$i] = -1;
     
        for (
    $i 0$i $size$i++)
            
    $result[(int) $str[$i]] = $i;
    }
     
    function 
    prepare_goodsuffix_heuristic($normal$size, &$result=1
    {
        
    $result $size 1;
        
    $left = (string) $normal;
        
    $right $left $size;
        
    $reversed[$size+1];
        
    $tmp $reversed $size;
        
    $i;
     
        
    /* reverse string */
        
    $tmp 0;
        while (
    $left $right)
            
    $tmp $left++;
     
        
    $prefix_normal[$size];
        
    $prefix_reversed[$size];
     
        
    compute_prefix($normal$size$prefix_normal);
        
    compute_prefix($reversed$size$prefix_reversed);
     
        for (
    $i 0$i <= $size$i++) 
        {
            
    $result[$i] = $size $prefix_normal[$size-1];
        }
     
        for (
    $i 0$i $size$i++) 
        {
            
    $j $size $prefix_reversed[$i];
            
    $k $i $prefix_reversed[$i]+1;
     
            if (
    $result[$j] > $k) { $result[$j] = $k; }
        }
    }
    /*
     * Boyer-Moore search algorithm
     */
    function boyermoore_search($haystack,$needle) {
        
    /*
         * Calc string sizes
         */
        
    $needle_len
        
    $haystack_len;
        
    $needle_len strlen($needle);
        
    $haystack_len strlen($haystack);
     
        
    /*
         * Simple checks
         */
        
    if($haystack_len == 0) { return NULL; }
        if(
    $needle_len == 0) { return $haystack; }
     
        
    /*
         * Initialize heuristics
         */
        
    $badcharacter[ALPHABET_SIZE];
        
    $goodsuffix[$needle_len+1];
     
        
    prepare_badcharacter_heuristic($needle$needle_len$badcharacter);
        
    prepare_goodsuffix_heuristic($needle$needle_len$goodsuffix);
     
        
    /*
         * Boyer-Moore search
         */
        
    $s 0;
        while(
    $s <= ($haystack_len $needle_len))
        {
            
    $j $needle_len;
            while(
    $j && $needle[$j-1] == $haystack[$s+$j-1])
                
    $j--;
     
            if(
    $j 0)
            {
                
    $k $badcharacter[(int) $haystack[$s+$j-1]];
                
    $m;
                if(
    $k < (int)$j && ($m $j-$k-1) > $goodsuffix[$j])
                    
    $s += $m;
                else
                    
    $s += $goodsuffix[$j];
            }
            else
            {
                return 
    $haystack $s;
            }
        }
     
        return 
    NULL;
    }
    ?>
    Boyer-Moore-Horspool string matching algorithm php version
    PHP:
    <?php
    define
    ('UCHAR_MAX',255);
    function 
    boyermoore_horspool_memmem($haystack,$needle)
    {
        
    $hlen strlen($haystack);
        
    $nlen strlen($needle);
        
    $scan 0;
        
    $bad_char_skip[UCHAR_MAX 1];
     
        if (
    $nlen <= || !$haystack || !$needle) { return NULL; }
     
        for (
    $scan 0$scan <= UCHAR_MAX$scan $scan 1)
            
    $bad_char_skip[$scan] = $nlen;
     
        
    $last $nlen 1;
     
        for (
    $scan 0$scan $last$scan $scan 1)
            
    $bad_char_skip[$needle[$scan]] = $last $scan;
     
        while (
    $hlen >= $nlen)
        {
            for (
    $scan $last$haystack[$scan] == $needle[$scan]; $scan $scan 1)
                if (
    $scan == 0) { return $haystack; }
     
           
            
    $hlen -= $bad_char_skip[$haystack[$last]];
            
    $haystack += $bad_char_skip[$haystack[$last]];
        }
        
        return 
    NULL;
    }


    ?>
    Boyer-Moore string match C version
    Code:
    #include <limits.h>
    #include <string.h>
     
    #define ALPHABET_SIZE (1 << CHAR_BIT)
     
    static void compute_prefix(const char* str, size_t size, int result[size]) {
    	size_t q;
    	int k;
    	result[0] = 0;
     
    	k = 0;
    	for (q = 1; q < size; q++) {
    		while (k > 0 && str[k] != str[q])
    			k = result[k-1];
     
    		if (str[k] == str[q])
    			k++;
    		result[q] = k;
    	}
    }
     
    static void prepare_badcharacter_heuristic(const char *str, size_t size,
    		int result[ALPHABET_SIZE]) {
     
    	size_t i;
     
    	for (i = 0; i < ALPHABET_SIZE; i++)
    		result[i] = -1;
     
    	for (i = 0; i < size; i++)
    		result[(size_t) str[i]] = i;
    }
     
    void prepare_goodsuffix_heuristic(const char *normal, size_t size,
    		int result[size + 1]) {
     
    	char *left = (char *) normal;
    	char *right = left + size;
    	char reversed[size+1];
    	char *tmp = reversed + size;
    	size_t i;
     
    	/* reverse string */
    	*tmp = 0;
    	while (left < right)
    		*(--tmp) = *(left++);
     
    	int prefix_normal[size];
    	int prefix_reversed[size];
     
    	compute_prefix(normal, size, prefix_normal);
    	compute_prefix(reversed, size, prefix_reversed);
     
    	for (i = 0; i <= size; i++) {
    		result[i] = size - prefix_normal[size-1];
    	}
     
    	for (i = 0; i < size; i++) {
    		const int j = size - prefix_reversed[i];
    		const int k = i - prefix_reversed[i]+1;
     
    		if (result[j] > k)
    			result[j] = k;
    	}
    }
    /*
     * Boyer-Moore search algorithm
     */
    const char *boyermoore_search(const char *haystack, const char *needle) {
    	/*
    	 * Calc string sizes
    	 */
    	size_t needle_len, haystack_len;
    	needle_len = strlen(needle);
    	haystack_len = strlen(haystack);
     
    	/*
    	 * Simple checks
    	 */
    	if(haystack_len == 0)
    		return NULL;
    	if(needle_len == 0)
    		return haystack;
     
    	/*
    	 * Initialize heuristics
    	 */
    	int badcharacter[ALPHABET_SIZE];
    	int goodsuffix[needle_len+1];
     
    	prepare_badcharacter_heuristic(needle, needle_len, badcharacter);
    	prepare_goodsuffix_heuristic(needle, needle_len, goodsuffix);
     
    	/*
    	 * Boyer-Moore search
    	 */
    	size_t s = 0;
    	while(s <= (haystack_len - needle_len))
    	{
    		size_t j = needle_len;
    		while(j > 0 && needle[j-1] == haystack[s+j-1])
    			j--;
     
    		if(j > 0)
    		{
    			int k = badcharacter[(size_t) haystack[s+j-1]];
    			int m;
    			if(k < (int)j && (m = j-k-1) > goodsuffix[j])
    				s+= m;
    			else
    				s+= goodsuffix[j];
    		}
    		else
    		{
    			return haystack + s;
    		}
    	}
     
    	return NULL; // not found
    }
    Boyer-Moore-Horspool string matching in C
    Code:
    #include <string.h>
    #include <limits.h>
     
    /* The constant UCHAR_MAX is assumed to contain the maximum
     * value of the input character type. Typically it's 255.
     * size_t is an unsigned type for representing sizes.
     * If your system doesn't have it, replace with
     * unsigned int.
     */
     
    /* Returns a pointer to the first occurrence of "needle"
     * within "haystack", or NULL if not found. Works like 
     * memmem().
     */
    const unsigned char *
    boyermoore_horspool_memmem(const unsigned char* haystack, ssize_t hlen,
                               const unsigned char* needle,   ssize_t nlen)
    {
        size_t scan = 0;
        size_t bad_char_skip[UCHAR_MAX + 1]; /* Officially called:
                                              * bad character shift */
     
        /* Sanity checks on the parameters */
        if (nlen <= 0 || !haystack || !needle)
            return NULL;
     
        /* ---- Preprocess ---- */
        /* Initialize the table to default value */
        /* When a character is encountered that does not occur
         * in the needle, we can safely skip ahead for the whole
         * length of the needle.
         */
        for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1)
            bad_char_skip[scan] = nlen;
     
        /* [URL=http://www.go4expert.com/articles/c-arrays-t29952/]C arrays[/URL] have the first byte at [0], therefore:
         * [nlen - 1] is the last byte of the array. */
        size_t last = nlen - 1;
     
        /* Then populate it with the analysis of the needle */
        for (scan = 0; scan < last; scan = scan + 1)
            bad_char_skip[needle[scan]] = last - scan;
     
        /* ---- Do the matching ---- */
     
        /* Search the haystack, while the needle can still be within it. */
        while (hlen >= nlen)
        {
            /* scan from the end of the needle */
            for (scan = last; haystack[scan] == needle[scan]; scan = scan - 1)
                if (scan == 0) /* If the first byte matches, we've found it. */
                    return haystack;
     
            /* otherwise, we need to skip some bytes and start again. 
               Note that here we are getting the skip value based on the last byte
               of needle, no matter where we didn't match. So if needle is: "abcd"
               then we are skipping based on 'd' and that value will be 4, and
               for "abcdd" we again skip on 'd' but the value will be only 1.
               The alternative of pretending that the mismatched character was 
               the last character is slower in the normal case (Eg. finding 
               "abcd" in "...azcd..." gives 4 by using 'd' but only 
               4-2==2 using 'z'. */
            hlen     -= bad_char_skip[haystack[last]];
            haystack += bad_char_skip[haystack[last]];
        }
     
        return NULL;
    }
     
    Last edited: Aug 17, 2010
    Seansa Seven likes this.
  2. pein87

    pein87 Active Member

    Joined:
    Aug 6, 2010
    Messages:
    173
    Likes Received:
    47
    Trophy Points:
    28
    Occupation:
    Web Dev
    Location:
    Limbo
    Fixed a problem that I saw in the first two functions I tried to add a number and a string instead of using it as the array index key fixed that part. Also fixed the constant error as the function param another instance were it should be the array index key.

    PHP:
    <?php
     
    define
    ('ALPHABET_SIZE',1);
     
    function 
    compute_prefix($str,$size,&$result=0
    {
        
    $result $result[$size 1];
        
    $q;
        
    $k;
        
    $result[0] = 0;
     
        
    $k 0;
        for (
    $q 1$q $size$q++) 
        {
            while (
    $k && $str[$k] != $str[$q])
                
    $k $result[$k-1];
     
            if (
    $str[$k] == $str[$q])
                
    $k++;
            
    $result[$q] = $k;
        }
    }
     
    function 
    prepare_badcharacter_heuristic($str,$size,$result)
    {
        
    $result $result[ALPHABET_SIZE];
     
        
    $i 0;
     
        for (
    $i 0$i ALPHABET_SIZE$i++)
            
    $result[$i] = -1;
     
        for (
    $i 0$i $size$i++)
            
    $result[(int) $str[$i]] = $i;
    }
     
    function 
    prepare_goodsuffix_heuristic($normal$size, &$result=1
    {
        
    $result $result[$size 1];
        
    $left = (string) $normal;
        
    $right $left $size;
        
    $reversed[$size+1];
        
    $tmp $reversed $size;
        
    $i;
     
        
    /* reverse string */
        
    $tmp 0;
        while (
    $left $right)
            
    $tmp $left++;
     
        
    $prefix_normal[$size];
        
    $prefix_reversed[$size];
     
        
    compute_prefix($normal$size$prefix_normal);
        
    compute_prefix($reversed$size$prefix_reversed);
     
        for (
    $i 0$i <= $size$i++) 
        {
            
    $result[$i] = $size $prefix_normal[$size-1];
        }
     
        for (
    $i 0$i $size$i++) 
        {
            
    $j $size $prefix_reversed[$i];
            
    $k $i $prefix_reversed[$i]+1;
     
            if (
    $result[$j] > $k) { $result[$j] = $k; }
        }
    }
    /*
     * Boyer-Moore search algorithm
     */
    function boyermoore_search($haystack,$needle) {
        
    /*
         * Calc string sizes
         */
        
    $needle_len
        
    $haystack_len;
        
    $needle_len strlen($needle);
        
    $haystack_len strlen($haystack);
     
        
    /*
         * Simple checks
         */
        
    if($haystack_len == 0) { return NULL; }
        if(
    $needle_len == 0) { return $haystack; }
     
        
    /*
         * Initialize heuristics
         */
        
    $badcharacter[ALPHABET_SIZE];
        
    $goodsuffix[$needle_len+1];
     
        
    prepare_badcharacter_heuristic($needle$needle_len$badcharacter);
        
    prepare_goodsuffix_heuristic($needle$needle_len$goodsuffix);
     
        
    /*
         * Boyer-Moore search
         */
        
    $s 0;
        while(
    $s <= ($haystack_len $needle_len))
        {
            
    $j $needle_len;
            while(
    $j && $needle[$j-1] == $haystack[$s+$j-1])
                
    $j--;
     
            if(
    $j 0)
            {
                
    $k $badcharacter[(int) $haystack[$s+$j-1]];
                
    $m;
                if(
    $k < (int)$j && ($m $j-$k-1) > $goodsuffix[$j])
                    
    $s += $m;
                else
                    
    $s += $goodsuffix[$j];
            }
            else
            {
                return 
    $haystack $s;
            }
        }
     
        return 
    NULL;
    }
    ?>
     
    Seansa Seven and shabbir like this.
  3. Seansa Seven

    Seansa Seven New Member

    Joined:
    Jun 10, 2011
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    Occupation:
    Student
    Location:
    Indonesia
    Hi, I'm a newbie in programming...
    Thank you for this code, I search for this... :)
    I have tried this code, but I find error at line 49 :
    $tmp = $left++;
    "Fatal error: Maximum execution time of 60 seconds exceeded in C:\xampp\htdocs\boyermoore2\boyermoore.php on line 49"
    I think it's because the looping... And line 40 :
    $left = (string) $normal;
    Can you explain syntax "(string)" and that line?
     
  4. pein87

    pein87 Active Member

    Joined:
    Aug 6, 2010
    Messages:
    173
    Likes Received:
    47
    Trophy Points:
    28
    Occupation:
    Web Dev
    Location:
    Limbo
    remove the (string) since php does not require type defs, and change
    PHP:
    while ($left $right)
            
    $tmp $left++;
    to
    PHP:
    while ($left $right) {
            
    $tmp $left++;
    }
    This should fix the problem, I've personally never tested this code though it was just something I did to kill time. Hope it works for you now, although it would be slow because its in php and not C. Might also need to raise your execution limit from 60 to something higher.
     
  5. Seansa Seven

    Seansa Seven New Member

    Joined:
    Jun 10, 2011
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    Occupation:
    Student
    Location:
    Indonesia
    Thanks. I tried your suggestion, but It's still error...
    I think line 40 :
    PHP:
    $left $normal;
    It's must be number, cause it use for limit in line :
    PHP:
    $tmp 0;
    while (
    $left $right) {
            
    $tmp $left++;
    I changed line 40 to :
    PHP:
    $left strlen($normal)
    It's works, but sometimes the pattern cannot find...
     
  6. pein87

    pein87 Active Member

    Joined:
    Aug 6, 2010
    Messages:
    173
    Likes Received:
    47
    Trophy Points:
    28
    Occupation:
    Web Dev
    Location:
    Limbo
    Last edited: Jun 11, 2011
  7. Seansa Seven

    Seansa Seven New Member

    Joined:
    Jun 10, 2011
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    Occupation:
    Student
    Location:
    Indonesia
    I've tried this too...
    The code can run, but it's always return false even the pattern found or not... :thinking:
    Do you have another code?
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice