Boost Regex wont return value [c++]

thekevin07's Avatar, Join Date: Sep 2008
Go4Expert Member
Hi

I've spent 2hrs on a regex and i cant get it to work

Code:
\<div\\sclass=\"Summary\">(.*?)\<\/div>
and I'm using boost regex in c++ and visual studio 2008. I cant seam to get that regex above to give out the content between <div class="Summary"> and </div> below is an example of what Im trying to do

<div class=\"Summary\">
<b>heading</b>
</br>desc
<div class="anotherDiv">

</div>
</div>

when i run the regex i should get this back

Code:
<b>heading</b>
     </br>desc
     <div class="anotherDiv"></div>

thanks in advance
oogabooga's Avatar
Ambitious contributor
You have an extra backslash in your regex:
\<div\\sclass=\"Summary\">(.*?)\<\/div>
And you shouldn't have the backslashes before the quotes in your html:
<div class=\"Summary\">
thekevin07's Avatar, Join Date: Sep 2008
Go4Expert Member
that didnt work removing the \ before the " gave a compile err and removing the \ before the \s didnt return anything.

thanks
oogabooga's Avatar
Ambitious contributor
Sorry, I didn't realize the html was a string constant.
I assumed you were reading from an actual html file,
which of course shouldn't have backslashes before the quotes.

But the extra backslash in your regex is definitely a problem.
Two backslashes in a row give a literal backslash, so you
have to remove one of them to give a proper \s.

Your problem may be that the period doesn't usually match
newlines. In Perl you make it do so with the /s option.
Perhaps there is something similar in boost.
oogabooga's Avatar
Ambitious contributor
It just occurred to me that since this is all taking place in C strings
you may have to double up all (or most) of the backslashes, like this:
\\<div\\sclass=\"Summary\">(.*?)\\<\\/div>
thekevin07's Avatar, Join Date: Sep 2008
Go4Expert Member
I tried that. I doubled on everything and some of the things i tried every variation I could think of sometimes though it does return this part

Code:
<div class="Summary">
but nothing else which is really odd
oogabooga's Avatar
Ambitious contributor
One last thought. Try this:

boost::regex re ("<div\\sclass=\"Summary\">(.*?)</div>", boost::regex::mod_s);

If that doesn't work, try double-backslashes before the forward slash.
The mod_s switch ensures that the period can match newlines.
You've spurred me into installing boost (for the regexes if nothing else),
but I haven't done it yet, so I can't test it myself!
oogabooga's Avatar
Ambitious contributor
Hey thekev,
I've had dinner, installed boost, and I think I've found the problem.
Try this as your regex string:
".*<div\\sclass=\"Summary\">(.*?)</div>.*"
thekevin07's Avatar, Join Date: Sep 2008
Go4Expert Member
hmm that didn't work either same result as before it would just show a blank line. here is some more code the first part is a function that returns the string i need to get based on the regex. I have tested this function with about 8 other regexs and it should work with this one. the first parameter is the actual regex, the 2nd is the buffer string that contains our html page and the 3rd clears out xtra tags i dont need but i have omited that so it shouldnt be a problem. The second part is the actual calling of the function with the regex we are trying to figure out

Code:
string getListingData(string regexstring,string string1,string replaceregex)
{
	string content;
	boost::regex expression(regexstring, boost::regex::mod_s);
	boost::smatch match;
	expression.assign(regexstring, boost::regex_constants::icase);
	while(boost::regex_search(string1,match,expression,boost::match_not_dot_newline) )
	{
		content=match[0];
		//used to clear xtra chars on content
		//content=boost::regex_replace(content, boost::regex(replaceregex), "");
		string1 = match.suffix();
	}
	return content;
}
Code:
string tmp;
tmp=getListingData("<div\\sclass=\"Summary\">(.*?)</div>",string1,"");
cout<<tmp<<endl;
I have tried every variation of the regex u gave but still the same problem
thekevin07's Avatar, Join Date: Sep 2008
Go4Expert Member
just in case here are my includes as well

Code:
#include <string>
#include <cstring>
#include <iostream>
#include "curl.h"
#include <sstream>
#include "boost/regex.hpp"
#include <iterator>
#include <boost/algorithm/string/regex.hpp>