hi all, I'm a little bit new in C and would like to know how i could use regular expression or some sort of parsing method to scan through mulitple files for specific values and store them in C variable. Ie:. in the file i have name: "abcdef" ; id:12345 .. msgs... etc... name:"pqrs"; msg... id: 9876 .... and i want to traverse through the whole file and grep ALL the names and ids values and store them in some type of struct array in C for use later. How could i make use of the regexec? How would i load the file to input so that i could scan through and choose the value i want? Any suggestions or sample code would be greatly appreciated. Thanks in advance.
I don't think C supports regular expressions natively. You'd need a regexp library for that. Here are some I found. http://www.tropicsoft.com/Components/RegularExpression/ http://sourceforge.net/project/showfiles.php?group_id=7586&package_id=8041 http://research.microsoft.com/projects/greta/ I'm not sure if the last one is free. As for how to use them I guess they should come with documentation.
You should consider the regularity of the format. Regex is not an efficient way to find things if there is a sensible alternative (substring searches, etc.). If the information is essentially chaotic, regex is hard to beat. Since your format apparently has labels, I would consider another method of parsing/tokenizing. You might, for instance, find that your data is a tabular representation. In other words, lines might represent rows containing the various fields. If there are a fixed number of fields, always in the same order, parsing can be relatively simple. If there are a variable number of fields, or if they are not in some fixed order, then the use of the labels as delimiters is definitely warranted.
Thanks for the suggestions. Now, the question I have is.. I am not sure if parsing/tokenizing method would work so well with my data. There are a lot of info that i would like to ignore. Below is a sample of my file content. ================================================================ #alert tcp !$ICCP_CLIENT any -> $ICCP_SERVER $ICCP_PORT (flow:from_client,established; content:"|03 00|"; depth:2; content:"|E0 00 00|"; distance:3; depth:3; msg:"ICCP - COTP Connection Request From Unauthorized Client"; reference:scada,1111401.htm; classtype:bad-unknown; sid:1111401; rev:1; priority:2 ########################################################################## alert tcp $ICCP_SERVER $ICCP_PORT -> !$ICCP_CLIENT any (flow:established; content:"|03 00|"; depth:2; content:"|D0|"; distance:3; depth:1; msg:"ICCP - Unauthorized COTP Connection Established"; reference:scada,1111402.htm; classtype:bad-unknown; sid:1111402; rev:1; priority:1 # # pass: 12/19/05 # # #[**] [1:1111402:1] ICCP - Unauthorized COTP Connection Established [**] #[Classification: Potentially Bad Traffic] [Priority: 1] #12/04-22:47:56.048574 10.10.10.101:102 -> 10.10.10.102:3075 #TCP TTL:128 TOS:0x0 ID:41486 IpLen:20 DgmLen:82 DF #***AP*** Seq: 0x4EF110A Ack: 0x4D7C50DC Win: 0x4446 TcpLen: 20 #[Xref => http://www.digitalbond.com/SCADA_security/Snort_rules/1111402.htm] # =============================================================== I would only want to grep the msg label and the sid label only if the rule is not comment out. so in the above case, I only want to return the values of msg :ICCP - Unauthorized COTP Connection Established, and sid:1111402 . If I don't use regexe and choose other parsing/tokenizing method, which method would u suggest? At the beginning, I used a script to grep all the fields and store them in a file, however, I was asked to avoid using scripts in C and implement it using C library functions for performance reasons. I am not too familiar with the parsing functions in C, so would i use strtok, strchr to scan whole file and use the label names as delimiter? Would I be able to use delimeter say "alert" then scan through the rest of the info with delimiter "msg" and "sid"? It sounds like i would need to write my own function to parse these values, correct?
It isn't clear, from a posted reproduction, where the line endings are. I'm guessing that is all one line (terminated by newline, carriage return, or both), and that is also a single line. I base this on the presence of a single '#' in the first case. Commented content is probably defined as any line that begins with '#' (or has '#' as its first non-whitespace character). Knowing that is key information. Obviously, the first thing is to strip the data of all comments. That's a trivial thing to do. It appears that items of interest are all delimited by labels. strtok works with a collection of delimiters, but the delimiters are not multiple-character entities, like the labels. Regex to pick up a label would be simple to write (begins with whitespace, ends with ':'). It wouldn't be too hard to achieve that without regex. You mention "performance reasons." One can't tell from the context how important that is or what is actually considered to separate poor performance from good performance. Perhaps you could clarify some of that and I could give you some example code. Incidentally, you can prevent the smilies from appearing in atypical, cluttered text (or code) by using the advanced posting option and disabling smilies. EDIT: You might take a cut of the file and attach it so that copy/paste wouldn't be involved, and garfle up the file with non-existent line endings, and such. That would provide some relevant information.
yes .. i apologize for the confusion.. yes, all of that information is all on the same line and ends with a newline. If I was to use regex, doesn't that take in an input? but how would i be able to do that if i'm reading from a file? I am not entirely certain why it is considered bad performance either.. however, I was told that it is best to use the C library function instead of calling a script to do the same job. If I find out the reasons, I will let you know. Thanks for the tip. I'll keep that in mind in the future.
You don't operate on file data. You get it into memory, either whole or piecemeal, and work with it. My approach would be to read the file line by line, discard the comments, and put the result into a working file or retain it in memory. The approach depends upon your memory resources. I don't think I'd use regex (I might change my mind after additional thought, but not likely). I'd take the line as a string, locate the ':'s, backtrack to the whitespace, check the enveloped result for a match to the desired labels, and go from there. It's easy enough to do in C, easier in C++.
If I was to read file line by line, I would use say fopen and fgets, correct? What functions could i use to remove the lines starting with comments and i take it i would use the delimeter of a newline as a stopping condition? Once i use ":" as my delimiter, i would then match that with my label name using strcmp.. ? If there is a match, how do i store the value i want? i am uncertain how to iterate through the string. Do you have any materials, links or examples i could refer to? I have not perfected my strings manipulations yet so any help you could offer would be greatly appreciated. Thanks.
If you read a line with the normal invocation of fgets, it terminates on a newline. It will terminate earlier if you give it a maximum length that is shorter than the line. If you copy non-comment lines to a temporary file, and don't copy comment lines, you have discarded them, right? You are admittedly not a C expert. That doesn't matter at this point. You go sit behind the barn and watch the cotton grow while thinking about your problem, in logical terms. Once you understand what you have to do, in logical terms, given your data, then you translate those operations into the language of your choice. One of the reasons we are here is to help you do that last part correctly. We also help you with the first part, if you need it. It's called 'design.'
Thanks ... I think I will retrace my steps and clarify what is needed first , then figure out the next steps afterwards ..