how to parse fields in a file and store them in C struc

cky1123 · Jan 17, 2007

hi all,

I'm a little bit new in C and would like to know how i could use regular expression or some sort of parsing method to scan through mulitple files for specific values and store them in C variable.

Ie:. in the file i have name: "abcdef" ; id:12345 .. msgs... etc... name:"pqrs"; msg... id: 9876 ....

and i want to traverse through the whole file and grep ALL the names and ids values and store them in some type of struct array in C for use later.

How could i make use of the regexec? How would i load the file to input so that i could scan through and choose the value i want? Any suggestions or sample code would be greatly appreciated. Thanks in advance.

ever_thus · Jan 17, 2007

I don't think C supports regular expressions natively. You'd need a regexp library for that. Here are some I found. http://www.tropicsoft.com/Components/RegularExpression/
http://sourceforge.net/project/showfiles.php?group_id=7586&package_id=8041
http://research.microsoft.com/projects/greta/
I'm not sure if the last one is free. As for how to use them I guess they should come with documentation.

DaWei · Jan 17, 2007

You should consider the regularity of the format. Regex is not an efficient way to find things if there is a sensible alternative (substring searches, etc.). If the information is essentially chaotic, regex is hard to beat. Since your format apparently has labels, I would consider another method of parsing/tokenizing. You might, for instance, find that your data is a tabular representation. In other words, lines might represent rows containing the various fields. If there are a fixed number of fields, always in the same order, parsing can be relatively simple. If there are a variable number of fields, or if they are not in some fixed order, then the use of the labels as delimiters is definitely warranted.

cky1123 · Jan 18, 2007

Thanks for the suggestions.

DaWei said:

You should consider the regularity of the format. Regex is not an efficient way to find things if there is a sensible alternative (substring searches, etc.). If the information is essentially chaotic, regex is hard to beat. Since your format apparently has labels, I would consider another method of parsing/tokenizing. You might, for instance, find that your data is a tabular representation. In other words, lines might represent rows containing the various fields. If there are a fixed number of fields, always in the same order, parsing can be relatively simple. If there are a variable number of fields, or if they are not in some fixed order, then the use of the labels as delimiters is definitely warranted.
Click to expand...

Now, the question I have is.. I am not sure if parsing/tokenizing method would work so well with my data. There are a lot of info that i would like to ignore. Below is a sample of my file content.

================================================================
#alert tcp !$ICCP_CLIENT any -> $ICCP_SERVER $ICCP_PORT (flow:from_client,established; content:"|03 00|"; depth:2; content:"|E0 00 00|"; distance:3; depth:3; msg:"ICCP - COTP Connection Request From Unauthorized Client"; reference:scada,1111401.htm; classtype:bad-unknown; sid:1111401; rev:1; priority:2
##########################################################################
alert tcp $ICCP_SERVER $ICCP_PORT -> !$ICCP_CLIENT any (flow:established; content:"|03 00|"; depth:2; content:"|D0|"; distance:3; depth:1; msg:"ICCP - Unauthorized COTP Connection Established"; reference:scada,1111402.htm; classtype:bad-unknown; sid:1111402; rev:1; priority:1
#
# pass: 12/19/05
#
#
#[**] [1:1111402:1] ICCP - Unauthorized COTP Connection Established [**]
#[Classification: Potentially Bad Traffic] [Priority: 1]
#12/04-22:47:56.048574 10.10.10.101:102 -> 10.10.10.102:3075
#TCP TTL:128 TOS:0x0 ID:41486 IpLen:20 DgmLen:82 DF
#***AP*** Seq: 0x4EF110A Ack: 0x4D7C50DC Win: 0x4446 TcpLen: 20
#[Xref => http://www.digitalbond.com/SCADA_security/Snort_rules/1111402.htm]
#
===============================================================

I would only want to grep the msg label and the sid label only if the rule is not comment out. so in the above case, I only want to return the values of msg :ICCP - Unauthorized COTP Connection Established, and sid:1111402 .

If I don't use regexe and choose other parsing/tokenizing method, which method would u suggest? At the beginning, I used a script to grep all the fields and store them in a file, however, I was asked to avoid using scripts in C and implement it using C library functions for performance reasons. I am not too familiar with the parsing functions in C, so would i use strtok, strchr to scan whole file and use the label names as delimiter? Would I be able to use delimeter say "alert" then scan through the rest of the info with delimiter "msg" and "sid"?

It sounds like i would need to write my own function to parse these values, correct?

DaWei · Jan 18, 2007

It isn't clear, from a posted reproduction, where the line endings are. I'm guessing that

#alert tcp !$ICCP_CLIENT any -> $ICCP_SERVER $ICCP_PORT (flow:from_client,established; content:"|03 00|"; depth:2; content:"|E0 00 00|"; distance:3; depth:3; msg:"ICCP - COTP Connection Request From Unauthorized Client"; reference:scada,1111401.htm; classtype:bad-unknown; sid:1111401; rev:1; priority:2
Click to expand...

is all one line (terminated by newline, carriage return, or both), and that

alert tcp $ICCP_SERVER $ICCP_PORT -> !$ICCP_CLIENT any (flow:established; content:"|03 00|"; depth:2; content:"|D0|"; distance:3; depth:1; msg:"ICCP - Unauthorized COTP Connection Established"; reference:scada,1111402.htm; classtype:bad-unknown; sid:1111402; rev:1; priority:1
Click to expand...

is also a single line. I base this on the presence of a single '#' in the first case. Commented content is probably defined as any line that begins with '#' (or has '#' as its first non-whitespace character).
Knowing that is key information.

Obviously, the first thing is to strip the data of all comments. That's a trivial thing to do.

It appears that items of interest are all delimited by labels. strtok works with a collection of delimiters, but the delimiters are not multiple-character entities, like the labels. Regex to pick up a label would be simple to write (begins with whitespace, ends with ':'). It wouldn't be too hard to achieve that without regex. You mention "performance reasons." One can't tell from the context how important that is or what is actually considered to separate poor performance from good performance.

Perhaps you could clarify some of that and I could give you some example code. Incidentally, you can prevent the smilies from appearing in atypical, cluttered text (or code) by using the advanced posting option and disabling smilies.

EDIT: You might take a cut of the file and attach it so that copy/paste wouldn't be involved, and garfle up the file with non-existent line endings, and such. That would provide some relevant information.

cky1123 · Jan 18, 2007

DaWei said:

It isn't clear, from a posted reproduction, where the line endings are. I'm guessing that is all one line (terminated by newline, carriage return, or both), and that is also a single line. I base this on the presence of a single '#' in the first case. Commented content is probably defined as any line that begins with '#' (or has '#' as its first non-whitespace character).
Knowing that is key information.
Click to expand...

yes .. i apologize for the confusion.. yes, all of that information is all on the same line and ends with a newline.

Obviously, the first thing is to strip the data of all comments. That's a trivial thing to do.

It appears that items of interest are all delimited by labels. strtok works with a collection of delimiters, but the delimiters are not multiple-character entities, like the labels. Regex to pick up a label would be simple to write (begins with whitespace, ends with ':'). It wouldn't be too hard to achieve that without regex.
Click to expand...

If I was to use regex, doesn't that take in an input? but how would i be able to do that if i'm reading from a file?

You mention "performance reasons." One can't tell from the context how important that is or what is actually considered to separate poor performance from good performance.
Click to expand...

I am not entirely certain why it is considered bad performance either.. however, I was told that it is best to use the C library function instead of calling a script to do the same job. If I find out the reasons, I will let you know.

Perhaps you could clarify some of that and I could give you some example code. Incidentally, you can prevent the smilies from appearing in atypical, cluttered text (or code) by using the advanced posting option and disabling smilies.

EDIT: You might take a cut of the file and attach it so that copy/paste wouldn't be involved, and garfle up the file with non-existent line endings, and such. That would provide some relevant information.
Click to expand...

Thanks for the tip. I'll keep that in mind in the future.

DaWei · Jan 18, 2007

You don't operate on file data. You get it into memory, either whole or piecemeal, and work with it. My approach would be to read the file line by line, discard the comments, and put the result into a working file or retain it in memory. The approach depends upon your memory resources.

I don't think I'd use regex (I might change my mind after additional thought, but not likely). I'd take the line as a string, locate the ':'s, backtrack to the whitespace, check the enveloped result for a match to the desired labels, and go from there. It's easy enough to do in C, easier in C++.

cky1123 · Jan 18, 2007

If I was to read file line by line, I would use say fopen and fgets, correct? What functions could i use to remove the lines starting with comments and i take it i would use the delimeter of a newline as a stopping condition?

Once i use ":" as my delimiter, i would then match that with my label name using strcmp.. ? If there is a match, how do i store the value i want? i am uncertain how to iterate through the string. Do you have any materials, links or examples i could refer to? I have not perfected my strings manipulations yet so any help you could offer would be greatly appreciated. Thanks.

DaWei · Jan 18, 2007

If you read a line with the normal invocation of fgets, it terminates on a newline. It will terminate earlier if you give it a maximum length that is shorter than the line. If you copy non-comment lines to a temporary file, and don't copy comment lines, you have discarded them, right?

You are admittedly not a C expert. That doesn't matter at this point. You go sit behind the barn and watch the cotton grow while thinking about your problem, in logical terms. Once you understand what you have to do, in logical terms, given your data, then you translate those operations into the language of your choice. One of the reasons we are here is to help you do that last part correctly. We also help you with the first part, if you need it. It's called 'design.'

cky1123 · Jan 19, 2007

Thanks ... I think I will retrace my steps and clarify what is needed first , then figure out the next steps afterwards ..

Log in or Sign up

how to parse fields in a file and store them in C struc

cky1123 New Member

ever_thus New Member

DaWei New Member

cky1123 New Member

DaWei New Member

cky1123 New Member

DaWei New Member

cky1123 New Member

DaWei New Member

cky1123 New Member

Share This Page

Log in or Sign up

how to parse fields in a file and store them in C struc

cky1123 New Member

ever_thus New Member

DaWei New Member

cky1123 New Member

DaWei New Member

cky1123 New Member

DaWei New Member

cky1123 New Member

DaWei New Member

cky1123 New Member

Share This Page

Useful Searches