Thanks for the suggestions.

Quote:
Originally Posted by DaWei
You should consider the regularity of the format. Regex is not an efficient way to find things if there is a sensible alternative (substring searches, etc.). If the information is essentially chaotic, regex is hard to beat. Since your format apparently has labels, I would consider another method of parsing/tokenizing. You might, for instance, find that your data is a tabular representation. In other words, lines might represent rows containing the various fields. If there are a fixed number of fields, always in the same order, parsing can be relatively simple. If there are a variable number of fields, or if they are not in some fixed order, then the use of the labels as delimiters is definitely warranted.
Now, the question I have is.. I am not sure if parsing/tokenizing method would work so well with my data. There are a lot of info that i would like to ignore. Below is a sample of my file content.

================================================== ==============
#alert tcp !$ICCP_CLIENT any -> $ICCP_SERVER $ICCP_PORT (flow:from_client,established; content:"|03 00|"; depth:2; content:"|E0 00 00|"; distance:3; depth:3; msg:"ICCP - COTP Connection Request From Unauthorized Client"; reference:scada,1111401.htm; classtype:bad-unknown; sid:1111401; rev:1; priority:2
################################################## ########################
alert tcp $ICCP_SERVER $ICCP_PORT -> !$ICCP_CLIENT any (flow:established; content:"|03 00|"; depth:2; content:"|D0|"; distance:3; depth:1; msg:"ICCP - Unauthorized COTP Connection Established"; reference:scada,1111402.htm; classtype:bad-unknown; sid:1111402; rev:1; priority:1
#
# pass: 12/19/05
#
#
#[**] [1:1111402:1] ICCP - Unauthorized COTP Connection Established [**]
#[Classification: Potentially Bad Traffic] [Priority: 1]
#12/04-22:47:56.048574 10.10.10.101:102 -> 10.10.10.102:3075
#TCP TTL:128 TOS:0x0 ID:41486 IpLen:20 DgmLen:82 DF
#***AP*** Seq: 0x4EF110A Ack: 0x4D7C50DC Win: 0x4446 TcpLen: 20
#[Xref => http://www.digitalbond.com/SCADA_sec...s/1111402.htm]
#
================================================== =============

I would only want to grep the msg label and the sid label only if the rule is not comment out. so in the above case, I only want to return the values of msg :ICCP - Unauthorized COTP Connection Established, and sid:1111402 .

If I don't use regexe and choose other parsing/tokenizing method, which method would u suggest? At the beginning, I used a script to grep all the fields and store them in a file, however, I was asked to avoid using scripts in C and implement it using C library functions for performance reasons. I am not too familiar with the parsing functions in C, so would i use strtok, strchr to scan whole file and use the label names as delimiter? Would I be able to use delimeter say "alert" then scan through the rest of the info with delimiter "msg" and "sid"?

It sounds like i would need to write my own function to parse these values, correct?