compare two files

heidik · Nov 1, 2010

Could anyone please help me how to compare 2 files containing lines of string? The two files contain similar data except that one contains data with duplicates while the other one contains all unique. I have to make sure if the file having unique data contains all the lines present in the file containing duplicates.

Both files look something like this and I have to compare only the lines which has the sub string CME in it which comes right before the first coma in the string

Code:

BREACH:0:40:GE:20100601-07:34:22.796
0:1:ORDER ID:0000D9DB
0:2:ORDER ID:0000D9DC
0:1:TRDR:GRC
0:2:TRDR:GRC
0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx
0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
0:0:WASH-ORD-TIME-DIFF,2.703
0:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
BREACH:1:40:GE:20100601-07:34:22.796
1:1:ORDER ID:0000D9DB
1:2:ORDER ID:0000D9DC
1:1:TRDR:GRC
1:2:TRDR:GRC
1:0:TRADE CROSSING IDS:1iyucih1bcmpso,d88hmz15psx80
1:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
1:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
1:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
1:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
1:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
1:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
1:0:WASH-ORD-TIME-DIFF,2.703
1:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
BREACH:40:40:GE:20100601-14:08:35.406
40:1:ORDER ID:0000D9XN
40:2:ORDER ID:0000DBHJ
40:1:TRDR:DAF
40:2:TRDR:DAF
40:0:TRADE CROSSING IDS:4hr6iu6smidw,1t6btger8juyg
40:1:OrderReceive,01.06.2010 09:58:50.031,0323YK058,0000D9XN,A,25,0,25,0,-35
40:1:OrderReceive,01.06.2010 09:50:42.290,0323YK058,0000D9XN,A,25,0,25,0,-35
40:2:OrderReceive,01.06.2010 14:07:29.062,0323YK153,0000DBHJ,A,7,0,7,0,160
40:2:OrderReceive,01.06.2010 13:59:20.853,0323YK153,0000DBHJ,A,7,0,7,0,160
40:1:CME,20100601,12:45:46.250,DAF,GE,201012,FUT,XGDAF,0323YK058,1kxdklb1oghepj,Leg Fill,0000D9XN,B,00000,-3.5,2,99,20100601,12:38:57
40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,1v0f5tr1da5176,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57
40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,61wcagnnjmjl,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57
40:1:CME,20100601,12:45:46.265,DAF,GE,201106,FUT,XGDAF,0323YK058,v6d4b2agp1hr,Leg Fill,0000D9XN,B,00000,-3.5,2,98.715,20100601,12:38:57
40:0:WASH-ORD-TIME-DIFF,14918.6
40:0:2nd-ORD-TO-WASH-TIME-DIFF,554.553
BREACH:101:30:GE:20100601-07:18:05.015
101:1:ORDER ID:0001U8QR
101:2:ORDER ID:0001W0PJ
101:1:TRDR:MTJ
101:2:TRDR:FDC
101:0:TRADE CROSSING IDS:1ua0o7twia2cx,p3mqxj1it2iao
101:1:OrderReceive,01.06.2010 07:18:05.015,082X7Y007,0001U8QR,A,1,0,1,0,99025
101:1:OrderReceive,01.06.2010 07:09:57.556,082X7Y007,0001U8QR,A,1,0,1,0,99025
101:2:OrderReceive,01.06.2010 07:18:04.468,0323X8076,0001W0PJ,A,10,0,10,0,145
101:2:OrderReceive,01.06.2010 07:09:57.009,0323X8076,0001W0PJ,A,10,0,10,0,145
101:1:CME,20100601,07:18:05.015,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
101:1:CME,20100601,07:09:57.556,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
101:2:CME,20100601,07:18:04.468,FDC,GE,201009,FUT,XGFDC,0323X8076,1t6wkc41c2ki0m,Leg Fill,0001W0PJ,S,00000,14.5,3,99.17,20100601,07:11:15
101:2:CME,20100601,07:18:04.468,FDC,GE,201012,FUT,XGFDC,0323X8076,15vxdjj1r1imja,Leg Fill,0001W0PJ,B,00000,14.5,3,99.025,20100601,07:11:15
101:0:CROSS-ORD-TIME-DIFF,0.547

xpi0t0s · Nov 2, 2010

"I have to make sure if the file having unique data contains all the lines present in the file containing duplicates."

OK, so the comparison fails if the file U having unique data does not contain one or more of the lines present in the file D containing duplicates.

So if we can find a line in D that is not in U, the comparison fails.

Question: does U have to contain *only* the lines present in D?

U={1,2,3,4}

D={4,3,2,1,4,3,2,1} comparison succeeds, of course.
D={1,1,2,3,4,5} fail: U does not contain 5.
D={1,1,2,3} does this succeed or fail? U contains all lines present in D, i.e. 1,2,3, but D does not contain 4.

Assuming this last one succeeds, this algorithm should do the trick:

for each line in the "duplicates" file D
check if that line is present in the "uniques" file U
if it isn't, you can terminate the loop early with "comparison failed"
- i.e., you've found a line in the D file that isn't in U.

You can do fancy things like cache the U file in memory, store the lines in a linked list (optionally sorted) etc, so if you're expected to do all that then do, but if you're not and the test is *only* to do the comparison, you'll probably find that the above algorithm will complete in next to no time just using standard file IO, unless the files are enormous (but then, they might be too big to fit in RAM).

heidik · Nov 2, 2010

"OK, so the comparison fails if the file U having unique data does not contain one or more of the lines present in the file D containing duplicates"

that's TRUE

"So if we can find a line in D that is not in U, the comparison fails."

TRUE

"Question: does U have to contain *only* the lines present in D?"

YES because it is basically extracted from D

"D={1,1,2,3} does this succeed or fail? U contains all lines present in D, i.e. 1,2,3, but D does not contain 4."

it fails because U contains an entry which is not present in its parent or D file

I dont know how to start searching from the lines that contains CME in it? Because there are lines which do not contains CME in it.
Code:
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35

xpi0t0s · Nov 2, 2010

strstr checks if a string contains a substring. If the line doesn't contain CME, just exclude it from the comparison.

OK, so my first algorithm won't work, cos it wouldn't fail for D={1,1,2,3}. Will need to think up a new one...

xpi0t0s · Nov 2, 2010

OK, here's a quick one. Read both U and D into sorted linked lists, but don't add duplicate lines to the D linked list. Then each list should match exactly, and if there are any differences: one shorter than the other, or element U[n]!=D[n], then the comparison fails.

heidik · Nov 2, 2010

Thanks a lot xpi0t0s. I will work on what you suggested

Log in or Sign up

compare two files

heidik New Member

xpi0t0s Mentor

heidik New Member

xpi0t0s Mentor

xpi0t0s Mentor

heidik New Member

Share This Page

Log in or Sign up

compare two files

heidik New Member

xpi0t0s Mentor

heidik New Member

xpi0t0s Mentor

xpi0t0s Mentor

heidik New Member

Share This Page

Useful Searches