Could anyone please help me how to compare 2 files containing lines of string? The two files contain similar data except that one contains data with duplicates while the other one contains all unique. I have to make sure if the file having unique data contains all the lines present in the file containing duplicates. Both files look something like this and I have to compare only the lines which has the sub string CME in it which comes right before the first coma in the string Code: BREACH:0:40:GE:20100601-07:34:22.796 0:1:ORDER ID:0000D9DB 0:2:ORDER ID:0000D9DC 0:1:TRDR:GRC 0:2:TRDR:GRC 0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx 0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140 0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35 0:0:WASH-ORD-TIME-DIFF,2.703 0:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693 BREACH:1:40:GE:20100601-07:34:22.796 1:1:ORDER ID:0000D9DB 1:2:ORDER ID:0000D9DC 1:1:TRDR:GRC 1:2:TRDR:GRC 1:0:TRADE CROSSING IDS:1iyucih1bcmpso,d88hmz15psx80 1:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 1:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 1:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140 1:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140 1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34 1:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34 1:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35 1:0:WASH-ORD-TIME-DIFF,2.703 1:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693 BREACH:40:40:GE:20100601-14:08:35.406 40:1:ORDER ID:0000D9XN 40:2:ORDER ID:0000DBHJ 40:1:TRDR:DAF 40:2:TRDR:DAF 40:0:TRADE CROSSING IDS:4hr6iu6smidw,1t6btger8juyg 40:1:OrderReceive,01.06.2010 09:58:50.031,0323YK058,0000D9XN,A,25,0,25,0,-35 40:1:OrderReceive,01.06.2010 09:50:42.290,0323YK058,0000D9XN,A,25,0,25,0,-35 40:2:OrderReceive,01.06.2010 14:07:29.062,0323YK153,0000DBHJ,A,7,0,7,0,160 40:2:OrderReceive,01.06.2010 13:59:20.853,0323YK153,0000DBHJ,A,7,0,7,0,160 40:1:CME,20100601,12:45:46.250,DAF,GE,201012,FUT,XGDAF,0323YK058,1kxdklb1oghepj,Leg Fill,0000D9XN,B,00000,-3.5,2,99,20100601,12:38:57 40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,1v0f5tr1da5176,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57 40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,61wcagnnjmjl,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57 40:1:CME,20100601,12:45:46.265,DAF,GE,201106,FUT,XGDAF,0323YK058,v6d4b2agp1hr,Leg Fill,0000D9XN,B,00000,-3.5,2,98.715,20100601,12:38:57 40:0:WASH-ORD-TIME-DIFF,14918.6 40:0:2nd-ORD-TO-WASH-TIME-DIFF,554.553 BREACH:101:30:GE:20100601-07:18:05.015 101:1:ORDER ID:0001U8QR 101:2:ORDER ID:0001W0PJ 101:1:TRDR:MTJ 101:2:TRDR:FDC 101:0:TRADE CROSSING IDS:1ua0o7twia2cx,p3mqxj1it2iao 101:1:OrderReceive,01.06.2010 07:18:05.015,082X7Y007,0001U8QR,A,1,0,1,0,99025 101:1:OrderReceive,01.06.2010 07:09:57.556,082X7Y007,0001U8QR,A,1,0,1,0,99025 101:2:OrderReceive,01.06.2010 07:18:04.468,0323X8076,0001W0PJ,A,10,0,10,0,145 101:2:OrderReceive,01.06.2010 07:09:57.009,0323X8076,0001W0PJ,A,10,0,10,0,145 101:1:CME,20100601,07:18:05.015,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16 101:1:CME,20100601,07:09:57.556,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16 101:2:CME,20100601,07:18:04.468,FDC,GE,201009,FUT,XGFDC,0323X8076,1t6wkc41c2ki0m,Leg Fill,0001W0PJ,S,00000,14.5,3,99.17,20100601,07:11:15 101:2:CME,20100601,07:18:04.468,FDC,GE,201012,FUT,XGFDC,0323X8076,15vxdjj1r1imja,Leg Fill,0001W0PJ,B,00000,14.5,3,99.025,20100601,07:11:15 101:0:CROSS-ORD-TIME-DIFF,0.547
"I have to make sure if the file having unique data contains all the lines present in the file containing duplicates." OK, so the comparison fails if the file U having unique data does not contain one or more of the lines present in the file D containing duplicates. So if we can find a line in D that is not in U, the comparison fails. Question: does U have to contain *only* the lines present in D? U={1,2,3,4} D={4,3,2,1,4,3,2,1} comparison succeeds, of course. D={1,1,2,3,4,5} fail: U does not contain 5. D={1,1,2,3} does this succeed or fail? U contains all lines present in D, i.e. 1,2,3, but D does not contain 4. Assuming this last one succeeds, this algorithm should do the trick: for each line in the "duplicates" file D check if that line is present in the "uniques" file U if it isn't, you can terminate the loop early with "comparison failed" - i.e., you've found a line in the D file that isn't in U. You can do fancy things like cache the U file in memory, store the lines in a linked list (optionally sorted) etc, so if you're expected to do all that then do, but if you're not and the test is *only* to do the comparison, you'll probably find that the above algorithm will complete in next to no time just using standard file IO, unless the files are enormous (but then, they might be too big to fit in RAM).
"OK, so the comparison fails if the file U having unique data does not contain one or more of the lines present in the file D containing duplicates" that's TRUE "So if we can find a line in D that is not in U, the comparison fails." TRUE "Question: does U have to contain *only* the lines present in D?" YES because it is basically extracted from D "D={1,1,2,3} does this succeed or fail? U contains all lines present in D, i.e. 1,2,3, but D does not contain 4." it fails because U contains an entry which is not present in its parent or D file I dont know how to start searching from the lines that contains CME in it? Because there are lines which do not contains CME in it. Code: 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
strstr checks if a string contains a substring. If the line doesn't contain CME, just exclude it from the comparison. OK, so my first algorithm won't work, cos it wouldn't fail for D={1,1,2,3}. Will need to think up a new one...
OK, here's a quick one. Read both U and D into sorted linked lists, but don't add duplicate lines to the D linked list. Then each list should match exactly, and if there are any differences: one shorter than the other, or element U[n]!=D[n], then the comparison fails.