compare two files

Discussion in 'C' started by heidik, Nov 1, 2010.

  1. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Could anyone please help me how to compare 2 files containing lines of string? The two files contain similar data except that one contains data with duplicates while the other one contains all unique. I have to make sure if the file having unique data contains all the lines present in the file containing duplicates.

    Both files look something like this and I have to compare only the lines which has the sub string CME in it which comes right before the first coma in the string

    Code:
    BREACH:0:40:GE:20100601-07:34:22.796
    0:1:ORDER ID:0000D9DB
    0:2:ORDER ID:0000D9DC
    0:1:TRDR:GRC
    0:2:TRDR:GRC
    0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx
    0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
    0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
    0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
    0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
    0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
    0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
    0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
    0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
    0:0:WASH-ORD-TIME-DIFF,2.703
    0:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
    BREACH:1:40:GE:20100601-07:34:22.796
    1:1:ORDER ID:0000D9DB
    1:2:ORDER ID:0000D9DC
    1:1:TRDR:GRC
    1:2:TRDR:GRC
    1:0:TRADE CROSSING IDS:1iyucih1bcmpso,d88hmz15psx80
    1:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
    1:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
    1:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
    1:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
    1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
    1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
    1:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
    1:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
    1:0:WASH-ORD-TIME-DIFF,2.703
    1:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
    BREACH:40:40:GE:20100601-14:08:35.406
    40:1:ORDER ID:0000D9XN
    40:2:ORDER ID:0000DBHJ
    40:1:TRDR:DAF
    40:2:TRDR:DAF
    40:0:TRADE CROSSING IDS:4hr6iu6smidw,1t6btger8juyg
    40:1:OrderReceive,01.06.2010 09:58:50.031,0323YK058,0000D9XN,A,25,0,25,0,-35
    40:1:OrderReceive,01.06.2010 09:50:42.290,0323YK058,0000D9XN,A,25,0,25,0,-35
    40:2:OrderReceive,01.06.2010 14:07:29.062,0323YK153,0000DBHJ,A,7,0,7,0,160
    40:2:OrderReceive,01.06.2010 13:59:20.853,0323YK153,0000DBHJ,A,7,0,7,0,160
    40:1:CME,20100601,12:45:46.250,DAF,GE,201012,FUT,XGDAF,0323YK058,1kxdklb1oghepj,Leg Fill,0000D9XN,B,00000,-3.5,2,99,20100601,12:38:57
    40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,1v0f5tr1da5176,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57
    40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,61wcagnnjmjl,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57
    40:1:CME,20100601,12:45:46.265,DAF,GE,201106,FUT,XGDAF,0323YK058,v6d4b2agp1hr,Leg Fill,0000D9XN,B,00000,-3.5,2,98.715,20100601,12:38:57
    40:0:WASH-ORD-TIME-DIFF,14918.6
    40:0:2nd-ORD-TO-WASH-TIME-DIFF,554.553
    BREACH:101:30:GE:20100601-07:18:05.015
    101:1:ORDER ID:0001U8QR
    101:2:ORDER ID:0001W0PJ
    101:1:TRDR:MTJ
    101:2:TRDR:FDC
    101:0:TRADE CROSSING IDS:1ua0o7twia2cx,p3mqxj1it2iao
    101:1:OrderReceive,01.06.2010 07:18:05.015,082X7Y007,0001U8QR,A,1,0,1,0,99025
    101:1:OrderReceive,01.06.2010 07:09:57.556,082X7Y007,0001U8QR,A,1,0,1,0,99025
    101:2:OrderReceive,01.06.2010 07:18:04.468,0323X8076,0001W0PJ,A,10,0,10,0,145
    101:2:OrderReceive,01.06.2010 07:09:57.009,0323X8076,0001W0PJ,A,10,0,10,0,145
    101:1:CME,20100601,07:18:05.015,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
    101:1:CME,20100601,07:09:57.556,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
    101:2:CME,20100601,07:18:04.468,FDC,GE,201009,FUT,XGFDC,0323X8076,1t6wkc41c2ki0m,Leg Fill,0001W0PJ,S,00000,14.5,3,99.17,20100601,07:11:15
    101:2:CME,20100601,07:18:04.468,FDC,GE,201012,FUT,XGFDC,0323X8076,15vxdjj1r1imja,Leg Fill,0001W0PJ,B,00000,14.5,3,99.025,20100601,07:11:15
    101:0:CROSS-ORD-TIME-DIFF,0.547
    
     
  2. xpi0t0s

    xpi0t0s Mentor

    Joined:
    Aug 6, 2004
    Messages:
    3,009
    Likes Received:
    203
    Trophy Points:
    63
    Occupation:
    Senior Support Engineer
    Location:
    England
    "I have to make sure if the file having unique data contains all the lines present in the file containing duplicates."

    OK, so the comparison fails if the file U having unique data does not contain one or more of the lines present in the file D containing duplicates.

    So if we can find a line in D that is not in U, the comparison fails.

    Question: does U have to contain *only* the lines present in D?

    U={1,2,3,4}

    D={4,3,2,1,4,3,2,1} comparison succeeds, of course.
    D={1,1,2,3,4,5} fail: U does not contain 5.
    D={1,1,2,3} does this succeed or fail? U contains all lines present in D, i.e. 1,2,3, but D does not contain 4.

    Assuming this last one succeeds, this algorithm should do the trick:

    for each line in the "duplicates" file D
    check if that line is present in the "uniques" file U
    if it isn't, you can terminate the loop early with "comparison failed"
    - i.e., you've found a line in the D file that isn't in U.

    You can do fancy things like cache the U file in memory, store the lines in a linked list (optionally sorted) etc, so if you're expected to do all that then do, but if you're not and the test is *only* to do the comparison, you'll probably find that the above algorithm will complete in next to no time just using standard file IO, unless the files are enormous (but then, they might be too big to fit in RAM).
     
  3. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    "OK, so the comparison fails if the file U having unique data does not contain one or more of the lines present in the file D containing duplicates"

    that's TRUE

    "So if we can find a line in D that is not in U, the comparison fails."

    TRUE

    "Question: does U have to contain *only* the lines present in D?"

    YES because it is basically extracted from D

    "D={1,1,2,3} does this succeed or fail? U contains all lines present in D, i.e. 1,2,3, but D does not contain 4."

    it fails because U contains an entry which is not present in its parent or D file

    I dont know how to start searching from the lines that contains CME in it? Because there are lines which do not contains CME in it.

    Code:
    0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
    0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
    0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
    0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
    
     
  4. xpi0t0s

    xpi0t0s Mentor

    Joined:
    Aug 6, 2004
    Messages:
    3,009
    Likes Received:
    203
    Trophy Points:
    63
    Occupation:
    Senior Support Engineer
    Location:
    England
    strstr checks if a string contains a substring. If the line doesn't contain CME, just exclude it from the comparison.

    OK, so my first algorithm won't work, cos it wouldn't fail for D={1,1,2,3}. Will need to think up a new one...
     
  5. xpi0t0s

    xpi0t0s Mentor

    Joined:
    Aug 6, 2004
    Messages:
    3,009
    Likes Received:
    203
    Trophy Points:
    63
    Occupation:
    Senior Support Engineer
    Location:
    England
    OK, here's a quick one. Read both U and D into sorted linked lists, but don't add duplicate lines to the D linked list. Then each list should match exactly, and if there are any differences: one shorter than the other, or element U[n]!=D[n], then the comparison fails.
     
  6. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Thanks a lot xpi0t0s. I will work on what you suggested :)
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice