finding duplicates for part of a string

Discussion in 'C' started by heidik, Nov 1, 2010.

  1. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Hello everyone I am trying to find duplicates for part of a string, not the whole stHello everyone

    I am trying to find duplicates for part of a string, not the whole string. The strings are stored in a file. Each line of file contains a string and many of which looks something like this (not all of the lines).

    Code:
    0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    where the '0' at the very beginning is common throughout a block of lines. The other block will have a '1' common through out the block and so on. The string starting from CCD untill the end can be duplicated and I have to find how many such duplicate lines are there against each '0' and '1' and so on. The file can contain any combination of any strings, not just the one mentioned in the above example but if at all it contains duplicates then the string starting from position of 'C' of the 'CCD' till the end would be repeated.

    After I find the duplicates. I have to compare it with the other file which contains all unique strings extracted from the first file that is having duplicates. I actually want to know if the file having the non-duplicate values contains all strings that appear in the first file (with duplicates). I want to make sure that all of the strings have been extracted uniquely and stored in the other file (with unique values)

    Can anyone please help. Would be grateful.
     
  2. zaster

    zaster New Member

    Joined:
    Oct 9, 2010
    Messages:
    6
    Likes Received:
    0
    Trophy Points:
    0
    What part of this problem do you find troublesome? I assume it's string comparison, so I'll explain accordingly.

    1. Read one line at a time into string
    2. Use strtok to tokenize the string, with ',' as delimiter.
    3. Compare each line 4th token onwards.
     
  3. virxen

    virxen Active Member

    Joined:
    Nov 24, 2009
    Messages:
    387
    Likes Received:
    90
    Trophy Points:
    28
    send more lines of the file or better all the file
    then try to explain better what you want to do by giving an example
    with the expected results.
     
  4. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Below are some more line from the file. The actual file contains almost 70000 lines.

    Code:
    0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    0:1:CME,20100601,13:59:45.556,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oilw3g1bvmoyo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    0:1:CME,20100601,13:59:45.150,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oyr7uiubtx0l,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    0:1:CME,20100601,13:59:45.165,CCD,GE,201009,FUT,XGCCD,0G4L7D294,v09wp1112gneo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    
    1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,5o5wv61n8ds4w,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
    1:2:CME,20100601,10:14:25.275,CFD,GE,201106,FUT,XGCFD,0G4LGP101,7ga0hh1psbfa5,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
    1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,a46o111s2tk45,Leg Fill,0000DA3F,S,00000,18.5,3,98.675,20100601,10:15:44
    1:2:CME,20100601,10:22:33.046,CFD,GE,201106,FUT,XGCFD,0G4LGP101,k13xp1tfotis,Leg Fill,0000DA3F,S,00000,18.5,4,98.675,20100601,10:15:44
    
    50:1:CME,20100601,12:07:24.384,DCE,GE,201109,FUT,XGDCE,0G4LCN103,mge46sg0pe1k,Fill,0001VVCH,S,00000,98.48,1,98.48,20100601,12:08:43
    50:2:CME,20100601,12:07:24.384,GVM,GE,201109,FUT,XGGVM,0G4L9J144,14xceud1jgrquv,Leg Fill,0001W2UK,B,00000,24,2,98.48,20100601,12:08:43
    50:2:CME,20100601,12:15:32.390,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1hchm0pwgw0ip,Leg Fill,0001W2UK,B,00000,24,6,98.48,20100601,12:08:43
    50:2:CME,20100601,12:07:24.415,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1igm2qn1djr0ry,Leg Fill,0001W2UK,B,00000,24,1,98.48,20100601,12:08:43
    
    I have just been able to extract duplicates from the actual file based on some condition and finally stored all unique strings in a different file. Now I want to compare the two files: the original file (having duplicates) and the new file (with unique strings) to make sure that each line of the original file has been read and checked for duplicates. If any of the lines that is found in the original file (with duplicates) is missing from the unique file then the program should output an error message. Like wise if any of the lines from the original file appears more than once against one ID in the unique file then again the program should output an error message. There are blocks of lines and each block contains a unique ID. A line should not appear more than once against one ID. If same line appears twice in the unique file but against different IDs then it would not be considered an error.

    Code:
    0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    
    the Zero (0) at the beginning of the above string is the string (line) ID and it can go up to 70000 and there are around 10 strings (lines) against one ID e-g there can be 10 lines against ZERO (0), 10 against ONE (1) etc and so one.

    I hope I have made my point clear.
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice