finding duplicates for part of a string

heidik · Nov 1, 2010

Hello everyone I am trying to find duplicates for part of a string, not the whole stHello everyone

I am trying to find duplicates for part of a string, not the whole string. The strings are stored in a file. Each line of file contains a string and many of which looks something like this (not all of the lines).
Code:
0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
where the '0' at the very beginning is common throughout a block of lines. The other block will have a '1' common through out the block and so on. The string starting from CCD untill the end can be duplicated and I have to find how many such duplicate lines are there against each '0' and '1' and so on. The file can contain any combination of any strings, not just the one mentioned in the above example but if at all it contains duplicates then the string starting from position of 'C' of the 'CCD' till the end would be repeated.

After I find the duplicates. I have to compare it with the other file which contains all unique strings extracted from the first file that is having duplicates. I actually want to know if the file having the non-duplicate values contains all strings that appear in the first file (with duplicates). I want to make sure that all of the strings have been extracted uniquely and stored in the other file (with unique values)

Can anyone please help. Would be grateful.

zaster · Nov 1, 2010

What part of this problem do you find troublesome? I assume it's string comparison, so I'll explain accordingly.

1. Read one line at a time into string
2. Use strtok to tokenize the string, with ',' as delimiter.
3. Compare each line 4th token onwards.

virxen · Nov 1, 2010

send more lines of the file or better all the file
then try to explain better what you want to do by giving an example
with the expected results.

heidik · Nov 1, 2010

Below are some more line from the file. The actual file contains almost 70000 lines.
Code:
0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
0:1:CME,20100601,13:59:45.556,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oilw3g1bvmoyo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
0:1:CME,20100601,13:59:45.150,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oyr7uiubtx0l,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
0:1:CME,20100601,13:59:45.165,CCD,GE,201009,FUT,XGCCD,0G4L7D294,v09wp1112gneo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04

1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,5o5wv61n8ds4w,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
1:2:CME,20100601,10:14:25.275,CFD,GE,201106,FUT,XGCFD,0G4LGP101,7ga0hh1psbfa5,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,a46o111s2tk45,Leg Fill,0000DA3F,S,00000,18.5,3,98.675,20100601,10:15:44
1:2:CME,20100601,10:22:33.046,CFD,GE,201106,FUT,XGCFD,0G4LGP101,k13xp1tfotis,Leg Fill,0000DA3F,S,00000,18.5,4,98.675,20100601,10:15:44

50:1:CME,20100601,12:07:24.384,DCE,GE,201109,FUT,XGDCE,0G4LCN103,mge46sg0pe1k,Fill,0001VVCH,S,00000,98.48,1,98.48,20100601,12:08:43
50:2:CME,20100601,12:07:24.384,GVM,GE,201109,FUT,XGGVM,0G4L9J144,14xceud1jgrquv,Leg Fill,0001W2UK,B,00000,24,2,98.48,20100601,12:08:43
50:2:CME,20100601,12:15:32.390,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1hchm0pwgw0ip,Leg Fill,0001W2UK,B,00000,24,6,98.48,20100601,12:08:43
50:2:CME,20100601,12:07:24.415,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1igm2qn1djr0ry,Leg Fill,0001W2UK,B,00000,24,1,98.48,20100601,12:08:43
I have just been able to extract duplicates from the actual file based on some condition and finally stored all unique strings in a different file. Now I want to compare the two files: the original file (having duplicates) and the new file (with unique strings) to make sure that each line of the original file has been read and checked for duplicates. If any of the lines that is found in the original file (with duplicates) is missing from the unique file then the program should output an error message. Like wise if any of the lines from the original file appears more than once against one ID in the unique file then again the program should output an error message. There are blocks of lines and each block contains a unique ID. A line should not appear more than once against one ID. If same line appears twice in the unique file but against different IDs then it would not be considered an error.
Code:
0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
the Zero (0) at the beginning of the above string is the string (line) ID and it can go up to 70000 and there are around 10 strings (lines) against one ID e-g there can be 10 lines against ZERO (0), 10 against ONE (1) etc and so one.

I hope I have made my point clear.

Log in or Sign up

finding duplicates for part of a string

heidik New Member

zaster New Member

virxen Active Member

heidik New Member

Share This Page

Log in or Sign up

finding duplicates for part of a string

heidik New Member

zaster New Member

virxen Active Member

heidik New Member

Share This Page

Useful Searches