Need help with duplicate removal

Discussion in 'C++' started by heidik, Oct 23, 2010.

  1. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Hello everyone. I am totally new to using STL.

    I have a vector which contains struct type of elements and that struct contains 16 string type of members. There are over 60000 elements in that vector. The vector contains duplicate elements (of struct). I have to remove all the duplicates along with the actual element (having duplicates) and store them in another vector or any container. I copied the whole vector into a multimap. Key and values both contain the same vector e-g in case of simple integers:

    key value
    1 1
    2 2
    3 3
    4 4
    1 1
    3 3
    5 5
    5 5

    and I want it to return:

    (pos) key value
    (1) 1 1 1 (there are two 1s, one at pos 1 and the other one at pos 5)
    (2) 2 2
    (3) 3 3 3 3 (there are three 3s, at pos 3, pos 6 and pos 9)
    (4) 4 4
    (5) 1 1 1
    (6) 3 3 3 3
    (7) 5 5 5
    (8) 5 5 5
    (9) 3 3 3 3

    it works fine with simple ints or string type of vectors but it doesnt work with struct type of elements and instead returns:

    key value
    1 1
    2 2
    3 3
    4 4
    1 1
    3 3
    5 5
    5 5

    that is NO difference.

    Afterwards I want this multimap to contain only the duplicate elements i-e

    (pos) key value
    (1) 1 1 1 (there are two 1s, one at pos 1 and the other one at pos 5)
    (3) 3 3 3 3 (there are three 3s, at pos 3, pos 6 and pos 9)
    (7) 5 5 5

    struct contains string type of variables e-g a, x, y, z and if string x, y, z (i-e excluding variable a) are all equal to each other then convert a and z to int (which I am sure there would be some function in C++ for this type of conversion) and fnd out the the difference between a and z of each such duplicate and then push_back the vector with minimum a-z difference to the original vector. could you please help me do this?
     
  2. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    Please don't double post.

    Please show what you have tried.

    Also please provide a small sample of your input file (10 records max).

    Jim
     
  3. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Actually I dont have the code for string type vectors. It is on my office PC (along with the data file). However I have the code for int type vectors I was working on at home (not struct type vectors because I dont know how to do it with structs and that's what my problem is).

    Code:
     multimap<vector<int>::iterator,vector<int>::iterator> mymm;
     multimap<vector<int>::iterator,vector<int>::iterator>::iterator it;
        pair<multimap<vector<int>::iterator,vector<int>::iterator>::iterator,multimap<vector<int>::iterator,vector<int>::iterator>::iterator> ret;
    
        for(vector<int>::iterator j = vPred.begin(); j != vPred.end(); j++)
        {    
            mymm.insert(pair<vector<int>::iterator,vector<int>::iterator>(j, j));    
        }
    
        /*for (it = mymm.begin(); it != mymm.end(); ++it)
        {
                   std::cout << "(key = first): " << *it->first << endl;
                   std::cout << "(value = second): " << *it->second << std::endl;
        }*/
    
        for (multimap<vector<int>::iterator,vector<int>::iterator>::iterator itr = mymm.begin(); itr != mymm.end(); ++itr)
        {
            cout << *itr->first << " => ";
            
            ret = mymm.equal_range(itr->first);
    
            for (it = ret.first; it != ret.second; ++it)
            {
                      cout << *((*it).second) << endl;
            }
            cout << endl;
        }
    
    The struct is: (I am unable to do it with structs but just with simple strings and ints)

    Code:
    struct MyPred
    {
       int a, x, y, z; (which should actually be of type string)
    };
    
    and the file contains something like this:

    startDateTime (yyyy/mm/dd hh:mm:ss), TrID, OrdID, Vol, Price ..... EndDate (yyyy/mm/dd hh:mm:ss). All are string variables e-g the TrID could be "TDX6I70MNH". Same is the case with OrdID and so on. I need to match the variables from TrID till EndDate for equality and if all match then find the time difference between EndDate and startDate and push_back the element which has minimum difference between the two times.

    I hope I have explained it well.

    Thanks for offering your help :)
     
  4. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    Have you been able to get a map, mutlimap to work with std::vector<std::string>?

    Jim
     
  5. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    I just copied the vector elements to a multimap. Could you please suggest a better way with the help of code. Pleaseeeeee help me do it.
     
  6. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    I think the below is something close to what you want. Please consider it pseudo code.

    Code:
    struct Pred;
    {
       std::string startDate;
       std::string TrID;
       std::string OrdID
       std::string endDate;
    };
    
    std::vector<Pred> mypred;
    ///// fill in the vector.
    std::multimap<std::string, Pred > mmap;
    for(size_t i = 0; i < mypred.size(); ++i)
    {
       mmap.insert(make_pair((TrID + OrdID),  mpred[i]));
    }
    
    

    Jim
     
  7. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Thank you once again. Is there any way I can do the whole thing with vectors instead of copying elements from vectors to multimap and then again copying it back to the vector?

    1) populating the vector with struct type of elements (over 60,000 elements).
    2) struct contains 16 string type of member variables.
    e-g string a, x, y, z;
    3) finding duplicates in the vector.
    4) storing the element and its duplicates in another vector (likewise all such duplicate elements) and removing the element from the original vector so that the original vector contains only those elements which do not have any duplicates
    5) finding the difference between the a and z (by converting them to ints) for all duplicates for each element.
    6) push_back the vector having the smallest a-z difference to the original vector
     
  8. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    The only other way I know of finding the duplicates would be to sort the vector. There might be other ways but I not sure how that would be.

    For item 4 you could leave the initial vector alone and create another vector to hold pointers to the initial vector and just use the pointers to display/compute item 5 and 6.

    Code:
       std::vector<Pred*> nodupes;
       std::vector<Pred*> dupes;
       if( IT IS A DUPE)
          dupes.push_back(&pred[x] ) 
    
    Only a thought might/might not work.

    Jim
     
  9. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Thanks Jim. Please have a look at the code below. It is working fine but I dont know how much time will it take for processing more than 60000 records/vector elements? Do you have any idea? or do I have to test it with 60000 elements to make sure?

    More over, suppose there are four struct members i-e a, x, y, z. I have to convert a and z of each duplicate element to int and then find the difference between them and will push_back that vector which has a and z with minimum difference to the original vector. How do I go about it? Could you please help.

    Code:
    #include <string>
    #include <vector>
    #include <iostream>
    #include <algorithm>
    
    struct MyPred
    {
    	std::string x;
    	std::string y;
    
    	MyPred(const std::string& x, const std::string& y): x(x), y(y) {}
    
    	bool operator==(const MyPred& p) const
    	{
    		return x == p.x && y == p.y;
    	}
    
    	bool operator<(const MyPred& p) const
    	{
    		if(x < p.x) return true;
    		if(x > p.x) return false;
    		if(y < p.y) return true;
    		if(y > p.y) return false;
    		return false;
    	}
    };
    
    
    int main()
    {
    	std::vector<MyPred> vPred;
    	vPred.push_back(MyPred("a", "a"));
    	vPred.push_back(MyPred("a", "b"));
    	vPred.push_back(MyPred("a", "a"));
    	vPred.push_back(MyPred("b", "a"));
    	vPred.push_back(MyPred("a", "b"));
    	vPred.push_back(MyPred("a", "c"));
    	vPred.push_back(MyPred("b", "b"));
    	vPred.push_back(MyPred("a", "a"));
    	vPred.push_back(MyPred("a", "a"));
    
    	// The values need to be in order for equal_range() to work
    	std::sort(vPred.begin(), vPred.end());
    
    	std::vector<MyPred> uPred; // values that were always unique
    	std::vector<MyPred> dPred; // values that were duplicated
    
    	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;
    
    	for(std::vector<MyPred>::iterator i = vPred.begin(); i != vPred.end(); i = ret.second)
    	{
    		ret = std::equal_range(i, vPred.end(), *i);
    		
    		if(ret.second - ret.first != 1) // duplicates
    		{
    				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
    				{
    					dPred.push_back(*i); //put each duplicate onto a new vector
    				}
    		}
    		else if(ret.second - ret.first == 1)
    		{
    			uPred.push_back(*i);
    		}
    	}
    
    	std::cout << "vPred: Sorted input\n";
    	for(std::vector<MyPred>::iterator i = vPred.begin(); i != vPred.end(); ++i)
    	{
    		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
    	}
    
    	std::cout << "dPred: Only the values that were duplicated\n";
    	for(std::vector<MyPred>::iterator i = dPred.begin(); i != dPred.end(); ++i)
    	{
    		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
    	}
    
    	std::cout << "uPred: Only the values that were unique\n";
    	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
    	{
    		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
    	}
    }
    
     
  10. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    In your structure:
    Code:
    	std::string x;
    	std::string y;
    
    	MyPred(const std::string& x, const std::string& y): x(x), y(y) {}
    
    My compiler complains about shadowing of the variables x and y. I would change your constructor to:

    Code:
    	MyPred(const std::string& x, const std::string& y): x(xx), y(yy) {}
    
    As far as how long? I would first start out with a limited number of records, say 10 - 60, to insure the program is working correctly. Then once I got the program working for the limited set I would then increase the number by say 100. Time both sets and then maybe you will have a sense of the time required.

    To lower the time required make sure you are not displaying everything on the screen. Displaying items can increase the time of execution dramatically.

    How often will this need to be done? If it's only once then I wouldn't worry too much about the time it takes, except to display something so the user knows the program is not malfunctioning.

    My worries would be: can vector be completely held in memory. The things that will probably take the longest are reading the information from the file, sorting, then searching.

    You might be able to save a little time by converting the values when you read them, and store them as the numeric type.

    I not quite sure what you mean by this.


    Jim
     
    shabbir likes this.
  11. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Thanks for your advice about how to find out the time complexity.

    I meant that:

    there are four string type variables in the struct a, x, y, z. I have to find out the difference between member variable 'a' and member variable 'z' of the struct. Since these variables are string type but like "123" so I have to convert them to integers and then find out the difference. By duplicates I mean struct type of object having similar x, y, and z variables (excluding 'a'). After finding the difference I have to put the struct object again to the original vector (which was initially having duplicates), the object which is having minimum difference between 'a' and 'z'. I hope I have explained it well :)
     
  12. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    I understand everything above but:
    Do you mean that you are adding a "record" to the vector or you need to alter the contents of one of the existing "records"?

    Also I believe your program is not doing what you want. When do your print out the section titled: "dPred: Only the values that were duplicated" is only storing the first duplicate record. I added another field to your structure to see this. Here is my output.

    Jim
     
  13. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Hello Jim. I have ran the program again and it is giving me the correct result :). The only problem is now the speed. According to what you suggested I tried it first with 10 test values, then with 100 and then with 60000 values and the program almost became DEAD with 60000 values specially when looking for duplicates ;'(. Do you have any idea how can I make it better, fast, only the duplicate part ?

    Thanks in advance.
     
  14. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Hello Jim. I have run the program again and it is giving me the correct result . The only problem is now the speed. According to what you suggested I tried it first with 10 test values, then with 100 and then with 60000 values and the program almost became DEAD with 60000 values specially when looking for duplicates ;'(. Do you have any idea how can I make it better, fast, only the duplicate part ?

    Thanks in advance.
     
  15. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    The only thing I can suggest is to change your dPred and uPred vectors to pointers.

    Code:
    	std::vector<MyPred*> uPred; // values that were always unique
    	std::vector<MyPred*> dPred; // values that were duplicated
    
    This will mean instead of copying all the duplicates to a new vector you will only push the pointer to the current record into the new vector. This should speed things up quite a bit.

    You will have to change the code where you push_back() into the new vector.

    Code:
    dPred.push_back(&(*j));  //put each duplicate onto a new vector
    
    And also where you print them out.

    Hope this helps.

    Jim
     
  16. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Alright Thank you Jim. I have the changes you suggested and it is working. I actually need the uPred (unique elements) and dont need dPred and vPred so I made them pointer type and deleted them in the end.

    Could you please help me complete this assignment. I am actually reading from a file containing strings (lines). The first and last string elements of each line are start date/time and end date/time respectively. I have to find the duplicates in original vector where all string elements in a line are the same except the first element i-e start date/time. I have to find the difference between the minits of the start date/time and the minits of the end date/time and push_back the element from the duplicates to the original vector where the time difference is the minimum. e-g

    Start date/time of '1' element is 2006/06/01 16:34:43
    End date/time of that element is 2006/06/01/ 16:55: 51

    Start date/time of duplicate element (of '1') is 2006/06/01 16:24:43
    End date/time of that element is 2006/06/01/ 16:55: 51

    I now have to find the difference between 34 (minits of '1' element's start date/time) and 55 (minits of '1' element's end date/time) which is equal to 21

    and the difference between 24 (minits of duplicate (of '1') element's start date/time) and 55 (minits of duplicate (of '1') element's end date/time) which is equal to 31.

    since 21 is less than 31 so I will push_back the element with time difference of 21 to the original vector and will discard all its duplicates.

    Could you help me do this? I hope I have explained well. For the time being I have put random int+string combination to struct first (start date/time) and last (end date/time) member variables instead of proper date/time to find the difference between the int part of that variable. Here's my code

    Code:
    struct MyPred
    {
    	std::string a;
    	std::string x;
    	std::string y;
    	std::string z;
    
    	MyPred(const std::string& a, const std::string& x, const std::string& y, const std::string& z): a(a), x(x), y(y), z(z) {}
    
    	bool operator==(const MyPred& p) const
    	{
    		return x == p.x && y == p.y && z == p.z; // a == p.a && 
    	}
    
    	bool operator<(const MyPred& p) const
    	{
    		//if(a < p.a) return true;
    		//if(a > p.a) return false;
    		if(x < p.x) return true;
    		if(x > p.x) return false;
    		if(y < p.y) return true;
    		if(y > p.y) return false;
    		if(z < p.z) return true;
    		if(z > p.z) return false;
    		return false;
    	}
    };
    
    
    int main()
    {
    	std::vector<MyPred>* vPred = new std::vector<MyPred>;
    	vPred->push_back(MyPred("a2c", "1Gak", "c", "d4f"));
    	vPred->push_back(MyPred("j4h", "b", "c", "j87h"));
    	vPred->push_back(MyPred("d4f", "1Gak", "c", "d4f"));
    	vPred->push_back(MyPred("n7s", "1Gak", "c", "d4f"));
    	vPred->push_back(MyPred("l9m", "b", "c", "j87h"));
    	vPred->push_back(MyPred("p24a", "x", "c", "p43a"));
    	vPred->push_back(MyPred("q56r", "l", "m", "q90r"));
    	vPred->push_back(MyPred("g11v", "8f", "h", "g63v"));
    	vPred->push_back(MyPred("u3w", "v", "d", "u11w"));
    	vPred->push_back(MyPred("k76l", "x", "c", "p43a"));
    	vPred->push_back(MyPred("p24a", "g", "z", "p43a"));
    
    	// The values need to be in order for equal_range() to work
    	std::sort(vPred->begin(), vPred->end());
    
    	std::vector<MyPred> uPred; // values that were always unique
    	std::vector<MyPred>* dPred = new std::vector<MyPred>; // values that were duplicated
    
    	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;
    
    	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
    	{
    		/*ret = std::equal_range(i, vPred.end(), *i);
    		if(ret.second - ret.first == 1)
    		{
    			uPred.push_back(*i);
    		}
    		else
    		{
    			dPred.push_back(*i);
    		}*/
    
    		ret = std::equal_range(i, vPred->end(), *i);
    		
    		if(ret.second - ret.first != 1) // duplicates
    		{
    				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
    				{
    					dPred->push_back(*j); //put each duplicate onto a new vector
    				}
    		}
    		else if(ret.second - ret.first == 1)
    		{
    			uPred.push_back(*i);
    		}
    	}
    
    	std::cout << "vPred: Sorted input\n";
    	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); ++i)
    	{
    		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
    	}
    
    	std::cout << "dPred: Only the values that were duplicated\n";
    	for(std::vector<MyPred>::iterator i = dPred->begin(); i != dPred->end(); ++i)
    	{
    		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
    	}
    
    	std::cout << "uPred: Only the values that were unique\n";
    	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
    	{
    		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
    	}
    
    	delete vPred;
    	delete dPred;
    }
    
     
  17. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0

    For this you should check out the functions in the ctime header.

    Create a structure tm to hold each of the dates and times. Then use difftime to compute the difference between the start and end times. I would not forget about the date part because if your time passes over midnight you will get incorrect results.

    I would use stringstream to break apart your date, time string.

    Jim
     
    Last edited: Oct 24, 2010
  18. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    Pleaseeeeee Jim could you please do it for me because someone else suggested me the same thing but I havent used ctime etc ever. I have to submit it tomorrow ;'(. Pleaseeeeeeeeeeeee. Could you please make changes to my code and do it for me. I shall really be grateful to you.
     
  19. jimblumberg

    jimblumberg New Member

    Joined:
    May 30, 2010
    Messages:
    120
    Likes Received:
    29
    Trophy Points:
    0
    I will help you, but I will not do the assignment for you. If I just give you the code what will you learn??

    Create a function returning the difference in time (int) with the parameters of the start time, and end time strings.

    The first thing to do is to strip out the individual parts ( year, month, day, hours, minutes, seconds. You can use std::string.substr() to do this. Or you could use getline() your choice.

    Convert the individual strings that you obtained above to ints.

    Create a structure tm for start and end times. Here is what struct tm looks like: http://www.cplusplus.com/reference/clibrary/ctime/tm/

    Put the individual ints into the correct element of your structures.

    Use mktime() to make the times from your structures. http://www.cplusplus.com/reference/clibrary/ctime/mktime/

    Then use difftime() to compute the differences. http://www.cplusplus.com/reference/clibrary/ctime/difftime/

    Jim
     
  20. heidik

    heidik New Member

    Joined:
    Oct 23, 2010
    Messages:
    69
    Likes Received:
    0
    Trophy Points:
    0
    it isnt a university assignment. It has been given to me in office ;'(. I am new to STL and the code they have given me uses STL and it has almost been 2 weeks since I am trying to do it. I had to understand that code first and then tried to make changes and it has taken me almost 2 weeks. I just want to finish it before my boss says something to me ;'(. Could you please help?
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice