Need help with duplicate removal

heidik's Avatar, Join Date: Oct 2010
Contributor
Hello everyone. I am totally new to using STL.

I have a vector which contains struct type of elements and that struct contains 16 string type of members. There are over 60000 elements in that vector. The vector contains duplicate elements (of struct). I have to remove all the duplicates along with the actual element (having duplicates) and store them in another vector or any container. I copied the whole vector into a multimap. Key and values both contain the same vector e-g in case of simple integers:

key value
1 1
2 2
3 3
4 4
1 1
3 3
5 5
5 5

and I want it to return:

(pos) key value
(1) 1 1 1 (there are two 1s, one at pos 1 and the other one at pos 5)
(2) 2 2
(3) 3 3 3 3 (there are three 3s, at pos 3, pos 6 and pos 9)
(4) 4 4
(5) 1 1 1
(6) 3 3 3 3
(7) 5 5 5
(8) 5 5 5
(9) 3 3 3 3

it works fine with simple ints or string type of vectors but it doesnt work with struct type of elements and instead returns:

key value
1 1
2 2
3 3
4 4
1 1
3 3
5 5
5 5

that is NO difference.

Afterwards I want this multimap to contain only the duplicate elements i-e

(pos) key value
(1) 1 1 1 (there are two 1s, one at pos 1 and the other one at pos 5)
(3) 3 3 3 3 (there are three 3s, at pos 3, pos 6 and pos 9)
(7) 5 5 5

struct contains string type of variables e-g a, x, y, z and if string x, y, z (i-e excluding variable a) are all equal to each other then convert a and z to int (which I am sure there would be some function in C++ for this type of conversion) and fnd out the the difference between a and z of each such duplicate and then push_back the vector with minimum a-z difference to the original vector. could you please help me do this?
0
jimblumberg's Avatar
Ambitious contributor
Please don't double post.

Please show what you have tried.

Also please provide a small sample of your input file (10 records max).

Jim
0
heidik's Avatar, Join Date: Oct 2010
Contributor
Actually I dont have the code for string type vectors. It is on my office PC (along with the data file). However I have the code for int type vectors I was working on at home (not struct type vectors because I dont know how to do it with structs and that's what my problem is).

Code:
 multimap<vector<int>::iterator,vector<int>::iterator> mymm;
 multimap<vector<int>::iterator,vector<int>::iterator>::iterator it;
    pair<multimap<vector<int>::iterator,vector<int>::iterator>::iterator,multimap<vector<int>::iterator,vector<int>::iterator>::iterator> ret;

    for(vector<int>::iterator j = vPred.begin(); j != vPred.end(); j++)
    {    
        mymm.insert(pair<vector<int>::iterator,vector<int>::iterator>(j, j));    
    }

    /*for (it = mymm.begin(); it != mymm.end(); ++it)
    {
               std::cout << "(key = first): " << *it->first << endl;
               std::cout << "(value = second): " << *it->second << std::endl;
    }*/

    for (multimap<vector<int>::iterator,vector<int>::iterator>::iterator itr = mymm.begin(); itr != mymm.end(); ++itr)
    {
        cout << *itr->first << " => ";
        
        ret = mymm.equal_range(itr->first);

        for (it = ret.first; it != ret.second; ++it)
        {
                  cout << *((*it).second) << endl;
        }
        cout << endl;
    }
The struct is: (I am unable to do it with structs but just with simple strings and ints)

Code:
struct MyPred
{
   int a, x, y, z; (which should actually be of type string)
};
and the file contains something like this:

startDateTime (yyyy/mm/dd hh:mm:ss), TrID, OrdID, Vol, Price ..... EndDate (yyyy/mm/dd hh:mm:ss). All are string variables e-g the TrID could be "TDX6I70MNH". Same is the case with OrdID and so on. I need to match the variables from TrID till EndDate for equality and if all match then find the time difference between EndDate and startDate and push_back the element which has minimum difference between the two times.

I hope I have explained it well.

Thanks for offering your help
0
jimblumberg's Avatar
Ambitious contributor
Have you been able to get a map, mutlimap to work with std::vector<std::string>?

Jim
0
heidik's Avatar, Join Date: Oct 2010
Contributor
I just copied the vector elements to a multimap. Could you please suggest a better way with the help of code. Pleaseeeeee help me do it.
0
jimblumberg's Avatar
Ambitious contributor
I think the below is something close to what you want. Please consider it pseudo code.

Code:
struct Pred;
{
   std::string startDate;
   std::string TrID;
   std::string OrdID
   std::string endDate;
};

std::vector<Pred> mypred;
///// fill in the vector.
std::multimap<std::string, Pred > mmap;
for(size_t i = 0; i < mypred.size(); ++i)
{
   mmap.insert(make_pair((TrID + OrdID),  mpred[i]));
}

Jim
0
heidik's Avatar, Join Date: Oct 2010
Contributor
Thank you once again. Is there any way I can do the whole thing with vectors instead of copying elements from vectors to multimap and then again copying it back to the vector?

1) populating the vector with struct type of elements (over 60,000 elements).
2) struct contains 16 string type of member variables.
e-g string a, x, y, z;
3) finding duplicates in the vector.
4) storing the element and its duplicates in another vector (likewise all such duplicate elements) and removing the element from the original vector so that the original vector contains only those elements which do not have any duplicates
5) finding the difference between the a and z (by converting them to ints) for all duplicates for each element.
6) push_back the vector having the smallest a-z difference to the original vector
0
jimblumberg's Avatar
Ambitious contributor
The only other way I know of finding the duplicates would be to sort the vector. There might be other ways but I not sure how that would be.

For item 4 you could leave the initial vector alone and create another vector to hold pointers to the initial vector and just use the pointers to display/compute item 5 and 6.

Code:
   std::vector<Pred*> nodupes;
   std::vector<Pred*> dupes;
   if( IT IS A DUPE)
      dupes.push_back(&pred[x] )
Only a thought might/might not work.

Jim
0
heidik's Avatar, Join Date: Oct 2010
Contributor
Thanks Jim. Please have a look at the code below. It is working fine but I dont know how much time will it take for processing more than 60000 records/vector elements? Do you have any idea? or do I have to test it with 60000 elements to make sure?

More over, suppose there are four struct members i-e a, x, y, z. I have to convert a and z of each duplicate element to int and then find the difference between them and will push_back that vector which has a and z with minimum difference to the original vector. How do I go about it? Could you please help.

Code:
#include <string>
#include <vector>
#include <iostream>
#include <algorithm>

struct MyPred
{
	std::string x;
	std::string y;

	MyPred(const std::string& x, const std::string& y): x(x), y(y) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y;
	}

	bool operator<(const MyPred& p) const
	{
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred> vPred;
	vPred.push_back(MyPred("a", "a"));
	vPred.push_back(MyPred("a", "b"));
	vPred.push_back(MyPred("a", "a"));
	vPred.push_back(MyPred("b", "a"));
	vPred.push_back(MyPred("a", "b"));
	vPred.push_back(MyPred("a", "c"));
	vPred.push_back(MyPred("b", "b"));
	vPred.push_back(MyPred("a", "a"));
	vPred.push_back(MyPred("a", "a"));

	// The values need to be in order for equal_range() to work
	std::sort(vPred.begin(), vPred.end());

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred> dPred; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred.begin(); i != vPred.end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred.end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred.push_back(*i); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	std::cout << "vPred: Sorted input\n";
	for(std::vector<MyPred>::iterator i = vPred.begin(); i != vPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred.begin(); i != dPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}
}
0
jimblumberg's Avatar
Ambitious contributor
In your structure:
Code:
	std::string x;
	std::string y;

	MyPred(const std::string& x, const std::string& y): x(x), y(y) {}
My compiler complains about shadowing of the variables x and y. I would change your constructor to:

Code:
	MyPred(const std::string& x, const std::string& y): x(xx), y(yy) {}
As far as how long? I would first start out with a limited number of records, say 10 - 60, to insure the program is working correctly. Then once I got the program working for the limited set I would then increase the number by say 100. Time both sets and then maybe you will have a sense of the time required.

To lower the time required make sure you are not displaying everything on the screen. Displaying items can increase the time of execution dramatically.

How often will this need to be done? If it's only once then I wouldn't worry too much about the time it takes, except to display something so the user knows the program is not malfunctioning.

My worries would be: can vector be completely held in memory. The things that will probably take the longest are reading the information from the file, sorting, then searching.

Quote:
More over, suppose there are four struct members i-e a, x, y, z. I have to convert a and z of each duplicate element to int and then find the difference between them and will push_back that vector which has a and z with minimum difference to the original vector.
You might be able to save a little time by converting the values when you read them, and store them as the numeric type.

Quote:
will push_back that vector which has a and z with minimum difference to the original vector.
I not quite sure what you mean by this.


Jim
shabbir like this