Need help with duplicate removal

heidik · Oct 23, 2010

Hello everyone. I am totally new to using STL.

I have a vector which contains struct type of elements and that struct contains 16 string type of members. There are over 60000 elements in that vector. The vector contains duplicate elements (of struct). I have to remove all the duplicates along with the actual element (having duplicates) and store them in another vector or any container. I copied the whole vector into a multimap. Key and values both contain the same vector e-g in case of simple integers:

key value
1 1
2 2
3 3
4 4
1 1
3 3
5 5
5 5

and I want it to return:

(pos) key value
(1) 1 1 1 (there are two 1s, one at pos 1 and the other one at pos 5)
(2) 2 2
(3) 3 3 3 3 (there are three 3s, at pos 3, pos 6 and pos 9)
(4) 4 4
(5) 1 1 1
(6) 3 3 3 3
(7) 5 5 5
(8) 5 5 5
(9) 3 3 3 3

it works fine with simple ints or string type of vectors but it doesnt work with struct type of elements and instead returns:

key value
1 1
2 2
3 3
4 4
1 1
3 3
5 5
5 5

that is NO difference.

Afterwards I want this multimap to contain only the duplicate elements i-e

(pos) key value
(1) 1 1 1 (there are two 1s, one at pos 1 and the other one at pos 5)
(3) 3 3 3 3 (there are three 3s, at pos 3, pos 6 and pos 9)
(7) 5 5 5

struct contains string type of variables e-g a, x, y, z and if string x, y, z (i-e excluding variable a) are all equal to each other then convert a and z to int (which I am sure there would be some function in C++ for this type of conversion) and fnd out the the difference between a and z of each such duplicate and then push_back the vector with minimum a-z difference to the original vector. could you please help me do this?

jimblumberg · Oct 23, 2010

Please don't double post.

Please show what you have tried.

Also please provide a small sample of your input file (10 records max).

Jim

heidik · Oct 23, 2010

Actually I dont have the code for string type vectors. It is on my office PC (along with the data file). However I have the code for int type vectors I was working on at home (not struct type vectors because I dont know how to do it with structs and that's what my problem is).
Code:
 multimap<vector<int>::iterator,vector<int>::iterator> mymm;
 multimap<vector<int>::iterator,vector<int>::iterator>::iterator it;
    pair<multimap<vector<int>::iterator,vector<int>::iterator>::iterator,multimap<vector<int>::iterator,vector<int>::iterator>::iterator> ret;

    for(vector<int>::iterator j = vPred.begin(); j != vPred.end(); j++)
    {    
        mymm.insert(pair<vector<int>::iterator,vector<int>::iterator>(j, j));    
    }

    /*for (it = mymm.begin(); it != mymm.end(); ++it)
    {
               std::cout << "(key = first): " << *it->first << endl;
               std::cout << "(value = second): " << *it->second << std::endl;
    }*/

    for (multimap<vector<int>::iterator,vector<int>::iterator>::iterator itr = mymm.begin(); itr != mymm.end(); ++itr)
    {
        cout << *itr->first << " => ";
        
        ret = mymm.equal_range(itr->first);

        for (it = ret.first; it != ret.second; ++it)
        {
                  cout << *((*it).second) << endl;
        }
        cout << endl;
    }
The struct is: (I am unable to do it with structs but just with simple strings and ints)
Code:
struct MyPred
{
   int a, x, y, z; (which should actually be of type string)
};
and the file contains something like this:

startDateTime (yyyy/mm/dd hh:mm:ss), TrID, OrdID, Vol, Price ..... EndDate (yyyy/mm/dd hh:mm:ss). All are string variables e-g the TrID could be "TDX6I70MNH". Same is the case with OrdID and so on. I need to match the variables from TrID till EndDate for equality and if all match then find the time difference between EndDate and startDate and push_back the element which has minimum difference between the two times.

I hope I have explained it well.

Thanks for offering your help

jimblumberg · Oct 23, 2010

Have you been able to get a map, mutlimap to work with std::vector<std::string>?

Jim

heidik · Oct 23, 2010

I just copied the vector elements to a multimap. Could you please suggest a better way with the help of code. Pleaseeeeee help me do it.

jimblumberg · Oct 23, 2010

I think the below is something close to what you want. Please consider it pseudo code.

Code:

struct Pred;
{
   std::string startDate;
   std::string TrID;
   std::string OrdID
   std::string endDate;
};

std::vector<Pred> mypred;
///// fill in the vector.
std::multimap<std::string, Pred > mmap;
for(size_t i = 0; i < mypred.size(); ++i)
{
   mmap.insert(make_pair((TrID + OrdID),  mpred[i]));
}

Jim

heidik · Oct 23, 2010

Thank you once again. Is there any way I can do the whole thing with vectors instead of copying elements from vectors to multimap and then again copying it back to the vector?

1) populating the vector with struct type of elements (over 60,000 elements).
2) struct contains 16 string type of member variables.
e-g string a, x, y, z;
3) finding duplicates in the vector.
4) storing the element and its duplicates in another vector (likewise all such duplicate elements) and removing the element from the original vector so that the original vector contains only those elements which do not have any duplicates
5) finding the difference between the a and z (by converting them to ints) for all duplicates for each element.
6) push_back the vector having the smallest a-z difference to the original vector

jimblumberg · Oct 23, 2010

The only other way I know of finding the duplicates would be to sort the vector. There might be other ways but I not sure how that would be.

For item 4 you could leave the initial vector alone and create another vector to hold pointers to the initial vector and just use the pointers to display/compute item 5 and 6.
Code:
   std::vector<Pred*> nodupes;
   std::vector<Pred*> dupes;
   if( IT IS A DUPE)
      dupes.push_back(&pred[x] ) 
Only a thought might/might not work.

Jim

heidik · Oct 23, 2010

Thanks Jim. Please have a look at the code below. It is working fine but I dont know how much time will it take for processing more than 60000 records/vector elements? Do you have any idea? or do I have to test it with 60000 elements to make sure?

More over, suppose there are four struct members i-e a, x, y, z. I have to convert a and z of each duplicate element to int and then find the difference between them and will push_back that vector which has a and z with minimum difference to the original vector. How do I go about it? Could you please help.

Code:

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>

struct MyPred
{
	std::string x;
	std::string y;

	MyPred(const std::string& x, const std::string& y): x(x), y(y) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y;
	}

	bool operator<(const MyPred& p) const
	{
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred> vPred;
	vPred.push_back(MyPred("a", "a"));
	vPred.push_back(MyPred("a", "b"));
	vPred.push_back(MyPred("a", "a"));
	vPred.push_back(MyPred("b", "a"));
	vPred.push_back(MyPred("a", "b"));
	vPred.push_back(MyPred("a", "c"));
	vPred.push_back(MyPred("b", "b"));
	vPred.push_back(MyPred("a", "a"));
	vPred.push_back(MyPred("a", "a"));

	// The values need to be in order for equal_range() to work
	std::sort(vPred.begin(), vPred.end());

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred> dPred; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred.begin(); i != vPred.end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred.end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred.push_back(*i); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	std::cout << "vPred: Sorted input\n";
	for(std::vector<MyPred>::iterator i = vPred.begin(); i != vPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred.begin(); i != dPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}
}

jimblumberg · Oct 23, 2010

In your structure:
Code:
	std::string x;
	std::string y;

	MyPred(const std::string& x, const std::string& y): x(x), y(y) {}
My compiler complains about shadowing of the variables x and y. I would change your constructor to:
Code:
	MyPred(const std::string& x, const std::string& y): x(xx), y(yy) {}
As far as how long? I would first start out with a limited number of records, say 10 - 60, to insure the program is working correctly. Then once I got the program working for the limited set I would then increase the number by say 100. Time both sets and then maybe you will have a sense of the time required.

To lower the time required make sure you are not displaying everything on the screen. Displaying items can increase the time of execution dramatically.

How often will this need to be done? If it's only once then I wouldn't worry too much about the time it takes, except to display something so the user knows the program is not malfunctioning.

My worries would be: can vector be completely held in memory. The things that will probably take the longest are reading the information from the file, sorting, then searching.

More over, suppose there are four struct members i-e a, x, y, z. I have to convert a and z of each duplicate element to int and then find the difference between them and will push_back that vector which has a and z with minimum difference to the original vector.
Click to expand...

You might be able to save a little time by converting the values when you read them, and store them as the numeric type.

will push_back that vector which has a and z with minimum difference to the original vector.
Click to expand...

I not quite sure what you mean by this.

Jim

heidik · Oct 23, 2010

Thanks for your advice about how to find out the time complexity.

I meant that:

there are four string type variables in the struct a, x, y, z. I have to find out the difference between member variable 'a' and member variable 'z' of the struct. Since these variables are string type but like "123" so I have to convert them to integers and then find out the difference. By duplicates I mean struct type of object having similar x, y, and z variables (excluding 'a'). After finding the difference I have to put the struct object again to the original vector (which was initially having duplicates), the object which is having minimum difference between 'a' and 'z'. I hope I have explained it well

jimblumberg · Oct 23, 2010

heidik said: ↑

Thanks for your advice about how to find out the time complexity.

I meant that:

there are four string type variables in the struct a, x, y, z. I have to find out the difference between member variable 'a' and member variable 'z' of the struct. Since these variables are string type but like "123" so I have to convert them to integers and then find out the difference. By duplicates I mean struct type of object having similar x, y, and z variables (excluding 'a'). After finding the difference I have to put the struct object again to the original vector (which was initially having duplicates), the object which is having minimum difference between 'a' and 'z'. I hope I have explained it well
Click to expand...

I understand everything above but:

After finding the difference I have to put the struct object again to the original vector
Click to expand...

Do you mean that you are adding a "record" to the vector or you need to alter the contents of one of the existing "records"?

Also I believe your program is not doing what you want. When do your print out the section titled: "dPred: Only the values that were duplicated" is only storing the first duplicate record. I added another field to your structure to see this. Here is my output.

vPred: Sorted input
[a, a, A]
[a, a, C]
[a, a, H]
[a, a, I]
[a, b, B]
[a, b, E]
[a, c, F]
[b, a, D]
[b, b, G]
dPred: Only the values that were duplicated
[a, a, A]
[a, a, A]
[a, a, A]
[a, a, A]
[a, b, B]
[a, b, B]
uPred: Only the values that were unique
[a, c, F]
[b, a, D]
[b, b, G]
Click to expand...

Jim

heidik · Oct 24, 2010

Hello Jim. I have ran the program again and it is giving me the correct result . The only problem is now the speed. According to what you suggested I tried it first with 10 test values, then with 100 and then with 60000 values and the program almost became DEAD with 60000 values specially when looking for duplicates ;'(. Do you have any idea how can I make it better, fast, only the duplicate part ?

Thanks in advance.

heidik · Oct 24, 2010

Hello Jim. I have run the program again and it is giving me the correct result . The only problem is now the speed. According to what you suggested I tried it first with 10 test values, then with 100 and then with 60000 values and the program almost became DEAD with 60000 values specially when looking for duplicates ;'(. Do you have any idea how can I make it better, fast, only the duplicate part ?

Thanks in advance.

jimblumberg · Oct 24, 2010

The only thing I can suggest is to change your dPred and uPred vectors to pointers.
Code:
	std::vector<MyPred*> uPred; // values that were always unique
	std::vector<MyPred*> dPred; // values that were duplicated
This will mean instead of copying all the duplicates to a new vector you will only push the pointer to the current record into the new vector. This should speed things up quite a bit.

You will have to change the code where you push_back() into the new vector.
Code:
dPred.push_back(&(*j));  //put each duplicate onto a new vector
And also where you print them out.

Hope this helps.

Jim

heidik · Oct 24, 2010

Alright Thank you Jim. I have the changes you suggested and it is working. I actually need the uPred (unique elements) and dont need dPred and vPred so I made them pointer type and deleted them in the end.

Could you please help me complete this assignment. I am actually reading from a file containing strings (lines). The first and last string elements of each line are start date/time and end date/time respectively. I have to find the duplicates in original vector where all string elements in a line are the same except the first element i-e start date/time. I have to find the difference between the minits of the start date/time and the minits of the end date/time and push_back the element from the duplicates to the original vector where the time difference is the minimum. e-g

Start date/time of '1' element is 2006/06/01 16:34:43
End date/time of that element is 2006/06/01/ 16:55: 51

Start date/time of duplicate element (of '1') is 2006/06/01 16:24:43
End date/time of that element is 2006/06/01/ 16:55: 51

I now have to find the difference between 34 (minits of '1' element's start date/time) and 55 (minits of '1' element's end date/time) which is equal to 21

and the difference between 24 (minits of duplicate (of '1') element's start date/time) and 55 (minits of duplicate (of '1') element's end date/time) which is equal to 31.

since 21 is less than 31 so I will push_back the element with time difference of 21 to the original vector and will discard all its duplicates.

Could you help me do this? I hope I have explained well. For the time being I have put random int+string combination to struct first (start date/time) and last (end date/time) member variables instead of proper date/time to find the difference between the int part of that variable. Here's my code
Code:
struct MyPred
{
	std::string a;
	std::string x;
	std::string y;
	std::string z;

	MyPred(const std::string& a, const std::string& x, const std::string& y, const std::string& z): a(a), x(x), y(y), z(z) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y && z == p.z; // a == p.a && 
	}

	bool operator<(const MyPred& p) const
	{
		//if(a < p.a) return true;
		//if(a > p.a) return false;
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		if(z < p.z) return true;
		if(z > p.z) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred>* vPred = new std::vector<MyPred>;
	vPred->push_back(MyPred("a2c", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("j4h", "b", "c", "j87h"));
	vPred->push_back(MyPred("d4f", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("n7s", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("l9m", "b", "c", "j87h"));
	vPred->push_back(MyPred("p24a", "x", "c", "p43a"));
	vPred->push_back(MyPred("q56r", "l", "m", "q90r"));
	vPred->push_back(MyPred("g11v", "8f", "h", "g63v"));
	vPred->push_back(MyPred("u3w", "v", "d", "u11w"));
	vPred->push_back(MyPred("k76l", "x", "c", "p43a"));
	vPred->push_back(MyPred("p24a", "g", "z", "p43a"));

	// The values need to be in order for equal_range() to work
	std::sort(vPred->begin(), vPred->end());

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred>* dPred = new std::vector<MyPred>; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
	{
		/*ret = std::equal_range(i, vPred.end(), *i);
		if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
		else
		{
			dPred.push_back(*i);
		}*/

		ret = std::equal_range(i, vPred->end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred->push_back(*j); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	std::cout << "vPred: Sorted input\n";
	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred->begin(); i != dPred->end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	delete vPred;
	delete dPred;
}

jimblumberg · Oct 24, 2010

Start date/time of '1' element is 2006/06/01 16:34:43
End date/time of that element is 2006/06/01/ 16:55: 51

Start date/time of duplicate element (of '1') is 2006/06/01 16:24:43
End date/time of that element is 2006/06/01/ 16:55: 51

I now have to find the difference between 34 (minits of '1' element's start date/time) and 55 (minits of '1' element's end date/time) which is equal to 21

and the difference between 24 (minits of duplicate (of '1') element's start date/time) and 55 (minits of duplicate (of '1') element's end date/time) which is equal to 31.

since 21 is less than 31 so I will push_back the element with time difference of 21 to the original vector and will discard all its duplicates.
Click to expand...

For this you should check out the functions in the ctime header.

Create a structure tm to hold each of the dates and times. Then use difftime to compute the difference between the start and end times. I would not forget about the date part because if your time passes over midnight you will get incorrect results.

I would use stringstream to break apart your date, time string.

Jim

heidik · Oct 24, 2010

Pleaseeeeee Jim could you please do it for me because someone else suggested me the same thing but I havent used ctime etc ever. I have to submit it tomorrow ;'(. Pleaseeeeeeeeeeeee. Could you please make changes to my code and do it for me. I shall really be grateful to you.

jimblumberg · Oct 24, 2010

heidik said: ↑

Pleaseeeeee Jim could you please do it for me because someone else suggested me the same thing but I havent used ctime etc ever. I have to submit it tomorrow ;'(. Pleaseeeeeeeeeeeee. Could you please make changes to my code and do it for me. I shall really be grateful to you.
Click to expand...

I will help you, but I will not do the assignment for you. If I just give you the code what will you learn??

Create a function returning the difference in time (int) with the parameters of the start time, and end time strings.

The first thing to do is to strip out the individual parts ( year, month, day, hours, minutes, seconds. You can use std::string.substr() to do this. Or you could use getline() your choice.

Convert the individual strings that you obtained above to ints.

Create a structure tm for start and end times. Here is what struct tm looks like: http://www.cplusplus.com/reference/clibrary/ctime/tm/

Put the individual ints into the correct element of your structures.

Use mktime() to make the times from your structures. http://www.cplusplus.com/reference/clibrary/ctime/mktime/

Then use difftime() to compute the differences. http://www.cplusplus.com/reference/clibrary/ctime/difftime/

Jim

heidik · Oct 24, 2010

it isnt a university assignment. It has been given to me in office ;'(. I am new to STL and the code they have given me uses STL and it has almost been 2 weeks since I am trying to do it. I had to understand that code first and then tried to make changes and it has taken me almost 2 weeks. I just want to finish it before my boss says something to me ;'(. Could you please help?

Log in or Sign up

Need help with duplicate removal

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

Share This Page

Log in or Sign up

Need help with duplicate removal

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

jimblumberg New Member

heidik New Member

Share This Page

Useful Searches