Go4Expert (http://www.go4expert.com/)
-   Engineering Concepts (http://www.go4expert.com/forums/engineering-concepts/)
-   -   Efficient sorting into categories with choices (http://www.go4expert.com/forums/efficient-sorting-categories-choices-t11861/)

 Liglogthepanda 5Jul2008 01:16

Efficient sorting into categories with choices

So here's the problem.

I have 8 categories, and about a hundred items. Each item can fit into 1, 2, 3 or 4 of these categories (and we know which ones each item can fit into). The task is to have 'the most even distribution' of the items among the categories, i.e. it is best if all the categories have something in them before putting more than one into a category. Once an item has been placed in a category, it cannot be repeated in another.

Example 1, with eight items
Categories labelled, A to H. Items are represented by X's.
Code:

```A | B | C | D | E | F | G | H ---------------------------- X | X | X | X | X | X | X | X .......Table (1) A | B | C | D | E | F | G | H ---------------------------- X | X | X | X | X |  | X | X X |  |  |  |  |  |  |  .......Table (2) A | B | C | D | E | F | G | H ---------------------------- X | X | X | X | X |  |  | X X |  | X |  |  |  |  |  .......Table (3)```
In the above case (with eight items), 1 is better than 2 or 3. Basically, to fill a row as quickly as possible is the task, given that some items will only fit into some categories, and that you can't put the same item into two categories in the end!

Example 2, with ten items
Categories labelled, A to H. Items are represented by X's.
Code:

```A | B | C | D | E | F | G | H ---------------------------- X | X | X | X | X | X | X | X .......Table (1) X |  | X |  |  |  |  | A | B | C | D | E | F | G | H ---------------------------- X | X | X | X | X |  | X | X X |  | X | X |  |  |  |  .......Table (2) A | B | C | D | E | F | G | H ---------------------------- X | X | X | X | X |  |  |  X | X | X | X |  |  |  |   | X |  |  |  |  |  |  .......Table (3)```
Again, 1 is better than 2 or 3.

Example data:
Code:

```Item X could fit into category A. Item Y could fit into categories A or B. Item Z could fit into category C. Item W could fit into categories B or F.```
The second stage after this is filling the next row up as quickly as possible, but I'm not too concerned about this at the moment (one step at a time, right!)

I'm having some difficulty thinking of an algorithm for this, and I don't know what to search for in my favourite search engine (which may or may not begin with a G), so help, please!

 xpi0t0s 9Sep2008 17:06

Re: Efficient sorting into categories with choices

Interesting problem. I think it could be a variation on the knapsack algorithm.

I had a play with this, first with 10 items. Some interesting ideas but no final solution. Got the computer to generate some random data:

Item000 Categories: FG
Item001 Categories: H
Item002 Categories: EF
Item003 Categories: BCDF
Item004 Categories: A
Item005 Categories: E
Item006 Categories: F
Item007 Categories: FG
Item008 Categories: BDFG
Item009 Categories: DEF

With 10 items obviously the best result is 1 full row and 2 left over. This can be determined directly from the number of items, so the best result for 20 is 2 full rows and 4 left over (20=2*8+4) and with 100 it's 12 rows and 4 over (12*8=96).

There are some heuristics we can apply to reduce the search space. Obviously 1,4,5 and 6 can only go into one category so this gives us A:4 B: C: D: E:5 F:6 G: H:1 and leaves:
Item000 Categories: FG
Item002 Categories: EF
Item003 Categories: BCDF
Item007 Categories: FG
Item008 Categories: BDFG
Item009 Categories: DEF

Now only one item (3) can go in C. Then there's only 1 in B(8) then 1 in D(9) so this leaves:
A:4 B:8 C:3 D:9 E:5 F:6 G: H:1
Item000 Categories: FG
Item002 Categories: EF
Item007 Categories: FG

To achieve the ideal result (1 full row and 2 left over) we just have to drop one of these in G and be careful not to pick FF for the remaining 2. Final result A:4 B:8 C:3 D:9 E:2,5 F:6,7 G:0 H:1 (one of 6 possibilities (0 or 7 in G, then EF,EG or FG for the remaining 2)).

Then I tried with 20 items:
B B ACD H BEGH EF AD H AH CDFH CDGH F CEGH ADGH H ABF ABCF BFG BH E. Fixing the singletons gives A: B:0,1 C: D: E:19 F:11 G: H:3,7,14. Interestingly H is now "full", meaning for the best solution, we don't want to put any more in there. So this reduces the categories to
- - ACD - BEG EF AD - A CDF CDG - CEG ADG - ABF ABCF BFG B - which now leaves singletons 8 and 18 in A and B. Result: A:8 B:0,1,18 C: D: E:19 F:11 G: H:3,7,14. B is now full so let's remove it from the category list in the search for the ideal solution:
- - ACD - EG EF AD - - CDF CDG - CEG ADG - AF ACF FG - -

So now what do we complete the first row with? The choices for C are 2,9,10,12,16; for D 2,6,9,10,13; for G 4,10,12,13,17. This could be where the search starts.

On the other hand, given the speed of modern hardware we could just loop through all possibilities asking the simple question can we pick 2 A's, 2 B's, 2 C's... 2 H's in such a way as to leave four singletons at the end? With just 100 items and only 4 categories the time to perform the search could be considerably less than the time it would take a highly paid engineer to determine the "ideal" solution, particularly as we can bail out of the loop on the first occurrence of a 2*8+4 solution.

From the initial data B B ACD H BEGH EF AD H AH CDFH CDGH F CEGH ADGH H ABF ABCF BFG BH E category A appears in items 2 6 8 13 15 16 so there are 6*5=30 possible choices for 2 A's. The first choice, 2,6, leaves the following data: B B - H BEGH EF - H AH CDFH CDGH F CEGH ADGH H ABF ABCF BFG BH E (not removing the other A's, because it's still OK to have an A left over at the end).

So in part the answer depends on the type of course you're on (if any). If it's a purely academic environment where the determination of the ideal solution is the goal regardless of how long it takes (undergrads are a free infinite resource we can waste as much of as we like) then that determines one solution. On the other hand if it's a business environment where time equals money then of all the results that are "good enough" the cheapest (i.e. quickest) could be best.

If the test data is horribly skewed so that 50% of the items appear in A,B,C the program may need to be able to cope with that kind of possibility.

 All times are GMT +5.5. The time now is 19:26.