Choosing appropriate Hashing algorithms
Hi all,
I'm new here. I have been given a task regarding Hash tables. I have a file which contains one random number per line. I am to read the contents of the file and then hash them before finding some statistics. Here are the requirements : 1. The hash table should be implemented as an array with 5,000 elements (buckets). Each bucket should be able to store multiple entries to handle collisions; there should be no limit on the number of entries each bucket can store. Optimize your design for fast insertions. 2. A good hashing algorithm should be chosen and implemented to ensure that each bucket has approximately the same number of entries. 3. The following statistics should be stored in the output file: a. Total count of random numbers received by system. b. Total time taken to insert the numbers into hashtable. c. Average time taken to insert a number into hashtable. d. Average number of random numbers stored in each bucket. e. The number of entries in each bucket. The maximum and minimum should be highlighted. I basically need to know which kinds of hashing function/algorithm will be best suited for this task. Thanks, T.W. 
Re: Choosing appropriate Hashing algorithms
A hashing algorithm in C, not C++.

Re: Choosing appropriate Hashing algorithms
Basically, the problem statement is that I'm to read numbers from a file, store them efficiently using a hash table and then provide the statistics mentioned above. Any help would be greatly appreciated.

Re: Choosing appropriate Hashing algorithms
Very simple, by the looks of it. All you need to do is determine the maximum number M in the input file, then if the numbers are completely random in the sense that the probability of any number occurring is exactly equal to the probability of any other number, then you can simply store sequential numbers in each bucket, for example if the M=14999, then the values 0,1,2 can be stored in bucket[0], the values 3,4,5 in bucket[1], 6,7,8 in bucket[2],..., 3n, 3n+1,3n+2 in bucket[n].
Then if bucket is defined as int bucket[5000];, all you need to is initialise this to all zeros then for each number N in the input file just do bucket[N/3]++; You may also need to determine the lowest number L in the input file, in which case bucket[0] contains L, L+1, L+2 (assuming M=L+14999). Generalise for M=L+<any number> and determine how many items I would go in each bucket then the counter would be bucket[(NL)/I]++;. Analysing the probability distribution is more tricky  you will have to make assumptions and test for various possibilities. The obvious one to check for is a flat line distribution, i.e. P(n)=1/n. Depends how complicated this project is. Another obvious one to check for is a straight line (not flat) distribution, where there are more small numbers than large, or vice versa. Getting less obvious, you could also check for a normal distribution, and of course the extent of this depends on what kinds of probability distributions you've covered; if this is a basic programming course then I would have thought a flat line distribution was the most likely, but if you're doing a PhD in statistics then of course this would probably need to cover all distributions mankind has ever thought up. The allocation of numbers to buckets would then be the "inverse" of the distribution curve, for example if you take a normal distribution and flip it horizontally (so that it looks a bit like y=cos(x) for x=0..2pi) then this would ensure the less probable extreme values are compressed over fewer buckets and the more probable middle values spread out over more, so that the outcome is approximately the same number of items in each bucket. 
Re: Choosing appropriate Hashing algorithms
Thanks alot xpi0t0s. This is very helpful. I do have some more questions.
Regarding this implementation, is it possible to retrieve the numbers from the hash table? Because it doesn't seem like it. Is there another way in which I can retrieve the stored numbers too ? Regards, T.W. 
All times are GMT +5.5. The time now is 20:00. 