Choosing appropriate Hashing algorithms

Discussion in 'C' started by mtwaha, Apr 5, 2011.

  1. mtwaha

    mtwaha New Member

    Joined:
    Apr 5, 2011
    Messages:
    5
    Likes Received:
    0
    Trophy Points:
    0
    Hi all,

    I'm new here. I have been given a task regarding Hash tables. I have a file which contains one random number per line. I am to read the contents of the file and then hash them before finding some statistics.

    Here are the requirements :
    1. The hash table should be implemented as an array with 5,000 elements (buckets). Each bucket should be able to store multiple entries to handle collisions; there should be no limit on the number of entries each bucket can store. Optimize your design for fast insertions.

    2. A good hashing algorithm should be chosen and implemented to ensure that each bucket has approximately the same number of entries.

    3. The following statistics should be stored in the output file:
    a. Total count of random numbers received by system.
    b. Total time taken to insert the numbers into hash-table.
    c. Average time taken to insert a number into hash-table.
    d. Average number of random numbers stored in each bucket.
    e. The number of entries in each bucket. The maximum and minimum should be highlighted.

    I basically need to know which kinds of hashing function/algorithm will be best suited for this task.

    Thanks, T.W.
     
  2. mtwaha

    mtwaha New Member

    Joined:
    Apr 5, 2011
    Messages:
    5
    Likes Received:
    0
    Trophy Points:
    0
    A hashing algorithm in C, not C++.
     
  3. mtwaha

    mtwaha New Member

    Joined:
    Apr 5, 2011
    Messages:
    5
    Likes Received:
    0
    Trophy Points:
    0
    Basically, the problem statement is that I'm to read numbers from a file, store them efficiently using a hash table and then provide the statistics mentioned above. Any help would be greatly appreciated.
     
  4. xpi0t0s

    xpi0t0s Mentor

    Joined:
    Aug 6, 2004
    Messages:
    3,009
    Likes Received:
    203
    Trophy Points:
    63
    Occupation:
    Senior Support Engineer
    Location:
    England
    Very simple, by the looks of it. All you need to do is determine the maximum number M in the input file, then if the numbers are completely random in the sense that the probability of any number occurring is exactly equal to the probability of any other number, then you can simply store sequential numbers in each bucket, for example if the M=14999, then the values 0,1,2 can be stored in bucket[0], the values 3,4,5 in bucket[1], 6,7,8 in bucket[2],..., 3n, 3n+1,3n+2 in bucket[n].

    Then if bucket is defined as int bucket[5000];, all you need to is initialise this to all zeros then for each number N in the input file just do bucket[N/3]++;

    You may also need to determine the lowest number L in the input file, in which case bucket[0] contains L, L+1, L+2 (assuming M=L+14999). Generalise for M=L+<any number> and determine how many items I would go in each bucket then the counter would be bucket[(N-L)/I]++;.

    Analysing the probability distribution is more tricky - you will have to make assumptions and test for various possibilities. The obvious one to check for is a flat line distribution, i.e. P(n)=1/n. Depends how complicated this project is. Another obvious one to check for is a straight line (not flat) distribution, where there are more small numbers than large, or vice versa. Getting less obvious, you could also check for a normal distribution, and of course the extent of this depends on what kinds of probability distributions you've covered; if this is a basic programming course then I would have thought a flat line distribution was the most likely, but if you're doing a PhD in statistics then of course this would probably need to cover all distributions mankind has ever thought up. The allocation of numbers to buckets would then be the "inverse" of the distribution curve, for example if you take a normal distribution and flip it horizontally (so that it looks a bit like y=cos(x) for x=0..2pi) then this would ensure the less probable extreme values are compressed over fewer buckets and the more probable middle values spread out over more, so that the outcome is approximately the same number of items in each bucket.
     
    shabbir likes this.
  5. mtwaha

    mtwaha New Member

    Joined:
    Apr 5, 2011
    Messages:
    5
    Likes Received:
    0
    Trophy Points:
    0
    Thanks alot xpi0t0s. This is very helpful. I do have some more questions.

    Regarding this implementation, is it possible to retrieve the numbers from the hash table? Because it doesn't seem like it. Is there another way in which I can retrieve the stored numbers too ?

    Regards,

    T.W.
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice