Go4Expert

Go4Expert (http://www.go4expert.com/)
-   Win32 (http://www.go4expert.com/articles/win32-tutorials/)
-   -   Win32 Application to capture email addresses from any appropriate file (http://www.go4expert.com/articles/win32-application-capture-email-t5245/)

Sanskruti 13Jul2007 17:01

Win32 Application to capture email addresses from any appropriate file
 
Recently, I developed a Win32 Application for one of my client to capture email addresses from a file. Though very simple and straight, it was an interesting project. So, I thought, I must share it with you.

Program Description



This program extracts email addresses from an input file. It reads a file called "mail.dat"(input file) and outputs every email address in this file to a file "addresses.dat". Alternatively you can also use 'txt' files and some other types of files.

Please note, this program considers a valid email address to be a string that can contain ONLY letters, numbers, dashes, underscores, the '@' sign, and a 'dot(.)'. Also, this program expects an email address to start with a letter.

Important



File "mail.dat" must be in same directory where your 'cpp' file is. However, you need not create "addresses.dat", as it will be created automatically.

Keep "mail.dat" file in the same directory in which you have kept application file "ExtractEmails". Click on application file "ExtractEmails" and you will get emails extracted from "mail.dat" to "addresses.dat". You can also find your executable file created in 'debug' folder under your project after successful compilation of the project.

I have made this program in such a way that it should work in ALL environments that is VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition etc. as I wanted to execute this program in all MS-Windows versions such as Windows 98, Windows ME, Windows 2000, Windows XP, Windows Vista and others. I have already checked it in these environments and it works fine. This program must compile without any errors and any warnings in VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition with little or no modifications.

I have also put comments in the source code for your understanding of the program wherever I felt necessary. However, as too many comments will make program unreadable, so I have avoided it.

Here, I am also providing you with a brief description of some of the important functions used in the program. I am sure those who know C, C++ are already aware of it.

'malloc' function: a brief description

The 'malloc' function allocates a memory block of at least 'size specified' bytes. The block may be larger than 'size specified' bytes because of space required for alignment and maintenance information.

File I/O functions used in program: a brief description

The 'fseek' function moves the file pointer (if any) associated with stream to a new location that is offset bytes from origin.

The 'ftell' function retrieve the current position of the file pointer (if any) associated with stream.

The 'fgetc' function returns the character read as an int or returns EOF to indicate an error or end of file.

The 'fputs' function writes a string to a stream. It copies a string to the output stream at the current position.

The 'rewind' function repositions the file pointer associated with stream to the beginning of the file.

Prerequisites

In order to complete the example application, you should be familiar with the following:

C, C++, VC++

Source file: ExtractEmails.cpp



Code: cpp

// Purpose:    collect email addresses from file "mail.dat" and copy it into "addresses.dat"
//
//             Pseudo-code:
//             open file
//             find '@' sign
//               "back up" to start of email address
//               place email address in memory
//             repeat steps above until end of file is reached

#define _CRT_SECURE_NO_DEPRECATE 1  // this is to suppress deprecation warnings
#define _CRT_NONSTDC_NO_DEPRECATE 1   // this is to suppress deprecation warnings

#include <stdio.h>
#include <iostream>

using namespace std;

FILE *fp;   // input file pointer
FILE *ofp;  // output file pointer

long byte_count = 0L; 
long at_count = 0L;

char fname[200]; //refers to file name
char collected[200]; //refers to email address

char curr_string[200]; //referes to current email address
char prev_string[200]; //referes to previous email address
 
char *hold[250000]//  array of pointers to char, 250,000 maximum, ***holds email addresses***
int mx = 0;   // index into memory

//function prototypes
void get_emails();
void copy_emails();
size_t validate(char the_string[]);

int main()
{
    get_emails(); // extract emails

    copy_emails(); // copies emails to addresses.dat
 
    return (0);
}

void get_emails() // this function extract emails from "mail.dat"
{
    char disp_msg[500];  //to display message
    char temp_str[200];  //temporary string
    int ch;     //character: to read from file
    long fpos = 0L;   //position of byte/cursor in file
    int idx;

    char umsg[500]// for debug purpose

    strcpy(fname, "mail.dat");
   
    if ( (fp = fopen(fname, "rb")) == NULL)
    {
        printf("Can not open file mail.dat\n\n\n");
        exit(1);
    }

    // counts number of emails (at_count) in file based on '@' and calculates file size (byte_count)
    while ((ch = fgetc(fp)) != EOF)
    {
        if (ch == '@')
            at_count++;   
       
        byte_count++;
    }
 
    fclose(fp);

    sprintf(disp_msg, "\nFile name: %s\n", fname);
    sprintf(temp_str, "\nFile Size: %ld bytes\n", byte_count);
    strcat(disp_msg, temp_str);
    sprintf(temp_str, "\nCounted %ld '@' signs in the file.\n", at_count);
    strcat(disp_msg, temp_str);

    printf(disp_msg);

    if ( (fp = fopen(fname, "rb")) == NULL)
    {
        cout<< "Can not open file mail.dat\n\n\n";
        exit(1);
    }

    int valid = 0;
 
    while ((ch = fgetc(fp)) != EOF && (fpos <= byte_count))
    {
        if (ch == '@')
        {
          at_count++;    
   
          fpos = ftell(fp) - 1L;

          sprintf(umsg, "\n'@' sign found at byte position %ld", fpos);    
          printf(umsg);

          if (fpos >= 1L) fpos--;
         
            fseek(fp, fpos, 0)
          ch = fgetc(fp);

          printf("\nstart backing up ...");       
          
            while   ( 
                        (ch >= 'a' && ch <= 'z') ||
                  (ch >= 'A' && ch <= 'Z') ||
                  (ch >= '0' && ch <= '9') ||
                  (ch == '_' || ch == '-'  || ch == '.')
                    )
            {
                    if (fpos == 0)
                    {
                  rewind(fp);
                  break;
                    }
                    else
                    {
                  fpos--; 
                  fseek(fp, fpos, 0)
                  ch = fgetc(fp);
                    }      
                   
                    if (ch == EOF) fclose(fp);
          }           

           idx = 0;

          printf("\nFinished backing up...\n\n");
       
          while ( (ch = fgetc(fp)) != EOF)
            { 
                //printf("\nstart collecting ... ");
   
              valid = 0;
                   
                if (ch >= 'a' && ch <= 'z') valid = 1;
                if (ch >= 'A' && ch <= 'Z') valid = 1;
                if (ch >= '0' && ch <= '9') valid = 1;
                if (ch == '_' || ch == '-') valid = 1;
                if (ch == '@' || ch == '.') valid = 1;
       
                if (!valid) break;
     
                    collected[idx] = ch;     
                    idx++;
            }          
     
            collected[idx] = '\0';
         
            hold[mx] = (char *) malloc(sizeof(collected) + 1);   
         
            strcpy(hold[mx], collected);

            mx++;

        }  // end of outer if          
    }

    fclose(fp);
}

void copy_emails()
{
    int first = 1;
    int found_dot = 0;
    int x;
    int z;
    char s1[800];

    if ( (ofp = fopen("addresses.dat", "w") ) == NULL)
    {
        printf("Unable to open output file.");
        exit(1);
    }

    if (mx == 0)
    {
        sprintf(s1, "\n  File: %s", fname);    
        strcat(s1, "\n  No email addresses were found in the file.     ");
        printf(s1);
    }

    for (x = 0; x < mx; x++)
    {
        if (first) // first address to display
        { 
            first = 0// turn off this flag
            strcpy(curr_string, hold[x]);

            if (curr_string[0] >= 'a' && curr_string[0] <= 'z' || curr_string[0] >= 'A' && curr_string[0] <= 'Z')
            {
                validate(curr_string);
                fputs(curr_string, ofp);
                fputs("\n", ofp);
                continue;
            }
      }

        if (!first && x >= 1 && x <= mx) 
        {
            strcpy(curr_string, (char *) hold[x]);
            strcpy(prev_string, (char *) hold[x - 1]);

            if (strcmp(curr_string, prev_string) == 0) // duplicate
            {
                continue;
            }
            else
            {
                found_dot = 0;
                z = 0;
               
                while (curr_string[z]) // look for a dot in email address
                { 
                    if (curr_string[z] == '.') found_dot = 1;
                    z++;              
                }
 
                if (found_dot) // if there is a dot in address, display it
                { 
                    if (curr_string[0] >= 'a' && curr_string[0] <= 'z' ||  curr_string[0] >= 'A' && curr_string[0] <= 'Z')
                    {
                            validate(curr_string);

                            fputs(curr_string, ofp);   // write to output file
                            fputs("\n", ofp);
                    }
                }
            }
        }
    }
    fclose(ofp);   // close output file
}

size_t validate(char curr_string[])
{
  // validate function makes sure that last char of "curr_string"
  // ends in range of 'a' to 'z' (or 'A' to 'Z'.)
  // This function removes any trailing period, dash, etc.
  // Return value is length of validated string
 
  size_t x; //size_t : unsigned integer
  int done = 0;
  x = strlen(curr_string);
  x--;
 
  while (!done)
  {
    if (curr_string[x] >= 'a' && curr_string[x] <= 'z') break;
    if (curr_string[x] >= 'A' && curr_string[x] <= 'Z') break;
   
    curr_string[x] = '\0'; // replace trailing char with NULL
    x--;
  }
  return(strlen(curr_string));
}


shabbir 13Jul2007 18:19

Re: Win32 Application to capture email addresses from any appropriate file
 
You should not be having the array size predefined but should be using the pointer instead. At least for char *hold[250000]; because you never know the size of the file you are parsing.

Sanskruti 13Jul2007 18:58

Re: Win32 Application to capture email addresses from any appropriate file
 
Yes, thats true. But in this case, I was told that the file size is max up to the given limit in program.

Sanskruti 13Jul2007 18:59

Re: Win32 Application to capture email addresses from any appropriate file
 
With some efforts, definitely this program can be generalized. By using Regular Expressions or by using some pure loops also the same can be achieved.

shabbir 13Jul2007 19:31

Re: Win32 Application to capture email addresses from any appropriate file
 
Quote:

Originally Posted by Sanskruti
Yes, thats true. But in this case, I was told that the file size is max up to the given limit in program.

Even if you are given the file size its always good to be having the most general solutions because clients don't like the program crashing for different test cases.

Sanskruti 14Jul2007 12:01

Re: Win32 Application to capture email addresses from any appropriate file
 
I understand and I agree.

parvez.yu 6Mar2008 14:51

Re: Win32 Application to capture email addresses from any appropriate file
 
i feel wew should always use char * instead of arrays


All times are GMT +5.5. The time now is 03:00.