Win32 Application to capture email addresses from any appropriate file

Discussion in 'Win32' started by Sanskruti, Jul 13, 2007.

  1. Sanskruti

    Sanskruti New Member

    Joined:
    Jan 7, 2007
    Messages:
    108
    Likes Received:
    18
    Trophy Points:
    0
    Occupation:
    Software Consultant
    Location:
    Mumbai, India
    Recently, I developed a Win32 Application for one of my client to capture email addresses from a file. Though very simple and straight, it was an interesting project. So, I thought, I must share it with you.

    Program Description



    This program extracts email addresses from an input file. It reads a file called "mail.dat"(input file) and outputs every email address in this file to a file "addresses.dat". Alternatively you can also use 'txt' files and some other types of files.

    Please note, this program considers a valid email address to be a string that can contain ONLY letters, numbers, dashes, underscores, the '@' sign, and a 'dot(.)'. Also, this program expects an email address to start with a letter.

    Important



    File "mail.dat" must be in same directory where your 'cpp' file is. However, you need not create "addresses.dat", as it will be created automatically.

    Keep "mail.dat" file in the same directory in which you have kept application file "ExtractEmails". Click on application file "ExtractEmails" and you will get emails extracted from "mail.dat" to "addresses.dat". You can also find your executable file created in 'debug' folder under your project after successful compilation of the project.

    I have made this program in such a way that it should work in ALL environments that is VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition etc. as I wanted to execute this program in all MS-Windows versions such as Windows 98, Windows ME, Windows 2000, Windows XP, Windows Vista and others. I have already checked it in these environments and it works fine. This program must compile without any errors and any warnings in VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition with little or no modifications.

    I have also put comments in the source code for your understanding of the program wherever I felt necessary. However, as too many comments will make program unreadable, so I have avoided it.

    Here, I am also providing you with a brief description of some of the important functions used in the program. I am sure those who know C, C++ are already aware of it.

    'malloc' function: a brief description

    The 'malloc' function allocates a memory block of at least 'size specified' bytes. The block may be larger than 'size specified' bytes because of space required for alignment and maintenance information.

    File I/O functions used in program: a brief description

    The 'fseek' function moves the file pointer (if any) associated with stream to a new location that is offset bytes from origin.

    The 'ftell' function retrieve the current position of the file pointer (if any) associated with stream.

    The 'fgetc' function returns the character read as an int or returns EOF to indicate an error or end of file.

    The 'fputs' function writes a string to a stream. It copies a string to the output stream at the current position.

    The 'rewind' function repositions the file pointer associated with stream to the beginning of the file.

    Prerequisites

    In order to complete the example application, you should be familiar with the following:

    C, C++, VC++

    Source file: ExtractEmails.cpp



    Code:
    
    // Purpose:    collect email addresses from file "mail.dat" and copy it into "addresses.dat"
    //
    //             Pseudo-code:
    //             open file
    //             find '@' sign
    //               "back up" to start of email address
    //               place email address in memory
    //             repeat steps above until end of file is reached
    
    #define _CRT_SECURE_NO_DEPRECATE 1	// this is to suppress deprecation warnings
    #define _CRT_NONSTDC_NO_DEPRECATE 1   // this is to suppress deprecation warnings
    
    #include <stdio.h>
    #include <iostream>
    
    using namespace std;
    
    FILE *fp;   // input file pointer
    FILE *ofp;  // output file pointer
    
    long byte_count = 0L;  
    long at_count = 0L;
    
    char fname[200]; //refers to file name
    char collected[200]; //refers to email address
    
    char curr_string[200]; //referes to current email address
    char prev_string[200]; //referes to previous email address
     
    char *hold[250000];  //  array of pointers to char, 250,000 maximum, ***holds email addresses***
    int mx = 0;   // index into memory 
    
    //function prototypes
    void get_emails();
    void copy_emails();
    size_t validate(char the_string[]);
    
    int main()
    {
    	get_emails(); // extract emails
    
    	copy_emails(); // copies emails to addresses.dat
      
    	return (0);
    }
    
    void get_emails() // this function extract emails from "mail.dat"
    {
    	char disp_msg[500];		//to display message
    	char temp_str[200];		//temporary string 
    	int ch;					//character: to read from file
    	long fpos = 0L;			//position of byte/cursor in file
    	int idx;
    
    	char umsg[500];  // for debug purpose
    
    	strcpy(fname, "mail.dat");
    	
    	if ( (fp = fopen(fname, "rb")) == NULL) 
    	{
    	    printf("Can not open file mail.dat\n\n\n");
        	exit(1);
    	}
    
    	// counts number of emails (at_count) in file based on '@' and calculates file size (byte_count)
    	while ((ch = fgetc(fp)) != EOF) 
    	{
    		if (ch == '@') 
    			at_count++;   
    		
    		byte_count++;
    	}
      
    	fclose(fp);
    
    	sprintf(disp_msg, "\nFile name: %s\n", fname);
    	sprintf(temp_str, "\nFile Size: %ld bytes\n", byte_count);
    	strcat(disp_msg, temp_str);
    	sprintf(temp_str, "\nCounted %ld '@' signs in the file.\n", at_count);
    	strcat(disp_msg, temp_str);
    
    	printf(disp_msg);
    
    	if ( (fp = fopen(fname, "rb")) == NULL) 
    	{
    		cout<< "Can not open file mail.dat\n\n\n";
    		exit(1);
    	}
    
    	int valid = 0;
      
    	while ((ch = fgetc(fp)) != EOF && (fpos <= byte_count)) 
    	{
    		if (ch == '@') 
    		{
          		at_count++;  	  
        
          		fpos = ftell(fp) - 1L;
    
          		sprintf(umsg, "\n'@' sign found at byte position %ld", fpos);	    
          		printf(umsg);
    
    	  		if (fpos >= 1L) fpos--;
          		
    			fseek(fp, fpos, 0);  
          		ch = fgetc(fp);
    
          		printf("\nstart backing up ...");       
         	 	
    			while	(  
    						(ch >= 'a' && ch <= 'z') ||
    		       			(ch >= 'A' && ch <= 'Z') ||
    			   			(ch >= '0' && ch <= '9') ||
    			   			(ch == '_' || ch == '-'  || ch == '.') 
    					) 
    			{
        	    		if (fpos == 0) 
    					{
              				rewind(fp); 
    		  				break;
            			}
    					else 
    					{
    		  				fpos--;  
              				fseek(fp, fpos, 0);  
              				ch = fgetc(fp);
    					}		
            			
    					if (ch == EOF) fclose(fp);
          		}            
    
         		idx = 0;
    
    	  		printf("\nFinished backing up...\n\n");
           
          		while ( (ch = fgetc(fp)) != EOF) 
    			{  
    				//printf("\nstart collecting ... ");
       
         			valid = 0; 
            			
    				if (ch >= 'a' && ch <= 'z') valid = 1;
    				if (ch >= 'A' && ch <= 'Z') valid = 1;
    				if (ch >= '0' && ch <= '9') valid = 1;
    				if (ch == '_' || ch == '-') valid = 1;
    				if (ch == '@' || ch == '.') valid = 1;
    		
    				if (!valid) break;
         
            			collected[idx] = ch;      
            			idx++;
                }	        
          
    			collected[idx] = '\0';
          		
    			hold[mx] = (char *) malloc(sizeof(collected) + 1); 	  
          		
    			strcpy(hold[mx], collected);
    
    			mx++;
    
        	}  // end of outer if    	    
    	}
    
    	fclose(fp);
    }
    
    void copy_emails()
    {
    	int first = 1;
    	int found_dot = 0;
    	int x;
    	int z;
      	char s1[800];
    
    	if ( (ofp = fopen("addresses.dat", "w") ) == NULL) 
    	{
    		printf("Unable to open output file.");
    		exit(1);
    	}
    
    	if (mx == 0) 
    	{
    		sprintf(s1, "\n  File: %s", fname);	   
    		strcat(s1, "\n  No email addresses were found in the file.     ");
    		printf(s1);
    	}
    
    	for (x = 0; x < mx; x++) 
    	{
    		if (first) // first address to display
    		{  
    			first = 0;  // turn off this flag
    			strcpy(curr_string, hold[x]);
    
    			if (curr_string[0] >= 'a' && curr_string[0] <= 'z' || curr_string[0] >= 'A' && curr_string[0] <= 'Z') 
    			{
    				validate(curr_string);
    				fputs(curr_string, ofp);
    				fputs("\n", ofp);
    				continue;
    			} 
      		}
    
    		if (!first && x >= 1 && x <= mx)  
    		{
    			strcpy(curr_string, (char *) hold[x]);
    			strcpy(prev_string, (char *) hold[x - 1]);
    
    			if (strcmp(curr_string, prev_string) == 0) // duplicate
    			{ 
    				continue; 
    			}
    			else 
    			{
    				found_dot = 0;
    				z = 0;
    				
    				while (curr_string[z]) // look for a dot in email address
    				{  
    					if (curr_string[z] == '.') found_dot = 1;
    					z++;			  
    				}
      
    				if (found_dot) // if there is a dot in address, display it
    				{  
    					if (curr_string[0] >= 'a' && curr_string[0] <= 'z' ||  curr_string[0] >= 'A' && curr_string[0] <= 'Z') 
    					{
    							validate(curr_string);
    
    		                  	fputs(curr_string, ofp);   // write to output file
                				fputs("\n", ofp);
    					}
    				} 
    			} 
    		}
    	}
    	fclose(ofp);   // close output file
    }
    
    size_t validate(char curr_string[])
    {
      // validate function makes sure that last char of "curr_string" 
      // ends in range of 'a' to 'z' (or 'A' to 'Z'.)
      // This function removes any trailing period, dash, etc.
      // Return value is length of validated string
      
      size_t x; //size_t : unsigned integer
      int done = 0;
      x = strlen(curr_string);
      x--;
      
      while (!done) 
      {
        if (curr_string[x] >= 'a' && curr_string[x] <= 'z') break;
    	if (curr_string[x] >= 'A' && curr_string[x] <= 'Z') break;
    	
    	curr_string[x] = '\0'; // replace trailing char with NULL
        x--;
      }
      return(strlen(curr_string));
    }
    
    
     
  2. shabbir

    shabbir Administrator Staff Member

    Joined:
    Jul 12, 2004
    Messages:
    15,375
    Likes Received:
    388
    Trophy Points:
    83
    You should not be having the array size predefined but should be using the pointer instead. At least for char *hold[250000]; because you never know the size of the file you are parsing.
     
  3. Sanskruti

    Sanskruti New Member

    Joined:
    Jan 7, 2007
    Messages:
    108
    Likes Received:
    18
    Trophy Points:
    0
    Occupation:
    Software Consultant
    Location:
    Mumbai, India
    Yes, thats true. But in this case, I was told that the file size is max up to the given limit in program.
     
  4. Sanskruti

    Sanskruti New Member

    Joined:
    Jan 7, 2007
    Messages:
    108
    Likes Received:
    18
    Trophy Points:
    0
    Occupation:
    Software Consultant
    Location:
    Mumbai, India
    With some efforts, definitely this program can be generalized. By using Regular Expressions or by using some pure loops also the same can be achieved.
     
    Last edited: Jul 13, 2007
  5. shabbir

    shabbir Administrator Staff Member

    Joined:
    Jul 12, 2004
    Messages:
    15,375
    Likes Received:
    388
    Trophy Points:
    83
    Even if you are given the file size its always good to be having the most general solutions because clients don't like the program crashing for different test cases.
     
  6. Sanskruti

    Sanskruti New Member

    Joined:
    Jan 7, 2007
    Messages:
    108
    Likes Received:
    18
    Trophy Points:
    0
    Occupation:
    Software Consultant
    Location:
    Mumbai, India
    I understand and I agree.
     
  7. parvez.yu

    parvez.yu New Member

    Joined:
    Feb 14, 2008
    Messages:
    100
    Likes Received:
    0
    Trophy Points:
    0
    i feel wew should always use char * instead of arrays
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice