Recently, I developed a Win32 Application for one of my client to capture email addresses from a file. Though very simple and straight, it was an interesting project. So, I thought, I must share it with you.
This program extracts email addresses from an input file. It reads a file called "mail.dat"(input file) and outputs every email address in this file to a file "addresses.dat". Alternatively you can also use 'txt' files and some other types of files.
Please note, this program considers a valid email address to be a string that can contain ONLY letters, numbers, dashes, underscores, the '@' sign, and a 'dot(.)'. Also, this program expects an email address to start with a letter.
File "mail.dat" must be in same directory where your 'cpp' file is. However, you need not create "addresses.dat", as it will be created automatically.
Keep "mail.dat" file in the same directory in which you have kept application file "ExtractEmails". Click on application file "ExtractEmails" and you will get emails extracted from "mail.dat" to "addresses.dat". You can also find your executable file created in 'debug' folder under your project after successful compilation of the project.
I have made this program in such a way that it should work in ALL environments that is VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition etc. as I wanted to execute this program in all MS-Windows versions such as Windows 98, Windows ME, Windows 2000, Windows XP, Windows Vista and others. I have already checked it in these environments and it works fine. This program must compile without any errors and any warnings in VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition with little or no modifications.
I have also put comments in the source code for your understanding of the program wherever I felt necessary. However, as too many comments will make program unreadable, so I have avoided it.
Here, I am also providing you with a brief description of some of the important functions used in the program. I am sure those who know C, C++ are already aware of it.
'malloc' function: a brief description
The 'malloc' function allocates a memory block of at least 'size specified' bytes. The block may be larger than 'size specified' bytes because of space required for alignment and maintenance information.
File I/O functions used in program: a brief description
The 'fseek' function moves the file pointer (if any) associated with stream to a new location that is offset bytes from origin.
The 'ftell' function retrieve the current position of the file pointer (if any) associated with stream.
The 'fgetc' function returns the character read as an int or returns EOF to indicate an error or end of file.
The 'fputs' function writes a string to a stream. It copies a string to the output stream at the current position.
The 'rewind' function repositions the file pointer associated with stream to the beginning of the file.
Prerequisites
In order to complete the example application, you should be familiar with the following:
C, C++, VC++
Program Description
This program extracts email addresses from an input file. It reads a file called "mail.dat"(input file) and outputs every email address in this file to a file "addresses.dat". Alternatively you can also use 'txt' files and some other types of files.
Please note, this program considers a valid email address to be a string that can contain ONLY letters, numbers, dashes, underscores, the '@' sign, and a 'dot(.)'. Also, this program expects an email address to start with a letter.
Important
File "mail.dat" must be in same directory where your 'cpp' file is. However, you need not create "addresses.dat", as it will be created automatically.
Keep "mail.dat" file in the same directory in which you have kept application file "ExtractEmails". Click on application file "ExtractEmails" and you will get emails extracted from "mail.dat" to "addresses.dat". You can also find your executable file created in 'debug' folder under your project after successful compilation of the project.
I have made this program in such a way that it should work in ALL environments that is VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition etc. as I wanted to execute this program in all MS-Windows versions such as Windows 98, Windows ME, Windows 2000, Windows XP, Windows Vista and others. I have already checked it in these environments and it works fine. This program must compile without any errors and any warnings in VC++ 5.0, VC++ 6.0, Visual Studio 2003, VC++ Express Edition, VC++ 2005 Professional edition with little or no modifications.
I have also put comments in the source code for your understanding of the program wherever I felt necessary. However, as too many comments will make program unreadable, so I have avoided it.
Here, I am also providing you with a brief description of some of the important functions used in the program. I am sure those who know C, C++ are already aware of it.
'malloc' function: a brief description
The 'malloc' function allocates a memory block of at least 'size specified' bytes. The block may be larger than 'size specified' bytes because of space required for alignment and maintenance information.
File I/O functions used in program: a brief description
The 'fseek' function moves the file pointer (if any) associated with stream to a new location that is offset bytes from origin.
The 'ftell' function retrieve the current position of the file pointer (if any) associated with stream.
The 'fgetc' function returns the character read as an int or returns EOF to indicate an error or end of file.
The 'fputs' function writes a string to a stream. It copies a string to the output stream at the current position.
The 'rewind' function repositions the file pointer associated with stream to the beginning of the file.
Prerequisites
In order to complete the example application, you should be familiar with the following:
C, C++, VC++
Source file: ExtractEmails.cpp
Code: cpp
// Purpose: collect email addresses from file "mail.dat" and copy it into "addresses.dat"
//
// Pseudo-code:
// open file
// find '@' sign
// "back up" to start of email address
// place email address in memory
// repeat steps above until end of file is reached
#define _CRT_SECURE_NO_DEPRECATE 1 // this is to suppress deprecation warnings
#define _CRT_NONSTDC_NO_DEPRECATE 1 // this is to suppress deprecation warnings
#include <stdio.h>
#include <iostream>
using namespace std;
FILE *fp; // input file pointer
FILE *ofp; // output file pointer
long byte_count = 0L;
long at_count = 0L;
char fname[200]; //refers to file name
char collected[200]; //refers to email address
char curr_string[200]; //referes to current email address
char prev_string[200]; //referes to previous email address
char *hold[250000]; // array of pointers to char, 250,000 maximum, ***holds email addresses***
int mx = 0; // index into memory
//function prototypes
void get_emails();
void copy_emails();
size_t validate(char the_string[]);
int main()
{
get_emails(); // extract emails
copy_emails(); // copies emails to addresses.dat
return (0);
}
void get_emails() // this function extract emails from "mail.dat"
{
char disp_msg[500]; //to display message
char temp_str[200]; //temporary string
int ch; //character: to read from file
long fpos = 0L; //position of byte/cursor in file
int idx;
char umsg[500]; // for debug purpose
strcpy(fname, "mail.dat");
if ( (fp = fopen(fname, "rb")) == NULL)
{
printf("Can not open file mail.dat\n\n\n");
exit(1);
}
// counts number of emails (at_count) in file based on '@' and calculates file size (byte_count)
while ((ch = fgetc(fp)) != EOF)
{
if (ch == '@')
at_count++;
byte_count++;
}
fclose(fp);
sprintf(disp_msg, "\nFile name: %s\n", fname);
sprintf(temp_str, "\nFile Size: %ld bytes\n", byte_count);
strcat(disp_msg, temp_str);
sprintf(temp_str, "\nCounted %ld '@' signs in the file.\n", at_count);
strcat(disp_msg, temp_str);
printf(disp_msg);
if ( (fp = fopen(fname, "rb")) == NULL)
{
cout<< "Can not open file mail.dat\n\n\n";
exit(1);
}
int valid = 0;
while ((ch = fgetc(fp)) != EOF && (fpos <= byte_count))
{
if (ch == '@')
{
at_count++;
fpos = ftell(fp) - 1L;
sprintf(umsg, "\n'@' sign found at byte position %ld", fpos);
printf(umsg);
if (fpos >= 1L) fpos--;
fseek(fp, fpos, 0);
ch = fgetc(fp);
printf("\nstart backing up ...");
while (
(ch >= 'a' && ch <= 'z') ||
(ch >= 'A' && ch <= 'Z') ||
(ch >= '0' && ch <= '9') ||
(ch == '_' || ch == '-' || ch == '.')
)
{
if (fpos == 0)
{
rewind(fp);
break;
}
else
{
fpos--;
fseek(fp, fpos, 0);
ch = fgetc(fp);
}
if (ch == EOF) fclose(fp);
}
idx = 0;
printf("\nFinished backing up...\n\n");
while ( (ch = fgetc(fp)) != EOF)
{
//printf("\nstart collecting ... ");
valid = 0;
if (ch >= 'a' && ch <= 'z') valid = 1;
if (ch >= 'A' && ch <= 'Z') valid = 1;
if (ch >= '0' && ch <= '9') valid = 1;
if (ch == '_' || ch == '-') valid = 1;
if (ch == '@' || ch == '.') valid = 1;
if (!valid) break;
collected[idx] = ch;
idx++;
}
collected[idx] = '\0';
hold[mx] = (char *) malloc(sizeof(collected) + 1);
strcpy(hold[mx], collected);
mx++;
} // end of outer if
}
fclose(fp);
}
void copy_emails()
{
int first = 1;
int found_dot = 0;
int x;
int z;
char s1[800];
if ( (ofp = fopen("addresses.dat", "w") ) == NULL)
{
printf("Unable to open output file.");
exit(1);
}
if (mx == 0)
{
sprintf(s1, "\n File: %s", fname);
strcat(s1, "\n No email addresses were found in the file. ");
printf(s1);
}
for (x = 0; x < mx; x++)
{
if (first) // first address to display
{
first = 0; // turn off this flag
strcpy(curr_string, hold[x]);
if (curr_string[0] >= 'a' && curr_string[0] <= 'z' || curr_string[0] >= 'A' && curr_string[0] <= 'Z')
{
validate(curr_string);
fputs(curr_string, ofp);
fputs("\n", ofp);
continue;
}
}
if (!first && x >= 1 && x <= mx)
{
strcpy(curr_string, (char *) hold[x]);
strcpy(prev_string, (char *) hold[x - 1]);
if (strcmp(curr_string, prev_string) == 0) // duplicate
{
continue;
}
else
{
found_dot = 0;
z = 0;
while (curr_string[z]) // look for a dot in email address
{
if (curr_string[z] == '.') found_dot = 1;
z++;
}
if (found_dot) // if there is a dot in address, display it
{
if (curr_string[0] >= 'a' && curr_string[0] <= 'z' || curr_string[0] >= 'A' && curr_string[0] <= 'Z')
{
validate(curr_string);
fputs(curr_string, ofp); // write to output file
fputs("\n", ofp);
}
}
}
}
}
fclose(ofp); // close output file
}
size_t validate(char curr_string[])
{
// validate function makes sure that last char of "curr_string"
// ends in range of 'a' to 'z' (or 'A' to 'Z'.)
// This function removes any trailing period, dash, etc.
// Return value is length of validated string
size_t x; //size_t : unsigned integer
int done = 0;
x = strlen(curr_string);
x--;
while (!done)
{
if (curr_string[x] >= 'a' && curr_string[x] <= 'z') break;
if (curr_string[x] >= 'A' && curr_string[x] <= 'Z') break;
curr_string[x] = '\0'; // replace trailing char with NULL
x--;
}
return(strlen(curr_string));
}


