Need help with a script for a project

perlproject's Avatar, Join Date: Jul 2012
Newbie Member
I have to write a perl script for a project I am working on and I am kind of stuck, so I would appreciate any feedback. I am working with an epistolary novel (a novel that consists of a bunch of letters written by different characters) and I am supposed to separate out the letters by each character so that I end up with a bunch of files, each containing all of the letters from a particular character. So I want to separate all of the letters written by a character named Anna Howe, and I noticed that her letters begin with either Jan., Febr., or March and then a date, and her letters end with <p class="left">ANNA HOWE, and usually there is something in between the p class-"left" and her name, such as yours truly or your sister. So, I am trying to match the beginning of the letter and the ending of the letter using regular expressions, and then somehow telling the script to output the content in between those two patterns to a file named "Anna". This is the script that I wrote, but it doesn't work:
#!/usr/bin/perl -w
#Attempting to extract Anna's letters from Clarissa Text.

$input = "Clarissa_CleanText_Vol1";
$output = "Anna";

open(INPUT, $input) || die("couldn't open $input");
open(OUTPUT, ">$output");

@curr_letter_lines = (); # array that holds the lines of the current letter

$ref_to_curr_lines_arr = \@curr_letter_lines;

while(defined($inline = <INPUT>)){
    if(begin_letter($inline) eq "yes"){
        @curr_letter_lines = ($inline);    

    }elsif(end_letter($inline) eq "yes" ){ 

        $ref_to_curr_lines_arr = \@curr_letter_lines;
        print_curr_letter_lines(\@curr_letter_lines, $output);
 @curr_letter_lines = ();

    push(@curr_letter_lines, $inline);

sub print_curr_letter_lines{

    my($ref_to_curr_letter_lines_arr, $output) = @_;
    foreach $line(@{$ref_to_curr_letter_lines_arr}) {
        print(OUTPUT $line);  

sub begin_letter {

    my ($all_letters) = @_;

    my ($want_begin_anna) = "no";

    if ($all_letters =~ /^Jan\.|Febr\.|March \d{2}\./) {

        $want_begin_anna = "yes";
    return $want_begin_anna;

sub end_letter {

    my ($all_letters) = @_;

    my ($want_end_anna)    = "no";

    if ($all_letters =~ /<p class="left">.*?ANNA HOWE\./) {

        $want_end_anna = "yes"

    return $want_end_anna;

Last edited by shabbir; 6Jul2012 at 08:29.. Reason: Code blocks
awatson's Avatar, Join Date: Feb 2008
Newbie Member
One approach would be to do a split on the string that has all the letters using her signature (ANNA HOWE) - that would break it into chunks, with a letter of hers at the end of each. Then process each chunk, looking for the date string that begins each letter and grabbing everything (using the 's' modifier) to the end each chunk.