Need help with a script for a project

Discussion in 'Perl' started by perlproject, Jul 5, 2012.

  1. perlproject

    perlproject New Member

    Joined:
    Jul 5, 2012
    Messages:
    1
    Likes Received:
    0
    Trophy Points:
    0
    I have to write a perl script for a project I am working on and I am kind of stuck, so I would appreciate any feedback. I am working with an epistolary novel (a novel that consists of a bunch of letters written by different characters) and I am supposed to separate out the letters by each character so that I end up with a bunch of files, each containing all of the letters from a particular character. So I want to separate all of the letters written by a character named Anna Howe, and I noticed that her letters begin with either Jan., Febr., or March and then a date, and her letters end with <p class="left">ANNA HOWE, and usually there is something in between the p class-"left" and her name, such as yours truly or your sister. So, I am trying to match the beginning of the letter and the ending of the letter using regular expressions, and then somehow telling the script to output the content in between those two patterns to a file named "Anna". This is the script that I wrote, but it doesn't work:
    Code:
    #!/usr/bin/perl -w
    #Attempting to extract Anna's letters from Clarissa Text.
    
    $input = "Clarissa_CleanText_Vol1";
    $output = "Anna";
    
    open(INPUT, $input) || die("couldn't open $input");
    open(OUTPUT, ">$output");
    
    
    @curr_letter_lines = (); # array that holds the lines of the current letter
    
    $ref_to_curr_lines_arr = \@curr_letter_lines;
    
    while(defined($inline = <INPUT>)){
        if(begin_letter($inline) eq "yes"){
            @curr_letter_lines = ($inline);    
            
    
        }elsif(end_letter($inline) eq "yes" ){ 
    
          
            $ref_to_curr_lines_arr = \@curr_letter_lines;
            print_curr_letter_lines(\@curr_letter_lines, $output);
            
     @curr_letter_lines = ();
    
        }else{ 
        push(@curr_letter_lines, $inline);
        }
    }
    
    sub print_curr_letter_lines{
    
        my($ref_to_curr_letter_lines_arr, $output) = @_;
        foreach $line(@{$ref_to_curr_letter_lines_arr}) {
            print(OUTPUT $line);  
        }
    }
    
    
    sub begin_letter {
    
        my ($all_letters) = @_;
    
        my ($want_begin_anna) = "no";
    
        if ($all_letters =~ /^Jan\.|Febr\.|March \d{2}\./) {
    
            $want_begin_anna = "yes";
        }
        return $want_begin_anna;
    }
    
    
    sub end_letter {
    
        my ($all_letters) = @_;
    
        my ($want_end_anna)    = "no";
    
        if ($all_letters =~ /<p class="left">.*?ANNA HOWE\./) {
    
            $want_end_anna = "yes"
    
        }
        return $want_end_anna;
    }
     
    Last edited by a moderator: Jul 6, 2012
  2. awatson

    awatson New Member

    Joined:
    Feb 28, 2008
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    Home Page:
    http://www.nexcess.net
    One approach would be to do a split on the string that has all the letters using her signature (ANNA HOWE) - that would break it into chunks, with a letter of hers at the end of each. Then process each chunk, looking for the date string that begins each letter and grabbing everything (using the 's' modifier) to the end each chunk.
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice