Sunday, March 15, 2015

From Natural Language to Calendar Entries with Clojure - Part 2

In the first installment of this series I created an NLP process to extract calendar event attributes from plain text. The goal was to identify the participants, date and time, as well as duration from the request. I leveraged the models that have been trained and made available by the Apache Foundation for the OpenNLP library, which I used via the clojure-opennlp library. More specifically, the English name and Date/Time models were used. However, phrases specifying duration (e.g. "for 2 hours", "for 30 minutes") were parsed "manually" since a model does not exist for this particular purpose. 

In part 2 of this series I will improve on the entity extraction capabilities by training a model to detect and extract Duration phrases.  To accomplish this I will need a large body of annotated or tagged sentences to use as inputs into the training algorithm. The clojure-opennlp library has a succinct description of this process and of the tagging syntax. 

For the impatient: In this installment I create an OpenNLP model to detect phrases indicating the duration of a meeting or appointment (a calendar event), such as "for 1 hour" or "for 45 minutes". I then incorporate that into existing code from the first installment to detect and parse the Participants, DateTime, and Duration of a meeting request written in plain text English.  The code can be seen here.

In the first section, I will describe the Clojure code to generate training sentences. These sentences will follow a pattern made of fragments in a pre-defined sequence. Each fragment will have a fragment generator capable of producing variations, some times pseudo-randomly, others based on pre-defined set of values.

In the second section, I will go over the model-training code and the results obtained. I will also highlight interesting and unexpected observations and describe the limitations caused by the training data. Thirdly, the results of cross-validating the model against a set of automatically generated sentences will also be presented. Lastly, in the fourth section, the integration of the trained Duration model into the original code will be described. 

1. Generating Training Sentences

Training sentences will have the structure or grammar I expect for this problem domain. For example: "Please schedule a meeting with Andrew Smith and Joseph Cannon on January 4 2016 at 2:30PM for 1 hour to discuss the sale of the Brooklyn Bridge." This type of sentence can be generally described by the following fragments:

Request        = "Please schedule a meeting"
Participants = "with Andrew Smith and Joseph Cannon"
DateTime      = "on January 4 2016 at 2:30PM"
Duration       = "for 1 hour"
Subject         = "to discuss the sale of the Brooklyn Bridge"

As described here, the entity being extracted needs to be surrounded with <START:duration> <END>  SGML tags. This indicates to the training algorithm the fragment of text that should be identified as an entity. Thus, the Duration fragment will be created as:

Duration      = "for <START:duration>1 hour <END>"

The end result, if we continue with the same example sentence from above, is as follows:

"Please schedule a meeting with Andrew Smith and Joseph Cannon on January 4 2016 at 2:30PM for <START:duration> 1 hour <END> to discuss the sale of the Brooklyn Bridge."

To satisfy the requirements of the training algorithm we need to generate at least 15,000 sample sentences, each built with our grammar.  

I structured the code in terms of fragment generators, each generating a properly formatted phrase using a pseudo-random process. A sentence generator arranges them in the proper sequence. But enough preamble, let's take a look at each of the generators.

Full Name Sub-Generator: First and Last names are randomly selected from a list loaded from disk. The list of available first and last names were created from this site and stored in individual files under the "./resources" folder off the root of the project. The content of each is then read into memory (these are small files, in the order of 150 names each) as Clojure collections. These are passed into the (generate-fullname) function, from which it randomly chooses a combination and joins them into a single string. This sub-generator forms the core of the Participants generator described below.

(defn- generate-fullname
  [last-names first-names]
  (cljstr/join " " [(rand-nth first-names) (rand-nth last-names)])

Participants Generator: I chose a range of 1 to 3 participants (plus the sender, which is implied). The number of participants is a random stream with a range of 1..3. Clojure's (repeatedly) is used to generate a lazy sequence of full names with the (generate-fullname) function. Depending on the generated count, the syntax of the fragment is different. For example, two participants will be generated as "Participant1 and Participant2", where as 3, would be "Participant1, Participant2, and Participant3". 

(defn- generate-participants-clause
  [a-fn a-ln]
  ;; arbitrarily set to 1-3 participants
  (let [participant-count (+ (rand-int 3) 1) 
        seq-p (take participant-count 
                 (repeatedly #(generate-fullname a-fn a-ln)))
        participants (case participant-count
                       3 (str 
                            (first seq-p) ", " 
                            (nth seq-p 1) ", and " 
                            (nth seq-p 2))
                       2 (str 
                            (first seq-p) " and "  
                            (second seq-p))
                       1 (first seq-p)
     (str "with " participants)))

Date/Time Generator: This is a a bit more complex. Basically, a random date is generated with today's date (at midnight) being the starting point. Random days, hours, and minutes are added. The range of random values generated for each of these was semi-arbitrarily chosen.

(defn- generate-datetime
  (let [d (clj-time.core/plus
              (clj-time.core/today-at 12 00)
              (clj-time.core/days (rand-int 365))
              (clj-time.core/hours (rand-int 23))
              (clj-time.core/minutes (rand-nth [10 15 30 45 60])))]
     (ttf/unparse custom-formatter d)))

A custom date formatter was used to emit the format we expect. This format appears to have a high probability of being extracted using the date-time model provided by the OpenNLP library.

(def custom-formatter     
       (ttf/formatter "MMMM d yyyy 'at' hh:mma"))

Duration Generator: I made the assumption that duration expressed in minutes will likely have the values of 15, 30, and 45 minutes. All other requests would be in hours. The duration is randomly picked from an array of likely durations. If the selected number is less than 15, the phrase is expressed in hours. Thus, the numbers 1..10 will generate "N hours" (or "1 hour"). If the selected number is 15, 30 or 45, then the phrase is "N minutes". Note that I did not use abbreviations. For now I expect these two words to be unabbreviated.

(defn- generate-duration
  "Generates the possible or likely specifications for the
   duration of a calendar event."
  (let [ atime [1 2 3 4 5 6 7 8 9 10 15 30 45]
         time (rand-nth atime)
         dim   (cond
                    (== time 1) "hour"
                    (<= time 10) "hours"
                    (> time 10) "minutes"
         (str (str time) " " dim)))

The Subject generator is trivial and self-explanatory. I will not go into any details, but suffice it to say that it is very similar to the first, or last name generator. The 

(defn- generate-subject
  (rand-nth subject-clauses)

The subject-clauses parameter is an array loaded with a pre-defined set of subjects stored on disk. These are the values are used:

to discuss world peace
to have a brief discussion
to chat about the case
to fine tune the points
to brainstorm about the main topic
to brainstorm about the contract
to discuss the contract
to discuss our hiring process
to talk about our hiring process
to discuss our testing strategy
to brainstorm about our testing strategy
to discuss our QA approach
to brainstorm about our QA strategy
to discuss economic trends
to review the summary statistics
to fine tune our pitch

Each of these generators can be easily invoked from the REPL. The following session illustrates their respective output. Notice that first I move into the namespace which is where all the Duration model training code resides. The files containing the last and first names are loaded and then passed into multiple invocations of the (generate-fullname) function. Sample execution of (generate-duration) and (generate-datetime) are also illustrated.

nlp.core=> (ns
nil> (def last-names (read-file "resources/last-names")) 
#'> (def first-names (read-file "resources/first-names"))
#'> (generate-fullname last-names first-names)
"Jan Paterson"> (generate-fullname last-names first-names)
"Robert Poole"> (generate-fullname last-names first-names)
"Sophie White"> (generate-participants-clause first-names last-names)
"with Rees Dylan, Smith Harry, and Reid Isaac"> (generate-participants-clause first-names last-names)
"with Taylor Leonard and Hardacre Phil"> (generate-participants-clause first-names last-names)
"with Gibson Christian and Forsyth Stephen"> (generate-participants-clause first-names last-names)
"with Russell Gabrielle, Gill Faith, and Lawrence Megan"> (generate-participants-clause first-names last-names)
"with Gill Alexander"> (generate-duration)
"4 hours"> (generate-duration)
"10 hours"> (generate-duration)
"7 hours"> (generate-datetime)
"January 24 2016 at 01:10AM"> (generate-datetime)
"September 26 2015 at 01:30AM"> (generate-datetime)
"September 20 2015 at 10:10PM"> (generate-datetime)
"October 3 2015 at 02:10AM"

With these individual generators at hand, my next step is to define the Sentence generator. This is composed of two functions, one that generates a single sentence given all the necessary input and another that iterates over a lazy sequence taking N number of sentences and storing them to disk.

(defn- generate-sentence
  "Generates a single sentence with the following specification.

   Where each of the clauses is generated randomly from a possible
   set of pre-defined values. An example:

      [Please schedule a meeting]
      [with Adam Smith and Sonya Smith]
      [on January 2016 at 1:30pm]
      [for 1 hour] [to discuss x, y and z].

     is-training: whether the sentence is being for training
                  or not (cross-validation)
     afn:         array of first names
     aln:         array of last names
     areq:        array of requests
     asub:        array of subjects
  [is-training afn aln areq asub ]

  (str (generate-request-clause areq)
                           " "
                           (generate-participants-clause aln afn)
                           " on "
                           " for "
                           (if is-training " <START:duration> " " ")
                           (if is-training" <END> " " " )
                           (generate-subject asub)

(defn- generate-sentences
  "Generate and writes [count] training sentences to [file],
   one per line."
  [cnt filename]
  ;; coerse the filename parameter to a string to avoid being
  ;; mistaken for input stream.
  (let [file_name (str filename)
        last-names (read-file last-name-file)
        first-names (read-file first-name-file)
        request-clauses (read-file request-clause-file)
        subject-clauses (read-file subject-clause-file)
    (with-open [wrt (io/writer file_name )]
      (doseq [sentence (take cnt
                              #(generate-sentence true
                                                  subject-clauses))) ]
        (.write wrt (str sentence "\n" )) ;; write line to file

The (generate-sentence) function is parameterized to generate either a training or a non-training sentence. The latter excludes the tagging since it will be used for cross-validation. During cross-validation we want sentences that resemble our expected input (i.e., without tags).

At this point all the code necessary to generate training data is in place. In the next section I will describe the training code that consumes this data to create a model that extracts Duration phrases from sentences. This turns out to be the simplest part of this process.

2. Training the Duration Model

Training a name-finder with the clojure-opennlp library is a trivial exercise and can be accomplished with the following steps:

  • Harvest and tag training data.
  • Train the model with the function (train-name-finder) using the training data. This function returns the trained model.
  • Save the model to disk for subsequent use.

We've already done the first, and as it turns out, the most difficult step. Training the model is, well, simple:

(defn- train-duration-model
  "Trains a name-finder model using sentences that have 
   been benerated by the (generate-sentences) form."
  [training-filename output-filename]
  (let [duration-finder-model 
          (train/train-name-finder training-filename)]
      (store-model-file duration-finder-model output-filename)

The parameters to this function are the path to the training data created by the process in the previous section and the name of the output file - to store the trained model. The (train/train-name-finder) function encapsulates the vagaries of the OpenNLP Java code. I chose to take the default behavior of this function. However, you should be aware that it comes with three different signatures. 

Source from OpenNLP:
(defn ^TokenNameFinderModel train-name-finder
  "Returns a trained name finder based on a given training file. Uses a
  non-deprecated train() method that allows for perceptron training with minimum
  modification. Optional arguments include the type of entity (e.g \"person\"),
  custom feature generation and a knob for switching to perceptron training
  (maXent is the default). For perceptron prefer cutoff 0, whereas for
  maXent 5."
  ([in] (train-name-finder "en" in))
  ([lang in] (train-name-finder lang in 100 5))
  ([lang in iter cut & {:keys [entity-type feature-gen classifier]
                        ;;MUST be either "MAXENT" or "PERCEPTRON"
                        :or  {entity-type "default" classifier "MAXENT"}}]

These allow the selection of the training algorithm, and of the CUTOFF and ITERATIONS parameters. The details of these are beyond what I set out to cover in this installment - perhaps a source for a more in-depth writeup about OpenNLP in the future. But we can note that the MAXENT classifier is used with ITERATION value of 100 and a CUTOFF of 5.

The (store-model-file) function simply stores the generated model to disk for subsequent use in our - let's not forget the goal - calendar event NLP processing engine.

(defn- store-model-file
  "Store the binary model to disk for subsequent reuse."
  [bin-model model-file-name]
  (let [ out-stream (FileOutputStream. model-file-name)]
     (train/write-model bin-model out-stream)))

Executing the Training Code

My goal is to produce a model file, evaluate its efficacy at detecting a duration phrase and then simply reuse this model as an instance of a (name-finder) within the code that parses calendar event requests. Thus, executing this code interactively in the REPL is sufficient and that's what the next listing illustrates. It is composed of two simple commands, (1) change into the training namespace of the project, and execute the (create-duration-model) function.

nlp.core=> (ns
nil> (create-duration-model)
Indexing events using cutoff of 5

 Computing event counts...  done. 346732 events
 Indexing...  done.
Sorting and merging events... done. Reduced 346732 events to 139005.
Done indexing.
Incorporating indexed data for training...  
 Number of Event Tokens: 139005
     Number of Outcomes: 3
   Number of Predicates: 10464
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-380924.0360749354 0.9134778445600636
  2:  ... loglikelihood=-160295.82794628432 0.978294475271968
  3:  ... loglikelihood=-101054.77116702979 0.99729185653473
  4:  ... loglikelihood=-73205.73603332718 0.999997115928152
  5:  ... loglikelihood=-57133.830716095385 1.0
  6:  ... loglikelihood=-46735.585043210194 1.0
  7:  ... loglikelihood=-39485.96976033249 1.0
 99:  ... loglikelihood=-2556.899197362149 1.0
100:  ... loglikelihood=-2531.3059065857246 1.0
Note: The execution of this code on my MBA, 1.8 GHz i7, with 4GB took just a few second. The output is truncated to show the early iterations and the end.

With the model created and saved to disk, we can proceed to testing and validating it.

3. Testing the Duration Model

Interactive Testing

Testing the results of training the model on the REPL is accomplished easily with the code we've written so far. First we load the resources required by the sentence generator, then create a name-finder function from our model, and finally we invoke it with a randomly generated function.

In more details, we do:

  • Bring the opennlp.nlp namespace into scope
  • Define the duration-find model based on the en-duration.bin file created by the (create-duration-model)
  • Define a tokenizer
  • Load the first name, last name, request clauses, and subject clauses collections.
  • Define a date-time formatter needed by the sentence generator.
  • Execute (generate-sentence) just to illustrate its output.
  • Finally, iteratively invoke the (duration-find) function to extract the duration phrase for randomly generated sentences, displaying the extracted text.> (use 'opennlp.nlp)
nil> (def duration-find (make-name-finder "models/en-duration.bin"))
#'>  (def tokenize (make-tokenizer "models/en-token.bin"))  
#'> (def last-names (read-file "resources/last-names")) 
#'> (def first-names (read-file "resources/first-names"))
#'> (def request-clauses (read-file "resources/request-clauses"))
#'> (def subject-clauses (read-file "resources/subject-clauses"))
#'> (def custom-formatter  (ttf/formatter "MMMM d yyyy 'at' hh:mma")) #'> (generate-sentence false first-names last-names request-clauses subject-clauses)
"Schedule an appointment with Audrey Ince and Christopher Lawrence on June 16 2015 at 08:00AM for  8 hours to fine tune our pitch."> (duration-find (tokenize (generate-sentence false first-names last-names request-clauses subject-clauses)))
("2 hours")> (duration-find (tokenize (generate-sentence false first-names last-names request-clauses subject-clauses)))
("30 minutes")> (duration-find (tokenize (generate-sentence false first-names last-names request-clauses subject-clauses)))
("4 hours")> (duration-find (tokenize (generate-sentence false first-names last-names request-clauses subject-clauses)))
("1 hour")> (duration-find (tokenize (generate-sentence false first-names last-names request-clauses subject-clauses)))
("1 hour")> (duration-find (tokenize (generate-sentence false first-names last-names request-clauses subject-clauses)))
("3 hours")> (duration-find (tokenize (generate-sentence false first-names last-names request-clauses subject-clauses)))
("2 hours")> 


As with any statistical learning algorithm, cross-validation is a necessary component. It turns out that the nature of the training data and expected input into this model results in a perfect recall quality; meaning, that every duration clause of the "N [minute(s) | hours(s)] is correctly detected. 

An interesting observation, however, is that my original definition of the duration clause included the "for" preposition (e.g., "for 2 hours"). When I first trained the model to detect these types of entities, the results were dismal - in the order of 55-70% recall accuracy. This was discouraging given that the goal of creating this model was to avoid the hard-coded parsing I had done in the first version of the calendar event request code. 

I never expected 100% recall accuracy. However, excluding the "for" preposition from the tagged text increased the recall accuracy to 100%. I think this is the result of the semi-structured nature of the training sentences. The fact that the "for" preposition always appears right before the duration tokens, likely had a significant impact on this result.

The following REPL session illustrates multiple executions of the cross-validation function. Each execution takes 100 randomly generated sentences as input. The (duration-find) function is then applied to each sentence. The cross-validation code counts how many matching clauses are returned by (duration-find), which should be one per sentence. This is then (reduce)'d and the sum is divided by the number of sentences in the input. The result is displayed as the "Accuracy Ratio".> (use 'clojure.pprint)
nil> (pprint (str "Accuracy Ratio: " (cross-validate-duration-model 100)))
"Accuracy Ratio: 1.0"
nil> (pprint (str "Accuracy Ratio: " (cross-validate-duration-model 100)))
"Accuracy Ratio: 1.0"
nil> (pprint (str "Accuracy Ratio: " (cross-validate-duration-model 100)))
"Accuracy Ratio: 1.0"
nil> (pprint (str "Accuracy Ratio: " (cross-validate-duration-model 100)))
"Accuracy Ratio: 1.0"

The cross-validation function is illustrated below. There isn't much that requires explanation other than what was already described above. 

(defn cross-validate-duration-model
  "Cross-validates the model by generating a set of sentences using the
   same rules as those used for training and then using the trained
   model to extract the Duration entity from each. The efficacy of the
   model is described by the success/total ratio."
  (let [
        aln (read-file last-name-file)
        afn (read-file first-name-file)
        areqs (read-file request-clause-file)
        asubs (read-file subject-clause-file)
        success   (reduce +
                           (take sample-count
                                  #(count (nlp.core/duration-find
                                            (generate-sentence false

    (/ (float success) (float sample-count))

4. Integrating the Duration Model

Finally, we come to the goal of this installment. Recall that my objective is to improve on the capabilities of the NLP code started in the first installment of this series. In that initial implementation, the duration phrase was extracted using a crude process based on looking for the "for" preposition and then parsing the next two tokens in the sentence.  This was not an ideal situation. Relying on a single  occurrence of this token is faulty at best. So, the task is to replace that original duration-parsing code with the use of this trained model. 

Original implementation of the duration-parsing code:
(defn parse-duration 
  "Looks for tokens that specify the duration of the 
   meeting. We use a very simple approach: 
   1. find location of the 'for' token.  
   2. assume the next two tokens are of the form 'N [minutes|hours]" 
  (let [duration_index (.indexOf tokens "for")   
        duration (nth tokens (+ duration_index 1)) 
        dimension (nth tokens (+ duration_index 2))   
    {:duration duration :time dimension} 

That implementation is then replaced with the (duration-find) finder, which leverages our newly trained model.

(defn parse-duration
  (let [duration-tokens (str/split (first (duration-find tokens) )  #"\s+") ]
    {:duration (first duration-tokens) :time (second duration-tokens) }

The (duration-find) function is defined as follows:

(def duration-find 

We can easily test this on the REPL:

nlp.core=> (duration-find (tokenize "Schedule this meeting for 2 hours."))
("2 hours")

At this point I know that the Duration model works and have integrated it with the original code which parses requests for meetings or calendar events into structured data - data needed to eventually create a calendar entry. All that is left is to execute the (parse-message) function I built in the first installment of this series. Again, I turn to the REPL and execute the function with various inputs. The first outcome is representative of successful parsing. 

   "Please schedule a meeting with John Smith, Kevin Cooper, 
    and Steve Green on March 2 2015 at 02:30PM 
    for 45 minutes to discuss the master plan.")

{:duration {:duration "45" :time "minutes"}
 :people ("John Smith" "Kevin Cooper" "Steve Green")
 :starts-at #<org.joda.time.DateTime@3b25c7e2 2015-03-02T14:30:00.000-05:00>}

However, the next outcome illustrates the type of issues I am likely to encounter. Notice that the hour component of the meeting time is expressed as a single digit - 2:30PM as opposed to 02:30PM. This format has a very low chance of being detected with the default date-time finder provided by the OpenNLP library.

nlp.core=> (parse-message "Please schedule a meeting with
  John Smith, Kevin Cooper, and Steve Green 
  on March 2 2015 at 2:30PM for 45 minutes 
  to discuss the master plan.")

{:duration {:duration "45" :time "minutes"}
 :people ("John Smith" "Kevin Cooper" "Steve Green")
 :starts-at #<org.joda.time.DateTime@67bc9eb7 2015-03-02T00:00:00.000-05:00>}

Similarly, putting a space between the minutes and the PM|AM token causes the time component to be missed by the date-time finder. 

Observations and Next Steps

I set out to improve on the NLP extraction code from the first installment. The main objective was to replace the hard-coded duration phrase parsing code with a NLP model capable of detecting the type of phrases I expect to encounter. This was accomplished by training a name-finder model using the clojure-opennlp wrapper for OpenNLP Java library. To do this, a training-sentences generator was developed. This generator composed sentences in the format we expect to encounter. The duration phrase in each sentence was property tagged with the <START:duration><END> tags. The generated model proved to have a perfect recall capability; meaning, that all instances of duration phrases encountered in the cross-validation process were successfully extracted. 
I now have the capability to retrieve structured data from plain text; data that represents a request to create an event calendar - a meeting, an appointment, etc. However, I am not satisfied yet. As illustrated above, the model can still get thrown off by simple things, such as a space between minutes and [PM|AM]. This code is also unable to handle single-digit hours. This is far from ideal, as it would be unnatural (and goes against the whole premise of using Natural Language Processing) to expect users to write 02:30PM rather than 2:30PM. 

So, in the next installment, I will attempt to make the model more robust, to allow more flexibility in the specification of dates and times. 

<- First installment

No comments:

Post a Comment