Thursday, March 26, 2015

From Natural Language to Calendar Entries with Clojure - Part 3


Training An OpenNLP Date-Time Model


Part 1: From Natural Language to Calendar Entries with Clojure
Part 2: From Natural Language to Calendar Entries with Clojure - Part 2

In the first installment of this series, I created a simple Clojure library to detect the name of participants, the date-time, and duration of a calendar event (appointment or meeting) request expressed in natural language. I used the NER (Named Entity Recognition) models for English Names and DateTime provided by the OpenNLP library. However, the duration of a calendar event, expressed in the form of "for 2 hours" or "for 45 minutes" is not an entity for which a model already existed. In Part 2 of the series I trained such a model and incorporated into the library. The end result was not sufficiently robust as the DateTime NER provided by the OpenNLP library fails to recognize some formats. In this installment I will illustrate how to train a named entity model to detect date/times in various formats. The basis for most of this work is described in the two previous installments in detail. I will keep this entry to the bare minimum necessary to describe the training code. The full source code can be seen here.

Generating Training Data

As was the case in the previous entry, training data needs to be generated. I reused most of the code written in the second installment of this series to setup a sentence generator using individual phrase generators. If you have read the two previous articles you will notice some changes in this code. First, the signature of the (generate-sentence) function was changed to accept two hash maps, one containing a set of data needed by each generator, and a second containing the generator function for each phrase in a sentence. In the previous implementation of (generate-sentence) the generators were explicitly called from within the function. This change provides for more control over which generators are used for each phrase. This leads to the second change, which is the use "tagged" generators to produce training data (as opposed to testing data). This obviates the need for the "is-training" parameter in the previous implementation. 



(defn generate-sentences-for-datetime-training
  "Generate and writes [count] training sentences to [file],
   one per line."
  [cnt filename]
  ;; coerse the filename parameter to a string to avoid being
  ;; mistaken for input stream.
  (let [file_name (str filename)
        last-names (nlp.training/read-file nlp.training/last-name-file)
        first-names (nlp.training/read-file nlp.training/first-name-file)
        request-clauses (nlp.training/read-file nlp.training/request-clause-file)
        subject-clauses (nlp.training/read-file nlp.training/subject-clause-file)
        data {:data-lastnames last-names
              :data-firstname last-names
              :data-requests request-clauses
              :data-subjects subject-clauses}
        generators {:gen-datetime generate-datetime-tagged
                    :gen-duration nlp.training/generate-duration
                    :gen-subject  nlp.training/generate-subject
                    :gen-request  nlp.training/generate-request-clause}
        ]
    (with-open [wrt (io/writer file_name )]
      (doseq [sentence (take cnt
                             (repeatedly
                              #(nlp.training/generate-sentence data generators))) ]
        (.write wrt (str sentence "\n" )) ;; write line to file
        ))))

The goal of this sentence generator is to produce sentences with a variety of date-time formats, the formats I would expect in this problem domain. As described in the second installment, the model provided by the OpenNLP library for date/time extraction has limitations that make it impractical for my purposes; the main one being the requirement of expressing hours with two digits. Thus, with this built-in model, 8:00PM would have be written as 08:00PM in order for it be successfully extracted. The following code illustrates how I chose to produce dates in various formats, including single and double-digit hours as well as spaces (and no spaces) between the hours:minutes term and the period (AM or PM).

(def datetime-formats ["MMM d yyyy 'at' h:mma"
                       "MMMM d yyyy 'at' hh:mma"
                       "MMM d yyyy 'at' h:mm a"
                       "MMMM d yyyy 'at' h:mm a"])

(def custom-formatter
     (ttf/formatter "MMMM d yyyy 'at' hh:mma"))

(def formatters (map #(ttf/formatter %) datetime-formats) )

(defn- rand-datetime
  [formats]
  (clj-time.format/unparse
    (rand-nth formats)
    (clj-time.format/parse custom-formatter (nlp.training/generate-datetime)))
  )

(defn generate-datetime-tagged
  []
  (str "<START:datetime> " (rand-datetime formatters) " <END>")
  )


You will notice that I used a sequence of formatter functions produced by this simple, but powerful use of the Clojure map function. I simply used (map) to iterate over each of the defined formats and, for each, create a formatter function using clj.time/formatter. The latter is a wrapper of the Java Joda library's DateTimeFormatterBuilder class.



(def formatters (map #(ttf/formatter %) datetime-formats) )


Now, I can pick a random formatter to generate each sentence. The following code illustrates how this is accomplished. The [formats] parameter is the collection of formatters built by the previous listing.


(defn- rand-datetime
  [formats]
  (clj-time.format/unparse
    (rand-nth formats)
    (clj-time.format/parse custom-formatter (nlp.training/generate-datetime)))
  )

With this core code in place I can now train a named entity extraction (NER) model using the capabilities of the clojure-opennlp (itself a wrapper of the Java OpenNLP library) library. This is trivially accomplished with this:


(defn train-datetime-model
  [training-filename output-filename]
  (let [datetime-finder-model (train/train-name-finder training-filename)]
    (nlp.training/store-model-file datetime-finder-model
                                   output-filename)
    ))

(defn create-datetime-model
  []
  (let [sentences-filename "models/en-event-datetime.sentences"
        sentences-count 20000
        output-filename "models/en-event-datetime.bin"
        ]
    (generate-sentences-for-datetime-training sentences-count
                                              sentences-filename)
    (train-datetime-model sentences-filename
                          output-filename)))

The function (create-datetime-model) generates the training sentences and then invokes (train-datetime-model) to train the NER to extract date/time tokens. I chose to generate 20,000 training sentences. This is well above the recommended 15K observation count for the MaxEnt (Maximum Entropy) learning algorithm used by default in the OpenNLP library.


Testing and Validation

With this model generated I can now turn to validating and testing its full capabilities. But first let me step back and describe what is in place. The current library is designed to parse natural text in the general format of "Schedule a meeting with John Witman on March 31 2015 at 2:30 PM for 2 hours to discuss the project." and extract the meeting participants, date, time and duration. I am using the built-in English Names NER to extract participant names. The date, time and duration are extracted with the models described in this and the previous installment.

To test the accuracy of these models I used the same sentence-generation code used for training. The generated sentences are parsed and the accuracy of the extracted entities validated via the following code. This is a very simple evaluation. I simply count and reduce the number of successful date-time extraction. Note that this is possible because (datetime-find) returns and empty sequence () if no matching entity is found; thus, (count) can be used validate if the date-time tokens were extracted for that sentence.


(defn test-event-datetime-model
  "Cross-validates the model by generating a set of sentences using the
   same rules as those used for training and then using the trained
   model to extract the DateTime entity from each. The efficacy of the
   model is described by the success/total ratio."
  [sample-count]
  (let [
        datetime-find     (nlp/make-name-finder "models/en-event-datetime.bin")
        last-names        (nlp.training/read-file  nlp.training/last-name-file)
        first-names       (nlp.training/read-file  nlp.training/first-name-file)
        request-clauses   (nlp.training/read-file  nlp.training/request-clause-file)
        subject-clauses   (nlp.training/read-file  nlp.training/subject-clause-file)
        data {:data-lastnames last-names
              :data-firstname last-names
              :data-requests  request-clauses
              :data-subjects  subject-clauses}
        generators {:gen-datetime nlp.training/generate-datetime
                    :gen-duration nlp.training/generate-duration
                    :gen-subject  nlp.training/generate-subject
                    :gen-request  nlp.training/generate-request-clause}
        success   (reduce +
                           (take sample-count
                                 (repeatedly
                                  #(count (datetime-find
                                           (@nlp.core/tokenize
                                            (nlp.training/generate-sentence
                                                 data
                                                generators)))))))]
    (/ (float success) (float sample-count))
    )
  )


Conclusion


I think the best way to illustrate the end result of this library is to share my screen. Since I can't really do that, the next best thing follows (which is pretty cool, I think). Please note that there are two significant delays in the screen-cast. First, when the REPL is launched, there is a few seconds of delay. This is caused by Leiningen bootstrapping the JVM. The second can be seen the first time (parse-message) is invoked. This is caused by the various NER OpenNLP models being loaded from disk. Subsequent calls to  (parse-message) do not incur this cost.



I hope these installments offer some form of valuable information to those attempting to leverage OpenNLP via the clojure-openNLP Clojure library to process natural language. 

No comments:

Post a Comment