Saturday, March 7, 2015

From Natural Language to Calendar Entries, with Clojure

If you have been paying attention during the past few years, no doubt you have come across an article or two about Natural Language Processing (NLP) even in the mainstream media. Most likely you have used NLP to command your smart phone to take some form of action. Some automated interactive voice recognition (IVR) system has probably asked you a few questions and interpreted your utterances. The proliferation of computing devices without traditional keyboards has led to more applications of speech-to-text and the necessary interpretation of that text into commands. Although not a new concept, these applications have brought NLP into the forefront of many software development efforts.

Natural Language Processing (NPL) opens the door to the possibility of turning otherwise inert text into meaningful or, more interestingly, actionable information. It is the latter that I am interested in and what this installment will focus on. I will explore the basics of NLP using the OpenNLP library and Clojure to convert a sentence into a useful structure to store or act on. More specifically, my goal is to take simple sentences that indicate the desire to create a meeting request or an appointment and extract the date, duration and participants. 

The applications of this are obvious. You can easily imagine turning an email message containing the sentence "Please schedule a meeting with Adam Smith and Sally Keynes, on November 22 2015, at 1:30pm, for 1 hour, to discuss the perils of economic forecasting." into an appointment in which [Adam Smith, Sally Keynes] are identified as the participants, [Start => 2015-11-22T13:30:00, End => 2015-11-22T14:30:00] becomes the appointment start and end time, and [discuss the perils of economic forecasting] is identified as the appointment subject.

The syntactical variations possible to express the same intent can be quite large. For example, a more terse variation could be "Need meeting with Adam Smith and Sally Keynes on Nov 22 at 1:30pm to discuss the perils of economic forecasting".  The first obvious difference is that this is fragment. As opposed to the liberal use of commas on the first example, this sentence omits all commas and so is the year. This is just one possible variation. 

The large number of possible variations makes it very hard to leverage regular expression or other type of parsing to extract the relevant data from natural text. In what I plan to be the first of several installments about NLP, I will explore the basic concepts of applied NLP with the modest goal of creating calendar entries from plain text such as the one illustrated above. To this end I will leverage the OpenNLP library via the Clojure clojure-opennpl API.

I assume a basic knowledge of Clojure and Leiningen. If you are reading this article without the benefit of hands-on experience with either, I highly recommend the Clojure For the Brave and True site, which provides an excellent introduction to the language and the Leiningen toolchain.

Exploratory code

Clojure provides an excellent REPL for exploring concepts prior to committing to a given approach. The following code, all of which is for execution within the REPL, will explore the capabilities of the OpenNLP library to identify and extract entities - names and dates.  

But before I can run any of our exploratory code I must setup the project and have the necessary models files used by the library.  To create the project, I used Leiningen:

$ lein new app nlp

Next I add a references to the two libraries I will use in this phase.

(defproject nlp "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url ""
  :license {:name "Eclipse Public License"
            :url ""}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [clojure-opennlp "0.3.3"]
                 [clj-time "0.9.0"]]
  :main ^:skip-aot nlp.core
  :target-path "target/%s"
  :profiles {:uberjar {:aot :all}})

The model files can be downloaded from here. The following figure depicts how I chose to store these models. Please note that this setup follows closely the process documented in the from the clojure.opennlp GitHub page. 

I am a novice on both NLP in general and on the OpenNLP implementation specifically. I cannot describe the details of the model files, but I expect that these should reflect the result of training the library for each specific concept. Corrections welcome.

  |- en-chunker.bin
  |- en-parser-chinking.bin
  |- en-pos-maxent.bin
  |- en-sent.bin
  |- en-token.bin
  |- english-detokenizer.xml
    |- en-ner-date.bin
    |- en-ner-person.bin
    |- en-ner-time.bin

At this point we can start our exploratory steps. As stated before, my goal is to extract names and date. Let's try names first.

(:use '[opennlp.nlp])
(def tokenize (make-tokenizer "models/en-token.bin"))
(def find-name (make-name-finder "models/namefind/en-ner-person.bin"))

(find-name (tokenize "My name is Ernie de Feria."))

(find-name (tokenize "My name is Ernie Deferia."))
("Ernie Deferia")

(find-name (tokenize "Adam Smith was a very smart person."))
("Adam Smith")

(find-name (tokenize "Annie Cannon and Marie Curie were great scientists."))
("Marie Curie")

(find-name (tokenize "Anna Cannon and Marie Curie were great scientists."))
("Anna Cannon" "Marie Curie")

In the first line I bring the opennlp.nlp namespace into scope. The next two lines make use of two function generators from the library.  The first creates a tokenizer function. The second, find-name, creates a function that identifies names based on the English name model "en-ner-person.bin". 

The next 5 incantations invoke the find-name function with sentences that explore some of the edges of the NLP libraries. The first sentences includes my full name - "Ernie de Feria" - which includes the Spanish preposition "de". From the first result we can see these types of surnames are not readily detected as part of a name. This is not a situation one is likely to find in the problem domain I care about. The workaround, as is the case in many data storage systems is to concatenates the preposition and surname into one term. The second sentence illustrates this. The 4th sentence illustrates another problem, probably more likely to be encountered. In this case the name "Annie" seems to throw off the name finding model. Notice how the name "Anna" is successfully extracted along with the surname "Cannon". 

At this point I have not researched why this is so. My intuition is that the NLP library uses a probabilistic or contextual algorithm to determine the likelihood of a name being used as such. I based this partially on the following observation. While the first sentence doesn't yield any results, the second does.

(find-name (tokenize "Annie Cannon was a great scientist."))

(find-name (tokenize "My name is Annie Cannon".))
("Annie Cannon")

So far we I have simply exercised the capability to extract names from free text. Let's now turn to extracting dates.

Similar to the steps taken above we simply create the date finder and then invoke it with various fragments containing dates.

(def date-find (make-name-finder "models/namefind/en-ner-date.bin")) 

(date-find (tokenize "Meet me on November 22 2012."))
("November 22" "2012")

(date-find (tokenize "Meet me on Nov 22 2012."))
("Nov 22 2012")

(date-find (tokenize "Meet me on Nov 22 2012 at 1:30pm."))
("Nov 22 2012")

(date-find (tokenize "Meet me on Nov 22 2012 at 01:30pm."))
("Nov 22 2012 at 01:30pm")
(date-find (tokenize "Meet me on Nov 22 2012 @ 01:30pm."))
("Nov 22 2012 @ 01:30pm")

(date-find (tokenize "Meet me on 22 November 2012 at 01:30pm."))
("November 2012 at 01:30pm")

(date-find (tokenize "Meet me on 22 Nov 2012 at 01:30pm."))
("Nov 2012 at 01:30pm")

These other tests allowed me to generalized the date extraction capabilities as follows:

  • Overall format must be [Month DD YYYY] where Month can be the abbreviation or fully spelled month.
  • The format [Month DD YYYY at HH:MM[pm|am] is extracted as a single entity.
  • The format [Month DD YYYY ^at HH:MM[pm|am] is extracted as two date entities; where ^at means anything other than 'at'.

Ultimately, the extracted date/time string must be converted to a proper datetime object in order for it to be useful. I will use the clj-time library, which wraps the Java Joda Time library. The first step is to understand the formats this library can parse. To get a glimpse of the built-in formats, one can use the incantation in the following listing. But first, note that at this point I have brought the following namespaces into scope:

((:use [clojure.pprint :as pp]
   [clj-time.format :as ttf] 
  [clj-time.local :as l] 
   [clj-time.core :as tt :exclude [second]]) 

Now, to list the formats available. The first thing to notice is that the :rfc822 is the only format that comes close to what most people would naturally write in a meeting or appointment request.

nlp.core=> (ttf/show-formatters)
:basic-date                             20150308
:basic-date-time                        20150308T022453.110Z
:basic-date-time-no-ms                  20150308T022453Z
:basic-ordinal-date                     2015067
:basic-ordinal-date-time                2015067T022453.110Z
:basic-ordinal-date-time-no-ms          2015067T022453Z
:basic-t-time                           T022453.110Z
:basic-t-time-no-ms                     T022453Z
:basic-time                             022453.110Z
:basic-time-no-ms                       022453Z
:basic-week-date                        2015W107
:basic-week-date-time                   2015W107T022453.110Z
:basic-week-date-time-no-ms             2015W107T022453Z
:date                                   2015-03-08
:date-hour                              2015-03-08T02
:date-hour-minute                       2015-03-08T02:24
:date-hour-minute-second                2015-03-08T02:24:53
:date-hour-minute-second-fraction       2015-03-08T02:24:53.110
:date-hour-minute-second-ms             2015-03-08T02:24:53.110
:date-time                              2015-03-08T02:24:53.110Z
:date-time-no-ms                        2015-03-08T02:24:53Z
:hour                                   02
:hour-minute                            02:24
:hour-minute-second                     02:24:53
:hour-minute-second-fraction            02:24:53.110
:hour-minute-second-ms                  02:24:53.110
:mysql                                  2015-03-08 02:24:53
:ordinal-date                           2015-067
:ordinal-date-time                      2015-067T02:24:53.110Z
:ordinal-date-time-no-ms                2015-067T02:24:53Z
:rfc822                                 Sun, 08 Mar 2015 02:24:53 +0000
:t-time                                 T02:24:53.110Z
:t-time-no-ms                           T02:24:53Z
:time                                   02:24:53.110Z
:time-no-ms                             02:24:53Z
:week-date                              2015-W10-7
:week-date-time                         2015-W10-7T02:24:53.110Z
:week-date-time-no-ms                   2015-W10-7T02:24:53Z
:weekyear                               2015
:weekyear-week                          2015-W10
:weekyear-week-day                      2015-W10-7
:year                                   2015
:year-month                             2015-03
:year-month-day                         2015-03-08

Formatters to the rescue, more specifically a multi-formatter. This can be easily achieved as follows:

(def multi-parser 
        "yyyy-MM-dd @ hh:mma" 
        "YYYY/MM/dd @ hh:mma" 
        "YYYY/MM/dd 'at' hh:mma" 
         "MMM d yyyy @ hh:mma" 
         "MMM d yyyy 'at' hh:mma" ))

Notice how I have specified formats that we know cannot be extracted by the OpenNLP library. I have left these in there for the sake of completeness - to have a more complete parser. The formats relevant to the problem I set out to solve with this effort are the last two. 

As we saw from the sample extraction of dates, in some cases these can be extracted by OpenNLP as separate items in the returned sequence. The following code brings it all together - name and date extraction, joining dates when extracted separately, and conversion of date strings into Joda datetime objects.

(ttf/parse multi-parser 
   (clojure.string/join " " (date-find 
       (tokenize "Meet me on November 1 2012 at 01:30pm."))))

#<org.joda.time.DateTime@2225a84e 2012-11-01T13:30:00.000-04:00>

At this point we can confidently extract date/times with a given format, a format which is commonly used. However, there are restrictions on how these dates can be expressed and properly parsed. For example, the time component must use a two-digit representation of the hour.

The missing part in this first installment is that of parsing the duration of the requested calendar event. None of the existing models handle this. In the example sentence, "Please schedule a meeting with Adam Smith and Sally Keynes, on November 22 2015, at 01:30pm, for 1 hour, to discuss the perils of economic forecasting."  the segment "for 1 hour" is representative of duration specifications. The goal is to handle other specifications such as "for N minutes" and "for N hours". I will use a very simple approach to detect duration:

(defn parse-duration 
  "Looks for tokens that specify the duration of the 
   meeting. We use a very simple approach: 
   1. find location of the 'for' token.  
   2. assume the next two tokens are of the form 'N [minutes|hours]" 
  (let [duration_index (.indexOf tokens "for")   
        duration (nth tokens (+ duration_index 1)) 
        dimension (nth tokens (+ duration_index 2))   
    {:duration duration :time dimension} 

This is not an optimal approach, but it is consistent with the goal set for this installment. The long-term objective is to continue exploring the capabilities of OpenNLP to learn new models, such as the ones that might be required to detect and extract these segments.

With this we come to the final part of this installment. The following listing brings it all together into a cohesive and concise implementation.

(ns nlp.core
  (:use [clojure.pprint :as pp]
   [clj-time.format :as ttf]
   [clj-time.local :as l]
   [clj-time.core :as tt :exclude [second]])

(def get-sentences (make-sentence-detector "models/en-sent.bin"))
(def tokenize      (make-tokenizer "models/en-token.bin"))
(def detokenize    (make-detokenizer "models/english-detokenizer.xml"))
(def pos-tag       (make-pos-tagger "models/en-pos-maxent.bin"))
(def name-find     (make-name-finder "models/namefind/en-ner-person.bin"))
(def date-find     (make-name-finder "models/namefind/en-ner-date.bin"))
(def time-find     (make-name-finder "models/namefind/en-ner-time.bin"))

(def multi-parser (ttf/formatter (tt/default-time-zone) "yyyy-MM-dd @ hh:mma" "YYYY/MM/dd @ hh:mma" "MMM d yyyy @ hh:mma"  "YYYY/MM/dd 'at' hh:mma" "MMM d yyyy 'at' hh:mma" ))

(defn find-persons
  "Finds any properly capitalized names from a tokenized sentence."
  (name-find tokens))

(defn find-datetime-tokens
  "Finds date/time related tokens from a tokenized sentence and returns a lazy sequence of them."
  (date-find tokens))

(defn unify-datetime-tokens
  "Joins all date & time related tokens into a single string."
  (clojure.string/join " " tokens))

(defn parse-datetime
  "Parse a date string into a valid datetime object."
  (ttf/parse multi-parser s-date)

(defn parse-duration
  "Looks for tokens that specify the duration of the meeting. We use a very simple approach:
   1. find location of the 'for' token.
   2. assume the next two tokens are of the form 'N [minutes|hours]"
  (let [duration_index (.indexOf tokens "for")
        duration (nth tokens (+ duration_index 1))
        dimension (nth tokens (+ duration_index 2))
    {:duration duration :time dimension}

(defn parse-message
  "Parse the message and extract people and start time for the calendar event.
   Return a map with {:people :starts-at}."
  (let [tokens      (tokenize s)
        people      (find-persons tokens)
        starts-at   (-> tokens find-datetime-tokens unify-datetime-tokens parse-datetime)
        start-time  (parse-datetime (unify-datetime-tokens (find-datetime-tokens tokens)))
        duration    (parse-duration tokens)
       {:people people :starts-at starts-at :duration duration}

(defn -main
  "I don't do a whole lot ... yet. But here are a couple of sample OpenNPL functions in action:"
  [& args]


The function [parse-message] uses the underlying function we have built to convert a single-sentence meeting or event request into a structure. In this case, the structure is a map with the keys: 

{:people :start-at :duration}

What follows is the result of invoking this function on some sample  sentences.

(parse-message "Please schedule a meeting with Adam Smith and Sally Keynes, on November 22 2015 at 01:30pm, for 1 hour, to discuss the perils of economic forecasting.")

{:duration {:duration "1" :time "hour"}
 :people ("Adam Smith" "Sally Keynes")
 :starts-at #<org.joda.time.DateTime@168da839 2015-11-22T13:30:00.000-05:00>}

(parse-message "Please schedule a meeting with Adam Smith and Sally Keynes, on November 22 2015 at 12:30pm, for 30 minutes, to discuss the perils of economic forecasting.")

{:duration {:duration "30" :time "minutes"}
 :people ("Adam Smith" "Sally Keynes")
 :starts-at #<org.joda.time.DateTime@272f0dfd 2015-11-22T12:30:00.000-05:00>}


I illustrated the basic use of OpenNLP to create structured data representing a calendar event by processing natural language text. I cheated a bit and implemented some "hard-coded" parsing to determine the duration of the event. I also identified some of the peculiarities of the OpenNLP in identifying date-time entities. In subsequent installments I plan to expand the capabilities of the code written so far to better extract duration entities using trained models rather than the naive approach I used.

No comments:

Post a Comment