Skip to content

Porting word2phrase #1

@Prog19

Description

@Prog19

A quick solution to this issue from the Java implementation would be downloading this code file (from the original C tool) and compiling, and executing it from Clojure. This marks the multi-word phrases with an underscore in between in the training text corpus. (Refer 'From words to phrases and beyond' from here)

Below is the code to run the executable in /resources in the project directory using Java Runtime instance and alternatively, by shelling out in Clojure. Here, the input is placed in /resources/train.txt, the output may be found at /resources/output/out.txt and the other parameters to the word2phrase training take default values.

(import '(java.lang Runtime Process))
(import '(java.io BufferedReader InputStreamReader))
(use '[clojure.java.shell :only [sh]])

(let [tmp (-> (System/getProperty "user.dir")
              (.replace "\\" "/")) ;File path modified for Unix. 
                ;Windows accepts both style file paths.
      res (str tmp "/resources/")]
    (comment
    (let [proc (.(Runtime/getRuntime) exec (str res "word2phrase.exe
                                              -train " res "train.txt
                                              -output " res "output/out.txt"))
          br (BufferedReader. (InputStreamReader. (.getInputStream proc)))]
        (println (clojure.string/join "\n" (line-seq br)))
        (.close br)))

    (println (:out (sh (str res "word2phrase.exe")
                          "-train" (str res "train.txt")
                          "-output" (str res "output/out.txt"))))
    (System/exit 0))    

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions