Skip to content Skip to sidebar Skip to footer

Identify An English Word As A Thing Or Product?

Write a program with the following objective - be able to identify whether a word/phrase represents a thing/product. For example - 1) 'A glove comprising at least an index finger r

Solution 1:

What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:

  • create your own labelling algorithm, create training data, test, eval and finally tag your data
  • use an existing knowledge base (lexicon) to extract semantic labels for each target word

The first option is a complex research project in itself. Do it if you have the time and resources.

The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.

Solution 2:

This task is called named entity reconition problem.

EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.

Out of the box, Standford NLP can only recognize following types:

Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities

so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.

Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).

EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:

  • Download CRF++ and look at provided examples, they are in a simple text format
  • Annotate you data in a similar manner
    a OTHER 
    glove PRODUCT 
    comprising OTHER
    ... 

and so on.

Spilt you annotated data into two files train (80%) and dev(20%)

  1. use following baseline template features (paste in template file)
    

    U02:%x[0,0]
    U01:%x[-1,0]
    U01:%x[-2,0]
    U02:%x[0,0]
    U03:%x[1,0]
    U04:%x[2,0]
    U05:%x[-1,0]/%x[0,0]
    U06:%x[0,0]/%x[1,0]

4.Run

crf_learn template train.txt model
crf_test -m model dev.txt  > result.txt 
  1. Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.

As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates

ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Post a Comment for "Identify An English Word As A Thing Or Product?"