In the Thai NLP group, there is a poll asking people’s opinions about the existence of Thai sentences. As a student of linguistics, I thought I might have something that may be of interest (and hopefully of use) to say to NLP people who try to do tasks like sentence segmentation in Thai. I’ll try to answer the question, “is there a sentence in Thai?” here.

What is a sentence, anyway?

Before trying to answer the question whether Thai has a sentence. What seems to be needed is a definition of a sentence. Believe it or not, coming up with a satisfactory definition of a sentence (or any other deceptively simple grammatical concepts) is surprisingly difficult. This is partly because the criteria for being a sentence in one language often fail in another, and partly because the sentence is a not-clear-cut concept that we simply develop intuitions for in classes. Most linguists today no longer concern themselves with the sentence per se, but recast it as something else that is more theory-specific. They only use “sentence” informally.

There are, however, rather standard definitions that are in line with our intuition of what a sentence is. Here are two which I think are the most intuitive and, as we shall see, least problematic when applying to Thai:

  • Definition 1: A sentence is a clause that is not part of another clause (from Martin Haspelmath’s blog post)
    • Definition 1.1: A clause is a formal correlate of a proposition (Payne 2006: 324)
  • Definition 2: A linguistic unit with a single illocutionary force

Definition 1 basically says that if something is a sentence, then we should be able to talk about its truth status (i.e. be a clause, which expresses a proposition–something that can be true or false) and structurally it should not be part of another clause. For example, I want you., uttered alone, is a sentence, because it is a clause and is not part of any other clause. On the other hand, I wanted to go as part of I told her that I wanted to go. is not a sentence, because it is part of that larger clause.

Although a clause is a formal correlate of a proposition, it does not have to be an assertion, indicating that some event/state is true of the world or not. Instead, it could “do” other things to a proposition: you can turn it to a question, asking whether something is true: Are you coming with me? or leave some participant of an event out to find out its identity: Who are you talking to?. Or you can order people around, Go to hell! etc. These things you can do are part of what is called illocutionary force in linguistics.

This leads us to Definition 2, which I find more intuitive (although others may disagree). A sentence is essentially whatever that has a single illocutionary force: what you intend to do when you say something. A subordinate clause like that I wanted to go above on its own, in that particular context, has no illocutionary force, although the whole sentence does (the force is assertive). Note that a sentence type (declarative, imperative, intterogative…) can be used for more than one types of illocutionary force. For example, you can use a declarative sentence to ask a question by raising your pitch at the end of the sentence: You have done your homework?

Some people’s intuitions seem to disagree with this definition, however, with examples such as below:

  1. Hello!
  2. Kitties! (the speaker is a child, pointing at kittens)

1) has an illocutionary force of greeting, but it does not feel like a sentence to some people. 2) could mean something like “Look at those kittens!” or “these are kittens,” all of which have an illocutionary force of assertion, but the form is so different and brief that many would not think of it as a sentence.

So, is there a sentence in Thai?

With some definitions in our hands, we can now turn to the original question. Let us talk about Definition 2 first, because it is easier to deal with. According to Definition 2, there is, indeed, a sentence in Thai. All of the following are sentences:

  1. ฉันจะไปตลาด
  2. หิวข้าว
  3. งู!!!
  4. สวัสดีครับ
  5. ฉันซื้อข้าวให้แม่
  6. พ่อให้ฉันไปตลาดเพื่อซื้อข้าว
  7. ฉันอยากกลับบ้าน

Turning to Definition 1, we can apply it to most of these cases, so again there is, quite trivially, a sentence in Thai in this sense. At the same time, we see that defining a clause as a formal correlate of a proposition gets us in some trouble here. Obviously, this definition excludes 4., which is not a correlate of a proposition. But, is 2. a formal correlate of a proposition? We can of course understand it as having some subject (most likely the first person pronoun). But if we do allow omission of participants, then surely 3. should be a sentence too? Isn’t it just a highly elliptical version of “that’s a snake,” “here’s a snake,” or (perhaps less likely) “you’re a snake!” or some other propositions involving a snake? How much omission and contextual reliability are we going to allow here? I can think of no real solutions right now.

Still, before jumping to any conclusion, let us explore the data and analysis a bit more.

Hard cases

In his influetial paper on word and sentence segmentation, Professor Wirote Aroonmanakul mentions a few hard cases. We should now see how the two definitions fare with these (examples of serial verbs from Prof. Kingkarn Thepkanjana’s slides):

  • Sequential serial verbs: เขาจุดบุหรี่สูบ
  • Resultative serial verbs: เขาปาแก้วแตก

It is quite clear that the definitions can handle the two cases above. Each expresses a proposition, and has a single illocutionary force. There is no way to divide the sentences into two sentences while maintaining grammaticality thanks to the fact that one argument is shared (บุหรี่ is shared between จุด and สูบ; แก้ว is shared between ปา and แตก).

The following are more challenging:

  • Directional serial verbs: พ่อเดินตรงย้อนออกมา
  • Manner serial verbs: เขากวักมือเรียกฉัน
  • Posture serial verbs: เขายืนอ่านหนังสือ

The question here is whether each of these putative sentences are actually multiple independent clauses with omitted subjects. For example, is เขากวักมือเรียกฉัน is actually เขากวักมือ (เขา)เรียกฉัน, in which case there are two, not one, sentences, according to either definition?

Fortunately, there is one test linguists often employ when analyzing serial verbs: the negation test. Usually, if you negate the head of the serialized verbs (the first verb in this case), the whole sequence is negated, not only that verb. This would indicate that the verbs form a verbal complex and does not consist of multiple independent clauses. This seems to be the case here. If you negate the first verb, then the whole sequence is negated.

  • พ่อไม่ [เดินตรงย้อนออกมา]
  • เขาไม่ [กวักมือเรียกฉัน]
  • เขาไม่ [ยืนอ่านหนังสือ]

Clause-like compounds are also mentioned in Prof. Wirote’s paper. I assume that that he meant something like the following:

  • Clause-like compound: คนขับรถ

This structure is syntactically ambiguous. It could be a lexicalized compound, meaning a driver (an occupation); a phrase, meaning a person who drives; or a sentence, meaning “people drive cars.” To distinguish between the first two cases is very difficult, but to check whether this is a sentence is usually quite simple. Context will tell you whether it has an illocutionary force (of assertion) or not, or expresses a proposition or not.

Before moving on from Prof. Wirote’s paper, one remark is in order about the intuitive judgment of sentence status. As Prof. Wirote notes, the sentence as we intuitively understand is likely a fuzzy, invented concept shaped by orthography. However, this is not a problem if we have a concrete definition like the ones we do here and apply it consistently enough.

Finally, let us look at a real hard case, the last boss, which is probably the following:

  • Parataxis: ฉันตื่น ไปโรงเรียน กลับบ้าน นอน

These paratactic structures are structures that are unclear whether they are co-ordinated without a conjunction (note that many languages have no conjunctions) or are independent of each other. As discussed, since Thai sentences allow subject omission, this could be either

  • ฉันตื่น (ฉัน)ไปโรงเรียน (ฉัน)กลับบ้าน (ฉัน)นอน

which is four different sentences, with four illocutionary forces (of the same type)/propositions, or

  • ฉันตื่น ไปโรงเรียน กลับบ้าน (และ)นอน

which is a single, compound sentence; or anything in between of the total 8 ways of division.

There may be detectable differences between these two if we look at sound patterns such as pause duration, but with text alone we are pretty hopeless.

Is the sentence even appropriate as a unit of analysis for Thai?

At this point you may be thinking “What if we’re using the wrong elementary unit?” Maybe we’re looking at this the wrong way round. Maybe the sentence is not an appropriate basic unit of analysis in Thai. Maybe Thai is different from languages like English. Or maybe we get it all wrong and the sentence is not the basic unit of any language at all!?

Well, you are not alone. There are different lines of thinking in linguistics that challenge the mainstream idea that the clause/sentence is the basic unit. One is that the discourse-level unit is the basic unit, as in Functional Discourse Grammar. It would be too much to go through all the views here, but I will mention one that is particularly influential.

The idea is that Thai is a topic-prominent language. This means that Thai sentece is based on the basic information structure of Topic-Comment, not the traditioal Subject-Predicate structure. Topic is something that you talk about, while comment is the content you speak of Topic. For example,

  1. การบ้าน (Topic) ยังไม่เสร็จ (Comment)
  2. หัวหน้าห้อง (Topic) เลือกกันนานมาก (Comment)
  3. หนังสือเรื่องนี้ (Topic) แต่งขึ้นช่วงสามปีก่อน (Comment)

Analysts often face difficulty when encountering sentences like 3. It seems that แต่ง receives a passive-like meaning, but without any indication of passive marking. Topic-Comment structure supporters would say that the passive-like meaning is not surprising at all, because the relationship between Topic and Comment is not specific but more contextual (the aboutness relation).

A Topic-Comment analysis could be a solution to the above case. Since ตื่น, ไปโรงเรียน, กลับบ้าน and นอน all share the same topic, we give a bipartitite analysis,

  • ฉัน (Topic) ไปโรงเรียน กลับบ้าน นอน (Comment)

with the Comment potentially continuing until a new Topic is instantiated.

Alternatively, they might adopt a hybrid analysis that has both Topic-Comment and Subject-Predicate structures, with Subject ellipsis.

  • [หนังสือเรื่องนี้ (Topic)] [[(เขา) (Subject)] [แต่งขึ้นเมื่อสามปีก่อน (Predicate)] (Comment)]

But since Topic-Comment structures are recursive, we can face a problem not found with sentences (because sentences are maximal clauses):

  • [คณิตศาสตร์ (Topic)] [[สอบกลางภาค (Topic) ทำไม่ได้เลย (Comment)] (Comment)]

Since a single Topic can host quite hierarchically deep and linearly lengthy structures, one may end up marking very few boundaries that is not particularly informative.

What should we do?

So, after all this discussion, what should we do? The two definitions and the Topic-Comment analysis all have problems. Does this mean we should abandon all hope? Or should we just adopt Definition 2, which seems to be least problematic?

I believe that the right question to ask in application-oriented fields such as NLP is not of the descriptive kind like “Is there unit X in language Y?” or “where are the word boundaries in sequence of words XYZ?” Rather, it should be normative: “how should we do task X to most optimally achieve our goals?” And yes, this means that sometimes, what the theory about words, sentences, etc. says does not have much role.

In the case of sentence segmentation, we should focus our energy on thinking of the purposes of this task: what downstream tasks is the segmentation going to affect? What are our priorities? If we want to focus on machine translation, for example, we might want to consider using Definition 1, because it covers a narrow range of structures that have clearer analogues across languages. If we want to prioritize intention detection, we may want to use Definition 2 because it is related directly to the speaker’s intention. Or, if we want to use it for discourse-related tasks, using the Topic-Comment structure as the basis is probably not the worst idea. You may combine analyses/definitions, creating whatever criteria that you see fit so that they reflect your priorities, as long as they can be applied consistently in your data.

I hope my post helps at least some of you to some extent. Thank you for reading this (surprisingly) long post.