On Justification in Pico by Eduardo Chappa Seattle WA Last Updated: July 25, 2002. Introduction ------------ This document describes how the patch for several levels of justification works and why it works like it does. There will probably be a moment in which I will not be able to update this patch so it's important to write a document that teaches what the patch does, why it does it and how too!. This may also be important for someone who wants to implement the same type of functionality for other editors. When a person writes a document, normally one would like to have all text fit in the screen, and not have lines wrapped unnaturally. In a way, one would like that every line wraps naturally at the end of the screen. A more realistic approach, actually is to have lines wrapped at a certain specified width. A moment when this is needed is when one goes back to write text over a line that is full. Normally Pine will flush text that does not fit in the line to the next, leaving some lines short and some other long. Since text indented like this is unpleasant to read, one normally would like to have all lines filled up to the maximum allowed width. If you start to fill manually the lines, you will realize immediately, that this is an algorithmic process, you add the following line to the end of the previous line and press return at the place where the screen ends (or the maximum width is reached), leaving you another line for which you have to repeat the same process. It's natural as a user to think that Pine could do this process naturally and this is the topic that we will talk about here. Basic Justification ------------------- All functions about justifications can be found in the file pico/word.c. Here we will discuss several steps towards a very sophisticated justification algorithm. First we will see how to justify a clean message or text, like this one, without any indent-string (a topic which we will discuss later). The procedure that does this is in the function "fillpara". I will only talk about the part that does the justification, later we will talk about about other things that happen in that function. There are a few variables that appear in this procedure: - qlen is the length of the indent string, equal to zero in this case. - fillcol is the length of the width that is the maximum width for indenting a line. - "c" is a character. This character is taken from a buffer that contains the text to be justified, and each character is obtained by the call "c = fremove(i++). "i" is the position on the buffer. When "c" is not a space or an end of line, this character is put in a buffer called "word". The length of the word buffer is saved in the variable word_len. If "c" is a spaces the variable "spaces" increases its value in 1. - line_len is the length of the line that has been written so far. - When you get to an space, you stop writing the word buffer, and check if it is possible to write the word, without going over the margin (line_len + word_len + 1 > fillcol). If you overflow the line then create a new line and write the word there, if you don't overflow the line, then write the word on it and keep reading the buffer until the end. - Only one space is written between words, except if you already had more than one space after a character ".?!:;\"", in whose case it will write two spaces. The function "linsert(n,c)", inserts the character "c" "n times", so you always see "linsert(1, ' '), "linsert(1, word[n])", and so on. I'll leave it to you to read the code and see how all this is done. Indent-String ------------- An indent-string is text that the user chooses to write at the beginning of each line of quoted text to differentiate it from text that s/he is writing. The most common indent strings are ">" and "> ". If people only used these two strings, justification would be very simple (as compared to the algorithm that I will describe later). Most of the people that do not use any of these strings, use the following chacters: "|)]" (no quotes included). Less common are ":%$*#-~". The patch uses the following philosophy about these characters. ******************************************************************************** Fundamental Principle: If a character is hardly ever used at the beginning of a word, then that character is a good character for an indent-string. ******************************************************************************** I believe that the characters "|)]>" satisfy those conditions. Other characters like ":%$*#-~" do not satisfy necessarily the above properties, but for different reasons. People normally use characters like the ones above to format text, for example you can see above that I used the "*" character to separate lines. This usage is typical of "%*#-", so what we do in the algorithm is to discourage people from using these characters, and if they appear too often or do not follow certain rules we do not include them in our indent-string. This patch adds support for ":*#-~", but not yet for "%$". It's likely that I will add support for the character "%" in the same way that it was added for "#", but I have not found a way to add support for the "$" character. In fact, it seems to me that there's no good rule to distinguish when it is used as indent-string or not. For example: Here it's not used as indent-string: > $ cd mail/ > $ rm -f * > $ cd or > $ ENV="This is an environment variable, don't try to justify me because I'm too long for the width of your screen!" > $ export ENV; echo "This is the value of the variable ENV: $ENV" Here it is used as indent-string: $ Hello you! $ Are you coming to my party $ on Saturday? Can't decide what to do there yet. For the moment, it is not an allowed quote character and justifying it as a indent-string will fail (not all hope is lost as we will see later). Another interesting character for quoting is "~". Previous versions of this patch used to consider "~" as a quote character in the same footing as ">)]", but it took one message to show me that I was wrong. In fact, a message which contained a line like: :) I ~knew~ about this.... made the algorithm recognize ":) I ~" as the indent string, which is wrong. In particular, the new algorithm has very strict rules about what characters can follow a word. In particular, letters can only be followed by other letters (up to a maximum of 3), a ">" character or a " >" string. The same comment about "~" applies to the character ":". This character can be used only under very restricted rules. I will talk about the rules that are used to determine the indent string later. First we will worry of understanding how justification of text with one level of indentation works. Justification of quoted-text (Part I) ------------------------------------- Again we go back to the function fillpara. First we find the quote string of the last line of the paragraph that we want to indent. This is done in the call qstr = (Pmaster && Pmaster->quote_str && quote_match(Pmaster->quote_str, curwp->w_dotp, qstr2, NSTRING) && *qstr2) ? qstr2 : NULL; When you are calling fillpara from Pine, Pmaster is a variable created by this call and contains the quote string. For simplicity we will assume that Pmaster->quote_str = "> ". When this variable exists, the quote string "qstr" is determined by the function quote_match. This is called with several parameters: - Pmaster->quote_str is the quote string specified by the user: "> " in this example. - curwp->w_dotp is the line that is used to determine the quote string, when this call is made, this line is the last line of the paragraph to be justified. - qstr2 is a pointer to the quote string determined by the function quote_match. This is where we return the result of the quote string found. - NSTRING is the maximum size of a buffer. For the sake of simplicity we will assume that qstr2 returned to be "> ", so that qlen = strlen(qstr) = 2. Then the same algorithm described in the "Basic Justification" section will be executed, except for some points: * When you get to the end of a line in "c = fremove(i++)", the position in that buffer "i" will skip the quote string by the call "i += qlen". Before, it just continued. * When you get to the check if you can write a word, when you determine that there's not enough space to write the word in the current line, one ends the current line, adds a new line, adds the quote_string "> " and writes the word after that. This is done in the lines: if(line_len && line_len + word_len + 1 > fillcol && (line_len = fpnewline(qstr))) (fpnewline adds the line beginning with qstr and line_len is the length of qstr returned by the calling function). The code following the above code adds the word to the paragraph. As you can see this is a very general procedure, so in terms of thinking where to add code that recognizes more general quote strings, it's clear that we must put the intelligence in the function quote_match. We will describe what happens in this function in several parts too. The Function quote_match (Part I) --------------------------------- Initially this function was very simple. It copied the contents of the given line to a buffer, sent this buffer to the function "is_quote", which returned the position where the indent-string ends and sent back that answer to the calling function, writing the "qstr2" buffer accordingly. Below is a simple implementation of this function: for (i = 0; (i < NSTRING) && (i ]" is a quote character and can appear as many times as it wanted in the quote string. 2. The indent-string given by the user (if any) is always valid, even if it would not validate otherwise. 3. Special characters like "*-#~" follow special rules: a) "*" is only valid if followed by a character quote as above. b) "-" is valid only if it appears followed by ">" or "->". It can't follow a word. c) "#" can not follow a word, and can only be followed by quote or word or dash, or by another #. It can't follow a word. d) "~" and ":" can only be followed by a quote character or a space. 4. Words follow rules too, they can't be longer than 3 characters long, can not be at the end of a quote string, and if they are present in a quote string, they must be followed by a space or a space followed by the character ">". 5. There can be spaces in the middle of an indent-string, but only if they are followed by indent strings too. Examples: ":) > ABC> :)" is allowed, but ":) > ABC> Hello" is not allowed. In this case the indent-string is ":) > ABC> ". 6. If you find a space at the end of an indent string, then put it in the indent-string, if there's no space, do not add it. The algorithm here is as follows. We start from the beginning of the line and advance past the indent-string given by the user. We do this by calling "advance_indent_string". After we have advanced as much as possible with the above mentioned function we see if we can advance one more character applying the above mentioned rules. If we can, then we do it, and repeat the same process all over again (we call advance_indent_string again!). And so on. This function always returns at most one space at the end of a quote string. The Function quote_match (Part II) ---------------------------------- All this is great so far. We could end the search for a quote string right now, after all we have given all the necessary steps to complete that task. However there are some problems. Consider the following paragraph: > Hello you! >Are you coming to my party >on Saturday? The first line has quote string "> ", the second ">" and the third ">", so from the point of view of the algorithm, there are two paragraphs (paragraphs are determined by changing the quote string or by blank lines). In particular indenting this paragraph does not have the effect that is wanted. One way to solve this is by saying that the quote string never contains a space at the end. In that way the above paragraph would be indented correctly. The problem with this, is that the following paragraph would not be justified at all > Hello you! > Are you coming to my party > on Saturday? since it would be considered as three paragraphs, and each line would be justified separately. In other words, there's no way to win in this situation. If you add the space, you make a mistake, if you don't you do too. However, it seems to me a more serious mistake the second mistake rather than the first one, because you end up justifying more lines in the second case and fixing justification much more than with the first case, where you would have to fix indentation only once, so although it's not always correct we better insist in returning the space at the end if present. What quote_match can do for us now is to make an extra test. Notice that if we could remove the space at the end we would be fine. This is done, by testing the line following the given line (this is found by computing lforw(l)). If the following line does not have a space at the end of the indent string, then we back up one space and get rid of the space at the end. The code that produces this is: (the variable "c" is the value to be returned by the function) c = is_quote(GLine); /* Current or Given line */ if (n) n = is_quote(NLine); /* Next Line */ if ((c == (n + 1)) && (GLine[n] == ' ') && NLine[n]) c--; /* delete last space! */ That's all is needed (notice we do not check that the quote strings of both lines are equal!. This will be explained later) Then another idea came, why not add support of justification of paragraphs that have been indented with spaces to its left, something like :) This paragraph was indented to its left by lots :) of spaces, and I am writing it to show you an example :) of one. The function is_quote, would return ":) " as its indent_string and this paragraph would be treated as three paragraphs, and we know that this is bad already, so we need a fix. Again the fix consists in checking if, when one advances past the spaces one gets the same indent string than the next line. In the case of the last line of the paragraph we can't do this, so what we do instead is to check the previous line. The code that does this is the following: i = c; if (GLine[c] && ((GLine[c] == ' ') || (GLine[c] == TAB))) for (i = c; (GLine[i] == ' ') || (GLine[i] == TAB); i++); if (i > c){ /* More than one space after the indent string? */ j = n; if (NLine[n] && ((NLine[n] == ' ') || (NLine[n] == TAB))) for (j = n; (NLine[j] == ' ') || (NLine[j] == TAB); j++); if (c == n){ if (i == j){ /* same number of spaces */ c = i; } else c = prev_line_string(l, c, i); } else c = prev_line_string(l, c, i); } Here prev_line_string is commented in the patch, but it fixes "c" if there's a need to fix it using the previous line to the given line, as opposed to using the next line (the use of the next line is in the call "c = i;"). Justification of quoted-text (Part II) -------------------------------------- As we have seen we have introduced changes in the quote_match function to the result handed by is_quote based on text nearby. Because of these changes now there exists a possibility that the quote string may contain TABs. The effect of this fact is that calls like qlen = strlen(qstr); line_len = fpnewline(qstr) are not accurate anymore, we need accuracy in this call, and this is done by adding a call "strlenis(qstr)", which gives the real length of the indent string qstr when displayed on screen. TABs are accounted for. If you do the recount in the right place, and use strlenis where you should, you should not have any problem adding TABs to indent-strings. The way to do it is in the patch. TO DO ----- * I am looking into the possibility of accepting any length of words in a indent string as long as they are followed by ">" or " >". This is an experimental part of the code still and not released. * I would like to explore the possibility of adding support for justification of paragraphs that are structured in a different way, like the ones on this list. What distinguished these paragraphs is the fact that their first line is indented with a special character. I'll see if I can do that easily.