Grammar development using ANTLR4

Our ANTLR4 setup is ready in our machine , so now we can start writing the grammar , ANTLR4 file extension is g4.for example Grammar.g4 etc. there are some best practice we should follow.

Lexer file : This file is containing all the lexer or keyword or constant value that will be going to use in grammar file. lexer file also contain regular expression for text , comments ignorance , white space ignorance and all the fragments.
Parser file : This file handle all the rules and productions , also this should import Lexer file as well.

Lets take one example:

create one folder in any drive , lets give the name to folder is ANTLR4-EXAMPLE , and inside this folder create two files , one is GrammarLexer.g4 (Lexer file) and second will be GrammarParser.g4

for better looks , install vs code ANTLR4 grammar syntax support extension like below.

GrammarLexer.g4

lexer grammar GrammarLexer;
// Skip Sections
WHITESPACE: [ \r\f] -> skip;
WS: [ \t\n] -> channel(HIDDEN);
MULTI_LINE_COMMENT: '/*' .*? '*/' -> skip;
SINGLE_LINE_COMMENT: '--' ~[\r\n]* ('\r'? '\n' | EOF) -> skip;

// Keywords (should come before regular expressions to resolve ambiguity in favour of keywords)
K_CREATE: C R E A T E;
K_ALTER: A L T E R;
K_DATA: D A T A;
// Symbols (alphabetically sorted)
BRACKET_CLOSE: ']';
BRACKET_OPEN: '[';
N_QUOTE: 'N\'';
SINGLE_QUOTE: '\'';
DOUBLE_QUOTE: '"';
SEMICOLON: ';';
COMMA: ',';
DOT: '.';
// Regular Expressions
BRACKETED_NAME1: (BRACKET_OPEN ~[\]]* BRACKET_CLOSE) NAME (
		BRACKET_OPEN ~[\]]* BRACKET_CLOSE
	);
BRACKETED_NAME2: (BRACKET_OPEN ~[\]]* BRACKET_CLOSE) (
		DOT BRACKET_OPEN ~[\]]* BRACKET_CLOSE
	)? NAME;
BRACKETED_NAME3: NAME (BRACKET_OPEN ~[\]]* BRACKET_CLOSE) NAME;
BRACKETED_NAME4:
	NAME (BRACKET_OPEN ~[\]]* BRACKET_CLOSE DOT)? (
		BRACKET_OPEN ~[\]]* BRACKET_CLOSE
	);
BRACKETED_NAME5: (BRACKET_OPEN ~[\]]* BRACKET_CLOSE DOT)* (
		BRACKET_OPEN ~[\]]* BRACKET_CLOSE
	);
INTEGER: [+/-]? DIGIT+;
NUMBER: [+/-]? DIGIT+ (DOT DIGIT*)? (E [+/-]? DIGIT*)?;
NAME: (NAME_CHAR | DOT)+;
N_QUOTED_NAME: N_QUOTE ~[']* SINGLE_QUOTE;
SINGLE_QUOTED_NAME: SINGLE_QUOTE ~[']* SINGLE_QUOTE;
DOUBLE_QUOTED_NAME: DOUBLE_QUOTE ~["]* DOUBLE_QUOTE;
// Fragments
fragment DIGIT: [0-9];
fragment NAME_CHAR: ([0-9a-zA-Z] | [_@$#]);
fragment ESC_CHAR: '\\' (["\\/bfnrt] | UNICODE);
fragment UNICODE: 'u' HEX HEX HEX HEX;
fragment HEX: [0-9a-fA-F];
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
fragment F: [fF];
fragment G: [gG];
fragment H: [hH];
fragment I: [iI];
fragment J: [jJ];
fragment K: [kK];
fragment L: [lL];
fragment M: [mM];
fragment N: [nN];
fragment O: [oO];
fragment P: [pP];
fragment Q: [qQ];
fragment R: [rR];
fragment S: [sS];
fragment T: [tT];
fragment U: [uU];
fragment V: [vV];
fragment W: [wW];
fragment X: [xX];
fragment Y: [yY];
fragment Z: [zZ];

GrammarParser.g4

grammar GrammarParser;
// import the GrammarLexer
import GrammarLexer;
// stream is starting point of grammar.
stream: (sqlStatement SEMICOLON)+ EOF;
sqlStatement: createStatement | alterStatement;
createStatement: createData;
alterStatement: alterData;
//createData Productions.....
createData: K_CREATE K_DATA objectName dataOption?;
dataOption: (NAME | SINGLE_QUOTED_NAME);
//alterData Productions......
alterData: K_ALTER K_DATA objectName dataOptionChange;
dataOptionChange: dataOption (COMMA dataOption)*;
//General Productions
objectName:
	NAME
	| SINGLE_QUOTED_NAME
	| BRACKETED_NAME1
	| BRACKETED_NAME2
	| BRACKETED_NAME3
	| BRACKETED_NAME4
	| BRACKETED_NAME5
	| DOUBLE_QUOTED_NAME
	| N_QUOTED_NAME;

Now both lexer and parser files are ready, now time to generate the parse tree.
open the vs code terminal , first run the lexer file by below command.

> antlr4 GrammarLexer.g4

now antrl4 generated the .tokens file , that will contain the mapping of the all keyword with specific number.

after lexer file , we have to run the parser file using below command.

> antlr4 GrammarParser.g4

here antlr4 generated the all the required java files like Listener, BaseListener, Lexer and Parser. actually antlr4 is an java tools , so that is the reason its generating .java file by default.

Now we can pass the any input data to this lexer and parser files , and these files will generate parse tree.
input.txt

create data 'data1' ;
create data 'data2' abc098777;
alter data  'data1' abc098778;
alter data  'data1' abc098778,'abc098779',a898999988;

pass above .txt file to the antlr4 generated files by using below command.

> antlr4-parse .\GrammarParser.g4 stream input.txt -gui

GrammarParser.g4 —-> parser grammar file,
stream —-> start point of the parse grammar(refer to GrammarParser.g4)
input.txt —-> input file

Parser Tree : After executing above command , following parse tree will generate.

This is the process , how can we use ANTLR4 to parse any text or script and can generate the parse tree , these parse tree can be used in our code base to traverse all the nodes of tree and we can process it.

we can directly write the grammar on antlr4 lab for practice : antlr4-lab

Leave a Reply Cancel reply