Cover parser combinator usage in brhoades/marko-hs
Text.ParserCombinators.ReadP
(marko-hs IRC parser)
ByteStrings
Core types
type Parser = Parser ByteString
type Result = IResult ByteString
data IResult i r =
Fail i [String] String -- parse failed
| Partial (i -> IResult i r) -- continue feeding input
| Done i r -- complete
Starting functions
-- If you want to feed more data in chunks
parse :: Parser a -> ByteString -> Result a
-- Take one string, return result or error
parseOnly :: Parser a -> ByteString -> Either String a
Word8
s are unsigned 8-bit integersByteStrings
are Word8
arrays in a whole grain compiler wrapdata {-# CTYPE "HsWord8" #-} Word8 = W8# Word# -- :s
data ByteString = \
PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- offset
{-# UNPACK #-} !Int -- length
deriving (Typeable)
pack :: [Word8] -> ByteString
unpack :: ByteString -> [Word8]
-- Getting to/from Word8 is most easily done
-- by massaging a Char into something with Num
toW :: (Num c, Enum a) => a -> c
toW = fromIntegral . fromEnum
> :t pack [toW 'w']
pack [toW 'w'] :: ByteString
toBS str = pack $ map toW str
> :t toBS "Hello!"
toBS "Hello!" :: ByteString
-- Source: A dog jumps over the log! The log was very long.
-- We want:
["A", "dog", "jumps", "over", "the", "log", "!",
"The", "log", "was", "very", "long", "."]
-- So:
parseWords :: Parser [ByteString]
-- Which'll be easier if we find out how to parse a single word first
parseWord :: Parser ByteString
Attoparsec takes Word8 -> Bool
functions to do matching work. We’ll need those to determine when we hit a “separator” or “punctuation”.
splitter = (== toW ' ')
endline = \x -> x == (toW '\n') || x == (toW '\r')
punct = \x -> x == (toW '.')
|| x == (toW '!')
|| x == (toW '?')
wordTerminator = \x -> (splitter x || endline x || punct x)
Attoparsec provides utility functions to do this, but they’re slow. Documentation explicitly advises against their use.
inClass :: String -> Word8 -> Bool
notInClass :: String -> Word8 -> Bool
vowel = inClass "aeiou"
probablyConsonant = notInClass "aeiou"
Parser
MonadParser
wraps your parsing functions. It carries parsed input and unused input forward as you consume it.
parseWords :: Parser [ByteString]
parseWords = undefined -- soon!
parseOnly :: Parser a -> ByteString -> Either String a
-- We will implement parseWord
parseWords :: ByteString -> [ByteString]
parseWords str = case parseOnly parseWords str of
Left _ -> error "Parser error"
Right c -> c
parseWord
takeWhile :: (Word8 -> Bool) -> Parser ByteString
-- Input: toBS "abc 123 monad"
-- Output: ["abc", "123", "monad"]
-- Start with parsing individual words termed by spaces/punc
parseWord :: Parser ByteString
parseWord = do
res <- takeWhile1 (not . wordTerminator)
return res
Word8
matcher (predicate) for punctuation, spaces, or newlinestakeWhile1
requires a match and consumes input while the predicate is satisfied
takeWhile
allows no matchparseWord
parseWord :: Parser ByteString
parseWord = do
res <- takeWhile1 (not . wordTerminator)
return res
> parseOnly parseWord (toBS "Hello there!")
Right "Hello"
Wow, way cool
There’s some nasty edge cases if you begin your words with a splitter.
> parseOnly parseWord (toBS " Hello there!")
Left "Failed reading: takeWhile1"
We don’t want these separators in our words anyway, so let’s skip them.
skipWhile :: (Word8 -> Bool) -> Parser ()
parseWord = do
skipWhile splitter -- (== toW ' ')
res <- takeWhile1 (not . wordTerminator)
skipWhile splitter -- another edge case, dw about it
return res
> parseOnly parseWord (toBS " Hello there!")
Right "Hello" -- 😂
skipWhile
is a takeWhile
that discards. Not skipWhile1
because if there are no spaces, that’s fine.
We need to parse words over and over. We have the inner bit done but how do we chain them?
manyTill :: Alternative f => f a -> f b -> f [a]
satisfy :: (Word8 -> Bool) -> Parser Word8
parseWords :: ByteString -> [ByteString]
parseWords = do
skipWhile (\x -> endline x || splitter x)
manyTill parseWord (satisfy endline)
manyTill
calls one parser until the other consumes inputYes!
> parseOnly parseWords (toBS "Hello there\n")
Right ["Hello","there"]
But not for punctuation
> parseOnly parseWords (toBS "Hello there!\n")
Left "Failed reading: takeWhile1"
-- inside parseWord:
takeWhile1 (not . wordTerminator)
-- Input remaining: "Hello there!"
-- First call: "Hello"
-- Input remaining: "there!"
-- Second call: "There"
-- Input remaining: "!"
-- Third call: ""
-- Error: takeWhile1 requires at least one Word8 match
Punctuation is an important part of a sentence. We should consume it and return it if it’s there.
-- We need to call this if there's punctuation
-- but takeWhile1 requires 1 punctuation
takeWhile1 punct
-- \x -> x == (toW '.') || x == (toW '!') || x == (toW '?')
choice :: Alternative f => [f a] -> f a
parseWord :: Parser (ByteString)
parseWord = do
skipWhile splitter
res <- choice [takeWhile1 (not . wordTerminator),
takeWhile1 punct]
skipWhile splitter
return res
Now we can parse a word that’s only punctuation.
> parseOnly parseWords (toBS "Hello there!\n")
Right ["Hello","there","!"]
> parseOnly parseWords (
toBS "Hello there! How are you doing?\n"
)
Right ["Hello","there","!","How","are","you","doing","?"]
Perfect, it’s all separated now. This is (mostly) how marko-hs uses attoparsec to parse sentences into words.
ByteString
s are very memory efficientByteString
s is faster than parsing into Data.Text
ReadP
ByteString
s activated my C PTSDHere’s the bare essentials of IRC that our bot needs:
Examples of each:
PING irc.freenode.net
:irc.freenode.net MODE <umask> +b
:irc.freenode.net PRIVMSG <umask> #monads :guys...
Marko uses an event-based approach for parsing IRC
data Event = MessageEvent Source Message
| ModeEvent User String
| Ping Server
-- Parser gets events
ircParser :: ReadP Event
Main bot loops does stuff with Events:
-- Ignore the man behind the curtain. See that Event?!
handleEvent :: StdGen -> ChainData -> ChainData
-> Event -> Maybe (String, String)
ReadP
: Same great taste, ByteString-freeIRC parsing was done with base Text.ParserCombinators.ReadP
Network.Simple.TCP.TLS
likes Data.Text
Major differences
ReadP
is like the Parser
MonadtoW
/toBS
takeWhile
becomes munch
readP_to_S
is like parseOnly
, sans Result
PING irc.freenode.net
:irc.freenode.net MODE <umask> +b
:irc.freenode.net PRIVMSG <umask> #monads :guys...
There’s clearly two formats here: PING and not PING. This looks like choice
.
ircParser :: ReadP Event -- ReadP = Parser
ircParser = choice [otherMessages, ping]
Bam, that was easy.
Ping is easy too.
PING irc.freenode.net
Let’s start by defining some predicates:
eol :: Char -> Bool
eol = \x -> (x == '\n' || x == '\r')
eolws :: Char -> Bool
eolws = \x -> (x == '\n' || x == '\r' || x == ' ')
Some utility functions too. These just gobble arbitrary input.
_char :: Char -> ReadP ()
_char x = char x >> return ()
_string :: String -> ReadP ()
_string x = string x >> return ()
Ping has two parts:
So we have two things to parse:
ping :: ReadP (Maybe Event)
ping = do
-- get rid of leftover whitespace
skipMany $ satisfy eolws
-- 1) Discard PING.
_string "PING "
-- 2) Grab everything else (server name)
-- munch (not . eolws)
server <- consumeEOLWS
-- A Ping Event for our event-based bot
Just $ Ping server
ircParser :: ReadP (Maybe Event)
ircParser = choice [otherMessages, ping]
otherMessages starts off like PING. Get the event type and parse each event differently.
otherMessages :: ReadP (Maybe Event)
otherMessages = do
skipMany $ satisfy eolws
-- ':' is on these; if this fails, choice runs "ping"
_char ':'
server <- munch1 (not . eolws) -- munch1 is takeWhile1
_char ' '
eventType <- munch1 (/= ' ')
_char ' '
getEvent server eventType
:irc.freenode.net MODE <umask> +b
:irc.freenode.net PRIVMSG <umask> #monads :guys...
getEvent :: String -> eventType -> ReadP (Maybe Event)
getEvent server x
| x == "PRIVMSG" = handleMSG server
| x == "MODE" = do
-- Very easy. Get a umask, get the mode.
umask <- munch1 (not . eolws)
let user = parseUser umask
_char ' '
mode <- munch1 (not . eolws)
return $ Just $ ModeEvent user mode
| otherwise = do
-- Unsupported IRC event
_ <- consumeEOL
return Nothing
handleMSG
:irc.freenode.net PRIVMSG <umask> #monads :guys...
handleMSG :: String -> ReadP (Maybe Event)
handleMSG user = do
-- handles parsing umasks; boring
let nick = parseUser user
channel <- munch1 (/= ' ')
_string " :"
msg <- consumeEOL
return $ Just $ MessageEvent ("", channel, Just $ nick) msg
That’s all the IRC parsing that you need to get into a server. It’s incredibly simple. Just hook it up to your HTTP socket:
parseIncoming :: String -> Maybe Event
parseIncoming str = do
-- readP_to_S is parseOnly that includes all the intermediate steps.
-- unreducedEvents :: [(Event, String)]
let unreducedEvents = readP_to_S ircParser str
if length unreducedEvents > 0 then
-- We just want the last result's event
Just $ fst . last $ unreducedEvents
else
Nothing
Want more parsing?