2019-12-12 - Progress
As part of my work on superglue I have resumed work on the WebDriver scripts I started in January. And, predictably because they were a barely working mess, it took me a while to remember how to get them working again.
So I thought it might be worth writing a little tutorial describing how I am using WebDriver. These notes have nothing to do with my scripts or the DNS; it's just about the logistics of scripting a web site.
- What you will learn
- What you should know
- What you don't need to know
- Start
- Configuring a session
- Ending a session
- My dev setup
- Locating elements
- Client code helpers
- Error checking
- Reading the page
- Filling forms
- Other gotchas
- That's it!
What you will learn
WebDriver: the standard remote control protocol for web browsers, originating in but now somewhat separate from the Selenium project.
How to use geckodriver to automate Firefox.
What you should know
Scripting JSON-over-HTTP: use the programming language of your choice, so long as you have convenient libraries for REST-flavoured web APIs.
I'm going to use the command-line program HTTPie in the examples because it makes ad-hoc experiments pretty easy.
HTML: you need to be comfortable looking at the source code of web pages.
CSS selectors: you need to be able to write CSS selectors to pick the web page elements you want your script to act on.
Xpath: sometimes CSS selectors aren't powerful enough, so it's helpful to be able to write Xpath queries or at least navigate this Xpath cheat sheet.
Firefox dev tools: the web page inspector makes this work so much easier. (The other tools are not so relevant.)
What you don't need to know
- Javascript
- Selenium
- node.js
- webdriver.io
A lot of the existing web browser automation ecosystem is oriented around testing (specifically Selenium and the node.js framework webdriver.io), but my purpose is to script web sites that don't provide the APIs I need.
Start
Get Firefox if you don't already have it.
Download a copy of geckodriver for your system, unpack it, and
copy it to ~/bin
or some other suitable place on your $PATH
.
geckodriver
is proxy between the standard WebDriver protocol and
Firefox's less convenient native "marionette" remote-control protocol.
In a terminal window, run geckodriver
. It will sit there waiting for
something to happen. Keep the terminal open; geckodriver
will use it
for logging.
geckodriver
's default WebDriver endpoint is a web server running on
localhost
port 4444. Open a second terminal window and start a
WebDriver session by running:
$ echo '{}' | http POST http://localhost:4444/session HTTP/1.1 200 OK content-type: application/json; charset=utf-8 { "value": { "capabilities": { ... snip ... }, "sessionId": "570b8399-bc01-2745-b37b-ed6c641156b3" } }
geckodriver
should start a new copy of Firefox with an ephemeral
profile (so it won't have your cookies or history or settings or
extensions etc.). The address bar will have a stripey orange
background and a little picture of a robot so you know it is being
automated.
HTTPie prints a JSON response containing a lot of information about the browser. The important part is the session ID, like
"sessionId": "570b8399-bc01-2745-b37b-ed6c641156b3"
All the actions you perform on the browser will be associated with this session by using a URL prefix like
http://localhost:4444/session/570b8399-bc01-2745-b37b-ed6c641156b3
This URL is really long so let's call it $wds
for "WebDriver session".
sessionId=570b8399-bc01-2745-b37b-ed6c641156b3 wds=http://localhost:4444/session/$sessionId
Now, make the browser navigate to a URL with the command:
$ http -v POST $wds/url url=http://www.dns.cam.ac.uk POST /session/570b8399-bc01-2745-b37b-ed6c641156b3/url HTTP/1.1 Host: localhost:4444 Content-Type: application/json { "url": "http://www.dns.cam.ac.uk" } HTTP/1.1 200 OK content-type: application/json; charset=utf-8 { "value": null }
If you see this purple web site then you have started scripting a browser!
Configuring a session
As you have seen, it is easy to start a session with the default settings. Normally when starting a session I use options like:
{ "capabilities": { "alwaysMatch": { "timeouts": { "implicit": 2000, "pageLoad": 60000 }, "moz:firefoxOptions": { "args": [ "-headless" ] } } } }
The "implicit" timeout is to do with waiting for page elements to appear. By default it is 0 milliseconds, but I set it to 2 seconds. I am not convinced this is as helpful as I hoped because I have still had to write code that polls the browser waiting for Javascript to finish faffing around.
The "pageLoad" timeout is by default 300000 milliseconds (5 minutes) which is ridiculous. I have set it to 60 seconds which is still a lot more generous than should be necessary.
I normally leave the "moz:firefoxOptions" member out, because I'm
normally doing interactive development and I need to see what my
script is doing. But this example shows how a fully-automated and
operational script would start a session. (Annoyingly, geckodriver
returns a "moz:headless" capability, but it doesn't accept it
in requests, so we have to send it a longer version.)
Ending a session
It's best not to quit Firefox or kill geckodriver
when there is an
active session because it's possible to leave remnants of the
ephemeral browser profile cluttering up your disk. Instead, delete
the WebDriver session as follows, which quits the
browser and deletes its ephemeral profile. (I'm including a reminder
of what $wds
is short for - your sessionId
will be different!)
$ sessionId=570b8399-bc01-2745-b37b-ed6c641156b3 $ wds=http://localhost:4444/session/$sessionId $ http DELETE $wds HTTP/1.1 200 OK content-type: application/json; charset=utf-8 { "value": null }
Once the session is deleted, you can start a new one re-using the same
geckodriver
(but you can't have multiple concurrent sessions).
Or you can safely kill an idle geckodriver
which has no active session.
My dev setup
When I am writing a script to control a web site, I work with several windows:
Firefox under control of
geckodriver
(not in headless mode), for seeing what my script does to the web pageFirefox web page inspector, for working out the CSS selectors for the HTML elements I want to manipulate (this can be docked as part of the main browser window but I prefer to separate it)
An editor window for writing my script
A terminal window for running my script and logging a trace of the WebDriver protocol JSON messages, or for experiments with HTTPie
Another terminal window where
geckodriver
chatters (this is less informative and not necessary to keep visible)
Locating elements
Most WebDriver interaction consists of pairs of HTTP requests:
locate an element
do something with the element
The WebDriver protocol has several ways to locate elements:
css selector
link text
partial link text
tag name
xpath
Let's try an example:
$ http -v $wds/element using='link text' value='About this site' POST /session/c33be620-65b5-6944-bc41-cff38a372823/element HTTP/1.1 ... headers ... { "using": "link text", "value": "About this site" } HTTP/1.1 200 OK ... headers ... { "value": { "element-6066-11e4-a52e-4f735466cecf": "8a6f5a50-d197-c84f-a2b3-cae767dc6dab" } }
Grab the ID out of the response, and try this action:
$ elem="8a6f5a50-d197-c84f-a2b3-cae767dc6dab" $ echo {} | http POST $wds/element/$elem/click
You should see the "About this site" menu appear on the web page.
"using" pairs
The request has an object with a "using" member containing the location strategy, in this case "link text", and a "value" member that should identify the element we want.
element IDs
For obscure reasons, element IDs are returned in an object with a
member named element-6066-11e4-a52e-4f735466cecf
. This is a fixed
string that is part of the protocol, it isn't an ID! The element ID in
this example is "8a6f5a50-d197-c84f-a2b3-cae767dc6dab".
In the rest of this tutorial, when I locate an element I will set the
elem
shell variable to the element's ID. You will need to substitute
the actual ID you get from your WebDriver response.
Client code helpers
In my WebDriver code I have a different representation of elements in the web page, which is a lot more convenient than the WebDriver protocol representation.
Because I use them so heavily, a simple string is interpreted as a CSS selector.
Other locator strategies are represented like { "link text" : "About
this site" }
because it's much shorter to omit the "using" and
"value" strings.
Or if the element has alredy been located, it is represented in raw
WebDriver form like { "element-6066-11e4-a52e-4f735466cecf":
"8a6f5a50-d197-c84f-a2b3-cae767dc6dab" }
Whenever an action method in my code (such as click
) is passed a
locator rather than a raw WebDriver element, it automatically makes an
element
request to locate the element. This neatly wraps up the two
steps of locate and action for me.
Sometimes I explicitly locate elements. This typically happens when
I'm dealing with sub-elements such as rows of a table or fields in a
form. It's neater to use a $wds/element/$elem/element
sub-element
request than to use string concatenation to build CSS or
Xpath selectors.
Error checking
The element
request returns either one element or an error.
$ http $wds/element using='link text' value='weasels' HTTP/1.1 404 Not Found ... headers ... { "value": { "error": "no such element", "message": "Unable to locate element: weasels", ... snip ... } }
In my WebDriver scripts, the low-level HTTP request code catches errors like this, reports the problem and aborts the script. This is usually good, because the script will not blunder on when its idea of what is happening diverges from reality.
There is also an elements
request which can be used to find multiple
elements in one go (such as the rows of a table) or test whether an
element exists.
$ http $wds/elements using='link text' value='weasels' HTTP/1.1 200 OK ... headers ... { "value": [] }
Reading the page
There are several WebDriver requests for inspecting elements.
The ones that I have found most useful are the text
request, which I
have used to look at the page to check that things are working as
expected, for extracting status messages, etc.
$ http -b $wds/element using='css selector' value='h1' { "value": { "element-6066-11e4-a52e-4f735466cecf": "1ec41bf0-63cb-dc43-9b7e-728779d7b920" } } $ elem="1ec41bf0-63cb-dc43-9b7e-728779d7b920" $ http -b $wds/element/$elem/text { "value": "Overview" }
And I use the property/value
request for getting the current state
of a form. When I'm looking at a pre-filled form that might need
changes I can use this to avoid submitting if changes turn out not to
be necessary.
Filling forms
My main reason for writing WebDriver scripts is to automatically fill in forms. This is superficially easy, but there are traps for the unwary.
text boxes
Let's navigate to this tutorial page and get the id of the simple text box that appears just below.
$ http -b POST $wds/url \ url=http://www.dns.cam.ac.uk/news/2019-12-12-webdriver.html { "value": null } $ http -b $wds/element using='css selector' value='#wd-text' { "value": { "element-6066-11e4-a52e-4f735466cecf": "7cfbe5ea-903e-c945-898b-d3182852691c" } } $ elem="7cfbe5ea-903e-c945-898b-d3182852691c"
We can enter something in the box:
$ http -b $wds/element/$elem/value text='badger' { "value": null }
You should see a badger in the wd-text
box. If you run the command
more than once, you will see multiple badgers in the box.
The value
request does not set the value of a form input as you
might hope. Instead it simulates typing!
So, to correctly fill a text input you need to clear it first, like:
$ echo '{}' | http POST $wds/element/$elem/clear { "value": null } $ http -b $wds/element/$elem/value text='snake' { "value": null }
Then you can be sure you have only a snake.
selection menus
Because it pretends to type at an element, the value
request is no
use for setting the value of a menu.
$ http -b $wds/element using='css selector' value='#wd-sel' { "value": { "element-6066-11e4-a52e-4f735466cecf": "9b4fa642-d7e2-e942-a90a-b6700d1b9eef" } } $ elem="9b4fa642-d7e2-e942-a90a-b6700d1b9eef"
$ http -b $wds/element/$elem/value text='bcde' { "value": null } $ http -b $wds/element/$elem/property/value { "value": "cdef" }
If you try this you will find it doesn't select the option as
expected - my property/value
request read back "cdef" not "bcde"!
(It doesn't even behave in anything like a way that I can understand!)
Instead you need to click on the relevant option, like:
$ http -b $wds/element using='css selector' \ value='#wd-sel option[value="bcde"]' { "value": { "element-6066-11e4-a52e-4f735466cecf": "450dc8b3-aa9c-b241-b15b-3b66cdefa91a" } } $ elem="450dc8b3-aa9c-b241-b15b-3b66cdefa91a" $ echo '{}' | http POST $wds/element/$elem/click { "value": null }
In cases where the option values don't have straightforward meanings, I have found it helpful to use Xpath to match the option text, like:
$ http -b $wds/element using='xpath' \ value='//select[@id="wd-sel"]/option[text()="bcde"]' { "value": { "element-6066-11e4-a52e-4f735466cecf": "450dc8b3-aa9c-b241-b15b-3b66cdefa91a" } }
Other gotchas
There are a few other tricky cases that I have encountered.
hide and click
One of my scripts has to deal with a pop-up date picker. Fortunately I can just type into the date box and ignore the picker - except that the picker pops over another element that I want to click on. In that situation, WebDriver returns an error saying you can't click on an obscured element.
So I had to make my script click elsewhere to dismiss the date picker, before clicking on the obscured drop-down menu.
synchronous vs asynchronous
Most WebDriver actions return a response after the action has completed, so scripts don't have to worry about all the multi-process machinery that is making it work.
However, when a click activates some JavaScript that does the actual thing, WebDriver returns a response immediately. There are cases where the thing is slow (such as performing a back-end API request) so it is fairly obvious that the WebDriver script gets a response before the browser is done.
My scripts handle this by repeatedly making elements
requests until
the expected element appears. There's a timeout in case something
unexpected happens.
You also need to beware of cases where the thing is fast (such as manipulating the DOM to adjust a form) because that can lead to tricky race conditions between the WebDriver script turn-around time and the JavaScript completion time.
That's it!
That is basically everything I have needed to learn about WebDriver to make it useful for scripting web sites.
I have found that most of the work scripting a site is finding out how to automatically navigate the site while ensuring that it is working as expected. WebDriver itself has not been much of a pain point!
There are a bunch of other things that you can do with webdriver such as manipuating windows and taking screenshots, but I haven't needed them.