« GitHub | Main | Mediengedöns »

Sonntag, Januar 24, 2010

eOPAC

Bei dieser Libmondo-Webanwendung nimmt man auch wirklich alles mit, was schiefgehen kann. Heute - Cookie-Handling. Der HttpClient verfügt über automatisches Cookie-Handling, nur

RFC2109 is the first official cookie specification released by the W3C. Theoretically, all servers that handle version 1 cookies should use this specification and as such this specification is used by default within HttpClient. Unfortunately, many servers either incorrectly implement this standard or are still using the Netscape draft so occasionally this specification is too strict. If this is the case, you should switch to the compatibility specification as described below.

Wie sieht das Cookie aus?

Set-Cookie: SID=S296412643528879287930254966; path=/; Version="1"

Schade, nun muss man also doch was tun.

HttpClientParams.setCookiePolicy(httpClient.getParams(), CookiePolicy.BROWSER_COMPATIBILITY);

Nun geht's.

Hier auch nochmal ein schöner Text zum Scraping allgemein:

This scenario describes a hobbyist usage of HTTP, in other words: a bad practice. Web sites are designed for user interaction, not as an application programming interface (API). The interface of a web site is the user interface displayed by a browser. The HTTP communication between the browser and the server is an internal API, subject to change without notice. A web site can be redesigned at any point in time. The server then sends different documents and a browser will display the new content. The user easily adjusts to click the appropriate links, and the browser communicates via HTTP as specified by the new documents from the server. Your application that only mimicks a browser will simply break. Nevertheless, implementing this scenario will help you to get familiar with HTTP communication. It is also "good enough" for hobbyists applications, for example if you want to download the latest installment of your favorite daily webcomic to install it as the screen background. There is no big damage if such an application breaks. If you want to implement a solid application, you should use only published APIs.

Es ist alles wahr: "hobbyists applications", "bad practice", "use only published APIs" aber eben auch, dass es keine API gibt...

Erstellt von tixus um 7:30 PM Kategorien:
Powered by
Thingamablog 1.1b6