Issue Details (XML | Word | Printable)

Key: FLM-18
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Marcin Mościcki
Reporter: Marcin Mościcki
Votes: 0
Watchers: 0
Operations

Clone this issue
Create sub-task
If you were logged in you would be able to see more operations.
Filmaster

Enhance Criticker import

Created: 26/Mar/09 07:09 PM   Updated: 06/Jun/09 04:49 PM
Component/s: Fetchers
Affects Version/s: 1.0
Fix Version/s: 1.0

Time Tracking:
Not Specified


 Description  « Hide
Obecny importer z critickera ma błąd: update'owany film jest wyszukiwany po tytule używając heurystycznej wyszukiwarki dla użytkownika, co oznacza, że oceniony może zostać zupełnie inny film, który ma podobny tytuł i jest bardziej popularny.

Jako minimum fix powinien brać pod uwagę rok produkcji.

Dalsze ulepszenia:
1. Wybór algorytmu konwertowania score z critickera. Opcje: tier (jak teraz), round(score/10), score (zaklada oceny 1-10), oraz auto (skaluje ocene na podstawie maxymalnej i minimalnej oceny w pliku)
2. Możliwość nadpisywania ratingów na istniejące. To głównie dla mnie, gdzie pierwszy import uzywal innego algorytmu, niz zakladalem.
3. Mozliwosc importowania recenzji z critickera - wobec wersji anglojezycznej wydaje sie sensowna. Niestety, limit dlugosci recenzji filmastera jest krotszy, wiec jesli recenzja przekroczy ta dlugosc, zostanie zignorowana.
4. XML produkowany przez critickera robi podwojne escape'owanie, ktore myli filmastera. Probowac zgadnac co autor mial na mysli.
Jutro powinienem to zasubmitowac.





 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Marcin Mościcki added a comment - 26/Mar/09 07:10 PM
Dodalem issue, bo implementacja jest od jakiegos czasu prawie gotowa i chce dokonczyc, zanim zabiore sie za bardziej pilne sprawy - powinno byc jutro.

Borys Musielak added a comment - 26/Mar/09 08:32 PM
In English, please!!!

Marcin Mościcki added a comment - 27/Mar/09 06:02 PM - edited
Import should be working a lot better now:

New functionality:
1) possible to overwrite previous ratings
2) possible to import reviews from criticker

Improvments:
1) When importing from imdb, we first lookup by imdb_code. However, this fails - see FLM-19;
2) When looking up by title, we take the year into account. For criticker, where no year is given, we try to guess it from the url if possible;
3) We don't take first match as search result. Instead, we look in the matches for an exact match first. If more then one is found (and no year) is present, rating is not imported.
4) unescaping of imdb and criticker data which previously confused the import engine

Borys Musielak added a comment - 27/Mar/09 07:57 PM
Good work. A few remarks/questions though:

1. I tried importing my ratings on dev.filmaster.com ad got:

- Films that could not be imported:
The Blind Swordsman: Zatoichi; Solaris; Run Lola Run; The Time of the Wolf; Man of Marble; Metropolis; Psycho; Casino Royale; The Saragossa Manuscript; Once Upon a Time in the West;

Why is Casino Royale there? It does have an English title in the database.

2. When using the Polish locale version when importing, does it try to import the short reviews with LANG=pl into our database or is it smart enough to hardcode it to 'en' for Criticker reviews? Or does it skip importing the reviews all together?

Marcin Mościcki added a comment - 05/Apr/09 03:05 PM - edited
First, answers:
1. There are two films titled 'Casino Royale', and we took a safe approach of not guessing which one the user had in mind. In some cases, it was possible to deduce release year from the criticker link and then it would pick the right one

2. If 'import reviews' flag is set we import reviews regardless of the language. This is because some people reviewed in other languages then english and I wasn't trying to be too smart - after all, the default is false. If you request so, I can show this field only when LANGUAGE application variable is 'en'.

Second, new improvements:

I. Criticker import:
 a) added criticker_id column to core_film. Currently it is set only when a film is recognized during ratings import.
 b) if a film is not found based on the data in the xml file, we parse the criticker film page to get the year and aka, and try again
 c) not found films are shown in the summary as links to ADD_FILM form, which will automatically set 'title' field to 'title (year)'

II Imdb import:
 a) if a film is not found in db, it is automatically imported from imdb: IS IT A BEHAVIOR YOU REALLY WANT? Some of these titles may be TV, series, or adult.
 b) modified imdb import code (and criticker import) to unescape xml entities present in some titles (i.e. 'Héros' -> 'Héros')

III generic:
 a) more thought-out title matching:
 - exact ignore-case title match, if unique
 - normalized title match, if unique
 - best match from search algorithm, if it yields a single result

Sometimes criticker states different release year then imdb, and we could try a 'fuzzy' match if pressed, but I guess we can live without it.

Adam Zieliński added a comment - 05/Apr/09 10:32 PM
"Some of these titles may be TV, series, or adult. ..."

Fetcher saves titles with kind different than "tv series". This function a.do_adult_search(0) removes adult movies form search query.

Marcin Mościcki added a comment - 09/Apr/09 08:15 PM
I deployed another slight change of the algorithm - the problem of lately commited version is too many false positives for a change.