Full pipeline logic: naver + yandex suggest

The script takes a list of Korean keywords and expands each one into a rich set of related search suggestions, scores them by relevance, and saves the results.


Stage 1 — Input

The frontend sends a JSON payload { "words": ["바둑이", "서울 맛집", ...] } via POST. PHP reads and validates it, then loops over each keyword one by one.


Stage 2 — Suggestion gathering (4 sources per keyword)

For each keyword, suggestions are pulled from four places simultaneously:

Yandex suggest queries suggest.yandex.ru with uil=ko (Korean language context). In practice Yandex knows very little about Korean searches, so this source rarely contributes much — but it costs almost nothing to include.

Naver direct queries ac.search.naver.com with the raw keyword as-is. This is the primary source — Naver is the dominant Korean search engine and its autocomplete reflects real Korean search behaviour. For 바둑이 this returns things like 바둑이 룰, 바둑이 강아지 etc.

Naver consonant expansion is the new layer. It fires 14 separate Naver queries, one per Korean consonant: 바둑이 ㄱ, 바둑이 ㄴ ... 바둑이 ㅎ. When Naver sees a consonant hint it treats it as the first letter of the next word and returns suggestions starting with that consonant — surfacing results that the plain direct query never returns, such as 바둑이 게임, 바둑이 뜻, 바둑이 분양. Each call has an 80ms pause to avoid rate limiting.

N-gram expansion splits the keyword by spaces and generates all 2- and 3-token sub-phrases. For a single word like 바둑이 this produces nothing (minimum is 2 tokens). But for 서울 맛집 추천 it generates 서울 맛집 and 맛집 추천, then queries both Yandex and Naver with each sub-phrase, pulling in suggestions that are anchored to parts of the original query rather than the whole thing.


Stage 3 — Merge and deduplicate

All results from the four sources are flattened into one array and array_unique removes any duplicates. At this point the list may contain 50–100+ raw suggestions for a single keyword.


Stage 4 — Filtering

Each suggestion passes two checks before it qualifies:

Stop word filter drops anything containing words like 다운로드 (download), 무료 (free), 사진 (photo), 동영상 (video), 인스타그램, 유튜브 etc. These are commercially useless navigational or media queries.

Minimum token filter drops any single-word result. Everything that survives must contain at least one space — i.e. be a genuine multi-word phrase. This removes noise like Naver returning the bare keyword itself as a suggestion.


Stage 5 — Scoring and sorting

Each surviving suggestion is scored against the original keyword using two multibyte-safe measures, since PHP's native similar_text() and levenshtein() are byte-based and would miscount Korean characters (each Hangul character is 3 bytes in UTF-8).

mb_similar_score counts shared Unicode characters between the original query and the suggestion, expressed as a percentage of total characters across both strings. Suggestions that share more characters with the original keyword score higher.

mb_length_penalty subtracts a small amount for each character of length difference between the original and the suggestion. This slightly favours shorter, tighter suggestions over very long ones.

The final formula is score = similarity × 1.5 − length_penalty × 0.1. Results are sorted descending by score, so the closest-matching suggestions appear first.


Stage 6 — Output

The final sorted list for each keyword is written to a timestamped .txt file in /PALS/suggests/ on the server, and the full result is returned as JSON to the browser. The frontend renders each keyword block as a heading with its suggestions listed below, and provides a download link to the saved file.

2026Stable CORE