Schnelle Suche nach einem String in einer Textdatei

Allen Savio

Ich habe diesen Code geschrieben, um nach einer Zeichenfolge in einer .txtDatei zu suchen . Ist es möglich, den Code so zu optimieren, dass er auf schnellstem Weg nach dem String sucht? Angenommen, die Textdatei wäre groß (500 MB-1 GB)

Ich möchte keine Regex verwenden.


public class StringFinder {

public static void main(String[] args)
    double count = 0,countBuffer=0,countLine=0;
    String lineNumber = "";
    String filePath = "C:\\Users\\allen\\Desktop\\TestText.txt";
    BufferedReader br;
    String inputSearch = "are";
    String line = "";

    try {
        br = new BufferedReader(new FileReader(filePath));
        try {
            while((line = br.readLine()) != null)
                String[] words = line.split(" ");

                for (String word : words) {
                  if (word.equals(inputSearch)) {

                if(countBuffer > 0)
                    countBuffer = 0;
                    lineNumber += countLine + ",";

        } catch (IOException e) {
            // TODO Auto-generated catch block
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block

    System.out.println("Times found at--"+count);
    System.out.println("Word found at--"+lineNumber);
Haben Sie schon untersucht, welche Algorithmen verwendet werden sollen? Bei der Suche nach einer festen Zeichenfolge (dh ohne Platzhalterzeichen) ist [Knuth-Morris-Pratt] ( erheblich schneller als naives Matching zum Beispiel. David Richerby vor 6 Jahren 1
Dies könnte wahrscheinlich einen Unterschied machen, verwenden Sie `StringBuilder` Marc-Andre vor 6 Jahren 0

6 Antworten auf die Frage


Fast comes at a price.... code complexity and perhaps readability.

Assuming that your code produces the right results now.... and it is a big assumption because:

  • it expects the word to be at the beginning/end of the line, or surrounded by spaces (not commas, punctuation, etc.)
  • it does not look for the word inside another string, it will match 'are', but not 'bare'.

OK, a much faster way (keeping it as Java), is to do the following:

  1. Convert the search string ('are') to a byte-array in the same encoding as the file.
  2. Open a memory-mapped byte-buffer from a File-Channel on the file.
  3. Scan the ByteBuffer, looking for matches to the search byte-array
  4. count the newlines as you go.
  5. close the ByteBuffer

If the file is larger than your memory, you will have to re-position the byte-buffer occasionally. I recommend using a emopry-mapped size of about 4MB plus the size of the search-string. That way you can search the 4MB window, and then start the next window at the next 4mb boundary.

Once you get in to it, it will make sense.

This system will be fast because you will never have to copy the file's data in to Java. Everything will actually happen in the native side of things.

There is a lot to read to make it work.

I would start with a tutorial....

of course, if you want really fast, use grep.

Here is some example code that may get you started:

import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.nio.charset.StandardCharsets;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;

public class NIOGrep {

    public static void main(String[] args) throws IOException {
        if (args.length != 2) {
            throw new IllegalArgumentException();
        String grepfor = args[0];
        Path path = Paths.get(args[1]);

        String report = searchFor(grepfor, path);



    private static final int MAPSIZE = 4 * 1024 ; // 4K - make this * 1024 to 4MB in a real system.

    private static String searchFor(String grepfor, Path path) throws IOException {
        final byte[] tosearch = grepfor.getBytes(StandardCharsets.UTF_8);
        StringBuilder report = new StringBuilder();
        int padding = 1; // need to scan 1 character ahead in case it is a word boundary.
        int linecount = 0;
        int matches = 0;
        boolean inword = false;
        boolean scantolineend = false;
        try (FileChannel channel =, StandardOpenOption.READ)) {
            final long length = channel.size();
            int pos = 0;
            while (pos < length) {
                long remaining = length - pos;
                // int conversion is safe because of a safe MAPSIZE.. Assume a reaosnably sized tosearch.
                int trymap = MAPSIZE + tosearch.length + padding;
                int tomap = (int)Math.min(trymap, remaining);
                // different limits depending on whether we are the last mapped segment.
                int limit = trymap == tomap ? MAPSIZE : (tomap - tosearch.length);
                MappedByteBuffer buffer =, pos, tomap);
                System.out.println("Mapped from " + pos + " for " + tomap);
                pos += (trymap == tomap) ? MAPSIZE : tomap;
                for (int i = 0; i < limit; i++) {
                    final byte b = buffer.get(i);
                    if (scantolineend) {
                        if (b == '\n') {
                            scantolineend = false;
                            inword = false;
                            linecount ++;
                    } else if (b == '\n') {
                        inword = false;
                    } else if (b == '\r' || b == ' ') {
                        inword = false;
                    } else if (!inword) {
                        if (wordMatch(buffer, i, tomap, tosearch)) {
                            i += tosearch.length - 1;
                            if (report.length() > 0) {
                                report.append(", ");
                            scantolineend = true;
                        } else {
                            inword = true;
        return "Times found at--" + matches + "\nWord found at--" + report;

    private static boolean wordMatch(MappedByteBuffer buffer, int pos, int tomap, byte[] tosearch) {
        //assume at valid word start.
        for (int i = 0; i < tosearch.length; i++) {
            if (tosearch[i] != buffer.get(pos + i)) {
                return false;
        byte nxt = (pos + tosearch.length) == tomap ? (byte)' ' : buffer.get(pos + tosearch.length); 
        return nxt == ' ' || nxt == '\n' || nxt == '\r';
Diese Idee lässt sich möglicherweise mit einem Map-Reduction-Stil-Algorithmus beschleunigen, bei dem mehrere Bereiche der Datei gleichzeitig durchsucht werden. mauhiz vor 6 Jahren 2
`" Dieses System wird schnell sein, da Sie die Daten der Datei niemals in Java kopieren müssen. Alles wird tatsächlich auf der nativen Seite der Dinge geschehen. "" Nicht ganz richtig, Sie kopieren die Daten nach Java, ein Byte bei Zeit mit `buffer.get (i)`. Emily L. vor 6 Jahren 1

If you want to go for performance, you could try a different algorithm. This is what grep does :

GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.

from Why GNU grep is fast (you'll find on this page other smart ideas).

You can find more details on the corresponding Wikipedia page.

Emily L.

If you are looking to match "are" with spaces around it, simply add spaces like so: " are " and see if the line contains that string (taking some edge cases into account).

        String paddedInput = " " + inputSearch + " ";
        String paddedInputStart = inputSearch + " ";
        String paddedInputEnd = " " +inputSearch ;

        while((line = br.readLine()) != null)

            if(line.equals(inputSearch) || 
               line.startsWith(paddedInputStart) ||
               line.endsWith(paddedInputEnd) || 
               (line.contains(paddedInput)) {
                 lineNumber += countLine + ",";

Do the least expensive to fail checks first. Equals will check string length first so it's a quick check unless the line is equally long to the search space (not so frequent). Functions startsWith and endsWith are fast checks as they do not perform a search; contains is done last because it's the most expensive.

The above avoids the splitting (that can be slow) and iteration over the word list. Instead letting the string API that is most likely implemented in native code do the work for you. The strings used must be constructed before the loop to avoid repeated string operations although I think that the java compiler will optimize that, I'm not sure.

A good implementation of String.contains() would use Boyer-Moore but it is not necessary that it will. Java does not dictate which algorithm it is. If you want to be sure, see link in the answer:

Yeh @Emily you are right. The only thing that bothers me now is that i have to handle all the punctuations. Eg: Punctuations like ",.!)'->/ Allen Savio vor 6 Jahren 0
In diesem Fall ist es einfacher, `int i = String.indexOf (inputSearch)` und `if (i! = -1)` zu verwenden, und dann `line.charAt (i-1)` und `line.charAt (i + inputSearch.length) `für Interpunktion oder Leerzeichen (mit offensichtlichen Überprüfungen des Zeilenbereichs). Wenn die Übereinstimmung fehlgeschlagen ist, suchen Sie mit `String.indexOf (inputSearch, i + inputSearch.length)` nach dem nächsten Vorkommen und wiederholen Sie den Vorgang. Emily L. vor 6 Jahren 0

Since you want to find the line numbers where the match succeeds, I would try make improvements based on your current strategy based on BufferedReader.readLine(), and resort to more exotic means such as NIO only if necessary.

Two somewhat costly operations are string splitting and string concatenation.

When you split a line into words, it has to allocate and copy the characters of a new String for each word, as well as an array to store the results in. Instead, you could search within the line, and check whether there the beginning and end of the match occur at word boundaries.

int len = inputSearch.length();
int countInLine = 0;
int pos = -1;
while ((pos = line.indexOf(inputSearch, pos + 1)) >= 0) {
    if ((pos == 0 || Character.isWhitespace(line.charAt(pos - 1))) &&
        (pos + len == line.length() || Character.isWhitespace(line.charAt(pos + len))) {

You should probably treat punctuation as well as whitespace as word boundaries.

The other inefficiency is repeated string concatenation. Strings in Java are immutable. Whenever you write a + b for strings a and b, the code actually compiles to new StringBuilder(a).append(b).toString(). Therefore, lineNumber should be a StringBuilder instead, so you can keep appending to it efficiently.

Note that FileNotFoundException is a kind of IOException. You could use one catch-block to handle both. However, if an IOException occurs, you probably shouldn't try to report the word count, which will probably be invalid. To accomplish that, you could just eliminate all try-catch blocks from main(), and declare that main(String[] args) throws IOException. Then, in case of error, it will just print a stack trace and exit.

Now that we handle white-space. There would be loads of punctuations to deal with. Allen Savio vor 6 Jahren 0

Ich möchte keine Regex verwenden.

Vielleicht solltest du.

Eine wenig bekannte Tatsache ist, dass a Matchernicht Stringals Argument, sondern als CharSequence. Und Stringimplementiert diese Schnittstelle.

Nun, da Sie mit großen Textdateien zu tun haben, habe ich genau die Bibliothek für Sie: Largetext . Es implementiert CharSequenceüber eine große Textdatei, was bedeutet, dass eine LargeTextInstanz direkt mit einem Matcher:

private static final Pattern PATTERN = ...
private static final LargeTextFactory = LargeTextFactory.defaultFactory();

// in code;

final Path path = Paths.get("...");

try (
    final LargeText largeText = factory.fromPath(...);
) {
    final Matcher m = PATTERN.matcher(largeText);
    // do stuff with m
Ich habe Ihre Bibliothek ausprobiert und funktioniert - gut gemacht! -, aber das Problem ist, dass nur der Index des Übereinstimmens und nicht die Zeilennummer zurückgegeben wird. Obwohl LargeText cool ist, bin ich mir nicht sicher, ob er dafür gut geeignet ist Aufgabe. james.garriss vor 5 Jahren 0
@ james.garriss eh, sorry, `Matcher` hat auch seine Grenzen;) fge vor 5 Jahren 0

You could search a string as a byte array: check my version of public static int search(byte[] input, byte[] searchedFor) on

Of course while doing the byte-search you have to catch all "newline" chars and count them, in order to give to the user a line number of where the matches were found.